Overview of text encoding and processing

The main corpus files are transcribed and encoded using the markup scheme of the Text Encoding Initiative (TEI) in its Extensible Markup Language (XML-)based version. TEI is an international standard for the representation of all kinds of literary and linguistic texts, and is used in a large number of text-encoding projects worldwide. TEI-files are designed for long-term preservation of the material. The text files are also available here in other formats, which may be of more practical use for particular purposes. However, these files are all derived from original XML-files in one way or another. The XML-files contain all the information that was encoded about the texts. In other files, some information not relevant to that particular use may have been removed.

Markup schemes

A markup scheme is necessary because, in transcribing a text, we usually want to record information beyond simply what words it contains. For instance, there is information about the structure of the text, where the page and line breaks are or what chapter divisions it contains. There may be annotations within the text, which need to be recorded as such, along with information about who is or was responsible for them, perhaps scribal corrections or marginal glosses. As editors, we will want to annotate the text in various ways, noting, for instance, places where we believe there to be an error on the part of the scribe or printer, and the nature of that error. For the purposes of research, we may also want to annotate the text in other ways. A markup language allows all of these thing to be done, and makes it clear precisely what information has been added and who added it. All of this information is recorded in a standardised format which can be processed reliably by a computer.

The Text Encoding Initiative

TEI is a standard and very flexible markup language, and offers conventions for encoding a large number of possible features. Which of those features are used depends on the needs of the individal user, and not all features of TEI are used in the current project. TEI is independent of any particular piece of software or computer operating system. This means that we can be fairly sure that the corpus files will remain usable in the future.

XML-TEI markup encodes structure rather than form (the way the text is displayed). For instance, two words in italics in a text will be marked up differently depending on the reason for the italics. A piece of text will be marked up as forming a paragraph, but whether the beginning of the paragraph is indented or the size of the indentation does not necessarily have to be recorded. As a consequence, an XML-TEI file will give information about the structure of the text, but need not give information about how to display it. Although it is possible to include such information within an XML-TEI file, we have generally chosen not to include such markup. That is, we have not recorded such details as the font used, the font size or the colour of ink used.

XML-TEI uses a set of predefined elements for encoding various textual features. Most of these elements come in pairs of tags. Scribal additions, for instance, are marked up with the element add:

Here the scribe added <add>an</add> indefinite article.

As in the example, all tags are enclosed in angled brackets. All elements need to have an end tag, marked out by having a forward slash after the opening bracket.

Elements can have attributes, which provide additional information in their values. The place of the addition could be documented, for instance, by adding an attribute place and an attribute value supralinear to the opening tag <add>:

Here the scribe added <add place="supralinear">an</add> indefinite article above the line.

Some elements are empty, that is, they mark points in a text, rather than stretches of text, and therefore cannot contain text content. One such empty element is the one for encoding line-breaks, lb. These empty elements have the format <element/>; thus a line-break is encoded as <lb/>.

A selection of the commonest elements used in the corpus are given below:

<pb n="10"/>page number, with number specified
<lb />line break, number not specified
<add place="supralinear" resp="scribe">addition</add>an addition (in this case above the line, by the scribe)
<corr resp="transcr" sic="tetun">testun</corr>a correction (in this case, from "tetun" to "testun", made by the transcriber/editor)
<del resp="scribe">deletion</del>a deletion, in this case made by the scribe
<orig reg="ysgrifennu">sgwennu</orig>a regularisation: the text has "sgwennu", but for some purposes we may need to know that it is a form of more standard "ysgrifennu"

Stylesheets and secondary files

The XML-TEI file does not give detailed instructions about how much of this information is to be displayed, how it should be displayed, or what other use it may be put to. This means that the XML-TEI file of a text is not really meant to be used by the end user. For the researcher, the more useful files are secondary text files, generated from the TEI master file. These will contain the text is some other more useful format according to the needs of the user. An example would be an HTML file which allows the text to be viewed in a web browser in some appropriate format, for instance, with abbreviations written out in full inside curly brackets or in italics or in whatever format is required.

These secondary files are generated by applying XSLT- (Extensible Stylesheet Language Transformation) stylesheets to the XML-TEI master file to create other files that can be displayed directly in a web browser or processed by concordance software.

In this manner it is actually possible to generate various different output files by applying different XSLT-stylesheets to one and the same master file and, for instance, to generate both a diplomatic text file, an edited text file, and a text file meant to be used with a concordance program. The format of the output file is usually HTML (to be viewed in a web browser), but other formats (plain text, pdf, and XML) are also possible. Most of what you can see of the corpus has been generated in this way (although it may also have been processed further).

How textual features such as scribal addtions or line-breaks are to be displayed is specified in an XSLT-stylesheet. Stylesheets are separate text files which contain instructions about what the stylesheet processor is to do with TEI elements when it encounters an element. In order to be executed, a stylesheet has to be linked to a TEI file (that is, there is a reference to the stylesheet at the beginning of the file). It takes only a matter of seconds to exchange one stylesheet for another and in this way generate a different display. Thus, according to need or preference, the scribal addition above may be displayed in any of these ways (or others):

Here the scribe added \an/ indefinite article above the line.
Here the scribe added an indefinite article above the line.
Here the scribe added an[+ note] indefinite article above the line.
Here the scribe added an indefinite article above the line. [that is, unmarked]

Line-breaks can be reproduced as they are in the original, line-numbers can be inserted (by page or by text) if desired, or the line-breaks can be suppressed, thus generating a running text like this one. Both of these options have been used to display the corpus in different ways. XML-TEI in conjunction with stylesheet transformations is thus a very flexible tool and allows one and the same text to be edited in different formats with relatively little effort.

© University of Cambridge 2004
Last update: