Corpws Hanesyddol yr Iaith Gymraeg 1500-1850 | Historical Corpus of the Welsh Language 1500-1850

Encoding principles

1 Textual structure

1.1 General textual structure

The text itself is enclosed within a text-element. All texts are marked with a unique idenitifying abbreviation, usually derived from the title, encoded as an id-attribute on the text-element. For instance, Drych y Pryf Oesoedd opens with the following identifying tag:

and ends with a corresponding closing text-tag (</text>).

Within the text-element, allows front material (title pages, prefaces etc.) to be represented within a front-element. Within the corpus, front-elements are sometimes used for front material from the original source (e.g. the title page of the original), but are not used for editorial front material, which is placed instead within the TEI-header. The one exception to this is editorially supplied cast lists in plays, which are placed within a front-element (see below).

The main part of the text is enclosed within a body-element. Within the body element, the structure of the text is marked up, typically divided into paragraphs (p-elements) or more neutral divisions (div-elements). Other types of textual divisions are possible, the most important being:

headings: <head> ... </head>
(original) summaries of the content of a passage: <argument> ... </argument>

The argument-element may bear a rend-attribute, indicating the visual rendition of the passage in question. The values "small" and "italic" have been used.

For poetry and drama, the <lg>...</lg> and <l>...</l> tags are used to delimit line groups (stanzas and similar divisions) and lines respectively. For example:

<lg> <l>Dymuno rwy yrwan y truan di drai,</l> <l>Gael fy nwy glocsen wr llawen beth llai:</l> <l>Ni waeth geni wernen na derwen gwir yw,</l> <l>Mae honno 'n ysgafnach a llonnach ei lliw.</l> </lg>

Where an entire stanza is intended, the lg-element may have a rend-attribute (for instance <lg rend="indented">.

In prose texts, lines are not delimited using the <l>...</l> tag, but rather divisions between lines are marked using the <lb/> tag. The <lb/> tag precedes each new line. Note that this includes the first line of any text or piece of text. Prose within poetic texts is generally marked up as in prose texts, using the <lb/> element as the basic means of delimiting lines.

In all texts, page breaks are marked using the <pb> tag. The page number is itself recorded as the n-attribute of the tag. For instance, the start of page 10 is represented as <pb n="10"/>

1.2 Texts within texts

In many cases, there are 'texts within texts', for instance, one file may represent a selection of ballads from a number of different printed sources; or one file may include a number of stories or poems taken from a single source, where each story or poem is, in some sense, a text in its own right. In this case, the following markup is used. The entire text is surrounded by the <text> ... </text> element. Within this the subtexts are grouped together using the <group> ... </group> element, and then each subtext is itself surrounded by another <text> ... </text> element. The opening text-tag of the entire text has an id-attibute which gives a standardised abbreviation for the text. For instance the text of ‘Llyfr Gwyn Mechell’ is surrounded by the tag <text id="LlGM">...</text>, where "LlGM" is the standardised abbreviation for this text. Where a text contains subtexts, the opening text-tag of each subtext also has an id-attribute giving some form of standardised abbreviation, usually the standardised abbreviation for the whole text plus some form of identification for the individual subtext. For instance, poem no. 2 of ‘Llyfr Gwyn Mechell’ is surrounded by the tag <text id="LlGM02">...</text>. Ballads are always idenitified using abbreviations derived from A bibliography of Welsh ballads.

1.3 Line and page breaks

Individual words frequently cross line or page boundaries. The presents a problem, since it is both necessary to record the fact that a word is broken across two lines or pages, and to preserve the word as a single unit so that it is accessible and can be searched for successfully by computer software. Consider the following text:

hon oedd y daith hyfrydaf o'r holl ffordd y daeth-
um i Utica, 234 o filldiroedd mewn 4 diwrnod.

This is represented in the XML-file as:

<lb/>hon oedd y daith hyfrydaf o'r holl ffordd y <w>daeth<lb rend="shy"/>um</w> i Utica, 234 o filldiroedd mewn 4 diwrnod.

Here the word daethum is split between two lines. The split is represented by a rendition attribute (rend) of the line break tag (<lb rend="shy">). The rend-attribute has the value "shy" to indicate that the split was represented in the original text by a (soft) hyphen. The hyphen itself is not reproduced in the body of the transcription. The word that is split across two lines is surrounded by the word tag <w>...</w>.

In the output file for use by Concordance, words split across lines are represented in the following way:

$L 6% hon oedd y daith hyfrydaf o'r holl ffordd y $L 6-7% daethum
$L 7% i Utica, 234 o filldiroedd mewn 4 diwrnod.

Here, a separate line number tag is produced for the divided word. In any search or concordance, it will be (correctly) listed as being on lines 6-7. The exact position of the break is not recorded in the Concordance file, but can be recovered from the diplomatic edition or from the XML-file.

The same principles apply where a word crosses a page boundary. In the following example, the page break tag (<pb/>), in addition to its usual n-attribute giving the page number, has a rend-attribute to represent the hyphen. The split word is again marked out as a word by the <w>...</w> tag:

Maent yn credu, yn ol athrawiaeth Zoroaster, i'r
byd gael ei greu mewn chwech o dymhorau, pob un
yn cynnwys hyn a hyn o ddyddiau; sef, yn y 45 di-
-- PAGE BREAK --
wrnod cyntaf i Dduw greu'r nefoedd;

<lb/>Maent yn credu, yn ol athrawiaeth Zoroaster, i'r <lb/>byd gael ei greu mewn chwech o dymhorau, pob un <lb/>yn cynnwys hyn a hyn o ddyddiau; sef, yn y 45 <w>di<pb rend="shy" n="35"/><lb/>wrnod</w> cyntaf i Dduw greu'r nefoedd;

2 Representation of text

2.1 Fonts and highlighting

In other cases where text is highlighted, it is marked using the hi-element. The form that the highlighting takes is recorded using the rend-attribute. For instance, the following indicated that the initial <H> of the line was rubricated (rendered as a large capital):

<l><hi rend="rubric">H</hi>YD attoch benillion yn gyson i gyd,</l>

The values for the rend-attribute found in the corpus are: "italic", "normal" (e.g. for unitalicised text within text that is otherwise italicised), "rubric", "smallcaps", "underlined"./p>

2.2 Diacritics / Special characters

Although it is basically possible to use characters with diacritics in the XML file itself provided a suitable character encoding is used, problems in the further processing of the file can hardly ever be avoided completely. For this reason and because of the greater flexibility they provide, we have used character entities for all characters with diacritics. Depending on the purpose and the output format, these can be displayed either as Unicode characters or as any other kind of character string, as desired. For instance, if it is necessary to use the output text with a concordance program that cannot deal with accented characters, â can easily be replaced with a+ and so on.

A number of early modern texts used dotted characters where Modern Welsh has digraphs, for instance, d with subscript dot for <dd>. These have been encoded as character entities, for instance, as &ddot; in this case. At this stage it is difficult to display these characters reliably in a word processor or in a web browser in an HTML-text, and the character entities will always need to be replaced in output files by their Modern Welsh equivalents. The distinction between the two characters is, however, recorded fully in the XML-files, and can be recovered if necessary.

The following customised character entities are used in the corpus:

Character entity	Description	Replacement in files used for display	Example texts
&cd;	lower-case <c> with subscript dot	ch	Found in TCh.
&Cd;	upper-case <C> with subscript dot	Ch	Found in TCh.
&dd;	lower-case <d> with subscript dot	dd	Found in CHSM, RhY etc.
&Dd;	upper-case <D> with subscript dot	Dd	Found in CHSM, TCh etc.
&ld;	lower-case <l> with subscript dot	ll	Found in CHSM, RhY etc.
&Ld;	upper-case <L> with subscript dot	Ll	Found in CHSM.
&lhook	lower-case <l> with a following hook	ll	Found in TN.
&td;	lower-case <t> with subscript dot	th	Found in TCh.
&Td;	upper-case <T> with subscript dot	Th	Found in TCh.
&ud;	lower-case <u> with subscript dot	w	Found in CHSM, RhY etc.
&vd;	lower-case <v> with subscript dot	w	Found in RhY.
&Vd;	upper-case <V> with subscript dot	W	Found in CHSM.
&vv	lower-case double <v>	w	Found in TN.
&yr;	<r> surrounded by dots	yr	Found in NLW 2722E.

There were one or two other circumstances where it seemed advisable to use character entities. In the text of the Welsh New Testament of 1567, highlighted text is printed in Roman letters, normal text is in Gothic letters. Probably due to a shortage of <w>s in the printer's Roman typeface, in the passages in Roman letters <vv> ("double v") is almost always used instead of <w>. The use of a character entity here <&vv;> allowed us to record the original spelling, but it also means that <vv> can be replaced with <w> in output files intended for concordance software. Basically the same applies to <l'> for modern Welsh <ll> and <&edh;> for modern Welsh <dd> in both this translation of the New Testament and in the 1588 translation of the Bible.

2.3 Abbreviations

Abbreviations are almost always expanded in the corpus, with the expansions always being marked up with the expan-element. For example:

pa<expan>n</expan>n

The first <n> is the result of (editorial) expansion.

As for the exact extent of an abbreviation, "abbreviation" is taken to mean only those letters that are not spelled out in full. If, for instance, the text has an <e> with a stroke above it for <en>, only the <n> is taken to be abbreviated and the markup will look like this:

e<expan>n</expan>

NOT

The place of the abbreviation is usually obvious except where an abbreviated <n> or <m> is adjacent to an <n> or <m> that is written out in full. In these cases, for the sake of consistency, the abbreviation is always considered to precede the letter, even where it could be argued to follow. For instance, if hynny is abbreviated to hyñy, this will be encoded as:

hy<expan>n</expan>ny

NOT as

hyn<expan>n</expan>y

The exact position of the nasal stroke is often hard to decide. Its position may be due more to scribal haste than to scribal intention, and is ultimately of little importance.

The nature of the original abbreviation is not documented, that is, the expan-element never has an abbr-attribute.

Abbreviations, apart from a few very common types, are not widely used in Welsh manuscripts and early printed books, and their exact expansions are hardly ever ambiguous.

A small number of abbreviations have not been expanded, notably "Mr.", and these have not been marked as abbreviations either.

In the few cases where the expansion is doubtful, usually cases where the exact spelling of a name is uncertain, the abbreviation is left as is, and marked up with the element abbr. Such cases are rare.

2.4 Damaged or unclear text

Damage to manuscripts and printed texts is indicated using the <damage> tag. The nature of the damage may be indicated (but is not necessarily indicated) using the type-attribute. The two attributes normally used are "worn" and "torn", hence <damage type="worn"> and <damage type="torn">. Where the text of the damaged area cannot be reconstructed easily, the missing text is represented using the character element <c/>. An attempt has been made to indicate the approximate size of the damaged area. The number of <c/> elements is intended to correspond roughly to the number of characters present in the damaged area. This, however, can only be a very rough approximation. For instance, in the following example, the middle part of the line is worn and is difficult to read; in the first part of the damaged area, the word twullo can be made out, but is unclear; the second part of the damaged area contains space for approximately seven characters:

<l>fod i dwu chwaer hyna 'n <damage type="worn"> <unclear>twullo</unclear> <c/><c/><c/><c/><c/><c/><c/></damage> i <unclear>th</unclear>ad</l>

Unclear text is marked using the <unclear> tag. The main reading is given as the text inside the tag. Occasionally, an alternative reading may be given in the alt-attribute of the unclear-tag.

2.5 Catchwords

Where transcribed, catchwords are marked using the fw-element, and a type-attribute with the value "catchword". For instance, the following encodes the catchword hwmyr within an fw-element, which is repeated, in slightly different form, as HWmyr at the start of the next page (in this case, page 15):

<fw type="catchword">hwmyr</fw>
<pb n="15"/>
<lb/>HWmyr

Catchwords are not normally transcribed unless the catchword(s) on the first page differs in form in some way from the first word(s) of the following page.

2.6 Regularisation of spaces, apostrophes and capitalisation

Most medieval and early modern Welsh manuscripts differ from modern usage in the division of words, in the (consistent) use of apostrophes and in capitalisation and non-capitalization, and some kind of regularization, either tacit, or explicit in the markup, seems in order.

Due to the diverse character of the texts (prints vs. manuscripts, early texts vs. near-modern ones), it is not easy to be consistent. The corpus is, in fact, not entirely consistent, partly because of the diversity of the texts, and partly because of the immense time and effort that would be needed to implement any solution that does not involve silent regularisation.

Thus, in some of our transcriptions we have chosen an overt solution involving explicit markup of whitespace regularisation and insertion of apostrophes (not, however, capitalisation, although this - and much else in this area - would have been technically possible).

Regarding word separation, we have taken inspiration from the [Aberystwyth project], which in their transcriptions used markup characters for 'space-to-be-inserted' and 'space-to-be-deleted'. A similar approach within the TEI markup scheme seemed advisable; an approach that made use of a 'markup character' instead of using the more time-consuming solution of using the ORIG and/or REG elements, which would have been possible but would treat each case separately (along the lines of <orig reg="y mae">ymae</orig>) when in fact dealing with a frequent and recurring phenomenon.

The use of character entities for 'space-to-be-inserted' and 'space-to-be-deleted' is not viable since a character entity cannot be replaced with nothing, which would have been necessary to do where whitespace in the original text was to be deleted.

Therefore, two new (empty) elements have been introduced for these two common cases of regularisation: <isp/> ("Insert SPace") and <dsp/> (for "Delete SPace"):

y<isp/>mae (MS: ymae; regularisation: y mae)

o<dsp/>hono (MS: o hono; regularization: ohono)

This allows spaces to be inserted or deleted via the XSLT stylesheet. In a diplomatic display <dsp/> will be replaced by a space and <isp/> will simply be ignored; and in a stylesheet generating a regularized display the two elements will be treated inversely.

A similar solution is available for the insertion of apostrophes: <apo/> is a new element which makes their insertion possible where desired:

a<apo/>r (MS: ar; regularization a'r or a 'r)

Not all our transcriptions use these elements yet, mostly due to a lack of time to insert them. (Insertion during transcription was too much of a distraction and overall led to too many mistakes). Some texts were transcribed before we created these elements and could not be updated yet.

As for those (manuscript) texts that do make use of the ISP and DSP elements, it should be understood that the decision whether there is a space in the original or not is often hard to make. This problem only infrequently arises with prints.

While certain types of word-separation (or non-separation) are very common throughout our texts and basically standard, a few of them handle this issue in what could almost be called a bizarre way by tearing entire words apart. Examples This mostly applies to verse texts in manuscripts, where spaces often seem to be, in some vague way, connected with stress, a space often, but inconsistently, appearing before or after the stressed syllable. It seemed to be pointless for our purposes (although presumably of interest in a different context) to follow the manuscripts here. This type of word separation has tacitly been regularized.

Capitalisation in manuscripts poses similar problems. Not all capital letters can easily be distinguished from their lower-case counterparts (<c> vs <C> and others). Regularisation with some new (non-empty) element <cap>...</cap> or so would be possible. We have either tried to preserve the original spelling completely without intervention (most manuscripts and all prints), or completely regularized it (a very few of the manuscript texts, especially our earliest transcriptions).

What applies to capitalisation also applies to punctuation. Generally the original is followed; only in a few of our the earliest transcriptions have is punctuation ever supplied. Punctuation in prints is generally preserved (or, occasionally, corrected where it is obviously faulty).

As these issues show, XML-TEI is a very powerful tool and capable of holding in parallell, or on top of each other, several versions of the same text in the same file. However, only what was put in can be put out, and it should be noted that a heavily marked up text file can become increasingly illegible. (The latter point, however, can largely be mitigated by using XSLT stylesheets to generate various proof version output files of the masterfile.)

3 Emendations to the text

3.1 Corrections

Two types of corrections are marked with the CORR element. The first is where the correction was made by the scribe or a contemporary, in cases where the faulty original text and its correction could easily be related to each other. The second concerns editorial emendations, where the original scribe has clearly made an error.

TEI does not use different elements for these two types of correction, as opposed to additions and deletions, where such a distinction is made. To record the crucial difference between the two types of correction, we have always distinguished scribal corrections from editorial emendations by the value in the element's RESP attribute, which has either the value scribe or transcr (for 'transcriber') and is always present. The original reading that is being corrected is given in the element's SIC attribute. Examples:

<corr resp="transcr" sic="chwre">chware</corr>	Here the faulty manuscript reading chwre was corrected to chware by the transcriber.
<corr resp="scribe" sic="mid">mi</corr>	Here the scribe corrected the text, correcting mid to mi.

For technical reasons the corr-element always surrounds the whole word that contains the error, and not the error alone. Some precision is lost this way, but this markup makes further processing of the text easier. In particular, it allows concordance software to search accurately for either the original reading or the correction. It is not normally difficult to identify the actual mistake by comparing the faulty reading with the correction.

Occasionally it is necessary to add an editorial correction to a scribal correction. In this case a word may be enclosed in two pairs of CORR tags. For example:

Insert example here.

The sic-element is used to confirm a reading that might otherwise be questioned as an error. For example:

Insert example here.

For errors involving line- and pagebreaks see below.

3.2 Scribal deletions and insertions

Where the relation between a scribal deletion and a correction or addition cannot easily be established, separate <del hand="scribe"> and <add hand="scribe"> tags are used. Scribal insertions and deletions are also encoded in this way.

The place of a scribal insertion is documented in the add-element's place attribute. The following values may occur:

supralinear
previous-line
left-margin
interlinear
inline
above-line
margin
right-margin
unspecified

[Check this list for completeness.]

No resp-attribute (stating who is responsible for identifying the scribal intervention) is provided, as it can be assumed that this is the transcriber. The manner of deletion (crossed out, underdotted, erased etc.) is not noted, unless it appears to be of special interest. The manner of insertion is not normally noted, but may be contained in the rend-attribute (for instance, <add hand="scribe" place="previous-line" rend="brace"> indicates an insertion on the previous line made using a brace).

TEI makes available the tags DELSPAN and ADDSPAN for deletions and additions of larger amounts of text. These tags are not widely used in the corpus, because they make further processing with an XSLT-stylesheet difficult. In general, instead, where, for instance, a deletion starts inside one verse line and continues into the next, and a single DEL element could not be used because it would have violated the proper-nesting rule, two unconnected DEL elements are used. For example:

[Insert example here.]

However, in a few cases where there is a large addition, the addSpan-element has been used. The start of the addition is marked with a complete addSpan-element (including closing tag), with the person responsible for the addition marked with a hand attribute; a second to-attribute gives an arbitrary reference that identifies that matches an anchor-element located at the end of the addition. For instance, the following indicates that two whole lines have been added to the text by "hd2":

<addSpan hand="hd2" place="right-margin" to="addstanzahd2"/> <l>Mae un arall or chwaryddion</l> <l>yn cymryd henw aer y goron</l> <anchor id="addstanzahd2"/>

The corr-element is also used where text is to be omitted, for instance because of an inadvertent repetition of words. For instance, the following indicates that the text originally repeated the word y, and that the scribe corrected this error himself:

TEI does not permit illegible deletions to be encoded using the del-element. Instead, in cases where the original text of a deletion is illegible, the gap-element is used, with the reason-attribute "deleted" (that is, <gap reason="deleted"/>).

Punctuation is corrected only in the case of printed texts. In those cases where punctuation is corrected, the correction includes the preceding word. For example:

3.3 Editorial additions

Where text has obviously been omitted in the source text, the word or words to be supplied are marked with the supplied-element. However, where only single letters are missing, the text is annotated as a correction, using a CORR tag as described above. The latter case is in fact far more frequent than the former.

The reason why it has been necessary to supply text may be given as a reason-attribute to the supplied-element. In the following example, the character <d> has been supplied because it is not visible, although was perhaps once present, in the manuscript:

by<supplied reason="invisible">d</supplied>

In addition to "invisible", the reason-attribute may also have the values "missing" or "damaged"

4 Markup of drama

4.1 List of characters

A list of characters is provided editorally for each work of drama, and is placed inside the front material (<front> ...</front>). It takes the following format:

<div type="supplied-castlist"> <p> <castList> <castItem type="role"> <role id="Ff">Ffwl</role> </castItem> ...repeat for other characters... </castList> </p> </div>

Note that each character is identified in the cast list by an abbreviation which is used to identify the speech of that character in the main part of the text.

4.2 Speakers

The speeches of different characters in drama are delimited by <sp> tags, with the identity of the speaker marked in the who-attribute (e.g. <sp who= "Ff">...</sp>). Speakers are indicated in the who-attribute using codes corresponding to those idenitified in the list of characters. Where the name of the speaker is also given in the text of the source, the text is marked using the <speaker> tag. For instance, the following markup marks off speech by Morgan (abbreviated as Mg); the speech is headed by the word <morgan> in the manuscript:

<sp who="Mg"> <speaker><lb/>morgan</speaker> ... </sp>

In the following example, the stage direction indicates a new speaker. The new speaker is marked in the <sp> tag, but is not indicated in the text itself, except in the stage direction:

<stage><lb/>Enter <name>morgan</name> diwc morganog</stage> <sp who="Mg"> ... </sp>

4.3 Stage directions

Stage directions are marked with the <stage> tag. Different types of stage directions (entrances, exits etc.) are not distinguished, and no use is made of type-attributes for stage directions, or of the <move> element for character movements. Stage directions are normally, and where possible, placed outside of characters' speeches, as delimited by the <sp>...</sp> tags.

5 Markup of authorial or scribal notes

Authorial or scribal notes should be marked up in the following way: the extent of the text to which the note applies is marked up with and lg-element or a seg-element, identified as the target of a note with the type-attribute "note-targetq" and with a unique id-attribute (e.g. n1, n2 etc.); the note itself appears somewhere else in the TEI-file (usually at the next convenient point), marked up as a note-element. The note element should have a target-attribute matching the id-attribute of the text to which the note applies. The note-element should also be identified as authorial or scribal by the addition of a resp-attribute (either resp="author" or resp="scribe"). The position of the note should be marked with a place-attributed. Values used are right-margin and foot (left-margin should also be possible). For example:

<lg type="note-target" id="n1"> <l>er mwyn gwr eddigeyddys.</l> <l>Damuniad i phriod:</l> <l>vur holl gyfarfod.</l> <l>Vy meddwl i am kyngor:</l> <l>chwi a wyddoch Antenor.</l></lg> [...] <note resp="scribe" target="n1" place="right-margin">Atteb Antenor sydd <lbNoN/>yn kanlyn ar ol yn <lbNoN/>y :37: dolen kanys <lbNoN/>drwy ryw ddrwc <lbNoN/>ddamwain y gaded <lbNoN/>allan </note>

5 Technical issues

Preservation of endangered whitespace

Because of what appears to be a bug in one of the XSLT-parsers that were used with the project, there was a danger that significant spaces would be incorrectly deleted in a number of environments. After some experimentation the certainly easiest, and possibly only, solution to circumvent the problem seemed to be to replace the endangered spaces with a non-breaking space ( ) in the TEI-files. Non-breaking spaces have not been used for any other purpose; as soon as the bug is fixed or if a different parser is used, these non-breaking spaces can safely be replaced with ordinary spaces.

The affected spaces occur in the following environments:

between elements, or to be precise, between a closing and an opening angular bracket: ...</tag>_<tag>... Such an environment may arise where, for instance, italicized text is followed by an unrelated correction. Principally, the space could have been included within either of the elements, but such a solution was felt to be too error-prone; also, the space does not really belong into either of the elements;
between an element and a character entity, and vice versa: ...</tag>_&CharEnt; and &CharEnt;_<tag>... The position between an ordinary semicolon and a tag is not affected.
between two character entities: &CharEnt;_&CharEnt;. Again, ordinary semicolons are unproblematic.

The unfortunate result of the use of non-breaking spaces in those texts where character entities are already ubiquitous is that it makes the XML-file even more difficult to read to human eyes than it already is.