The shifting ambiguities of formality
In general, one could say that a goal of a formal language like XML is to address the inherent ambiguities of natural language. The promise is that through the proper application of XML, the implicit meanings of a written text can be made explicit through markup. In my current work with an archive of reviews of dance performances, I'm struck by how the application of precision-oriented formalities of standardized-encodings can themselves introduce new kinds of ambiguity.
In a particular collection of the archive, great care has been taken to follow proper encoding standards, in this case those of the TEI (Text Encoding Initiative). One of the interesting possibilities of the TEI standard is the note element, described in the document P5: Guidelines for Electronic Text Encoding and Interchange". According to section 3.8 of this document, note elements are to be used for the "encoding of discursive notes, whether already present in the copy text or supplied by the encoder".
The decision was made to use the note element to express information about the subject of a review such as the titles of performances and the names of related organizations, places and people (dancers, choreographers, musicians). Thus the decision was made to create, for each review, a single note element with type="performancedata", to then contain (following the hierarchical logic of XML) the various (duly tagged) titles and names. Looking at a the encoding of a particular review, published in 2008 by writer Pieter T'Jonck in the Flemish Daily De Morgen, one finds:
The trouble that one runs into when reading this representation (irregardless of familiarity with the XML or TEI conventions) is how the isolated names and titles beg the question of who belongs with what? Is Eleanor Bauer related to the title "Eleanor!" (seems likely), but does Hans Bryssink then relate to the title "At Large" and did "Reverend Billy" indeed have a "One Man Show"?!
Following the note, is a section called the "sourceDesc" which according to the TEI "describes the source from which an electronic text was derived or generated, typically a bibliographic description in the case of a digitized text, or a phrase such as 'born digital' for a text which has no previous existence."
A more subtle problem exists with the date "2008-03-05" (the ISO8601 representation of March 5, 2008). The date appears both in the "performancedata" note, and again within the "sourceDesc". By itself, this is not problematic, the fact of having two separate dates represents the separate dates of the performance (though which performance we cannot not sure of), and the "birth" date of the review (ie the publication date). In looking through the full collection of XML files, however, I have seen that the two dates are nearly always to identical (in all but 4 of the 1459 documents, and where the differences seem due to encoding errors). The evidence strongly suggests that a single date was fed into the "encoding process" (I imagine the publication date of the article since that was the primary material being handled) which was then copied into both contexts (with the idea that the information is at least close to accurate). Perhaps the decision to duplicate the dates was made earlier in the editorial process to fit the intermediary format to be fed into the XML-encoding process. However, the repetition itself creates a new kind of ambiguity as one is left to second guess the work flow behind the editorial process of the encoding.
Through my contact with the people involved in the archive, I know that text files were first prepared by one set of editors (SARMA) according to a convention that the order of the lines of the file were significant (i.e. first line should be the title of the review, the second line should be the name of the publication, and so on). These text files were then used as input to an automatic encoding as XML/TEI which was then "hand-cleaned" by another editor working on the related research project (University of Antwerp).
At the end of the XML document is (finally) the original text of the review, itself minimally marked up uniformly with HTML style 'p' tags denoting paragraphs. The uniformity (single sentence section titles are also encoded as paragraphs) again implies an algorithmic treatment of the plain-text formatted input files.
What's noteworthy here is how the implicit "ambiguities" of language in the original written material (which pose no problem for the human reader), are exagerated by the (algorithmic?) name-extraction process that isolates the names not only from their textual context but also from the semantic structures of sentences and paragraphs. These isolated, and now truly ambiguous elements, are then further distorted as they get mapped to the hyper-precise TEI structures.
Finally, what's interesting here is that in the chain of editorial process, increasingly formal representation structures are successively applied; each step of "cleaning" erases traces of editorial process that themselves introduce new ambiguities idiosyncrasies in the format structures are left without a context (Was this date copied, from where, and who made that decision, or is it (now) correct?).
Reading a portion of the original text is revealing in how precise it is in describing various complex relationships in a relatively unambiguous way:
In 2004 American dancer and choreographer Eleanor Bauer, then a student at PARTS, created the hilarious Eleanor!, a blatant promotion for her 'product', namely choreography. No surprise then that Vooruit invited her for a festival about the the link between art and commerce. ... Her new work At Large, made with dancers Famke Gyselinck and Manon Santkin and videomaker Inneke Van Waeyenberghe, researches the range of an art form such as dance.
In 2004 maakt de Amerikaanse danseres en choreografe Eleanor Bauer, toen nog studente bij PARTS, de hilarische solo Eleanor!, een onverbloemde promotie voor haar 'product', namelijk choreografie. Niet verwonderlijk dat Vooruit haar uitnodigde voor een festival over de band tussen kunst en commercie. Bauer verstaat trouwens uitstekend de kunst om moeilijke kwesties helder voor te stellen. Haar nieuwe werk At Large, gemaakt met danseressen Famke Gyselinck en Manon Santkin en videaste Inneke Van Waeyenberghe, onderzoekt de spreiding van een kunstproduct als dans. ...
Granuality & Order of Text
A text always has order. Despite the relation of "non-linearity reading" potential of HyperText, the basis of HyperText, text itself is inherantly linear, as is the process of reading. The non-linearity comes in how one provides means of moving from one track to another at nodes that are determined in the process of writing.
The process of extracting the titles and names of people related to performances, removing them from the natural order in a text, and the granularity of paragraphs, and "cleaning them" into the orderless-ness (XML-encoded) relations is a lossy process. (In this case, the names of performances of performers get summarily dumped into a box called "performancedata"). While it makes sense from the limited point of view of indexing the text, the resulting xml representation actually obscures the original structure, leaving one to wonder which performers belong to which performance.
Splitting writing from Encoding
One may be tempted to blame the encoding process, the people doing the encoding or the decisions made in how the standard would be implemented as an encoding. This misses the more essential point however. Indeed in this case the encoding job was done by very highly trained and proficient editors, with knowledge of both the subject matter and the encoding sciences. However, given the contraints of needing to encode a large number of articles (in this case close to 1500), endlessly perfecting the representation is not an option. In a strive for generality, the rules and related structures for encoding become so complex as to make work of consistent encoding over any non-trivially large set of documents effectively impossible without resorting to "short cuts" that invariably result in less than ideal representations. So part of the problem is that by not taking into account the actual work involved in applying the standards -- and the mental overload and editorial implications the encodings have, they are doomed to be at best incompletely, inconsistently, or fully improperly applied.
The separation of "data encoding" from the main "body" of the text exacerbates the problem by leaving no "margin" space for the exceptional to be explained or at least left in a state that is likely to be understood as incomplete and possible improved in the future. The apparent cleanliness of the formal system erases the traces of the doubt's in an author's text, and makes subsequent readings of the text unlikely to understand the problem, let alone be in a position to improve it through re-writing.
When encoding data hypertextually, one is confronted with the question of how the data is "read" not in a machinic sense but in the human logic of how a text will appear and be read by a human reader. In many cases, "stylistic" decisions about how a text might appear on the page can give guidance to the "semantic" decisions of encoding. (example?)
Special thanks to SARMA and Thomas Crombez, at the University of Antwerpen.
Todo: write about discussion of "#redirect" as sameas, and the Falco as Synonymous to "Rock me Amadeus".