The issue of what text really is, and how that affects our notions of proper text representation has been with us almost from the beginning of text encoding [Goldfarb 1981, Reid 1980, Coombs 1987, DeRose 1990, Renear]. The simplest reasonable view, that text is fundamentally an ordered hierarchical structure, determined by its editor and author, is an early one that has remained prominent since, especially as reified by ISO 8879 (SGML). However, this simple model is not enough, as the TEI [Sperberg-McQueen + Burnard 1990,1993] quickly discovered while moving text encoding from the realm of print production to that of scholarship, textual editing, and linguistic analysis. The initial work of the TEI metalanguage committee identified problems with SGML's simple hierarchical mechanisms, and developed and published techniques for working around them to encode non-hierarchical phenomena [Barnard 1996].
In [Renear] we began to analyze and label the theoretical and ontological foundations underlying many of the kinds of non-hierarchical structures discovered by practitioners using naive hierarchical markup. This paper uncovered some key notions and implicit partial theories underlying most previous theorizing about markup. The most important of these notions is the primacy of "perspectives," which we defined as "... [a] natural family of methodology, theory, and analytical practice." Perspectives are critical entities underlying the presuppositions of the simple hierarchical approach. Each perspective seems to define its own hierarchical view of a document, leading to the sort of multiple hierarchy markup provided by SGML's CONCUR feature.
We also defined the notion of a sub-perspective, reflecting the fact that even a single perspective, defined with reference to discilinary practice, might in fact require one or more hierarchies to correctly label all the textual phenomena of interest within that perspective. Sub-perspectives are perspectives within a discipline. Sub-perspectives also help to preserve the multiple-hierarchy notion of text structures: even if a perspective has non-hierarchical phenomena to describe, it is frequently decomposable into sub-perspectives that are hierarchical. However, we also showed that there are many cases where meaningful decomposition into sub-perspectives is not possible.
The notion of perspective clarifies the taxonomy of non-heirarchical document structures and allows a clearer definition of the source of the simple hierarchical model's problems. In this paper, we use these theoretical results to examine how the basic notions of hierarchical markup should be extended to allow a more expressive and accurate approach to document markup. At the same time, Renear [Draft] is continuing the philosophical analysis of the underlying ontological issues -- we, however, are pausing to take the theory back to its practical roots and determine requirements and specifications for a new generation of text-description languages.
The following discussion is framed in terms of SGML, because SGML represents the state of the art in document description languages. The features that we propose markup systems should handle can be regarded either as suggestions for improving SGML, as specifications for what a future successor should be able to handle, or even as specifications for a new standard, that, like HyTime, would add additional power to SGML markup by formalizing the representation of phenomena for which SGML has no standard representation. We will not take a position on these thorny standards issues, concentrating rather on the problems to be addressed. In our examples we will use syntax based on SGML for clarity, but we will diverge from that syntax as necessary (and with explanation).
The following (partial) list of non-hierarchical phenomena is based on [Renear, Barnard 1995]:
Despite SGML's limitations, methods of tagging all of these currently exist in the TEI, in the form of particular tags for particular perspectives. But since we now know that the breaking of strict hierarchies is the rule, rather than the exception, it is time to determine what additional features are required from markup systems to make the formal description of such non-hierarchical phenomena straightforward. We propose that in the long run, it is better to integrate the formal properties of these recurring non-hierarchical phenomena into markup systems themselves, rather than hiding a recurring feature of textual analysis in the specific rules for many distinct tags. In addition, since these formal properties can apply to any genre of document and its associated analytical perspectives, their explicit representation will enable more perspicuous, explicit, and consistent descriptions of nonhierarchical tag-relationships and constraints, in the same way that the formal definitions of content models in SGML provide for the description of hierarchical documents.
The feature-structure tags in the TEI [Langendoen 1995] are actually general enough to handle any non-hierarchical structure. This reflects the fact that TEI feature structures can describe arbitrary attributed directed graphs that indicate a point in a text. This is the most general structure that markup could possibly have (since elements cannot contain themselves). However, feature structures are not appealing because their consistent application to the problem of general document markup would lead to documents containing no tags other than the feature-structure tags, with all the information that is currently represented by tags encoded in them. This would produce extremely verbose encodings that would not be easily or effectively processed by existing tools unless we assume the widespread distribution of a specialized Feature-structure engine capable of interpreting their special usages. In short, feature structures do not solve the problem of improving hierarchical markup systems to deal with non-hierarchical markup structures; rather, they solve the related problem of encoding non-hierarchical structures within a hierarchical markup system.
Our earlier paper analyzed these phenomena at a high level. We now need to specify the features that are needed to create non-hierarchical markup systems. Essentially, we will expand the notion of a document schema to include non-hierarchical structures. One of the most obvious points to start with is the notion of analytical perspectives. The SGML CONCUR feature comes close to expressing the basic notion, but has a number of serious defects:
The problem of arbitrarily overlapping segments like hypertext anchors cannot be handled with something like CONCUR, but is usually handled by the use of SGML EMPTY tags, and IDREFs. The problem with this appproach is that the DTD cannot indicate the usage of such tags. For instance, given two tags <startSeg> and <endSeg> a DTD cannot indicate:
The issue here has nothing to do with the syntax or the use of cross-references to indicate relationships in the document instance. Rather, it is whether the processing system can automatically perform the obvious useful verification tasks, or enforce user requirements that depend on the intended semantics of the tags (such as any relationships to other tags that should be enforced or forbidden).
Ambiguous content, as described in [Barnard 1995] is content that has several differing analyses within a given perspective. It is especially interesting because the phenomenon that it reflects is so important. The points where analytic ambiguity exists, are often the most interesting ones in many different disciplines. There are other interesting aspects of the markup of ambiguous texts. Not only are there boundaries that overlap, but we may also have boundaries with significant, or even mandatory alignments. For instance, a document that records ambiguity precisely, sharing structure down to the lexical level, is different from one that records ambiguous structures at the the sentence level, with completely separate structures that share only the leaf text, despite a large number of identical positions.
Segmentations of a text, like reference systems that assign portions of a text to one of a discrete set of regions, can be analyzed as a special case of hierarchical markup that is extremely simple -- a single level with tags that divide an entire document into segments. Since such structures are usually marked when they do not have a natural mapping to a hierarchical perspective, they are candidates for special treatment. This kind of structure is usually marked by empty "milestone" tags.
We have identified several fundamental sorts of non-hierarchical structure in the discussion above:
Each of these has certain implications that recommend new features of markup schemas. We will briefly outline the kinds of additions needed to accomodate each phenomenon.
As a concept this is a pretty simple idea, similar to using multiple clear overlays of a page, each marked with highlighter. We propose to represent different perspectives by streams. A stream is a markup object (usually) corresponding to a pespective in the same way that an SGML element (usually) corresponds with a document object. Streams are actually very similar to CONCUR, except that:
This can be handled in two ways. The simplest is to simply treat each sub-perspective as an independent stream. The other way would be to encode perspective/sub-perspective relationships explicitly in the document. But this more-complex solution is also more problematic since the relation of fields of inquiry is not clear-cut. And the real reason that one would want such information is because there are relationships between the relevant streams, which can be as easily expressed bytreating them as independent, and adding contraints to them.
We add:
This is easily accomodated by extending the notion of a stream to have several "root elements." A list of possible top-level elements in the stream is given, and those may occur anywhere (as long as they meet the other constraints defined for them). All text not contained by any of the roots is automatically ignored in that stream.
These cannot be handled properly by the stream notion (as we noted in the discussion of CONCUR). Elements like this may overlap themselves. To handle such elements we need only add a way to specify where they can break the hierarchy within their own stream -- any requirements to constrain an element not to break the hierarchy in other streams have already been handled above.
We allow two new hierarchy-breaking declarations:
Since a segmentation is simply a specialized, but common, form of stream, we allow it as a special type. When declared a segmentation stream can optionally specify an element (of some stream). If specified this state that the segmentation is valid (and required) in every occurence of that element. If no such element is specified, at least one of the elements in the segmentation stream must occur before any character content in the document. All these elements are represented by tags without content, and interpreted as tags spanning a region from their appearance, up to the next occurrence of a tag in the same stream.
These do not have to be specially handled as they are formally similar to self-overlapping tags. All that is required is to declare such elements as self-overlapping within the stream corresponding to their perspective.
We have presented some design requirements, based on earlier analyses of the intellectual situation underlying many cases of non-hierarchical text-structure. The intellectual groundwork leads to some confidence that while the set of features proposed may not be complete, that it is at least reflective of the fundamental problems.
Barnard, David, Burnard, Lou, Gaspart, Jean-Pierre, Price, Lynne A., Sperberg-McQueen, C. M., Varile, Giovanni Batista. "Hierarchical Encoding of Text: Technical Problems and SGML Solutions" Computers and the Humanities. 29 (1995): 211-231.
Coombs, James S., Allen H. Renear and Steven J. DeRose. "Markup Systems and the Future of Scholarly Text Processing" Communications of the Association for Computing Machinery. 30 (1987): 933-47.
DeRose, Steven, J., David Durand, Elli Mylonas and Allen H. Renear. "What is Text, Really?," Journal of Computing in Higher Education. 1:2 (1990).
Goldfarb, Charles. "A Generalized Approach to Document Markup."in Proceedings of the ACM SIGPLAN--SIGOA Symposium on Text Manipulation, New York: ACM, 1981.
Langendoen, D. Terence, Simons, Gary F., "Rationale for the TEI Recommendations for Feature-Structure Markup" Computers and the Humanities. 29 (1995): 191-209.
Reid, Brian. "A High-Level Approach to Computer Document Formatting." in Proceedings of the 7th Annual ACM Symposium on Programming Languages. New York: ACM, 1980.
Renear, Allen. David Durand, and Elli Mylonas. "Refining Our Notion of What Text Really Is."Research in Humanities Computing. Oxford: Oxford University Press, forthcoming.
Rohr , P., "The TextBase Paradigm: Architectural Considerations for a Second-Generation Scholar's Workstation" Senior Thesis, University of Chicago, (seen in draft, 1991).
Sperberg-McQueen, C. Michael. and Burnard, Lou, eds. Guidelines for the Encoding and Interchange of Machine-Readable Texts.. Chicago and Oxford: TEI, 1990, 1993.