The master SXML specification file is written in SXML itself. The present web page is the result of translating that SXML code using an appropriate "stylesheet". The master file, its renditions in HTML and other formats, and the corresponding stylesheets are available at
<http://pobox.com/~oleg/ftp/Scheme/xml.html>.
An XML document is essentially a tree structure. The start and the end tags of the root element enclose the whole content of the document, which may include other elements or arbitrary character data. Text with familiar angular brackets is an external representation of an XML document. Applications ought to deal with its internalized form: XML information set, or its specializations. This form lets an application locate specific data or transform an XML tree into another tree, which can then be written out as an XML, HTML, PDF, etc. document.
XML information set (Infoset) [XML Infoset] is an abstract data set that describes information available in a well-formed XML document. Infoset is made of "information items", which denote elements, attributes, character data, processing instructions, and other components of the document. Each information item has a number of associated properties, e.g., name, namespace URI. Some properties -- for example, 'children' and 'attributes' -- are (ordered) collections of other information items. Infoset describes only the information in an XML document that is relevant to applications. The default value of attributes declared in the DTD, parameter entities, the order of attributes within a start-tag, and other data used merely for parsing or validation are not included. Although technically Infoset is specified for XML it largely applies to HTML as well.
The hierarchy of containers comprised of text strings and other containers greatly lends itself to be described by S-expressions. S-expressions [McCarthy] are easy to parse into an internal representation suitable for traversal. They have a simple external notation (albeit with many a parentheses), which is relatively easy to compose even by hand. S-expressions have another advantage: provided an appropriate design, they can represent Scheme code to be evaluated. This code-data dualism is a distinguished feature of Lisp and Scheme.
SXML is a concrete instance of the XML Infoset. Infoset's goal is to present in some form all relevant pieces of data and their abstract , container-slot relationships to each other. SXML gives the nest of containers a concrete implementation as S-expressions, and provides means of accessing items and their properties. SXML is a "relative" of XPath [XPath] and DOM [DOM], whose data models are two other instances of the XML Infoset. SXML is particularly suitable for Scheme-based XML/HTML authoring, SXPath queries, and tree transformations. In John Hughes' terminology, SXML is a term implementation of evaluation of the XML document.
We will use an Extended BNF Notation (EBNF) employed in the XML Recommendation [XML]. The following table summarizes the notation.
thing?
thing*
thing+
thing1 | thing2 | thing3
<thing>
thing
"thing"
thing
( <A> <B>* )
<A>
followed by zero or more <B>
( <A> . <B> )
<A>
to an S-expression denoted by <B>
MAKE-SYMBOL(<A>:<B>)
<A>
followed by the colon character and by the characters that spell <B>
. The MAKE-SYMBOL()
notation can be regarded a meta-function that creates symbols.| [1] | <TOP> | ::= | ( *TOP* <namespaces>? <PI>* <comment>* <Element> ) |
This S-expression stands for the root of the SXML tree, a document information item of the Infoset. It contains the root element of the XML document as its only child element.
| [2] | <Element> | ::= | ( <name> <attributes-list>? <namespaces>? <child-of-element>* ) |
| [3] | <attributes-list> | ::= | ( @ <attribute>* ) |
| [4] | <attribute> | ::= | ( <name> "value"? ) |
| [5] | <child-of-element> | ::= | <Element> | "character data" | <PI> | <comment> | <entity> |
These are the basic constructs of SXML.
| [6] | <PI> | ::= | ( *PI* pi-target "processing instruction content string" ) |
The XML Recommendation specifies that processing instructions (PI) are distinct from elements and character data; processing
instructions must be passed through to applications. In SXML, PIs are
therefore represented by nodes of a dedicated type *PI*
. DOM Level 2 treats processing instructions in a similar way.
| [7] | <comment> | ::= | ( *COMMENT* "comment string" ) |
| [8] | <entity> | ::= | ( *ENTITY* "public-id" "system-id" ) |
Comments are mentioned for completeness only. A SSAX XML parser
[SSAX], among others, transparently skips the comments.
The XML Recommendation permits the parser to pass the comments to
an application or to completely disregard them. The present SXML grammar
admits comment nodes but does not mandate them by any means.
An <entity>
node represents a reference to an
unexpanded external entity. This node corresponds to an unexpanded
entity reference information item, defined in Section 2.5 of [XML Infoset]. Internal parsed entities are always expanded by the
XML processor at the point of their reference in the body of the
document.
| [9] | <name> | ::= | <LocalName> | <ExpName> |
| [10] | <LocalName> | ::= | NCName |
| [11] | <ExpName> | ::= | MAKE-SYMBOL(<namespace-id>:<LocalName>) |
| [12] | <namespace-id> | ::= | URI | user-ns-shortcut |
| [13] | <namespaces> | ::= | ( *NAMESPACES* <namespace-assoc>* ) |
| [14] | <namespace-assoc> | ::= | ( <namespace-id> "URI" ) |
An SXML <name>
is a single symbol. It is generally an expanded
name [XML Namespaces], which conceptually consists of a
local name and a namespace URI. The latter part may be empty, in which
case
<name>
is a NCName
: a Scheme
symbol whose spelling conforms to production [4] of the XML Namespaces
Recommendation [XML Namespaces]. <ExpName>
is also a Scheme symbol, whose string representation
contains an embedded colon that joins a local and a namespace
parts of the name. A URI
is a Namespace URI
string converted to a Scheme symbol. Universal Resource Identifiers (URI)
may contain characters (e.g., parentheses) that are prohibited
in Scheme identifiers. Such characters must be %-quoted during the
conversion from a URI string to URI
. A user-ns-shortcut
is a Scheme symbol chosen by the user to represent a namespace URI in the application program. The SSAX parser offers a user to define (short and mnemonic) unique shortcuts for Namespace URIs, which are are often long and unwieldy
strings.
| [2e] | <Element> | ::= | ( <name> <attributes-list>? <aux>? <child-of-element>* ) |
| [4e] | <attribute> | ::= | ( <name> "value"? <aux>? ) |
| [15e] | <aux> | ::= | ( @@ <namespaces>? <aux-node>* ) |
| [16e] | <aux-node> | ::= | To be defined in the future
|
The XML Recommendation and related standards are not firmly
fixed, as the long list of errata and the proposed version 1.1 of XML
clearly show. Therefore, SXML has to be able to accommodate future
changes while guaranteeing backwards compatibility. SXML also ought to
permit applications to store various processing information (e.g.,
cached resolved IDREFs) in an SXML tree. To allow such extensibility, we
introduce two new node types, <aux>
and <aux-node>
. The semantics of the latter is to be
established in future versions of SXML. Other candidates for <aux-node>
are the unique id of an element or the reference to element's parent. The structure and the semantics of the <aux>
node is similar to those of the attribute list. Applications that do not specifically look for auxiliary
nodes can transparently ignore any present and future extensions.
Infoset's information item is a sum of its properties. This makes a list a particularly suitable data structure to represent an item. The head of the list, a Scheme identifier, names the item. For many items this is their (expanded) name. For an information item that denotes an XML element, the corresponding list starts with element's expanded name, optionally followed by collections of attributes and effective namespaces. The rest of the element item list is an ordered sequence of element's children -- character data, processing instructions, and other elements. Every child is unique; items never share their children even if the latter have the identical content.
Just as XPath does and the Infoset specification explicitly allows,
we group character information items into maximal text strings. The
value of an attribute is normally a string; it may be omitted (in
case of HTML) for a boolean attribute, e.g., <option checked>
.
We consider a collection of attributes an information item in its
own right, tagged with a special name @
. The character '@'
may not occur in a valid XML name; therefore
an <attributes-list>
cannot be mistaken for a list that represents an element. An XML
document renders attributes, processing instructions, namespace
specifications and other meta-data differently from the element
markup. In contrast, SXML represents element content and meta-data
uniformly -- as tagged lists. SXML takes advantage of the fact that
every XML name is also a valid Scheme identifier, but not every Scheme
identifier is a valid XML name. This observation lets us introduce
administrative names such as @
, *PI*
, *NAMESPACES*
without worrying about potential name
clashes. The observation also makes the relationship between XML and SXML
well-defined. An XML document converted to SXML can be reconstructed
into an equivalent (in terms of the Infoset) XML document. Moreover, due
to the implementation freedom given by the Infoset specification, SXML
itself is an instance of the Infoset.
Since an SXML document is essentially a tree structure, the SXML grammar of Section 3 can be presented in the following, more uniform form:
| [N] | <Node> | ::= | <Element> | <attributes-list> | <attribute> | "character data" | <namespaces> | <TOP> | <PI> | <comment> | <entity> | <aux> |
or as a set of two mutually-recursive datatypes,
Node
and Nodeset
, where the latter is an ordered set of Node
s:
| [N1] | <Node> | ::= | ( <name> . <Nodeset> ) | "text string" |
| [N2] | <Nodeset> | ::= | ( <Node> <Node>* ) |
| [N3] | <name> | ::= | <LocalName> | <ExpName> | @ | *TOP* | *PI* | *COMMENT* | *ENTITY* | *NAMESPACES* | @@ |
The uniformity of the SXML representation for elements, attributes, and processing instructions simplifies queries and transformations. In our formulation, attributes and processing instructions look like regular elements with a distinguished name. Therefore, query and transformation functions dedicated to attributes become redundant.
A function SSAX:XML->SXML
of a functional Scheme XML parsing
framework SSAX [SSAX] can convert an XML document or a
well-formed part of it into the corresponding SXML form. The parser
supports namespaces, character and parsed entities,
attribute value normalization, processing instructions and CDATA
sections.
The motivation for XML Namespaces is explained in an excellent article by James Clark [Clark1999]. He says in part:
The XML Namespaces Recommendation tries to improve this situation by extending the data model to allow element type names and attribute names to be qualified with a URI. Thus a document that describes parts of cars can usepartqualified by one URI; and a document that describes parts of books can usepartqualified by another URI. I'll call the combination of a local name and a qualifying URI a universal name. The role of the URI in a universal name is purely to allow applications to recognize the name. There are no guarantees about the resource identified by the URI. The XML Namespaces Recommendation does not require element type names and attribute names to be universal names; they are also allowed to be local names.
...
The XML Namespaces Recommendation expresses universal names in an indirect way that is compatible with XML 1.0. In effect the XML Namespaces Recommendation defines a mapping from an XML 1.0 tree where element type names and attribute names are local names into a tree where element type names and attribute names can be universal names. The mapping is based on the idea of a prefix. If an element type name or attribute name contains a colon, then the mapping treats the part of the name before the colon as a prefix, and the part of the name after the colon as the local name. A prefixfoorefers to the URI specified in the value of thexmlns:fooattribute. So, for example<cars:part xmlns:cars='http://www.cars.com/xml'/>maps to<{http://www.cars.com/xml}part/>Note that thexmlns:carsattribute has been removed by the mapping.
<name>
, is either a local name or an expanded
name. Both kinds of names are Scheme identifiers. A local name has no
colon characters in its spelling. An expanded name is spelled with at
least one colon, which may make the identifier look rather odd. In
SXML, James Clark's example will appear as follows:(http://www.cars.com/xml:part)or, somewhat redundantly,
(http://www.cars.com/xml:part (@)
(*NAMESPACES* (cars "http://www.cars.com/xml")))
Such a representation also agrees with the Namespaces Recommendation [XML Namespaces], which says: "Note that the prefix functions only as a placeholder for a namespace name. Applications should use the namespace name, not the prefix, in constructing names whose scope extends beyond the containing document."
http://www.cars.com/xml:part
. Therefore, an application that
invokes the SSAX parser may tell the parser to map the URI http://www.cars.com/xml
to an application-specific namespace shortcut user-ns-shortcut
, e.g., c
. The parser will then produce(c:part (*NAMESPACES* (c "http://www.cars.com/xml")))To be more precise, the parser will return just
(c:part)If an application told the parser how to map
http://www.cars.com/xml
, the application can keep this mapping in
its mind and will not need additional reminders.We must note there is a 1-to-1 correspondence between user-ns-shortcut
s and the corresponding namespace URIs. This is generally not true
for XML namespace prefixes and namespace URIs. A user-ns-shortcut
uniquely represents the corresponding namespace
URI within the document, but an XML namespace prefix does not. For
example, different XML prefixes may specify the same namespace
URI; XML namespace prefixes may be redefined in children elements. The
other difference between user-ns-shortcut
s and
XML namespace prefixes is that the latter are at the whims of the author
of the document whereas the namespace shortcuts are defined by an
XML processing application.
(symbol->string symbol) returns the name of symbol as a string. If the symbol was part of an object returned as the value of a literal expression (section 4.1.2) or by a call to the read procedure, and its name contains alphabetic characters, then the string returned will contain characters in the implementation's preferred standard case -- some implementations will prefer upper case, others lower case. If the symbol was returned by string->symbol, the case of characters in the string returned will be the same as the case in the string that was passed to string->symbol.
Thus, R5RS explicitly permits case-sensitive symbols: (string->symbol "a")
is always different from (string->symbol "A")
. SXML uses such case-sensitive symbols for
all XML names. SXML-compliant XML parsers must preserve the case of
all names when converting them into symbols. A parser may use the R5RS
procedure string->symbol
or other available means.
string->symbol
or
other standard or non-standard ways of producing case-sensitive
symbols. A SSAX built-in test suite is a highly portable example of entering literal case-sensitive SXML names. Such workarounds, however simple, become unnecessary on Scheme systems that support DSSSL. According to the DSSSL standard,7.3.1. Case Sensitivity. Upper- and lower-case forms of a letter are always distinguished. NOTE 5: Traditionally Lisp systems are case-insensitive.In addition, a great number of Scheme systems offer a case-sensitive reader, which often has to be activated through a compiler option or pragma. A web page [Scheme-case-sensitivity] discusses case sensitivity of various Scheme systems in detail.
Normalized SXML is a proper subset of SXML that permits efficient
processing. An SXML document in the normalized form must satisfy a
number of additional restrictions. The first restriction makes
<attributes-list>
mandatory (cf. Section 3,
production [2]). If an element has no attributes, <attributes-list>
shall be specified as (@)
. In normalized SXML,
<comment>
and <entity>
nodes
must be absent. Parsed entities should be expanded, even if they are external.
A node <namespaces>
may appear only in
a *TOP*
element.
<OPTION checked>
and full, <OPTION checked="checked">
. XML mandates the full form only, whereas HTML allows both,
preferring the former. SXML supports the minimized form along with the
full one: (OPTION (@ (checked)))
and (OPTION (@ (checked "checked")))
. The normalized SXML however accepts only the full form. (some-name) ; An empty element without attributes
(some-name (@)) ; The same but in the normalized form
The figure below shows progressively more complex examples.
| XML | SXML |
|---|---|
<WEIGHT unit="pound">
<NET certified="certified">67</NET>
<GROSS>95</GROSS>
</WEIGHT>
|
(WEIGHT (@ (unit "pound"))
(NET (@ (certified)) 67)
(GROSS 95)
)
|
<BR/> |
(BR) |
<BR></BR> |
(BR) |
<P>
<![CDATA[<BR>\r\n
<![CDATA[<BR>]]]]>> </P>
|
(*TOP*
(P "<BR>\n<![CDATA[<BR>]]> "))
|
An example from the XML Namespaces Recommendation |
|
<!-- initially, the default
namespace is 'books' -->
<book xmlns='urn:loc.gov:books'
xmlns:isbn='urn:ISBN:0-395-36341-6'>
<title>Cheaper by the Dozen</title>
<isbn:number>1568491379</isbn:number>
<notes>
<!-- make HTML the default namespace
for some commentary -->
<p xmlns='urn:w3-org-ns:HTML'>
This is a <i>funny</i> book!
</p>
</notes>
</book>
|
(*TOP*
(urn:loc.gov:books:book
(urn:loc.gov:books:title
"Cheaper by the Dozen")
(urn:ISBN:0-395-36341-6:number "1568491379")
(urn:loc.gov:books:notes
(urn:w3-org-ns:HTML:p
"This is a "
(urn:w3-org-ns:HTML:i "funny")
" book!"))))
|
Another example from the XML Namespaces Recommendation |
|
<RESERVATION
xmlns:HTML=
'http://www.w3.org/TR/REC-html40'>
<NAME HTML:CLASS="largeSansSerif">
Layman, A</NAME>
<SEAT CLASS='Y'
HTML:CLASS="largeMonotype">33B</SEAT>
<HTML:A HREF='/cgi-bin/ResStatus'>
Check Status</HTML:A>
<DEPARTURE>1997-05-24T07:55:00+1
</DEPARTURE></RESERVATION>
|
(*TOP*
(*NAMESPACES*
(HTML "http://www.w3.org/TR/REC-html40"))
(RESERVATION
(NAME (@ (HTML:CLASS "largeSansSerif"))
"Layman, A")
(SEAT (@ (HTML:CLASS "largeMonotype")
(CLASS "Y"))
"33B")
(HTML:A (@ (HREF "/cgi-bin/ResStatus"))
"Check Status")
(DEPARTURE "1997-05-24T07:55:00+1")))
|
Discussions with Kirill Lisovsky of MISA University are gratefully acknowledged. He shares the credit for this page. The errors are all mine.
[McCarthy] John McCarthy. Recursive Functions of Symbolic Expressions
and Their Computation by Machine, Part I. Comm. ACM, 3(4):184-195, April 1960.
<http://www-formal.stanford.edu/jmc/recursive/recursive.html>
[Clark1999] Jim Clark. XML Namespaces. February 4, 1999.
<http://www.jclark.com/xml/xmlns.htm>
[R5RS] R. Kelsey, W. Clinger, J. Rees (eds.), Revised5 Report on
the Algorithmic Language Scheme. Higher-Order and
Symbolic Computation, Vol. 11, No. 1, September, 1998
and
ACM SIGPLAN Notices, Vol. 33, No. 9, October, 1998.
<http://www.schemers.org/Documents/Standards/R5RS/>
[SSAX] Oleg Kiselyov. Functional XML parsing framework: SAX/DOM and
SXML parsers with support for XML Namespaces and validation. September
5, 2001.
<http://pobox.com/~oleg/ftp/Scheme/SSAX.scm>
[SXML-short-paper] Oleg Kiselyov. XML and Scheme. An introduction to SXML and SXPath;
illustration of SXPath expressiveness and comparison with
XPath. September 17, 2000.
<http://pobox.com/~oleg/ftp/Scheme/SXML-short-paper.html>
[Lisovsky] Kirill Lisovsky. Case sensitivity of Scheme systems.
<http://pair.com/lisovsky/scheme/case-sensitivity.html>
[DOM] World Wide Web Consortium. Document Object Model (DOM) Level 1
Specification. W3C Recommendation.
<http://www.w3.org/TR/REC-DOM-Level-1>
[XML] World Wide Web Consortium. Extensible Markup Language (XML)
1.0 (Second Edition). W3C Recommendation. October 6, 2000.
<http://www.w3.org/TR/REC-xml>
[XML Infoset] World Wide Web Consortium. XML Information Set. W3C Recommendation. 24 October 2001.
<http://www.w3.org/TR/xml-infoset>
[XML Namespaces] World Wide Web Consortium. Namespaces in XML. W3C Recommendation. January 14, 1999.
<http://www.w3.org/TR/REC-xml-names/>
[XPath] World Wide Web Consortium. XML Path Language (XPath).
Version 1.0. W3C Recommendation. November 16, 1999.
<http://www.w3.org/TR/xpath>
This site's top page is http://pobox.com/~oleg/ftp/
oleg@pobox.com or oleg@acm.org or oleg@computer.orgConverted from SXML by SXML->HTML