Instant Download Java XML and The JAXP 1st Edition Arthur Griffith PDF All Chapter
Instant Download Java XML and The JAXP 1st Edition Arthur Griffith PDF All Chapter
com
https://ebookgate.com/product/java-xml-and-the-
jaxp-1st-edition-arthur-griffith/
https://ebookgate.com/product/java-and-xml-3rd-ed-edition-
edelson/
https://ebookgate.com/product/java-2ee-and-xml-development-1st-
edition-kurt-a-gabrick/
https://ebookgate.com/product/developing-web-services-with-java-
apis-for-xml-1st-edition-robert-hablutzel/
https://ebookgate.com/product/building-web-services-with-java-
making-sense-of-xml-soap-wsdl-and-uddi-2nd-ed-edition-graham/
Building Web Services with Java Making Sense of XML
SOAP WSDL and UDDI 2nd Edition Steve Graham
https://ebookgate.com/product/building-web-services-with-java-
making-sense-of-xml-soap-wsdl-and-uddi-2nd-edition-steve-graham/
https://ebookgate.com/product/the-vauban-fortifications-of-
france-paddy-griffith/
https://ebookgate.com/product/construction-management-principles-
and-practce-1st-edition-alan-griffith/
https://ebookgate.com/product/the-halifax-explosion-and-the-
royal-canadian-navy-1st-edition-john-griffith-armstrong/
https://ebookgate.com/product/the-collected-poems-of-arthur-
yap-1st-edition-arthur-yap/
Java™, XML, and JAXP
Arthur Griffith
Arthur Griffith
ISBN: 0-471-20907-4
10 9 8 7 6 5 4 3 2 1
Contents
Introduction xi
v
vi Contents
Attributes 19
Comments 20
The Character Entities 20
The CDATA Section 22
DTD 22
Single File 23
Multiple Files 24
The DOCTYPE Inline Declaration 25
The DOCTYPE SYSTEM Declaration 26
The DOCTYPE PUBLIC Declaration 27
Comments 28
The ELEMENT Declaration 28
The ATTLIST Declaration 31
The ENTITY Declaration 32
Parameter Entities 34
Unparsed Entities 35
The PI Declaration 36
Conditional Inclusion with IGNORE and INCLUDE 37
Namespaces 37
A Simple Namespace 38
The Default Namespace 39
Multiple Namespaces 40
A Namespace Defined in a DTD 40
Multiple Namespaces in a DTD 41
Summary 42
mkdir 204
move 204
Summary 205
Glossary 213
Index 217
Introduction
xi
xii Introduction
Prerequisites
This is a beginner’s book for XML, but it is not a beginner’s book for general comput-
ing or programming. The following things are required of the reader:
1. The fundamentals of Java must be understood. This includes the concepts of in-
heritance, interfaces, static methods, properties, instantiation, and polymorphism.
If you have sucessfully written a few classes in Java, and you have a good refer-
ence book, you have everything you need.
2. There should be some familiarity with the basic structure of HTML.
3. The reader must have Internet access and should have some rudimentary un-
derstanding of URLs and the process of transferring files from one location to
another.
CHAPTER
1
Introduction to
XML with JAXP
This chapter is an overview of some of the basic things you will need to know before
you can understand the processing of XML documents using Java’s Java API for XML
Processing (JAXP). Although the Java application programming interface (API) is rather
straightforward, you will need to understand how XML is constructed before you can
clearly understand the sort of things that can be done with the API. The fundamental
concepts described in this chapter include the fact that, like HTML, the XML language
is derived from SGML. This kinship between XML and HTML has brought about the ex-
istence of a hybrid known as XHTML. There are two completely distinct parsers, named
DOM and SAX, that can be used to read the contents of an XML document.
1
3851 P-01 1/28/02 10:32 AM Page 2
2 Chapter 1
The JAXP package is a set of Java classes that implements XML document parsers,
supplies methods that can be used to manipulate the parsed data, and has special
transformation processors to automate the conversion of data from XML to another
form. For example, the other form can be a database record layout ready for storage, an
HTML Web page ready for display, or simply a textual layout ready for printing.
One of the outstanding features of XML is its fundamental simplicity. Once you un-
derstand how tags are used to create elements, it is easy to manually read and write
XML documents. With this basic XML understanding, and with knowledge of the Java
language, it is a straightforward process to understand the relationship between XML
and the Java API for manipulating XML. There are only a few classes in this API, and
it is only a matter of creating the appropriate set of objects, and they will supply the
methods you can call to manipulate the contents of an XML document. With these basic
concepts understood, and with the simplicity of the constructs involved, you can de-
sign and write programs while concentrating mostly on the problem you are trying to
solve, not on the mechanics of getting it done.
chine installed. And, because XML is also portable, the result is an almost uni-
versally portable system and can be used in exactly the same way on any com-
puter.
To fully understand the concepts discussed in the following paragraphs and chap-
ters, you should be familiar with Java, or familiarize yourself with the Java program-
ming language using Java tutorials. To understand how these classes do their jobs, you
will only need to understand Java classes, objects, interfaces, and methods. There is
nothing more complicated than a static method returning an object that implements an
interface; if you understand these fundamentals, you will have no problem with any-
thing in this book.
4 Chapter 1
<paragraph>
The purpose of this type of XML document is to use
<italic>tags</italic> in such a way that the software that
reads the document will be able to <underline>organize</underline>
and <underline>format</underline> the text in such a way that
it is more presentable and easier to read.
</paragraph>
This form of XML looks a lot like HTML. In fact, this form of XML and HTML both
serve exactly the same purpose: to allow the software reading the document to extract
things from it and also to use the tags as formatting instructions to create a display
from the extracted text.
The same basic form can be used to package data, as in the following example:
<person>
<name>Karan Dirsham</name>
<street>8080 Holly Lane</street>
<city>Anchor Point</city>
<state>Alaska</state>
<zip>99603</zip>
</person>
This second form is more like a collection of fields that go to make up a data record, and
used this way, it can be a very convenient method for storing data and transmitting in-
formation among otherwise incompatible systems. All that is necessary for successful
data reception of transmitted data is for the recipient to understand the meanings of the
tags and be able to extract the data from them. Of course, by using the appropriate ap-
plication to read and process the data, any XML document can be easily formatted for
display. The process of extraction and formatting XML data is the primary subject of
this book.
Attributes can be used to specify options that further refine the meaning of the tags
to the process reading the document. These attributes can be used both for data defin-
ition and for formatting. For example, the following code has attributes:
<person font="Courier">
<name type="first" enhance="bold">Janie</name>
<name type="last" enhance="underline">Rorick</name>
</person>
Any program reading this document can apply its own interpretation to the mean-
ings of the tags and the options. No formatting information is included in an XML doc-
ument. All formatting is left entirely to the process reading and interpreting the XML
document. One program could read a document containing this example and take the
3851 P-01 1/28/02 10:32 AM Page 5
bold option to mean a different font, another could take it to mean a larger font, and an-
other could use it as an instruction to underline the text or display it in a different color.
Or the bold option could be ignored altogether. The only thing XML knows about the
attribute is the syntax required to include it with a tag.
If you have worked with HTML, you can see the similarities in the syntax of HTML
and XML. They are similar enough that it is possible to write an XML document and
use only tag names known to a particular Web browser and then have that Web
browser read the document and impose its interpretations on the tags and options and
result in a displayed page. A displayable XML document is written often enough that
a document of this type has a special name; it is called XHTML. There is more infor-
mation about XHTML later in this chapter.
XML is a nonprocedural computer language, as opposed to a procedural language such
as Java. A procedural language is one that consists of lists of instruction that are ex-
pected to be obeyed one by one, usually in order from top to bottom. A nonprocedural
language is one that expects all of its instructions to be executed as if they were all
being executed simultaneously and, if necessary, to react to one another to create an
overall state or set of states. An example of nonprocedural processing is a spreadsheet
in which all the cells in the sheet that contain equations have their values calculated at
once, creating a static state of constant values displayed in the cells. This same sort of
thing happens in a Web browser where the HTML tags define the state (layout, colors,
text, pictures, fonts, and so on) that determines the appearance of a Web page.
DTD
DTD stands for Document Type Definition. Although DTD is normally treated sepa-
rately from XML tags, has a different syntax, and serves a different purpose, it is very
much a part of the XML language. The DTD section of an XML document is used to de-
fine the names and syntax of the elements that can be used in the document. There are
several steps involved in the creation and application of a DTD definition, and Chap-
ter 2 contains explanations of those steps along with a number of examples. Its source
can be included inline inside an XML document, or it can be stored in a separate docu-
ment from the XML text and tags. Because its purpose is to specify the correct format
of a marked up document, DTD is most useful if it is made available to several docu-
ments and is most often stored in a file separate from any XML document, which en-
ables it to be accessed from any number of XML documents. For the sake of simplicity,
however, most of the examples in this book have a simple DTD included as part of the
XML document.
DTD enables you to further refine the syntactical requirements of a set of XML
markup rules. In a DTD you can specify the allowed and disallowed content of each tag
that is to appear in an XML document. That takes at least a third of Chapter 2. DTD has
its limitations, but there are many different things that can be done by using it. You can
specify which elements are allowed to appear inside other elements as well as which el-
ements are required and where they are required. You can specify which attributes are
valid for each tag and even specify the set of possible values for each one. You can
3851 P-01 1/28/02 10:32 AM Page 6
6 Chapter 1
create macro-like objects (called entities) that are expanded into text as the XML docu-
ment is parsed. All of this is explored in Chapter 2.
An XML document that conforms to the fundamental syntax of markup tags is
called well-formed. To be a well-formed document, all elements must have matching
opening and closing tags, and the tags must be nested properly. For example, the ex-
pression <p><b>text</b></p> is well-formed because the opening tags, <p><b>,
have closing tags, </b></p>, and they are nested properly. To be well-formed every
closing tag must be a match with the most recent tag that is still open. In short, all tags
must be closed in the exact reverse order in which they were opened. The expressions
<p><b>text</b> and <p><b>text</p></b> are not well-formed.
An XML document that conforms to the rules of its DTD is referred to as a valid doc-
ument. For a document to be successfully tested as being valid, it must also be well-
formed. Some parsers can check the document against the DTD definitions and throw
an exception if the document is not valid.
Use of DTDs is very important to the portability of XML. If a DTD is well written
(that is, if all the tags are defined properly), a process can be written that will be able to
read and interpret any XML data from any document that conforms to the rules of the
DTD.
A single XML document can use more than one DTD. However, this multiple DTD
use can result in a naming collision. If two or more DTDs define a tag by the same
name, they will more than likely define that tag as having different characteristics. For
example, one could be defined as requiring a font attribute, whereas the other has no
such attribute. This problem is solved by using device known as a namespace. An ele-
ment specified as being from one namespace is distinct from one of the same name
from another namespace. For example, if a pair of DTDs both include a definition for
an element with the tag name selectable, one DTD could be declared in the name-
space max and the other in the namespace scrim; then there would be the two distinct
tag names max:selectable and scrim:selectable available for use in the docu-
ment. Examples using namespaces are explained in Chapter 2.
XSL
XSL stands for XML Stylesheet Language. It is used as a set of instructions for the
translation of the content of an XML document into another form—usually a presenta-
tion form intended to be displayed to a human. An XSL program is actually, in itself, a
document that adheres to the syntax of XML. It contains a set of detailed instructions
for extracting data from another XML document and converting it to a new format. Per-
forming such transformation is the subject of Chapter 8.
The process of using XSL to change the format of the data is known as transformation.
Transformation methods are built into the JAXP that can be used to perform any data
format translation you define in an XSL document. These transformations can be pro-
grammed directly into Java instead of using XSL, but XSL simplifies things by taking
care of some of the underlying mechanics, such as walking through the memory-
resident parse tree to examine the source document. It also supplies you with some
3851 P-01 1/28/02 10:32 AM Page 7
built-in methods for doing commonly performed tasks such as configuring the parser
and handling error conditions.
XSL performs a function—supplying human-readable data—that is every bit as im-
portant as XML itself. With the single exception of robotics, all the software in the
world is ultimately used to display data to humans. Nothing is ever stored in a data-
base without the expectation that it will be extracted and presented in some human-
readable format. In fact, presenting readable data is the entire purpose of the Internet.
Operating systems and computer language compilers only exist to support and create
other programs that, in turn, directly or indirectly present data in a form that can be un-
derstood by humans.
SAX
SAX stands for Simple API for XML. It is a collection of Java methods that can be used
to read an XML document and parse it in such a way that each of the individual pieces
of the input are supplied to your program. It is a very rudimentary form of parsing that
is not much more than a lexical scan: It reads the input, determines the type of things
it encounters (it recognizes the format of the nested tags and separates out the text that
is the data portion of the document), and supplies them to your program in the same
order in which they appear in the document.
The form of the data coming from a SAX parser can be very useful for streaming op-
erations such as a direct translation of tags or text into another form, with no changes
in order. If your application needs to switch things around, however, it will be neces-
sary for it to keep copies of the data so it can be reorganized. In many cases, it would
be easier to parse using the DOM parser. The SAX parser has the advantage of being
fast and small because it doesn’t hold anything in memory once it has moved on to the
next input item in the input document.
There are two versions of SAX. The original version is SAX 1.0 (also called SAX1).
The current version is SAX 2.0 (also called SAX2). SAX2 is an extension of the defini-
tions of SAX 1.0 to include things such as the ability to specify names using name-
spaces. Both SAX1 and SAX2 are a part of JAXP. Because SAX1 is still a part of JAXP,
programs based on it will work , but much of it has been deprecated in the API to pro-
mote use of SAX2 in all newly written programs. Only SAX2 is discussed in the fol-
lowing chapters because it does everything SAX1 does and more.
DOM
DOM stands for Document Object Model. It is a collection of Java methods that enable
your program to parse an input document into a memory-resident tree of nodes that
maintains the relationships found in the original input document. There are also meth-
ods that enable your application to walk freely about the tree and extract the informa-
tion stored there.
3851 P-01 1/28/02 10:32 AM Page 8
8 Chapter 1
The internal form of the data tree resulting from a DOM parse is quite convenient if
you are going to be accessing document content out of order. That is, if your program
needs to rearrange the incoming data for its output, or if it needs to move around the
document and select data in random-access order, you should find that the DOM doc-
ument tree will provide what you need for doing this. You can search for things in the
tree and pull out what you need without regard to where it appeared in the input doc-
ument. One disadvantage of DOM is that a large document will take up a lot of space
because the entire document is held in memory. With modern operating systems, how-
ever, the document would need to be extremely large before it would adversely affect
anything. DOM also has the disadvantage of being more complicated to use than SAX.
Because DOM can randomly access the stored data, the API for it is necessarily more
complex. Although DOM is more complicated to use than SAX, it can be used to do
much more. For more details about how DOM works, see Chapters 3, 6, and 7.
Internally in the JAXP, the DOM parser actually uses SAX as its lexical scanner. That
is, a SAX parser is used to read the document and break it down into a stream of its
components, and the DOM software takes this token stream and constructs a tree
from it. This is why it is best to have an understanding of SAX before trying to get a
clear idea of how JAXP DOM works. Although you may never use SAX directly, it’s a
good idea to know how it works and how the incoming document is broken down. At
the very least, you will need to be familiar with the meaning of its error messages and
how to process them in your application, which means you will need to know how
SAX works. For more details, see Chapters 3, 4, and 5
SGML
SGML stands for Standard Generalized Markup Language. This is the parent markup
language of XML and HTML, which were both derived as special-purpose subsets of
SGML. Included in the 500-page SGML specification document is a definition of the
system for organizing and tagging elements in a document. It became a standard with
the International Organization of Standards (ISO) in 1986, but the specification had ac-
tually been in use some time before that. It was designed to manage large documents
so that they could be frequently changed and also printed. It is a large language defin-
ition and too difficult to actually implement, which has resulted in the subsets XML
and HTML.
XML works well being a subset of SGML because the complexity of SGML isn’t nec-
essary to do all of the tagging and transforming that needs to be done. Being a practi-
cal subset makes it much easier to write a parser for XML. Because of the reduction in
complexity of the language, XML documents are smaller and easier to create than
SGML documents would be. For example, where SGML always requires the presence
of a DTD, in XML the DTD is largely optional. If you are going to validate the correct-
ness of an XML document, the DTD is necessary, but otherwise it can be omitted.
XML is a bit closer to being like SGML than is HTML. For one thing, HTML is filled
with ambiguities because it allows things like an opening tag that has no closing tag to
match it. This prevents any attempts to standardize HTML because a parser cannot
3851 P-01 1/28/02 10:32 AM Page 9
predict what it will find. And many HTML extensions and modifications apply in one
place but do not apply in another. Although XML is extensible—you can add all the tag
types you wish—it is very strict in the way it allows you to do it. Like SGML, the for-
matting of XML can be controlled by XSL documents used for transformations.
XHTML
XHTML stands for Extensible Hypertext Markup Language. An XHTML document is
a hybrid of XML and HTML in such a way that it is syntactically correct for both of
them. That is, although an XHTML document can be displayed by a Web browser, it
can also be parsed into its component parts by a SAX or DOM parser. Both XML and
HTML are subsets of SGML, so the only problem in combining the two into XHTML
was in dealing with the places where HTML had departed from the standard format.
Most obvious are the many opening tags in HTML that do not have closing tags to
match them and the fact that tag nesting is not required.
XHTML was conceived so that, once Web browsers were capable of dealing with the
strict and standard forms required for XML, a more standardized form of Web page
could evolve. With XML it is relatively easy to introduce new forms by defining addi-
tional elements and attributes, and because this same technique is part of XHTML, it
will allow the smooth integration of new features with the existing ones. This capabil-
ity is particularly attractive because alternate ways of accessing the Internet are con-
stantly being developed. The presence of a standard, parsable Web page will allow
easier modification to the display format for new demands, such as the special re-
quirements of hand-held computers.
There is a fundamental difference between XML and HTML. XML is an SGML,
while HTML is an application of SGML. That is, SGML does not have any tag names de-
fined and neither does XML. For both XML and SGML, a DTD must be used to define
and provide meanings for element names. On the other hand, HTML has a set of ele-
ment names already defined. The element names of HTML are the ones that have a
meaning to the Web browser attempting to format the page. This fundamental differ-
ence between XML and HTML can be overcome by the creation of a DTD that defines
the syntax for all the elements that are used in HTML. With such a DTD in place, an
XML document that adheres to the DTD’s definitions will also be an HTML document,
and thus it can be displayed using a Web browser.
JAXP
JAXP stands for Java API for XML Processing. It is a set of Java classes and interfaces
specifically designed to be used in a program to make it capable of reading, manipu-
lating, and writing XML-formatted data.
It includes complete parsers for SAX1 and SAX2 and the two types of DOM: DOM
Level 1 and DOM Level 2. Most of this book explores the use of parsers in extracting
3851 P-01 1/28/02 10:32 AM Page 10
10 Chapter 1
data from an XML document. More information on both of these parsers is the subject
of Chapter 3. There are examples of using the SAX parsers in Chapters 4 and 5 and of
the DOM parsers in Chapters 6 and 7. All of the parsers check whether a document is
well-formed, and the parsers can be used in validating or nonvalidating mode, as de-
scribed in the DTD section earlier in this chapter. There is also an extensive API that can
be used to access the data resulting from any of these parsers. And although SAX1 and
DOM Level 1 are both present and working, the API for them is deprecated. Anything
you need to do can be accomplished with just the SAX2 and DOM Level 2 API.
The simplest way to get a copy of JAXP is to download and install the latest copy of
Java from Sun at http://java.sun.com. Beginning with Java version 1.4, the JAXP API is
included as part of Java 2 Standard Edition. It is also a part of the Java 1.3 Enterprise
Edition. If you want to use JAXP with a prior version of Java, you can get a copy of it
from the Web site http://java.sun.com/xml. You can use JAXP version 1.1 with the
Java Software Development Kit ( SDK ) version 1.1.8 or newer.
If, for some reason, you are staying with an older version of Java and downloading
JAXP as a separate API, you will need to download the documentation separately.
This documentation is in the form of a set of HTML pages generated from the source
code by the standard javadoc utility. It is in the same format as the documentation for
the rest of Java. There are two ways to install the standalone JAXP: You can include its
jar files in the same directories as your Java installation, or you can install them in their
own directories. If you do not elect to include them with Java, you will need to specify
the classpath settings when either compiling or running a JAXP application.
Ant
Ant is a tool used to compile Java classes. It isn’t limited to Java; it can be used for your
entire software development project. It performs the same job as the traditional make
utility by compiling only the programs that need to be compiled, but it has some spe-
cial features that cause it to work very will with Java. For one thing, it understands the
Java package organization and can use it when checking dependencies. It is an XML
application and, as such, uses an XML file as its input control file (much like the make
utility uses a makefile) to determine which modules to compile. The Java classes of
Ant are available in source code form, so you can, by extending the existing classes, add
any new commands and processing that you would like.
When compiling Java, Ant compares the timestamp of the source files to timestamp
of the class files to determine which Java source files need to be compiled. Also, Ant
knows about the relationship between Java directory trees and packages, so it is capa-
ble of descending your source tree properly to create classes within several packages.
For more information about Ant, with examples, see Chapter 9. The Web site for Ant
is http://jakarta.apache.org/ant/.
3851 P-01 1/28/02 10:32 AM Page 11
Summary
This chapter is an overview of the basics. You should now have some idea of what
JAXP is designed to do and a rough idea of how it does it. Using JAXP, a Java program
is capable of reading an XML document and analyzing its various parts for meaning or
formatting. There is more than one way to read a document: You can do it in a se-
quential manner with a SAX parser or browse about the document randomly by using
a DOM parser.
The next chapter takes a detailed look at the syntax of an XML document. As you
will see, its tags and content are straightforward and easy to read and write. The com-
plexities of the syntax are all contained in the DTD portion. Chapter 2 describes the
syntax and is designed to make it easy for you to use as a reference later.
3851 P-01 1/28/02 10:32 AM Page 12
3851 P-02 1/28/02 10:31 AM Page 13
CHAPTER
This chapter examines the syntactical format of an XML document. At the top of each
document is a heading making the declaration that it is an XML document. Next there
is an optional Document Type Definition (DTD) that defines some specific syntax rules
that must be followed for the remainder of the document. Finally, the body of the doc-
ument itself consists of text and the tags that are used to mark it up.
The body of an XML document contains text marked with tags that describe the text.
The original intent of XML was to ensure that humans could easily read and write XML
documents, so all XML documents contain only text. When binary data—such as an
image or an audio file—must be included, the binary file is stored separately, and the
XML document contains a reference to that file.
All of the text of an XML document is included between pairs of tags. If you are fa-
miliar with HTML, you know how this works (although XML is much more strict
about it than HTML). A tag can be identified by its name, and every opening tag has a
closing tag with the text itself sandwiched between them. Unlike HTML, when writing
an XML document, you can create your own tag names. In creating your own tag
names, however, it is essential that you use tags known to the process that will receive
and read the XML document. The program must recognize what the tags are and what
they mean. Most of this book concerns itself with processes that read XML documents
and process tag data.
Very much a child of the Internet, XML makes it easy to complete the text of a doc-
ument by including links to data and text files stored in remote locations. In fact, the
13
3851 P-02 1/28/02 10:31 AM Page 14
14 Chapter 2
XML language itself has the ability to require the presence of data resources in a remote
location. These resources can be simply data, or they can be schema information used
to validate the correctness of the tags and the general format of a document. This means
that the parser—the program that reads an XML document—must be able to read in-
formation from another host on the Internet.
There are really two syntaxes built into XML. One is the tag-based form used for lay-
ing out the document itself, and the other is the DTD syntax that can be used by the
parser to validate the form and content of the text and attributes of tags throughout the
document. In other words, the tags define the content of a document, and the DTD can
be used by the parser as a sort of XML watchdog by making sure that the tags are used
correctly. As you will see in this chapter, the syntax is quite different and the two are
in separate locations in the document. The DTD is optional in that it carries no data; the
exact same XML document can be written with or without the DTD.
Names
Tag names are made up from specific characters. The first character of a name must be
one of the following:
A-Z a-z _ :
Following the first character of the name, the rest of the characters must each be one of
the following:
A-Z a-z . - _ :
Upper- and lowercase characters are distinct. These are all characters from the ASCII
character set. If you are using Unicode, you can also use the alphabetic characters from
any other language as name starters and as internal name characters. As you can see,
digits are not allowed in a name, and a name cannot begin with a period or a hyphen
(minus sign). A name can be of any length.
There is a concept in XML that deals with grouping names into namespaces to sim-
plify the organization of complicated documents. There are more details on this later in
the chapter. If a name contains a colon, it can actually be two names; the part before the
colon is the namespace in which the name is found, and the part after the colon is the
actual name. Because the specification of namespace came along after the specification
3851 P-02 1/28/02 10:31 AM Page 15
for XML, and is still in a separate document, the use of colons is valid in any name.
However, namespace uses this fact to add some scoping to names.
Names beginning with the three-letter sequence XML (upper- or lowercase in any
combination) are all reserved for future use. You may be able to use a name starting
with these three letters and find that it is not currently reserved, but it may become re-
served in some future version of XML.
Strings
Throughout XML you will find quoted strings. The examples in this book almost ex-
clusively use double quotes to create these strings, but it doesn’t have to be that way.
It’s just a matter of style and personal preference. Anywhere you see a pair of double
quotes defining a string, a pair of single quotes (apostrophes) could have been used.
That is:
"This is a string"
'This is a string'
The ability to select the type of quotes makes it convenient for including either single
or double quotes as a character inside a string. For example, if you need to include a sin-
gle quote character inside a string, simply create the string using double quotes, like this:
The same is true for the opposite situation; you can include double quotes in a string
created using single quotes. For another way to insert quotes, and other special char-
acters, see the discussion on entities later in this chapter.
Whitespace
The whitespace characters are listed Table 2.1. All other valid characters, including
ones like backspace and form feed, are assumed to be part of the XML document.
The handling of whitespace is largely up to the application program reading the doc-
ument, but there are some actions that will be taken automatically by the parser unless
space 0x20
tab 0x09
line feed 0x0A
carriage return 0x0D
3851 P-02 1/28/02 10:31 AM Page 16
16 Chapter 2
you specify otherwise. For example, it’s normal for the parser to strip out some of the
whitespace characters to allow the document to be formatted in a readable fashion. If
you’ve worked with HTML, you’re familiar with this type of action. For example, it’s cus-
tomary to indent the tags to help indicate which ones are nested inside others, as follows:
<outer>
<inner>
<bold>Text of the inner tag.</bold>
</inner>
<outer>
In this example it is perfectly all right for the parser to strip out all of the whitespace be-
tween the tags. However, if you have text that should retain all of its whitespace char-
acters exactly as written, you can command the parser to not make any changes by
using the xml:space attribute as follows:
<poem xml:space="preserve">
Algy met a bear.
The bear was bulgy.
The bulge was Algy.
</poem>
If the spacing is allowed to default, however, there is no standard way that white-
space is to be handled by the parser, and your application would have no way of know-
ing how the text was originally formatted. An example is:
<outer>
<inner>
Text of the inner tag.
</inner>
<outer>
This example demonstrates a situation where the newline character at the end of the
text and the blanks in front of it could each be reduced to a single space (as is done in
HTML). The problem is that different parsers may handle this type of situation in dif-
ferent ways; some will leave them in and others will strip them out.
Bottom line: Use xml:space="preserve" in every case where whitespace reten-
tion is important. Otherwise, assume that any sequence of two or more whitespace
characters will be reduced to a single space. But remember that the default action will
vary from one parser to the next; there are even parsers that ignore any settings you
may have for xml:space.
ognizing the form of the following simple document. The data contained in the docu-
ment shown in Listing 2.1, which is the text outside of the tags, lists the names and con-
tact information of a couple of people.
The first line of Listing 2.1 identifies it as an XML document. This is not strictly re-
quired, but it is a very good idea to include it at the top of every document so the type
and version of your document can be verified. This type of instruction is known as a
processing instruction (PI) because its purpose is to pass instructions to the process that
will be reading the document.
The data contained in the document is the text, and each piece of text is enclosed be-
tween a pair of tags—one opening and one closing tag—and the tags specify the mean-
ing of the text. For example, the name of each person is preceded by a <name> tag and
is terminated by a </name> tag. These tags give the data in an XML document identity
Notice how the tag pairs are nested one inside the other. The outermost tag pair,
which is <folks> at the top and </folks> at the bottom, is called the root tag. The
outermost pair of tags in every XML document are the root tags. All of the text of an
XML document is enclosed by the root tag pair, and, inside the pair of root tags, there
are normally other tag pairs that define pieces of the text further.
XML Declaration
Every XML document should have, as its first line, an XML declaration that specifies
the version number. The current, and only, version of XML is 1.0, so a minimum dec-
laration line looks like the following:
<?xml version="1.0"?>
This declaration is not an actual XML requirement, but it’s a very good idea. It’s im-
portant to include the version number because it’s possible that there will be future ver-
sions of XML that contain features not compatible with version 1.0, and the version
number may be crucial to how future parsers deal with XML documents.
<?xml version="1.0"?>
<folks>
<person>
<name>Bertha D. Blues</name>
<phone>907 555-8901</phone>
<email>[email protected]</email>
</person>
<person>
<name>Fred Drew</name>
<phone>907 555-9921</phone>
<email>[email protected]</email>
</person>
</folks>
18 Chapter 2
Other information is often included with the declaration. The following example
specifies that the text is Unicode in a compressed form (in other words, it’s an ASCII file
that allows for expanded 16-bit Unicode characters). Also, this declaration specifies that
the XML document doesn’t make references to external documents, so it can be read
and used as completely self-contained data.
The encoding value is the name of the character set used to write the XML docu-
ment. For the encoding value to have any meaning, the name must be recognized by
the program reading and processing the document. Because Java, by default, inputs
text as UTF-8, there’s no need to specify the encoding unless you’re going to do some-
thing unusual. If you know the document will be limited to 7-bit ASCII, you can de-
clare the encoding to be "US-ASCII". Also, a declaration of "UTF-8" works for
standard 7-bit ASCII because UTF-8 includes all of the ASCII characters (values from
0x00 through 0x7F), but it’ll also allow for the recognition of properly encoded Unicode
characters. Java easily reads and writes this format without you having to specify any-
thing else. If the text uses what is commonly called the extended ASCII characters (val-
ues from 0x10 through 0xFF), the encoding should be one of the names "ISO-8859-1"
through "ISO-8859-9" and cannot be "US-ASCII" or "UTF-8".
There are many other encodings possible. If you’re going to need some special
encoding, it would be best to use one of the Internet Assigned Numbers Authority
(IANA) character-set names. For the official and exhaustive list take a look at
http://www.iana.org/assignments/character-sets.
The standalone attribute specifies whether external documents are referenced
from inside this XML document. If standalone is "yes", it specifies that there are no
external documents referenced, and the parser, knowing it only needs to work with this
one file, can execute in a way that makes the processing more efficient. If it is "no", the
document may—but doesn’t necessarily have to—refer to an external document.
The declaration is in the form of a PI that begins with the character pair <? and is
terminated by ?>. This XML declaration is a very special PI that appears at the top of
every document. There’s more information on the purpose and form of PIs later in this
chapter.
All tags have both an opening and a closing. In the following example the opening
tag is <paragraph> and the closing tag is </paragraph>. The opening and closing
tags have the same name. An opening tag has no slash character, whereas the closing
tag always has one:
A tag pair, along with any text or other tag pairs it contains, is a single unit known
as an element. An element may contain any amount of text and any number of elements.
An XML document is made up of a collection of elements nested one inside the other.
The outermost element is the root element.
NOTE XML names are case sensitive. The tags <subtitle>, <Subtitle>,
<SubTitle>, and <SUBTITLE> are all different tags. Likewise, keywords of the
XML language are case sensitive; although DOCTYPE and ELEMENT are both
keywords in XML, Doctype and element are not.
An element isn’t required to contain anything between its opening and closing tags.
The following is an example of an empty element:
<paragraph></paragraph>
Empty elements occur often enough that there’s a special shorthand notation for them.
Following is an example of the shorthand notation for an empty element:
<paragraph/>
In this example, there’s only one tag, but because the trailing slash is included, it acts
as both the opening and closing tag of an empty element.
Attributes
It’s possible to specify one or more attributes as part of an opening tag. The attributes
normally modify or amplify the meaning of the tag. For example, the following form
could be used to specify the font to be applied to the text of a paragraph:
All attributes take this same form. There is an attribute name (font in this example), fol-
lowed by an equal sign and a quoted string. The quoted string is the value of the named
attribute. It is possible to include more than one attribute with a tag. An example is:
As with all quoted strings in XML, you can use double quotes or single quotes to
specify the value. An example is:
3851 P-02 1/28/02 10:31 AM Page 20
20 Chapter 2
The ability to use more than one kind of quote is more than a simple convenience. It
provides an easy way to include quote marks inside the attribute value, such as:
Single and double quote characters, along with other special characters, can also be in-
serted by using the predefined entities describe in the “Entities” section of this chapter.
Attributes can be specified inside the short form of empty elements, like this:
Comments
Anywhere you can put a tag, you can put a comment. A comment is normally ignored
by the process reading the document; it is only used to clarify the following or sur-
round XML document. A comment begins with the four-character sequence <!-- and
ends with the three-character sequence -->, such as:
Because XML is a free-form language, a single comment can spread across several lines
and look like the following:
<!--
This is a comment that continues for more than
one line. This form can be used for a descriptive block
of text at the top or to try to make sense of some
some XML element that could be somewhat obscure.
-->
Once this element is parsed, the resulting string will look like the following:
Table 2.2 lists the predefined entities that enable you to insert characters that would
otherwise cause problems with the parser.
You can use an entity to insert any character by specifying the Unicode or ASCII nu-
meric value of the character. This comes in handy when you want to insert a character
that has no key on your keyboard, such as a Greek character. This is particularly true of
Unicode because it includes the characters from every alphabet. The example in Listing
2.2 demonstrates the two different ways of specifying a Greek alphabet character:
Just like any other entity, when specifying a numeric value, you have to precede the
entity with an ampersand and terminate it with a semicolon. The # character is used to
indicate a numeric entity. If the first character in the number is a small x, the digits are
interpreted as hexadecimal digits; otherwise, they are assumed to be decimal digits.
You can define entities of your own by including their definitions in the DTD, as de-
scribed later in this chapter.
<?xml version="1.0"?>
<charents>
<fromhex>
The character Σ is an uppercase sigma.
</fromhex>
<fromdec>
The character Σ is an uppercase sigma.
</fromdec>
</charents>
22 Chapter 2
<?xml version="1.0"?>
<progs>>
<function>
<![CDATA[
int frammis() {
if((a > b) && (a < 3.4))
return(1);
return(0);
}
]]>
</function>
</progs>
NOTE There is also a CDATA in DTD that should not be confused with this
one. They are for entirely different purposes and are not related.
DTD
The tags in an XML file are, by default, completely free form as long as the basic syn-
tax of the XML language is followed. You can, however, impose syntactic specifications
on each of the tag names by including a DTD in your XML document. Remember, a
DTD doesn’t add any meaning to the elements of a document; it only refines and ex-
tends the syntactic requirements. You can use the DTD to specify the type of data that
can be included in an element, the relative order and position of the elements, and
which elements can be nested inside other elements.
3851 P-02 1/28/02 10:31 AM Page 23
The DTD syntax is quite different from that of the rest of the XML document. It pri-
marily consists of a list of all the tag names and a specification of the form each one will
take.
You can include the text of the DTD as part of the XML document itself, or you can
put the DTD into a separate file and simply refer to the file from the XML document.
The advantage of having the DTD as a separate file is that you can use it to define the
tag format for any number of XML documents. In fact, you can combine formats to-
gether by having a single XML document use the formatting defined in more than one
DTD file. The disadvantage of having the DTD as a separate file is that if you send the
XML document to someone else, they will normally need to use the same DTD file so
that the syntax of both the form and content of the elements can be recognized.
NOTE Even though the syntax of DTD is decidedly different from the rest of
XML, and the presence of DTD is optional, it’s still a part of the XML language.
The DTD text is read and processed by the same parser that reads and
processes the other parts of an XML document.
Single File
The document in Listing 2.4 shows how to use the DOCTYPE keyword to insert the text
of the DTD information inside the XML document.
Notice that standalone is set to "yes" in Listing 2.4 to indicate that everything is
contained in a single file. It would have been perfectly valid to set standalone to
"no" so that the parser would be ready to handle multiple files even though only one
file is used. Some parsers, however, will be more efficient if they know up front that
everything is going to be in one file.
To specify the DTD information, the keyword DOCTYPE is used as shown in Listing
2.4. Everything inside the opening bracket and closing bracket of the DOCTYPE decla-
ration is a part of the DTD. The DOCTYPE declaration itself requires the name of the root
element (in this example it’s the folks element) of the XML document.
In this example DTD there are five ELEMENT declarations. An ELEMENT declaration
specifies the contents of an element. The root element, named folks, is allowed to con-
tain only person elements. The number of person elements that can be contained is
specified by the asterisk modifier, which means there may be zero or more person el-
ements listed inside a folks element. And, because the person element is the only
thing specified for the folks element, a person element is the only thing it can con-
tain. Of course, the person element has a DTD definition of its own, but the content re-
quirements of a person element are independent of the content requirements of the
folks element.
Also in Listing 2.4, the person element must contain the three elements named
name, phone, and email. And because there is no occurrence modifier (like the aster-
isk in the folks element), each of these three elements must appear exactly once. Also,
because they’re separated by commas, they must appear in exactly the order in which
they appear in the DTD.
3851 P-02 1/28/02 10:31 AM Page 24
24 Chapter 2
<!DOCTYPE folks [
<!ELEMENT folks (person)*>
<!ELEMENT person (name, phone, email)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
<!ELEMENT email (#PCDATA)>
]>
<folks>
<person>
<name>Bertha D. Blues</name>
<phone>907 555-8901</phone>
<email>[email protected]</email>
</person>
<person>
<name>Fred Drew</name>
<phone>907 555-9921</phone>
<email>[email protected]</email>
</person>
</folks>
The elements name, phone, and email are specified to contain #PCDATA. This
means they can only contain parse character data. In other words, they can only contain
a character string that does not contain any other elements. Because the character string
is parsed, it may contain things like the character entities that are predefined as part of
the XML language.
Multiple Files
The example in Listing 2.5 is the same as the one in Listing 2.4, except that the DTD is
stored in a separate file. An advantage to using this approach is that a separate file
makes it easy to create several documents based on the same DTD without having to
duplicate the DTD in every document. An additional advantage is that if the recipient
of a transmitted document already has a copy of the DTD, there’s no need to send an-
other one. In fact, because the DTD is only for syntax checking, an application knows
how to read the contents of a document formatted by the rules of a DTD. Therefore,
there is no need for the recipient to refer to the DTD at all unless, for some reason, the
receiver of the document does not trust the sender to format it correctly. The only pur-
pose of the DTD is to check the syntax of a document.
3851 P-02 1/28/02 10:31 AM Page 25
<folks>
<person>
<name>Bertha D. Blues</name>
<phone>907 555-8901</phone>
<email>[email protected]</email>
</person>
<person>
<name>Fred Drew</name>
<phone>907 555-9921</phone>
<email>[email protected]</email>
</person>
</folks>
The DOCTYPE keyword still serves the same purpose as before, but this time the
SYSTEM keyword is used to precede the name of the file containing the DTD speci-
fications. The name of the DTD file is in the same directory as this XML file, is named
SimpleDoc.dtd, and contains the content shown in Listing 2.6.
This file contains only the definitions that go inside the DOCTYPE block, but not the
DOCTYPE declaration itself (DOCTYPE declarations are discussed in detail in the fol-
lowing sections of this chapter). It does, however, have an XML declaration at the top.
Also, the DTD file must include the encoding because the content of the file is going to
be read and analyzed by the parser, and the parser needs to know what encoding
scheme is being used.
26 Chapter 2
declarations directly inline as a part of the XML document. Its basic form is shown in
the following example:
<!DOCTYPE folks [
<!ELEMENT folks (person)*>
. . .
]>
All of the DTD declarations are included between the DOCTYPE opening and closing
brackets. This can include any combination of ELEMENT, CDATA, ATTLIST, ENTITY,
IGNORE, and INCLUDE. And, of course, there can be any number of comments inter-
mixed with the DTD definitions.
The following example, another relative URI, specifies the file named insub.dtd in
a local subdirectory named diction:
An absolute URI can be used to specify the address of a file anywhere on the Internet.
The following example shows the URI of a DTD document named SimpleDoc.dtd
that’s stored and readily available on the Internet. This very simple technique can be
used to make sure that everyone is using the same DTDs to define the formats of the
same set of documents:
NOTE The URI string will accept both forward and backward slashes to
accommodate the naming requirements of different operating systems.
It is possible to use a SYSTEM DTD and, at the same time, include some DTD modi-
fications that apply to the local document. You can include any statement that could
also be written directly into the existing DTD. That is, you cannot override and replace
an existing definition, but you can add definitions to the DTD. For example, you can
add a new attribute definition for an existing tag, as in Listing 2.7.
This example uses the same DTD file that was used in previous examples, but this
time the phone element is modified by ATTLIST to add the option of specifying an
extension attribute. Details of ATTLIST are described later in this chapter.
3851 P-02 1/28/02 10:31 AM Page 27
<folks>
<person>
<name>Bertha D. Blues</name>
<phone extension="409">907 555-8901</phone>
<email>[email protected]</email>
<fax>907 555-3333</fax>
</person>
</folks>
prefix//owner//description//language ID
If the prefix is ISO, the DTD is an approved ISO standard. If the prefix is +, the DTD
is an approved standard, but it is not an ISO standard. If the prefix is -, the DTD is an
ISO standard proposal that has not yet been approved. The owner is the name, or an
acronym, identifying the owner of the DTD specification. The description is a brief de-
scription of the DTD. The language ID is a two-letter ISO 639 specification of the lan-
guage of the DTD.
The name of the DTD comes before the URI that specifies its location. The following
example specifies that the DTD is to be the strict version of the W3C definition of
HTML version 4.01:
<html>
<head>
The heading of the XHTML page.
</head>
3851 P-02 1/28/02 10:31 AM Page 28
28 Chapter 2
<body>
The body of the XHTML page.
</body>
</html>
In this example the name specifies that the DTD is a proposed ISO standard owned by
the W3C organization. The descriptive name of the DTD is DTD HTML 4.01 and the
language is EN (English). Judging by the URI, it seems this DTD is a very strict imple-
mentation of the standard; there are other versions at the same site of both transitional
and loose implementations.
Also, as you would expect, you can make local additions to the DTD using the same
technique as described earlier for the SYSTEM declaration.
Comments
It doesn’t matter whether the DTD text is in the same file or in a separate file; the for-
mat of comments is the same as in any other section of an XML document. A valid com-
ment looks like this:
<!-- Comments in DTD are in the same format as the rest of XML-->
This means that, as long as the syntax is correct, anything goes. All of the following are
valid nolimit elements:
<!ELEMENT hr EMPTY>
An empty element is one that contains no text and no other elements. The following ex-
amples show the two possible forms of an empty element:
<hr></hr>
<hr/>
Note that even though an empty element cannot contain data between the tags, it
can still have attributes defined for it. Details on declaring and using attributes are de-
scribed later in this chapter. The following is an example of defining and using an op-
tional attribute with the hr element:
<!ELEMENT hr EMPTY>
<!ATTLIST hr style #IMPLIED>
. . .
<hr style="reversed"/>
#PCDATA must be included in parentheses. The name PCDATA is short for parsed
character data, so-called because the parser actually reads through the text to find em-
bedded tags or entities. At the very least, the parser has to scan the text to find the clos-
ing tag. The textonly tag, however, does not allow for any embedded tags; it can only
be used for text as follows:
If you wish to specify that the content can be text intermixed with tags, you can use
the vertical bar to separate the items and add an asterisk at the end to specify that the
items in the parentheses can each be repeated any number of times as shown in the fol-
lowing example:
An element of the type textag can contain text with an unlimited number of hr el-
ements embedded in it. You can extend this format to specify any number of element
names that can be embedded in the text by listing them this way:
3851 P-02 1/28/02 10:31 AM Page 30
30 Chapter 2
The vertical bars between the items indicates that the content can be any one of the
items, and the asterisk at the end indicates that the item can be repeated any number
of times. The result is that the content can be any amount of text with the named ele-
ments embedded in it any order.
The vertical bar between two elements is the OR operator, and it indicates that one
or the other may be used, but not both. As you can see, the OR operator can be used in
a sequence specifying that only one of the members of the list can be selected. The fol-
lowing example specifies that the content of pkall must be all three of the named el-
ements:
A comma between two elements is an AND operator, which indicates that both ele-
ments must be included, and they must be included in the order in which they are
listed. The pkall element must include the elements name, address, and phone,
and they must appear in that order. The following examples demonstrate how these
two forms can be combined to create a more complicated rule:
<pkone>
<name>Fred Drew</name>
</pkone>
<pkone>
<phone>555-1028</phone>
</pkone>
<pkall>
<name>Fred Drew</name>
<address>1313 Luck St</address>
<phone>555-1028</phone>
3851 P-02 1/28/02 10:31 AM Page 31
</pkall>
<pkchoose>
<name>Fred Drew</name>
<address>1313 Luck St</address>
<phone>555-1028</phone>
</pkchoose>
<pkchoose>
<name>Quintus Drew</name>
<address>1315 Luck St</address>
<email>[email protected]</email>
</pkchoose>
The operators listed in Table 2.3 specify how often an item may be repeated. For ex-
ample, the following definition specifies that the email address is optional, but there
must be at least one (and possibly more) phone number:
The occurrence operators can also be applied to sets of items in parentheses. For ex-
ample, the following specifies that a chlist element can contain any number of names
and addresses. Each name and address must be accompanied by at least one phone or
email element but can have any number of phone and email elements:
OPERATOR DESCRIPTION
? The item may be omitted but, if included, it can only appear once.
* The item may be omitted and, if included, it can be repeated any
number of times.
+ The item must be included at least once, and it may be repeated
any number of times.
none The item must be included once.
3851 P-02 1/28/02 10:31 AM Page 32
32 Chapter 2
A single ATTLIST entry can also be used to declare multiple tags for the same ele-
ment. An example is:
In the XML code, a valid rectangle element is required to include the two attrib-
utes declared with it, and it could look like this:
When defining an attribute, it’s necessary to specify the data type and, possibly, an ini-
tial value or set of possible values. Table 2.4 lists the possible default declarations used to
specify the requirement, or lack of requirement, imposed on attributes. Table 2.5 de-
scribes the keywords used to specify the type of data and gives an example of each one.
The entity definition cannot be made inside an element, so it must come before the
root element of the document. Once this entity has been defined, the string “Interna-
tional Widget Inc.” will be inserted automatically wherever you use the entity, such as:
<customer>&company;</customer>
You can define a number of entities in a single document and use them to general-
ize the contents of the document itself. That is, by just changing the values of the enti-
ties, the content of the document would change. Just as was done with the predefined
CONSTRAINT DESCRIPTION
#IMPLIED The attribute may used in an element of this type, but it is not
required.
#REQUIRED This attributed must be specified for all elements of this type.
#FIXED This is always followed by a quoted string that is the only
value that can be assigned to the attribute.
3851 P-02 1/28/02 10:31 AM Page 33
34 Chapter 2
<?xml version="1.0"?>
<agreement>
<intro>
The company name is &company; with main offices
located at &address;.
</intro>
</agreement>
entities, you can mix your own entities with text. The document in Listing 2.8 includes
the definition of some entities and some text with a pair of entities embedded in it.
An even more generalized treatment of form of ENTITY is to store the substitution
text in a separate file. By doing so, you’ll need only to change the content of the entity
file to make a modification to the document. This is done using the keyword SYSTEM
as in the following:
Just as with the SYSTEM keyword in the DOCTYPE declaration, the file can be located
anywhere on the Internet.
Parameter Entities
A parameter entity can be used to make substitutions in the DTD instead of in the XML
document. A normal entity does not cause substitutions to be made inside the DTD, so
there’s no way for the parser to recognize any of the DTD keywords. For example, the
following will not work:
A parameter entry can be used to make the substitution work inside the DTD. To
create a parameter entity, you’ll need to insert a percent sign in front of the name of the
entity being defined. An example is:
INTRODUCTIONS.