1.1 Introduction
The title of this book is Querying XML, so we start by introducing XML, describing what we mean by “querying,” and then discussing the special challenges in querying XML.
XML – the Extensible Markup Language – defines a set of rules for adding markup to data. Markup adds structure to data, and gives us a way of talking about the meaning of that data. The family of XML technologies provides a way to standardize the representation of data, so that we can process any data with standard programs, share data across applications, and transfer data from one person or application to another. In this first chapter, we introduce XML by looking at what markup is and what it’s good for. Then we look at a number of different uses for XML – a number of different kinds of XML data. Finally, we give examples of other ways to represent data, and compare them with XML.
1.2 Adding Markup to Data
Let’s take the movies example (Appendix A: The Example) used throughout this book. We have data describing many of our favorite movies. The data includes the title of the movie, the year it was first released, the names of some of the cast members, and other information about the movie. In this section, we look at the data in its raw form, then discuss how that data might be marked up to make it more useful.
1.2.1 Raw Data
We could represent our movie data in raw form, as in Example 1-1.
Example 1-1 movie, Raw Data
Example 1-1 is the raw data for one movie – a single record. In this format, the data doesn’t tell you much about the movie. You can probably spot the title, and, if you are familiar with “An American Werewolf in London,” you may be able to glean some information by means of educated guesswork. But if you wanted to write a program to read this data and do something with it – such as finding the name of the director – you would have to write code specifically for this piece of data (e.g., code that extracts the characters at positions 41 through 44 and 35 through 40 and adds a space in between them). What we need is some way to represent the data so that a program (or person) can process any movie record in the same way.
1.2.2 Separating Fields
A simple way to add some rudimentary structure to this record is to add a comma between each of the data items, or fields.
Example 1-2 movie, Fields Separated by Commas
Example 1-2 is the same movie data represented as a comma-separated list. Notice that, even with this simple mechanism, we had to introduce the “\” (backslash) character to “escape” a comma that was actually part of the data.
There are other ways to distinguish between fields of a record. In the early days of computing, fixed-length fields were common – each field might occupy, say, 8 bytes. This method makes access simple – if you want to access the beginning of the third field, you can go directly to the 17th byte. But fields smaller than 8 bytes take up more space than they need to, and fields longer than 8 bytes require some indication that they are spread across more than one field (such as a continuation marker).
Let’s continue our discussion with the comma-separated list in Example 1-2. You can spot the fields in this record, but there is no way of knowing which fields go together. For example, the fields “Agutter,” “Jenny,” “female,” and “Alex Price” each describe one aspect of a cast member, but it’s not apparent from the comma-separated list that those fields have anything in common. We have a way of delineating fields; now we need some way of grouping fields together.
1.2.3 Grouping Fields Together
Example 1-3 groups fields together. It also introduces a hierarchy of fields and subfields. Fields are separated by one or more commas, and fields that belong together are bounded by “,” at the start and “$,” at the end.
Example 1-3 movie, Grouped Fields
Example 1-3 is shown with some extra white space – each sub-field starts on a new line, and is indented. This is purely for (human) readability.
Now we know that “Agutter, Jenny, female, Alex Price” all belongs together and is all related in some way to “An American Werewolf in London.” And if you want to write a program to extract the director of each movie, given that each movie is formatted in the same way as in Example 1-3, you can write some general code that will parse the movie into first, second, and third fields, extract the contents of the third field, and parse that to get the first and last name of the director.
We are making progress! But Example 1-3 still has some shortcomings. There is no indication of what a field represents, other than its position within the record, which makes it difficult for humans to read. This has two implications – first, the data is vulnerable to error. If you (or the program generating the data) make a mistake and leave out the year of release, it’s not obvious that anything is missing, and a program processing this data may well return “LandisJohn” when asked for the year of release. Second, it makes it difficult to talk about the data. Most of the time, when we want to “talk about” the data, we want to describe some manipulation to a program – i.e., it’s difficult to write a program that says things like “print the second field of the third field of the movie record, then a space, then the first field of the third field of the movie record.” Our next step is to name the fields and subfields.
1.2.4 Naming Fields
If you read Example 1-3, you can probably guess that “An American Werewolf in London” is the title of the movie, and you may even deduce that Jenny Agutter plays the female lead, a character named Alex Price. But who is Peter Guber? And what does “98” mean? What we need is a way to name each field, to m...