This book is about something we call the Semantic Web. From the name, you can probably guess that it is related somehow to the World Wide Web (WWW) and that it has something to do with semantics. Semantics, in turn, has to do with understanding the nature of meaning, but even the word semantics has a number of meanings. In what sense are we using the word semantics? And how can it be applied to the Web?
This book is for a working ontologist. That is, the aim of this book is not to motivate or pitch the Semantic Web but to provide the tools necessary for working with it. Or, perhaps more accurately, the World Wide Web Consortium (W3C) has provided these tools in the forms of standard Semantic Web languages, complete with abstract syntax, model-based semantics, reference implementations, test cases, and so forth. But these are like any tools—there are some basic tools that are all you need to build many useful things, and there are specialized craftsman’s tools that can produce far more specializes outputs. Whichever tools are needed for a particular task, however, one still needs to understand how to use them. In the hands of someone with no knowledge, they can produce clumsy, ugly, barely functional output, but in the hands of a skilled craftsman, they can produce works of utility, beauty, and durability. It is our aim in this book to describe the craft of building Semantic Web systems. We go beyond only providing a coverage of the fundamental tools to also show how they can be used together to create semantic models, sometimes called ontologies, that are understandable, useful, durable, and perhaps even beautiful.
What is a Web?
The idea of a web of information was once a technical idea accessible only to highly trained, elite information professionals: IT administrators, librarians, information architects, and the like. Since the widespread adoption of the World Wide Web, it is now common to expect just about anyone to be familiar with the idea of a web of information that is shared around the world. Contributions to this web come from every source, and every topic you can think of is covered.
Essential to the notion of the Web is the idea of an open community: Anyone can contribute their ideas to the whole, for anyone to see. It is this openness that has resulted in the astonishing comprehensiveness of topics covered by the Web. An information “web” is an organic entity that grows from the interests and energy of the communities that support it. As such, it is a hodgepodge of different analyses, presentations, and summaries of any topic that suits the fancy of anyone with the energy to publish a web page. Even as a hodgepodge, the Web is pretty useful. Anyone with the patience and savvy to dig through it can find support for just about any inquiry that interests them. But the Web often feels like it is “a mile wide but an inch deep.” How can we build a more integrated, consistent, deep Web experience?
Smart Web, Dumb Web
Suppose you consult a web page, looking for a major national park, and you find a list of hotels that have branches in the vicinity of the park. In that list you see that Mongotel, one of the well-known hotel chains, has a branch there. Since you have a Mongotel rewards card, you decide to book your room there. So you click on the Mongotel web site and search for the hotel’s location. To your surprise, you can’t find a Mongotel branch at the national park. What is going on here? “That’s so dumb,” you tell your browsing friends. “If they list Mongotel on the national park web site, shouldn’t they list the national park on Mongotel’s web site?”
Suppose you are planning to attend a conference in a far-off city. The conference web site lists the venue where the sessions will take place. You go to the web site of your preferred hotel chain and find a few hotels in the same vicinity. “Which hotel in my chain is nearest to the conference?” you wonder. “And just how far off is it?” There is no shortage of web sites that can compute these distances once you give them the addresses of the venue and your own hotel. So you spend some time copying and pasting the addresses from one page to the next and noting the distances. You think to yourself, “Why should I be the one to copy this information from one page to another? Why do I have to be the one to copy and paste all this information into a single map?
Suppose you are investigating our solar system, and you find a comprehensive web site about objects in the solar system: Stars (well, there’s just one of those), planets, moons, asteroids, and comets are all described there. Each object has its own web page, with photos and essential information (mass, albedo, distance from the sun, shape, size, what object it revolves around, period of rotation, period of revolution, etc.). At the head of the page is the object category: planet, moon, asteroid, comet. Another page includes interesting lists of objects: the moons of Jupiter, the named objects in the asteroid belt, the planets that revolve around the sun. This last page has the nine familiar planets, each linked to its own data page.
One day, you read in the newspaper that the International Astronomical Union (IAU) has decided that Pluto, which up until 2006 was considered a planet, should be considered a member of a new category called a “dwarf planet”! You rush to the Pluto page and see that indeed, the update has been made: Pluto is listed as a dwarf planet! But when you go back to the “Solar Planets” page, you still see nine planets listed under the heading “Planet.” Pluto is still there! “That’s dumb.” Then you say to yourself, “Why didn’t someone update the web pages consistently?”
What do these examples have in common? Each of them has an apparent representation of data, whose presentation to the end user (the person operating the Web browser) seems “dumb.” What do we mean by “dumb”? In this case, “dumb” means inconsistent, out of synchronized, and disconnected. What would it take to make the Web experience seem smarter? Do we need smarter applications or a smarter Web infrastructure?
Smart web applications
The Web is full of intelligent applications, with new innovations coming every day. Ideas that once seemed futuristic are now commonplace; search engines make matches that seem deep and intuitive; commerce sites make smart recommendations personalized in uncanny ways to your own purchasing patterns; mapping sites include detailed information about world geography, and they can plan routes and measure distances. The sky is the limit for the technologies a web site can draw on. Every information technology under the sun can be used in a web site, and many of them are. New sites with new capabilities come on the scene on a regular basis.
But what is the role of the Web infrastructure in making these applications “smart”? It is tempting to make the infrastructure of the Web smart enough to encompass all of these technologies and more. The smarter the infrastructure, the smarter the Web’s performance, right? But it isn’t practical, or even possible, for the Web infrastructure to provide specific support for all, or even any, of the technologies that we might want to use on the Web. Smart behavior in the Web comes from smart applications on the Web, not from the infrastructure.
So what role does the infrastructure play in making the Web smart? Is there a role at all? We have smart applications on the Web, so why are we even talking about enhancing the Web infrastructure to make a smarter Web if the smarts aren’t in the infrastructure?
The reason we are improving the Web infrastructure is to allow smart applications to perform to their potential. Even the most insightful and intelligent application is only as smart as the data that is available to it. Inconsistent or contradictory input will still result in confusing, disconnected, “dumb” results, even from very smart applications. The challenge for the design of the Semantic Web is not to make a web infrastructure that is as smart as possible; it is to make an infrastructure that is most appropriate to the job of integrating information on the Web.
The Semantic Web doesn’t make data smart because smart data isn’t what the Semantic Web needs. The Semantic Web just needs to get the right data to the right place so the smart applications can do their work. So the question to ask is not “How can we make the Web infrastructure smarter?” but “What can the Web infrastructure provide to improve the consistency and availability of Web data?”
Connected data is smarter data
Even in the face of intelligent applications, disconnected data result in dumb behavior. But the Web data don’t have to be smart; that’s the job of the applications. So what can we realistically and productively expect from the data in our Web applications? In a nutshell, we want data that don’t surprise us with inconsistencies that make us want to say, “This doesn’t make sense!” We don’t need a smart Web infrastructure, but we need a Web infrastructure that lets us connect data to smart Web applications so that the whole Web experience is enhanced. The Web seems smarter because smart applications can get the data they need.
In the example of the hotels in the national park, we’d like there to be coordination between the two web pages so that an update to the location of hotels would be reflected in the list of hotels at any particular location. We’d like the two sources to stay synchronized; then we won’t be surprised at confusing and inconsistent conclusions drawn from information taken from different pages of the same site.
In the mapping example, we’d like the data from the conference web site and the data from the hotels web site to be automatically understandable to the mapping web site. It shouldn’t take interpretation by a human user to move information from one site to the other. The mapping web site already has the smarts it needs to find shortest routes (taking into account details like toll roads and one-way streets) and to estimate the time required to make the trip, but it can only do that if it knows the correct starting and endpoints.
We’d like the astronomy web site to update consistently. If we state that Pluto is no longer a planet, the list of planets should reflect that fact as well. This is the sort of behavior that gives a reader confidence that what they are reading reflects the state of knowledge reported in the web site, regardless of how they read it.
None of these things is beyond the reach of current information technology. In fact, it is not uncommon for programmers and system architects, when they first learn of the Semantic Web, to exclaim proudly, “I implemented something very like that for a project I did a few years back. We used….” Then they go on to explain how they used some conventional, established technology such as relational databases, XML stores, or object stores to make their data more connected and consistent. But what is it that these developers are building?
What is it about managing data this way that made it worth their while to create a whole subsystem on top of their base technology to deal with it? And where are these projects two or more years later? When those same developers are asked whether they would rather have built a flexible, distributed, connected data model support system themselves than have used a standard one that someone else optimized and supported, they unanimously chose the latter. Infrastructure is something that one would rather buy than build.