rdfabout.net: Resource Description Framework
What is RDF and what is it good for?
Navigate: < Previous: Introducing RDF [Contents] Next: Reading and Writing RDF > (Entire Article)

Triples of knowledge

RDF provides a general, flexible method to decompose any knowledge into small pieces, called triples, with some rules about the semantics (meaning) of those pieces.

The foundation is breaking knowledge down into basically what's called a labeled, directed graph, if you know the terminology.

Each edge in the graph represents a fact, or a relation between two things. The edge in the figure above from the node vincent_donofrio labeled starred_in to the node the_thirteenth_floor represents the fact that actor Vincent D'Onofrio starred in the movie “The Thirteenth Floor.” A fact represented this way has three parts: a subject, a predicate (i.e. verb), and an object. The subject is what's at the start of the edge, the predicate is the type of edge (its label), and the object is what's at the end of the edge.

The six documents composing the RDF specification tell us two things. First, it outlines the abstract model, i.e. how to use triples to represent knowledge about the world. Second, it describes how to encode those triples in XML. We'll take each subject in turn.

The abstract RDF model: Statements

RDF is nothing more than a general method to decompose information into pieces. The emphasis is on general here because the same method can be used for any type of information. And the method is this: Express information as a list of statements in the form SUBJECT PREDICATE OBJECT. The subject and object are names for two things in the world, and the predicate is the name of a relation between the two. You can think of predicates as verbs.

Here's how I would break down information about my apartment into RDF statements:

SUBJECTPREDICATEOBJECT
Iownmy_apartment
my_apartmenthasmy_computer
my_apartmenthasmy_bed
my_apartmentis_inPhiladelphia

These four lines express four facts. Each line is called a statement or triple.

The subjects, predicates, and objects in RDF are always simple names for things: concrete things, like my_apartment, or abstract concepts, like has. These names don't have internal structure or significance of their own. They're like proper names or variables. It doesn't matter what name you choose for anything, as long as you use it consistently throughout.

Names in RDF statements are said to refer to or denote things in the world. The things that names denote are called resources (dating back to RDF's use for metadata for web resources), nodes (from graph terminology), or entities. These terms are generally all synonymous. For instance, the name my_apartment denotes my actual apartment, which is an entity in the real world. The distinction between names and the entities they denote is minute but important because two names can be used to refer to the same entity.

Predicates are always relations between two things. Own is a relation between an owner and an 'ownee'; has is a relation between the container and the thing contained; is_in is the inverse relation, between the contained and the container. In RDF, the order of the subject and object is very important.

The next aspect of RDF almost goes without saying, but I want to put everything down in print: If someone refers to something as X in one place and X is used in another place, the two X's refer to the same entity. When I wrote my_apartment in the first line, it's the same apartment that I meant when I wrote it in the other three lines.

The rules so far already get us a lot farther than you might realize. Given this table of statements, I can write a simple program that can answer questions like "who own my_apartment" and "my_apartment has what." The question itself is in the form of an RDF statement, except the program will consider wh-words like who and what to be wild-cards. A simple question-answering program can compare the question to each row in the table. Each matching row is an answer. Here's the pseudocode:

Pseudocode for Question-Answering
question = (my_apartment, has, what)
knowledge = (
		(I, own, my_apartment),
		(my_apartment, has, my_computer),
		(my_apartment, has, my_bed),
		(my_apartment, is_in, Philadelphia)
	)
for each statement in knowledge {
	if ((statement.subject == question.subject
			or question.subject == what) {
		  and (statement.predicate == question.predicate
		  	or question.predicate == what)
		  and (statement.object == question.object
		    or question.object == what))
		call FoundAnswer(statement)
	}
}

Output:
	Answer: my_apartment has my_computer
	Answer: my_apartment has my_bed

The computer doesn't need to know what has actually means in English for this to be useful. That is, it's left up to the application writer to choose appropriate names for things (e.g. my_apartment) and to use the right predicates (own, has). RDF tools are ignorant of what these names mean, but they can still usefully process the information. (I'll get to more useful things later.)

URIs to Name Resources

RDF is meant to be published on the Internet, and so the names I used above have a problem. I shouldn't name something my_apartment because someone else might use the name my_apartment for their apartment too. Following from the last fact about RDF, RDF tools would think the two instances of my_apartment referred to the same thing in the real world, whereas in fact they were intended to refer to two different apartments. The last aspect of RDF is that names must be global, in the sense that you must not choose a name that someone else might conceivably also use to refer to something different. Formally, names for subjects, predicates, and objects must be Uniform Resource Identifiers (URIs). (Technically, names can be Internationalized Resource Identifiers (IRIs) but the distinction is not important.)

URIs can have the same syntax or format as website addresses, so you will see RDF files that contain URIs like http://www.w3.org/1999/02/22-rdf-syntax-ns#type, where that URI is the global name for some entity. This happens to be the URI for the concept of "is a" (if you recognize it).

Now, in the SemWeb world, URIs are treated in a somewhat inconsistent way, so bear with me here. On the one hand, URIs are supposed to be opaque. The fact that a URI looks like a web address is totally incidental. There may or may not be an actual website at that address, and it doesn't matter. There are other types of URIs besides http:-type URIs. URNs are a subtype of URI used for things like identifying books by their ISBN number, e.g. urn:isbn:0143034650. TAGs are a general-purpose type of URI. They look like tag:govtrack.us,2005:congress/senators/frist. Not all URIs name web pages. That's the difference between a URI and a URL. URLs are just those URIs that name things on the web that can be retrieved, aka "dereferenced".

So the convention goes: whatever their form, URIs you see in RDF documents are merely verbose names for entities, nothing more. Well, at least, that's how people felt about URIs till around 2007.

Starting in recent years there actually has been an expectation that if you create an http: URI — or, any dereferencable URI (a URL) — that you actually put something at that address so that RDF clients can access that page and get some information. Here's the bottom line: As for what a URI means in a document, what the URI is simply doesn't matter, but when you use dereferencable URIs, there may be an expectation that you put something on the web at that address. We will return to this in the section about Linked Data.

URIs are used as global names because they provide a way to break down the space of all possible names into units that have obvious owners. URIs that start with http://www.govtrack.us/ are implicitly controlled by me, or whoever is running the website at that address. By convention, if there's an obvious owner for a URI, no one but that owner will "mint" a new resource with that URI. This prevents name clashes. If you create a URI in the space of URIs that you control, you can rest assured no one will use the same URI to denote something else. (Of course, someone might use your URIs in a way that you would not appreciate, but this is a subject for another article.)

Since URIs can be quite long, in various RDF notations they're usually abbreviated using the concept of namespaces from XML. As in XML, a namespace is generally declared at the top of an RDF document and then used in abbreviated form later on. Let's say I've declared the abbreviation taubz for the URI http://razor.occams.info/index.html#. In many RDF notations, I can then abbreviate URIs like http://razor.occams.info/index.html#my_apartment by replacing the namespace URI exactly as it is given in the declaration with the abbreviation and a colon, in this case simply as taubz:my_apartment. The precise rules for namespacing depend on the RDF serialization syntax being used.

Importantly, namespaces have no significant status in RDF. They are merely a tool to abbreviate long URIs.

I might re-write the table about my apartment as it is below, replacing the simple names I first used above with abritrary URIs:

RDF about My Apartment
Let taubz: abbreviate http://razor.occams.info/index.html#

taubz:me            http://example.org/own    taubz:my_apartment
taubz:my_apartment  http://example.org/has    taubz:my_computer
taubz:my_apartment  http://example.org/has    taubz:my_bed
taubz:my_apartment  http://example.org/is_in  http://example.org/Philadelphia

The table above is just an informal table representing the graph of information that exists at an abstract level, which could just as well be described by the figure below. We will talk more about standard ways of actually writing out RDF later on.

RDF as a Graph

Wrapping It Up So Far

And that's RDF. Everything else in the Semantic Web builds on those three rules, repeated here to hammer home the simplicity of the system:

  1. A fact is expressed as a triple of the form (Subject, Predicate, Object).
  2. Subjects, predicates, and objects are given as names for entities, whether concrete or abstract, in the real world.
  3. Names are in the format of URIs, which are opaque and global.

These concepts form most of the abstract RDF model for encoding knowledge. It's analogous to the common API that most XML libraries provide. If it weren't for us curious humans always peeking into files, the actual format of XML wouldn't matter so much as long as we had our appendChild, setAttribute, etc. Of course, we do need a common file format for exchanging data, and in fact there are two for RDF, which we look at later.

Blank Nodes and Literal Values

There is actually a bit more to RDF than the three rules above. So far I've described three types of things in RDF: resources (things or concepts) that exist in the real world, global names for resources (i.e. URIs), and RDF statements (triples, or rows in a table). There are two more things.

Literals

The first new thing is the literal value. Literal values are raw text that can be used instead of objects in RDF triples. Unlike names (i.e. URIs) which are stand-ins for things in the real world, literal values are just raw text data inserted into the graph. Literal values could be used to relate people to their names, books to their ISBN numbers, etc.:

Some Uses of Literals
taubz:me          foaf:name  "Joshua Ian Tauberer"
book:HarryPotter  dc:title   "Harry Potter"
book:HarryPotter  ex:price   "$18.75"

Blank/Anonymous Nodes

Then there are anonymous nodes, blank nodes, or bnodes. These terms are all synonymous. The words anonymous or blank are meant to indicate that these are nodes in a graph without a name, either because the author of the document doesn't know or doesn't want to or need to provide a name. In a sense, this is like saying “John is friends with someone, but I'm not telling who.” When we say these nodes are nameless, keep in mind two things. First, the real-world thing that the node denotes is not inherently nameless. John's friend, in the example, has a name, after all. Second, when we say nameless here, we are refering to the concept of naming things with URIs. Actual blank nodes in documents may be given “local” identifiers so that they may be referred to multiple times within a document. It is only that these local identifiers are explicitly not global, and have no meaning outside of the document in which they occur.

If you're familiar with formal semantics, blank nodes can often be thought of as existentially bound variables.

Here's one way literal values and anonymous nodes are used. One literal value in the example is "Joshua Tauberer", and the anonymous or blank node is _:anon123.

Blank Nodes and Literal Values
taubz:me               foaf:name    "Joshua Tauberer"
taubz:me               ex:has_read  <urn:isbn:0143034650>
<urn:isbn:0143034650>  dc:title     "Free Culture : The Nature and Future of Creativity"
<urn:isbn:0143034650>  dc:author    _:anon123
_:anon123              foaf:name    "Lawrence Lessig"

To distinguish between URIs, namespaced names (abbreviated URIs), anonymous nodes, and literal values, I used the following common convention:

  • Full URIs are enclosed in angle brackets.
  • Namespaced names are written plainly, but their colons give them away.
  • Anonymous nodes are written like namespaced names, but in the reserved "_" namespace with an arbitrary local name after the colon.
  • Literal values are enclosed in quotation marks.

You should take a moment to try to visualize what graph is described by the table. Picture arrows between nodes.

There is one blank node in this example, _:anon123. What we know about this resource is that it is the author of <urn:isbn:0143034650> and it has the name Lawrence Lessig. Because no global name is used for this resource, we can't really be sure who we're talking about here. And, if we wanted to say more about whatever is denoted by _:anon123, we would have to do it in this very RDF document because we would have no way to refer to this particular Lawrence Lessig outside of the document.

More on Literals: Language Tags and Datatypes

Literal values can be optionally adorned with one of two pieces of metadata. The first is a language tag, to specify what language the raw text is written in. The language tag should be viewed as a vestige of how RDF was used in the early days. Today it is an ugly hack. You may see “ "chat"@en ”, the literal value “chat” with an English language tag, or “ "chat"@fr ”, the same with the French language tag.

Alternatively, a literal value can be tagged with a URI indicating a datatype. The datatype indicates how to interpret the raw text, such as as a number, a URI, a date or time, etc. Datatypes can be any URI, although the datatypes defined in XML Schema are used by convention. The notation for datatypes is often the literal value in quotes followed by two carets, followed by the datatype URI (possibly abbreviated):

Datatypes
"1234"                This is an untyped literal value. No datatype.
"1234"^^xsd:integer   This is a typed literal value using a namespace.
"1234"^^<http://www.w3.org/2001/XMLSchema#integer>   The same with the full datatype URI.

Datatypes are a bit tricky. Let's think of the datatype for floating-point numbers. At an abstract level, the floating-point numbers themselves are different from the text we use to represent them on paper. For instance, the text “5.1” represents the number 5.1, but so does “5.1000” and “05.10”. Here there are multiple textual representations — what are called lexical representations — for the same value. A datatype tells us how to map lexical representations to values, and vice versa.

The semantics of RDF takes language tags and datatypes into account. This means two things. First, a literal value without either a language tag or datatype is different from a literal with a language tag is different from a literal with a datatype. These four statements say four different things and none can be inferred from the others:

Literal Semantics
#john foaf:name "John Jones"              John's name is a lanaguage-less,
                                          datatype-less raw text value.
#john foaf:name "John Jones"@en           John's name, in English, is John Jones.
#john foaf:name "Jacque Jones"@fr         John's name, in French, is Jacque Jones.
#john foaf:name "John Jones"^^xsd:string  John's name is a string.

So, an untyped literal with or without a language tag is not the same as a typed literal. The second part of the semantics of literals is that two typed literals that appear different may be the same if their datatype maps their lexical representations to the same value. The following statements are equivalent (at least for an RDF application that has been given the semantics of the XSD datatypes):

Datatype Semantics
#john ex:age "10"^^xsd:float
#john ex:age "10.000"^^xsd:float

These mean John's age is 10. That is, the textual representation of the number is besides the point and is not part of the meaning encoded by the triples. Note that if the float datatype were not specified, the triples would not be inherently equivalent, and the textual representation of the 10 would be maintained as part of the information content.

More on Blank Nodes: Some Caveats

Unlike the rule for URIs stating that they are global, local identifiers used to name blank nodes are explicitly not global. A local bnode identifier used in two separate documents can refer to two things. Still, however, the identifier itself is arbitrary, and the actual identifier used in any particular case is not a part of the information content of the document.

Anonymous nodes are often used to avoid having to assign people URIs, as in the example above. They're also often used in representing more complex relations:

Blank Nodes for Complex Relations
taubz:me   ex:hasName     _:anon234
_:anon234  ex:firstName   "Joshua"
_:anon234  ex:middleName  "Ian"
_:anon234  ex:lastName    "Tauberer"

Here the anonymous node was used as an intermediate step in the relation between me and the parts of my name. The node represents my name in a structured way, rather than using a single opaque literal value "Joshua Ian Tauberer". RDF only allows binary relations, so it's necessary to express many-way relations using intermediate nodes, and these nodes are often anonymous.

Navigate: < Previous: Introducing RDF [Contents] Next: Reading and Writing RDF > (Entire Article)
This site is run by Joshua Tauberer.