Tim's Weblog
Tim Strehle’s links and thoughts on Web apps, software development and Digital Asset Management, since 2002.
2013-06-25

XHTML+RDFa for structured data exchange

Last week I wrote on Twitter: “Testing my theory that structured data exchange between apps/parties is best done in #XHTML+#RDFa. Human browseable & usable w/ XSLT, XPath.” This is the long version of that tweet:

As a developer in the enterprise DAM software business, I’ve been doing a lot of integration work – with news agency data and newspaper editorial systems, CMS, syndication, image databases and so on. In the early years, it was a joy to see ugly text formats disappear: Almost everyone switched to XML sooner or later. This made parsing and understanding the data so much easier. And we didn’t just consume XML; very early we went all XML for our APIs and data exports. (I alone am guilty of inventing more than ten simple XML formats in the last decade.)

The explosion of custom XML vocabularies (and XML-based standards) has its drawbacks, of course. Back in 2006, Tim Bray wrote in Don’t invent XML languages: “The smartest thing to do would be to find a way to use one of the perfectly good markup languages that have been designed and debugged and have validators and authoring software and parsers and generators and all that other good stuff.”

I don’t think that in the short term, everyone agreeing on just a handful of vocabularies means a lot less work for developers. Developers would still understand and use the vocabulary differently. My main gripe is that someone’s got to write code if a non-developer (i.e., someone not happy reading raw XML) simply wants to access, read and search the data. This hurts communication and quality control and bug hunting in lots of projects (in ours, at least).

Transparent, visible, browseable, searchable: That’s how I increasingly want the data our software receives, and that it emits. So I’ve started playing with XHTML+RDFa. Tim Bray again: “If you use XHTML you can feed it to the browsers that are already there on a few hundred million desktops and humans can read it, and if they want to know how to do what it’s doing, they can “View Source”—these are powerful arguments.” I’d like to add that using HTML opens the data up to Web search engines, and using XHTML specifically allows us to keep working with the fine toolset of XSLT, XPath and xmllint. (See also my post on Linked Data for better image search. And Jon Moore who started it all, watch his talk on Building Hypermedia APIs with HTML!)

This week, a customer requested that I deliver a data dump (a few days worth of newspaper articles and images) to another software vendor. Which format exactly wasn’t important yet. So I took the occasion and had a first shot at modelling the content stored in our DC-X DAM as XHTML with metadata and data structures expressed as RDFa within. This was a bit more complicated than expected: RDFa feels relatively complex (there’s various ways to mark up a statement), and I had to think a lot about the metadata schema (I tried to use schema.org types and properties where applicable).

I created one HTML file for each newspaper page, and one (hyperlinked) file for each story on that page. I ended up with relatively simple RDFa, using only these HTML attributes so far: content, datatype, prefix, property, resource, typeof. (The RDFa / Play visualization was quite helpful, by the way.) I avoided nesting objects: The simple XSLT I built to prove that the data can be easily converted searches for properties recursively, to remain independent of the HTML markup (example: <xsl:for-each select=".//*[@property='schema:datePublished']">), and got confused if objects were nested.

It feels great that the customer (and I) can easily view that data in a Web browser. But I’m an RDF newbie so the resulting RDFa source is rather ugly, and lots of things are still missing. If I find a way to publish samples on the Web, I’ll post about it here and would love your feedback! (It feels strange to advocate RDFa, by the way, as I still dislike the RDF data model and prefer Topic Maps instead…)