{"id":1667,"date":"2013-06-25T00:00:00","date_gmt":"2013-06-24T22:00:00","guid":{"rendered":"https:\/\/wwwneu.strehle.de\/tim\/weblog\/archives\/2013\/06\/25\/1619\/"},"modified":"2013-06-25T00:00:00","modified_gmt":"2013-06-24T22:00:00","slug":"1619","status":"publish","type":"post","link":"https:\/\/www.strehle.de\/tim\/weblog\/archives\/2013\/06\/25\/1619\/","title":{"rendered":"XHTML+RDFa for structured data exchange"},"content":{"rendered":"<p>Last week I <a href=\"https:\/\/twitter.com\/tistre\/status\/347713166429540353\">wrote on Twitter<\/a>: \u201cTesting my theory that structured data exchange between apps\/parties is best done in #XHTML+#RDFa. Human browseable &amp; usable w\/ XSLT, XPath.\u201d This is the long version of that tweet:<\/p>\n<p>As a developer in the enterprise DAM software business, I\u2019ve been doing a lot of integration work \u2013 with news agency data and newspaper editorial systems, CMS, syndication, image databases and so on. In the early years, it was a joy to see ugly text formats disappear: Almost everyone switched to XML sooner or later. This made parsing and understanding the data so much easier. And we didn\u2019t just consume XML; very early we went all XML for our APIs and data exports. (I alone am guilty of inventing more than ten simple XML formats in the last decade.)<\/p>\n<p>The explosion of <strong>custom XML vocabularies<\/strong> (and XML-based standards) has its drawbacks, of course. Back in 2006, Tim Bray wrote in <a href=\"http:\/\/www.tbray.org\/ongoing\/When\/200x\/2006\/01\/08\/No-New-XML-Languages\">Don\u2019t invent XML languages<\/a>: \u201cThe smartest thing to do would be to find a way to use one of the perfectly good markup languages that have been designed and debugged and have validators and authoring software and parsers and generators and all that other good stuff.\u201d<\/p>\n<p>I don\u2019t think that in the short term, everyone agreeing on just a handful of vocabularies means a lot less work for developers. Developers would still understand and use the vocabulary differently. My main gripe is that <strong>someone\u2019s got to write code<\/strong> if a non-developer (i.e., someone not happy reading raw XML) simply wants to access, read and search the data. This hurts communication and quality control and bug hunting in lots of projects (in ours, at least).<\/p>\n<p><strong>Transparent, visible, browseable, searchable<\/strong>: That\u2019s how I increasingly want the data our software receives, and that it emits. So I\u2019ve started playing with <a href=\"http:\/\/www.w3.org\/TR\/2012\/REC-xhtml-rdfa-20120607\/\">XHTML+RDFa<\/a>. Tim Bray again: \u201cIf you use XHTML you can feed it to the browsers that are already there on a few hundred million desktops and humans can read it, and if they want to know how to do what it\u2019s doing, they can \u201cView Source\u201d\u2014these are powerful arguments.\u201d I\u2019d like to add that using HTML opens the data up to Web search engines, and using XHTML specifically allows us to keep working with the fine toolset of XSLT, XPath and xmllint. (See also my post on <a href=\"\/tim\/weblog\/archives\/2013\/05\/26\/1608\">Linked Data for better image search<\/a>. And Jon Moore who started it all, watch his talk on <a href=\"http:\/\/www.infoq.com\/presentations\/web-api-html\">Building Hypermedia APIs with HTML<\/a>!)<\/p>\n<p>This week, a customer requested that I deliver a data dump (a few days worth of newspaper articles and images) to another software vendor. Which format exactly wasn\u2019t important yet. So I took the occasion and had a first shot at modelling the content stored in our DC-X DAM as <strong>XHTML<\/strong> with metadata and data structures expressed as <strong>RDFa<\/strong> within. This was a bit more complicated than expected: RDFa feels relatively complex (there\u2019s various ways to mark up a statement), and I had to think a lot about the metadata schema (I tried to use <a href=\"http:\/\/schema.org\/docs\/full.html\">schema.org types and properties<\/a> where applicable).<\/p>\n<p><span>I created one HTML file for each newspaper page, and one (hyperlinked) file for each story on that page. <\/span>I ended up with relatively simple RDFa, using only these HTML attributes so far: content, datatype, prefix, property, resource, typeof. <span>(The <\/span><a href=\"http:\/\/rdfa.info\/play\/\">RDFa \/ Play<\/a><span> visualization was quite helpful, by the way.)<\/span> I avoided nesting objects: The <strong>simple XSLT<\/strong> I built to prove that the data can be easily converted searches for properties recursively, to remain independent of the HTML markup (example: &lt;xsl:for-each select=&#8220;.\/\/*[@property=&#8217;schema:datePublished&#8216;]&#8220;&gt;), and got confused if objects were nested.<\/p>\n<p>It feels great that the customer (and I) can easily view that data in a Web browser. But I\u2019m an RDF newbie so the resulting RDFa source is rather ugly, and lots of things are still missing. If I find a way to publish samples on the Web, I\u2019ll post about it here and would love your feedback! (It feels strange to advocate RDFa, by the way, as I still <a href=\"\/tim\/weblog\/archives\/2013\/02\/08\/1555\">dislike the RDF data model and prefer Topic Maps instead<\/a>\u2026)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Last week I wrote on Twitter: \u201cTesting my theory that structured data exchange between apps\/parties is best done in #XHTML+#RDFa. Human browseable &amp; usable w\/ XSLT, XPath.\u201d This is the long version of that tweet: As a developer in the enterprise DAM software business, I\u2019ve been doing a lot of integration work \u2013 with news [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_share_on_mastodon":"0"},"categories":[1],"tags":[],"class_list":["post-1667","post","type-post","status-publish","format-standard","hentry","category-weblog"],"share_on_mastodon":{"url":"","error":""},"_links":{"self":[{"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/posts\/1667","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/comments?post=1667"}],"version-history":[{"count":0,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/posts\/1667\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/media?parent=1667"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/categories?post=1667"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/tags?post=1667"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}