Tim’s Weblog Tim's Weblog
Tim Strehle’s links and thoughts on Web apps, managing software development and Digital Asset Management, since 2002.

Your search for “rdfa” found 25 entries.


W3C DAM Group kick-off today! Why I’m interested in schema.org for DAM

Today’s the first telco of the new W3C DAM Group (official name: Digital Asset Management Industry Business Ontology Community Group). A big thank you to Emily Kolvitz for starting it! There’s still time to join in…

The group currently has 31 participants, including several experts well known in the DAM community.

Its mission statement is “to extend schema.org to increase the expressiveness, utility and interoperability of digital media assets.”

I’m participating because DAM interoperability and Linked Data are among my pet topics – but also because I’m now working for an organization that runs maybe a dozen content focused systems: DAM, WCMS, content production, workflow systems etc., each of which need to be able to exchange and/or link content items. Having a simple standard for core DAM metadata exchange should help us identify and link assets, migrate them, and build an “enterprise search engine” and “content picker” (think “File/Open” dialog or the “Dropbox Chooser”) on top of all those content stores.

I wrote about schema.org for DAM here:

See also:

Wed, 01 Nov 2017 12:20:00 +0000

schema.org RDFa markup for a DAM hypermedia API

Just a quick update to my previous schema.org DAM markup example. That example was in RDF/XML, but RDFa – RDF markup embedded in HTML – is pretty interesting as well, so here’s the same record in HTML+RDFa.

Click here to see that markup rendered by your browser. The benefit of RDFa is that it’s human readable by everyone with a Web browser, and at the same time machine-readable structured data. See Publish your data, don’t build APIs for more on hypermedia APIs.

Here’s the RDFa source:

<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="content-type" content="application/xhtml+xml; charset=UTF-8" /></head>
<body prefix="schema: http://schema.org/">
<div resource="https://www.flickr.com/photos/archbob/22875195123/" typeof="schema:Photograph">
<img src="http://c1.staticflickr.com/1/668/22875195123_1d0a409a41_n.jpg" />
<h1 property="schema:name">Desert Landscape</h1>
<p property="schema:description">Desert Landscape at Big Bend National Park.</p>
<dd property="schema:keywords">nature, landscape, outdoors, big, texas, desert, bend, dusk, scenic</dd>
<dt>Date created:</dt>
<dd property="schema:dateCreated" content="2014-01-20">Monday, January 20th, 2014</dd>
<dt>Content location:</dt>
<a href="http://sws.geonames.org/5516970/" property="schema:contentLocation" typeof="schema:Place">
<span property="schema:name">Big Bend National Park</span>
<a href="https://www.flickr.com/people/archbob/" property="schema:creator" typeof="schema:Person">
<span property="schema:name">Yinan Chen</span>
<dt>Copyright holder:</dt>
<a href="http://www.goodfreephotos.com/" property="schema:copyrightHolder" typeof="schema:Organization">
<span property="schema:name">Good Free Photos</span>
<dt>Copyright year:</dt>
<dd property="schema:copyrightYear">2014</dd>
<a href="https://creativecommons.org/licenses/by/2.0/" property="schema:license">
Creative Commons Attribution 2.0 Generic (CC BY 2.0)
<a href="https://www.flickr.com/" property="schema:provider" typeof="schema:Organization">
<span property="schema:name">Flickr</span>
<div resource="https://www.flickr.com/photos/archbob/22875195123/#original_file" property="schema:associatedMedia" typeof="schema:ImageObject">
Media file:
<a href="http://c1.staticflickr.com/1/668/22875195123_4fced120f0_k.jpg" property="schema:contentUrl">
(<span property="schema:fileFormat">image/jpeg</span>,
<span property="schema:contentSize" content="911577">911 kB</span>,
<span property="schema:width">2048</span>x<span property="schema:height">1387</span> px)
<div resource="https://www.flickr.com/photos/archbob/22875195123/#thumbnail_file" property="schema:thumbnail" typeof="schema:ImageObject">
Thumbnail file:
<a href="http://c1.staticflickr.com/1/668/22875195123_1d0a409a41_n.jpg" property="schema:contentUrl">
(<span property="schema:fileFormat">image/jpeg</span>,
<span property="schema:contentSize" content="36171">36 kB</span>,
<span property="schema:width">320</span>x<span property="schema:height">216</span> px)

You can try pasting the source into the EasyRdf Converter, or inspect the RDFa file using the OpenLink Structured Data Sniffer browser extension.

Wed, 13 Jan 2016 20:31:00 +0000

schema.org markup for a DAM system photo record

I’ve been talking about RDF and schema.org for DAM interoperability in a previous blog post. What’s been missing was an example.

Here’s what the actual schema.org markup for a random photograph could look like (in RDF/XML notation):

<?xml version="1.0" encoding="UTF-8"?>
<name>Desert Landscape</name>
<description>Desert Landscape at Big Bend National Park.</description>
<keywords>nature, landscape, outdoors, big, texas, desert, bend, dusk, scenic</keywords>
<Place rdf:about="http://sws.geonames.org/5516970/">
<name>Big Bend National Park</name>
<Person rdf:about="https://www.flickr.com/people/archbob/">
<name>Yinan Chen</name>
<Person rdf:about="https://www.flickr.com/people/archbob/">
<name>Yinan Chen</name>
<Organization rdf:about="https://www.flickr.com/">
<ImageObject rdf:about="https://www.flickr.com/photos/archbob/22875195123/#original_file">
<ImageObject rdf:about="https://www.flickr.com/photos/archbob/22875195123/#thumbnail_file">

Not so bad. Looks like something your Digital Asset Management system could produce (and maybe even consume), doesn’t it?

If you want to see what this looks like after having been processed by an RDF parser, you can paste it into the excellent EasyRDF converter.

Update: Here’s the same record in RDFa markup instead of RDF/XML.

Update: See my follow-up post Where do I put search result context in schema.org?

Related: DAM and the Semantic Web – our webinar on Dec 9th

Thu, 03 Dec 2015 23:45:00 +0000

DAM and the Semantic Web – our webinar on Dec 9th

On December 9th, 2015, Margaret WarrenDemian Hess and I will be doing a webinar hosted by the DAM Guru Program: How the Semantic Web Will Affect Digital Asset Management. You’re welcome to join us, or watch the recording later if you cannot attend live.

If you’re already using Linked Data in a DAM context, I’d love to hear from you soon. Maybe I can work some of your examples into the webinar!

I’m not an expert, but I’ve been thinking about this (and doing some experiments) for years. If you want to do some reading on the topic, here’s a timeline of articles about the intersection of DAM and Semantic Web / Linked Data:

Mike Linksvayer (2007) – Digital Asset Management and the Semantic Web: “Microsoft has apparently gone the next several steps by basing Microsoft Media Manager on RDF and OWL. […] IMM RDF model overcomes traditional barriers to metadata sharing between external systems.”

Naresh Sarwan (2012) – Semantic Web: The Key To More Meaningful Large Scale Digital Asset Management?: “Semantic techniques are likely to become more important as the volume of media stored in DAM systems increases exponentially.”

Nigel Cliffe (2012) – Your Guide to the DAM Road Map Part 4: “Semantic web technologies will find their way into the DAM experience. The development of data about data as well as documents on the web that machines can process, transform and assemble, will enable new value to be gained from that data in really useful ways.”

Margaret Warren (2013) – A new system which can help with protecting images from becoming orphans: “[ImageSnippets] approaches metadata authoring and editing from an entirely new point of view using linked data techniques. Not only does the linked data tagging offer improved querying (which can be experimented with in the application) but the techniques used also enable innovative methods for the storage and transport of an image and it's metadata.”

Semantic web technologies will find their way into the DAM experience. The development
of data about data as well as documents on the web that machines can process, transform and assemble, will enable new value to be gained from that data in really useful ways. - See more at: http://www.provideocoalition.com/your-guide-to-the-dam-road-map-part-4#sthash.57T6UsFv.dpuf
Semantic web technologies will find their way into the DAM experience. The development
of data about data as well as documents on the web that machines can process, transform and assemble, will enable new value to be gained from that data in really useful ways. - See more at: http://www.provideocoalition.com/your-guide-to-the-dam-road-map-part-4#sthash.57T6UsFv.dpuf
Semantic web technologies will find their way into the DAM experience. The development
of data about data as well as documents on the web that machines can process, transform and assemble, will enable new value to be gained from that data in really useful ways. - See more at: http://www.provideocoalition.com/your-guide-to-the-dam-road-map-part-4#sthash.57T6UsFv.dpuf

Me (2013) – Linked Data for better image search on the Web: “Make sure there’s a simple HTML page on the Web for each of your images. With essential metadata (license or offer, description, your contact information) embedded in the HTML source code as semantic RDFa markup.”

Ralph Windsor (2013) – Applying Linked Data Concepts To Derive A Global Image Search Protocol (in response to my post): “I suspect that both commercial interests and the complexity of getting everyone in the world to both agree and implement something like this might mean it is some time before it becomes a reality, but the benefits to image suppliers and users alike are clear and that may generate the required momentum to see some gradual progress towards it over time.”

Ralph Windsor (2013) – The Building Blocks Of Digital Asset Management Interoperability: “None of the methods I have described are complex to implement into DAM products as vendors either use them already or (in the case of the Linked Data techniques) they are fairly simple modifications to existing outputs used to populate existing APIs.”

Me (2014) – Web of information vs DAM, DM, CM, KM silos: “We could find a clever, generic way to link information from various systems together so that we can “surf” it in any direction. Linked data in the form of HTML+RDFa is a great way to do this.”

James Rourke (2014) – DAM Report: Henry Stewart London Conference: “Mark [Davey] spoke on the subject of where DAM is going, postulating that DAM will morph into a knowledge based platform, with assets and content being delivered in new ways. […] It was suggested that schema.org should become an industry standard, so that we are all speaking the same language.”

Demian Hess (2014) – DAM and Flexible Metadata Using Semantics: “Digital asset metadata is a challenge because of its variety. […] Several new types of databases have emerged that provide the flexible solutions that businesses require. While document-oriented databases have commanded the most attention, semantic “triple stores” have been gaining adherents.”

Mark Davey (2015) – What has DAM got to do with Web 3.0?: “I will be discussing the future of DAM as it relates to Web 3.0 and the social-network-infused web of data. Exploring DAM's place in the new machine-readable, knowledge-based economy.” (See also his funny video.)

Kim Davis (2015) – The state of DAM: digital asset management in 2015: “DAM, for [Mark] Davey, is "the beating heart of the semantic web." A "highly controlled vocabulary" should power search, and without the expertise of a librarian to implement that, your DAM will fail.”

Me (2015) – RDF and schema.org for DAM interoperability: “I suggest that we encourage the DAM community to move towards the schema.org vocabulary in an RDF syntax.”

If you want to dig deeper, the DAM Foundation lists several posts tagged with Linked Data or Semantic Web, and DAM News has a Semantic Web category.

Update: Ralph Windsor – Webinar: How the Semantic Web Will Affect DAM, 9th December 2015, 10am PT: “The Semantic Web remains the richest source (and currently largely unexplored) opportunity for metadata innovation in Digital Asset Management.”

Update: Mark Davey (2016) – Where DAM has been and where it's going: “I believe that DAM will become the beating heart of the Semantic Web.”

Wed, 25 Nov 2015 22:45:00 +0000

RDF and schema.org for DAM interoperability

There’s no widely-accepted standard for DAM data yet

Digital Asset Management (DAM) systems are the hubs for organizations’ creative content. DAMs need to exchange data with other systems all the time: import creative works and metadata from external content providers, export digital assets and metadata to Web Content Mananagement systems and so on. Sadly, none of the various DAM related standards (like the Dublin Core Metadata Element Set Lisa Grimm writes about, or IPTC NewsML G2) have been broadly adopted by DAM vendors. At least not broadly enough that you can expect to exchange data between DAM systems without programming effort. Update: OASIS CMIS4DAM is “in development” for quite a while now, but I’m not too excited about it as participation costs money and I don’t find CMIS particularly easy to implement.

Do we need a new standard?

Inventing a new standard is rarely a good idea. (You’ve probably seen the XKCD comic on standards.) If there is an existing open standard that more or less matches our use case, we better use that one to benefit from the existing documentation, tools, and adoption.

I suggest that we encourage the DAM community to move towards the schema.org vocabulary in an RDF syntax. This is the stuff that already powers large parts of the emerging Semantic Web. It introduces the DAM to the world of Linked Data.

Why schema.org?

The schema.org vocabulary seems quite sensible. But the great thing about schema.org is that it is supported by large search engines: Bing, Google, and more. Which means the SEO guys are picking it up, so businesses will (finally) want to invest money in structured data! The chance of a lifetime for librarians who were struggling to prove DAM ROI, isn’t it?

And it’s good that this vocabulary isn’t DAM specific. Most of the time, DAM interoperability is about dealing with non-DAM systems. They’re not too likely to support DAM standards. I bet the Web CMS market will move towards schema.org soon (because SEO).

Update: The DAM specific stuff that’s missing from schema.org, like more detailed file rendition information, could be added in the form of a schema.org extension.

Why RDF?

There’s various well-supported syntaxes for representing RDF data: RDF/XML, RDFa (embedding RDF within HTML), JSON-LD, textual formats like Turtle

That’s a plus – you’ll likely find a syntax that suits you well – but makes it a bit harder to adopt. Even within the RDF/XML format, the same information can be encoded in many different ways. So you’ll likely have to use RDF-aware software (like EasyRDF for PHP) to produce and process RDF. Directly dealing with, say, RDF/XML via XSLT is too hard.

The great thing about RDF is its limitless extensibility. You can easily mix schema.org markup with any schema or vocabulary that specifies URIs, be it IPTC Photo MetadataRightsML or your own custom schema.

Who’s in?

Me, of course. Mark Davey of the DAM Foundation is also in favor of schema.org, apparently, see this tweet by Jeff Lawrence or this video. Update: Margaret Warren, too (ImageSnippets). And Lisa Grimm.

How about you?

Related blog posts: Web of information vs DAM, DM, CM, KM silos. XHTML+RDFa for structured data exchange. Dreaming of a shared content store. Update: DAM and the Semantic Web – our webinar on Dec 9th.

Update: For example schema.org RDF/XML markup, see schema.org markup for a DAM system photo record.

Update: Alfresco’s Ray Gauss II on CMIS4DAM – An Open Digital Asset Management Standard.

Fri, 08 May 2015 12:28:00 +0000

Hunting for well-known Semantic Web vocabularies and terms

As a Semantic Web / Linked Data newbie, I’m struggling with finding the right URIs for properties and values.

Say I have a screenshot as an PNG image file.

If I were to describe it in the Atom feed format, I’d make an “entry” for it, write the file size into the “link/@length” attribute, the “image/png” MIME type into the “link/@type” attribute, and a short textual description into “content” (with “@xml:lang” set to “en”). Very easy for me to produce, and the semantics would be clear to everyone reading the Atom standard.

Now I want to take part in the “SemWeb” and describe my screenshot in RDFa instead. (In order to allow highly extensible data exchange between different vendors’ Digital Asset Management systems, for example.) But suddenly life is hard: For each property (“file size”, “MIME type”, “description”) and some values (“type: file”, “MIME type: image/png”, “language: English”) I’ve got to provide a URL (or URI).

I could make up URLs on my own domain – how about http://strehle.de/schema/fileSize ? But that would be missing the point and prevent interoperability. How to Publish Linked Data on the Web puts it like this: “A set of well-known vocabularies has evolved in the Semantic Web community. Please check whether your data can be represented using terms from these vocabularies before defining any new terms.”

The previous link lists about a dozen of vocabularies. There’s a longer list in the State of the LOD Cloud report. And a W3C VocabularyMarket page. These all seem a bit dated and incomplete: None of them link to schema.org, one of the more important vocabularies in my opinion. (Browsing Semantic Web resources in general is no fun, you run into lots of outdated stuff and broken links.) And I haven’t found a good search engine that covers these vocabularies: I don’t want to browse twenty different sites to find out which one defines a “file size” term.

I’m pretty sure the Semantic Web pros know where to look, and how to do this best. Please drop me a line (e-mail or Twitter) if you can help :-)

Update: The answer is the Linked Open Vocabularies site. Check it out!

For the record, here’s what I found so far for my screenshot example:

“file size”: https://schema.org/contentSize

“MIME type: http://en.wikipedia.org/wiki/Internet_media_type or http://www.wikidata.org/wiki/Q1667978

“description”: http://purl.org/dc/terms/description or https://schema.org/text

“type: file”: http://en.wikipedia.org/wiki/Computer_file or http://www.wikidata.org/wiki/Q82753, or more specific: http://schema.org/MediaObject or http://schema.org/ImageObject or even http://schema.org/screenshot

“MIME type: image/png”: http://purl.org/NET/mediatypes/image/png or http://www.iana.org/assignments/media-types/image/png

“language: English”: http://en.wikipedia.org/wiki/English_language or http://www.lingvoj.org/languages/tag-en.html or https://www.wikidata.org/wiki/Q1860

Mon, 22 Dec 2014 10:38:13 +0000

Web of information vs DAM, DM, CM, KM silos

I have spent years of my life making our software work with other software, and I think we have a problem: The “enterprise” is managing overlapping information in disparate systems that don’t interoperate well. There’s lots of system flavors: DAM (interesting stuff like photos, videos, articles). DM (boring stuff like forms, business letters, emails). CM for publishing on the Web. KM holds expert’s contact info and instructions. CRM, employee directories, project management tools, file sharing, document collaboration… Each one with a different focus, but with overlapping data.

Now one system’s asset metadata can be another system’s core asset… Take the Contact Info fields from the IPTC Photo Metadata standard, for instance: When a photographer’s phone number changes, will you update it in your DAM system? How many places will you have to update it in the DAM – is it stored in a single place, or has it been copied into each photo? You’ll probably just update your address book and ignore the DAM. A DAM system simply isn’t a good tool for managing contact information. But it still makes sense for it to display it…

For a more complex example, here’s a typical scenario from our customers: A freelance journalist submits a newspaper article with a photo. It’ll be published in print and online, copied into the newspaper archive, and the journalist is going to get paid. Now when an editor sees that nice photo in her Web CMS and wants to reuse it, can she click on it to see 1) the name of the editor who placed it in the print edition, 2) the photo usage rights, 3) the amount paid to the journalist for the current use, and 4) the journalist’s phone number? No, she can’t. The data for 1) is stored in the print editorial system, 2) in the DAM (rights) and the DM system (contracts), 3) in the SAP accounting system, and 4) in the employee directory.

Of course, all of this can be made to work since each system has some sort of API. With one-off interoperability hacks, for which you need a programmer who’s familiar with the systems involved! Incompatible information silos are hurting the business and wasting a lot of developer time. This is a known problem, and the subject of two more acronyms: II = Information Integration, and MDM = Master Data Management. As a software developer, I see two possible solutions:

First, going back to a monolithic system that does everything at once is not a solution. Neither its user interface nor its backend implementation would be well-suited to the host of different tasks that users need software for.

But we could find a clever, generic way to link information from various systems together so that we can “surf” it in any direction. Linked data in the form of HTML+RDFa is a great way to do this, see my post Publish your data, don’t build APIs. (And Lars Marius Garshol on Semantic integration in practice.)

Or a much more complicated (but fascinating) solution: Product developers stop rolling their own databases and assume they’re going to operate on a shared datastore that is created and managed by someone else. Their software accesses it through a configurable data access layer. Imagine running WordPress and Drupal simultaneously on top of the same MySQL database, working on the same content! A shared datastore would allow for centralized business rules and permissions. But for practical reasons (performance!), this is likely not going to happen. (A baby step in the right direction: Use LDAP instead of creating your own users and groups database tables. We’ve done this and it works great.) 

In real life, information doesn’t stand alone – it lives inside a web of interlinked data. Until our systems can handle this reality, we’ve got to break it down, remodel and copy it for each siloed system. Let’s try to improve on that!

Update: See also Ralph Windsor – Digital Asset Management And The Politics Of Metadata Integration. Related blog post by me: Dreaming of a shared content store.

Tue, 25 Feb 2014 19:17:42 +0000

Texte und Bilder elegant austauschen zwischen Redaktionen und Nachrichtenagenturen

Texte und Bilder, die Redaktionen von extern zugeliefert bekommen – hauptsächlich von Nachrichten- und Bildagenturen – werden digital und meist mit guten Metadaten angeliefert. Der Weg in die Produktionssysteme des Verlags erfordert aber oft zu viel Handarbeit, und es gehen Metadaten dabei verloren. Das liegt daran, dass die Software im Verlag technische Beschränkungen hat oder sie nicht entsprechend konfiguriert wurde. Ein Rückweg zum Lieferanten nach der Veröffentlichung fehlt entweder oder ist aufwändig.

Im Idealfall hätte der Redakteur:
… eine übergreifende Sicht auf die Planung: wann zu welchen Themen Agenturen oder eigene Mitarbeiter Inhalte liefern werden, mit Übernahme in die Produktionsplanung
… eine einheitliche Sicht auf digitale Inhalte aus allen verfügbaren Quellen: ein Portal oder eine Suchmaschine, über die man auf selbst produzierte Texte und Bilder, Agenturmaterial, Angebote von anderen Redaktionen oder Freien, interne und externe Archive zugreifen kann
… eine 1-Klick-Übernahme aller (geplanten oder bereits vorhandenen) Inhalte in die eigene Produktion, mit sämtlichen Metadaten (Angaben zum Urheber, Nutzungsrechte und Vergütung, Bildunterschriften, Verschlagwortung, Verknüpfung zur Planung)
… einen automatisierten Rückweg, der den Anbieter über die geplante bzw. erfolgte Veröffentlichung informiert und so die Erstellung von Nutzungsstatistiken, Abrechnung und Belegexemplaren stark vereinfacht
… eine einfache Möglichkeit, selbst eigene Inhalte anderen Redaktionen anzubieten

All das ist technisch machbar. Programmieren muss man dafür elegante und verständliche Schnittstellen für Inhalteanbieter (z.B. Portale von Nachrichtenagenturen und Bilddatenbanken) und Produktionssysteme (Redaktionssysteme, CMS). Schwieriger ist die nötige Standardisierung von Metadaten und Protokollen:

Es braucht Konventionen, welche Metadaten-Formate wie genutzt werden (z.B. NewsML G2, RightsML). Zumindest die für die Produktion grundlegenden Metadaten (Datum, Embargo, Bildunterschrift, Nutzungsrechte, Copyright) müssen einheitlich (bzw. kompatibel) ausgetauscht werden können. Für ein themenzentriertes Arbeiten muss man noch deutlich weiter gehen und ein gemeinsames Metadaten-Vokabular (für Personen, Orte, Ereignisse/Veranstaltungen, Themen) schaffen. Das bringt einen erheblichen “Mehrwert”, ist aber schwierig: Mit der Vereinheitlichung von Vokabular und Strukturen kämpft das “Semantic Web” schon länger. Am ehesten können Nachrichtenagenturen hier Standards setzen.

Und es müssen sich Protokolle für die Schnittstellen etablieren, über die man Bilder und Texte anbietet. Wenn jeder Anbieter und Abnehmer sein eigenes Datensilo betreibt mit proprietären APIs, kann kein universaler Content-Marktplatz entstehen. Schauen wir uns doch vom Web ab, wie es geht: Jeder Inhalte-Anbieter nutzt entweder einen Dienstleister oder hostet selbst eine Website, die für jeden Text (bzw. jedes Bild) und jedes Thema eine eigene HTML-Seite (unter einer permanenten URL) anbietet, mit Links und semantischem HTML-Markup (für Metadaten und Rechte, z.B. RDFa). Dafür können verschiedene Parteien Crawler und Suchmaschinen bauen. Alternativ zum Crawling können XML-Sitemaps, RSS-Feeds und PubSubHubbub bereitgestellt werden. Inhalte und Suchmaschinen werden meist nicht öffentlich sein, sondern ein Login erfordern (was z.B. auch die Personalisierung der Rechte ermöglicht).

(Siehe auch: Software für Journalismus – zwei Ideen vor dem scoopcamp 2013 und Linked Data for better image search on the Web.)

Wie sieht’s aus, wer macht mit?

Fri, 27 Sep 2013 13:21:02 +0000

Software für Journalismus – zwei Ideen vor dem scoopcamp 2013

[Sorry for the German blog post – I’ll publish an English version soon.]

Ich arbeite für Digital Collections, einen Hersteller von DAM-Systemen (und mehr) mit Kunden hauptsächlich aus der Verlagsbranche. Deren Umbruch beobachten wir aus der Nähe und machen uns Gedanken darüber… Hier zwei Ideen, die ich aus meiner Perspektive als Software-Entwickler und Dokumentar für den “Journalismus der Zukunft” habe und an deren Verwirklichung ich gern mitarbeiten würde. Ich freue mich über Ergänzungen und Nachfragen; per Mail oder Twitter oder gern persönlich auf dem scoopcamp 2013 in Hamburg (das für mich der Anlass ist, diese Ideen aufzuschreiben).

(Hinweis: Das sind meine eigenen Vorstellungen, ich spreche hier nicht für meinen Arbeitgeber!)

1. “Deine Themen im Blick behalten” – ein Themenportal für den Leser

Den aktuellen Überblick – “was passiert gerade Interessantes in meiner Nähe und dem Rest der Welt?” – bekommt man prima über die gedruckte Zeitung, Fernsehen und Radio oder eine Nachrichtenseite im Web. Sich punktuell über ein spezielles Thema zu informieren (“Mesut Özil wechselt? Das muss ich lesen”), klappt auch recht gut über die Suchfunktion des Online-Auftritts oder Suchmaschinen.

Schwierig ist es dagegen, über ein Thema laufend informiert zu bleiben. Ich interessiere mich z.B. für den Themenbereich “Jugendamt” und das Thema “Snowden/NSA-Affäre”. Mein Wunsch als Leser: “Ich möchte es mitbekommen, wenn Presse, Radio, Fernsehen oder Blogs etwas Wichtiges zu meinen Themen veröffentlichen.” Um es dann lesen/hören/sehen zu können. Gern auch mal gegen Bezahlung.

Für den Fußballfan gibt es viele Angebote, er verpasst nicht viel. Ansonsten finde ich nur mit Glück eine Themenseite (aber nur für eine Nachrichtenquelle, hier ein SPIEGEL-ONLINE-Beispiel). RSS-Feeds sind meist nicht themenspezifisch und auch nur aus einer Quelle. Und die Mühe, die eher mittelmäßigen Google Alerts einzurichten, mache ich mir auch nur selten.

Andreas Fischer fragt: “Warum gibt es nicht längst ein gemeinsames Portal unserer Tageszeitungen, das ähnlich wie Google News dafür sorgt, Leser auf die einzelnen Websites weiterzuleiten?” Die “Paywalls” werden mehr, vielleicht könnte eine Art “iTunes Store für Verlagsinhalte” entstehen. Der hätte aber ein Problem mit der Kleinteiligkeit der Inhalte. Bei Musik bieten sich Seiten für Künstler und Alben an. Die Flut von täglich ein paar zehntausend neuen Artikeln muss ebenfalls sinnvoll gruppiert werden, um einen leserfreundlichen Zugang zu bieten. Meiner Meinung nach wäre eine Gruppierung nach Thema ideal.

Also eine Website, die die aktuell in den Medien behandelten Themen auflistet und für jedes Thema eine Seite anbietet, die täglich (oder noch öfter) aktualisiert auf passende Artikel verlinkt. Artikel auf den bekannten Nachrichten-Websites, aber gern auch von guten Blogs, aus den Archiven, Hintergrundinfos bei Wikipedia oder Hinweise auf Fernsehsendungen. Mit Veröffentlichungsdatum, Name der Quelle, Überschrift, Umfang (lang/mittel/kurz), Autor und einem Hinweis, falls der Artikel hinter einer Bezahlschranke liegt. Ich kann mich per RSS-Feed oder E-Mail benachrichtigen lassen, wenn neue Links hinzukommen. 

Das würde ich mir als Leser wünschen. Und ich halte es für machbar!

Update: Hier ein Prototyp einer Themenseite für das scoopcamp 2013. S. auch meinen Blog Post Journalismus: Themenzentriertes Arbeiten, vernetzte Beiträge und hilfreiche Software.

2. Ein offenes Netzwerk für Anbieter von Bildern und Texten – und eine Suchmaschine für den Redakteur

Noch nie bestanden Zeitungen nur aus selbstproduzierten Inhalten. Freie, Externe, Korrespondenten, Nachrichtenagenturen, Bildagenturen liefern zu, und die eigene Produktion wird wieder anderen angeboten. Das Internet und die Digitalfotografie vereinfachen das Verteilen von Inhalten dramatisch – und das Veröffentlichen. Potentielle Anbieter und Abnehmer von Bildern und Texten gibt es immer mehr. 

Diese zusammenzubringen und einen einfachen Austausch der Inhalte zu ermöglichen (einschließlich Metadaten zu Veröffentlichungsrechten, Honorierung, Planung), ist allerdings gar nicht so einfach. Mein Ansatz: Anbieter sollten ihre Inhalte auf Webseiten (i.d.R. mit Passwortschutz) bereitstellen und sich dabei an ein paar einfache Konventionen für das Datenformat (HTML+RDFa) halten.

Das ermöglicht es anderen (z.B. Verlagen, Agenturen), mittels bewährter “Crawler”-Technik Suchmaschinen für diese Angebote aufzubauen. (In so einer Suchmaschine können natürlich auch die eigenen, internen Archive enthalten sein.) Im Idealfall findet der Redakteur dann, wenn eigene oder Agenturbilder fehlen, die Bilder vom freien Fotografen, der zufällig gerade vor Ort war und sie über das Netzwerk allen anbietet. Oder der über das Netzwerk das Angebot veröffentlicht, dort Fotos zu machen, wo er sich heute aufhält.

Solch ein Netzwerk wäre offen für beliebige Teilnehmer (die sich natürlich über die Nutzung der Inhalte einig werden müssen) und auf keine proprietäre Software oder zentrale Instanz angewiesen.

Zu diesem Thema siehe auch meine Blog-Posts Linked Data for better image search on the Web und Linked Data for public, siloed, and internal images.

Was denkst Du, was denken Sie? Braucht keiner, rechnet sich nicht? Oder ist etwas dabei, das wir gemeinsam angehen können?

Wed, 04 Sep 2013 22:10:18 +0000

IndieWebCamp: Principles

IndieWebCamp – Principles:

“Own your data.

Use visible data for humans first, machines second.

[…] Whatever you build should be for yourself. If you aren't depending on it, why should anybody else?

[…] The more your code is modular and composed of pieces you can swap out, the less dependent you are on a particular device, UI, templating language, API, backend language, storage model, database, platform.

[…] We should be able to build web technology that doesn't require us to destroy everything we've done every few years in the name of progress.”

Great principles for all content-centric software, not just the IndieWeb. “Data for humans first, machines second” sounds like RDFa to me…

Thu, 29 Aug 2013 11:44:24 +0000

Publish your data, don’t build APIs

I’m trying to find a good “elevator pitch” for building hypermedia APIs with HTML. How about this:

Don’t build an API – publish your data instead: easy to read for both humans (not just developers) and software, and easy to link to.

After providing read access, the next step is to enable others to modify your data, manually as well as through software. That’s what we would call an API, of course. But I think it helps if you focus on making your data available instead of starting with “let’s build an API”. (I’m tired of APIs, as explained in my Linked Data for better image search blog post.)

Once the data is out there, everyone can “surf your Web of content” (including search engines if you let them). And developers can write code to automate, to glue separate data sources together, to mash them up.

In my opinion, XHTML+RDFa is the best way to reach that goal. But even if you disagree with my choice of format, I hope you can agree with the general point.

Making data more visible has long been a favorite topic of mine. A decade ago, I wrote a simple PHP script that made it easy to browse an Oracle database, because I hated how my valuable data was hidden behind arcane Oracle tools or the sqlplus command line. (Apparently, some people are still using that script. I guess I should start working on it again, and add RDFa and JSON to it.)

Update: Mike Amundsen comments “don't just tell them what's there (data), show what they can do (actions)”. He’s right, this is missing from my pitch. Don’t stop at publishing your data – let people work with it, and make the actions as easy to discover as the data itself!

Update: See also Ruben Verborgh’s The lie of the API.

Mon, 19 Aug 2013 08:10:50 +0000

HTML Hypermedia API resources

One year ago, I wrote on Twitter that “my next API will be semantic XHTML”. Since then, I’ve been thinking a lot about Hypermedia APIs with HTML (and have done some prototyping). My dream API would use XHTML with RDFa, link to Atom feeds and offer an alternative JSON-LD representation.

Here’s a few articles on that topic that made me think:

It all started for me with Using HTML as the Media Type for your API by Jon Moore. Make sure to read this. And the “ugly website” Rickard Öberg quote tweeted by Stefan Tilkov.

Combining HTML Hypermedia APIs and Adaptive Web Design by Gustaf Nilsson Kotte is also a great read.

Then watch the full talk (53 minutes) by Jon Moore on Building Hypermedia APIs with HTML.

If you’ve got some time left, I highly recommend the RESTful Web Services book by Leonard Richardson and Sam Ruby. It already said this, back in 2007: “It might seem a little odd to use XHTML […] as a representation format for a web service. I chose it […] because HTML solves many general markup problems and you’re probably already familiar with it. […] Though it’s human-readable and easy to render attractively, nothing prevents well-formed HTML from being processed automatically like XML.” (By the way, the follow-up RESTful Web APIs is going to be published next month.)

I haven’t read the book Building Hypermedia APIs with HTML5 and Node by Mike Amundsen yet, but it sounds interesting.

Please let me know if I missed out on something important…

Update (2017-12-01): See my follow-up post Publish your data, don’t build APIs.

Update (2018-05-31): Jason Desrosiers – The Hypermedia Maturity Model.

Wed, 14 Aug 2013 20:29:51 +0000

RDFa tools and resources

I’m currently learning/exploring RDFa (try searching my blog for “rdfa”). As a total newbie to the world of RDF and RDFa, these tools and resources have been helpful so far:

First, the W3C RDFa 1.1 Primer is easy reading, a great introduction to RDFa. And it links to the full specifications (which are also well-written).

The W3C RDFa 1.1 Distiller and Parser is a Web page where you enter a URL, then it summarizes the RDFa data it finds there. Good for verifying your own Web site’s RDFa. (Or try it with one of my blog posts or my home page, http://www.strehle.de/tim/ …)

If you’re like me and prefer to analyze your RDFa from the command line, install the pyRdfa distiller/parser library and run “scripts/localRDFa.py -p URL” (-p means RDF/XML output).

RDFa / Play is a Web page where you type in HTML+RDFa code and, as you type, see it turned into a pretty graph visualization. Nice for playing around with the RDFa syntax.

I’m trying to use common vocabulary if possible, often from the schema.org hierarchy.

Of course, the nice thing about RDFa is that you can always “view source” on other’s pages to see what they’re doing.

Are you into RDFa? Please let me know if I’m missing out on something!

Thu, 08 Aug 2013 11:59:45 +0000

Jonas Öberg: A distributed metadata registry

Jonas Öberg – Developer’s corner: A distributed metadata registry:

“Anyone should be able to run their own registry for their own works or works in which they have an interest.

[…] Standards such as ccREL provide a way in which a user can look up the rights metadata by visiting a URL associated with the work and making use of RDFa metadata on that URL to validate a license. That’s a useful practice, since RDFa provides a machine readable semantic mapping for the metadata while ensuring that the URL could also contain human readable information.

[…] Let’s further imagine that the unique identifier was always a URL.”

Mon, 08 Jul 2013 14:31:17 +0000

XHTML+RDFa for structured data exchange

Last week I wrote on Twitter: “Testing my theory that structured data exchange between apps/parties is best done in #XHTML+#RDFa. Human browseable & usable w/ XSLT, XPath.” This is the long version of that tweet:

As a developer in the enterprise DAM software business, I’ve been doing a lot of integration work – with news agency data and newspaper editorial systems, CMS, syndication, image databases and so on. In the early years, it was a joy to see ugly text formats disappear: Almost everyone switched to XML sooner or later. This made parsing and understanding the data so much easier. And we didn’t just consume XML; very early we went all XML for our APIs and data exports. (I alone am guilty of inventing more than ten simple XML formats in the last decade.)

The explosion of custom XML vocabularies (and XML-based standards) has its drawbacks, of course. Back in 2006, Tim Bray wrote in Don’t invent XML languages: “The smartest thing to do would be to find a way to use one of the perfectly good markup languages that have been designed and debugged and have validators and authoring software and parsers and generators and all that other good stuff.”

I don’t think that in the short term, everyone agreeing on just a handful of vocabularies means a lot less work for developers. Developers would still understand and use the vocabulary differently. My main gripe is that someone’s got to write code if a non-developer (i.e., someone not happy reading raw XML) simply wants to access, read and search the data. This hurts communication and quality control and bug hunting in lots of projects (in ours, at least).

Transparent, visible, browseable, searchable: That’s how I increasingly want the data our software receives, and that it emits. So I’ve started playing with XHTML+RDFa. Tim Bray again: “If you use XHTML you can feed it to the browsers that are already there on a few hundred million desktops and humans can read it, and if they want to know how to do what it’s doing, they can “View Source”—these are powerful arguments.” I’d like to add that using HTML opens the data up to Web search engines, and using XHTML specifically allows us to keep working with the fine toolset of XSLT, XPath and xmllint. (See also my post on Linked Data for better image search. And Jon Moore who started it all, watch his talk on Building Hypermedia APIs with HTML!)

This week, a customer requested that I deliver a data dump (a few days worth of newspaper articles and images) to another software vendor. Which format exactly wasn’t important yet. So I took the occasion and had a first shot at modelling the content stored in our DC-X DAM as XHTML with metadata and data structures expressed as RDFa within. This was a bit more complicated than expected: RDFa feels relatively complex (there’s various ways to mark up a statement), and I had to think a lot about the metadata schema (I tried to use schema.org types and properties where applicable). 

I created one HTML file for each newspaper page, and one (hyperlinked) file for each story on that page. I ended up with relatively simple RDFa, using only these HTML attributes so far: content, datatype, prefix, property, resource, typeof. (The RDFa / Play visualization was quite helpful, by the way.) I avoided nesting objects: The simple XSLT I built to prove that the data can be easily converted searches for properties recursively, to remain independent of the HTML markup (example: <xsl:for-each select=".//*[@property='schema:datePublished']">), and got confused if objects were nested.

It feels great that the customer (and I) can easily view that data in a Web browser. But I’m an RDF newbie so the resulting RDFa source is rather ugly, and lots of things are still missing. If I find a way to publish samples on the Web, I’ll post about it here and would love your feedback! (It feels strange to advocate RDFa, by the way, as I still dislike the RDF data model and prefer Topic Maps instead…)

Tue, 25 Jun 2013 13:04:30 +0000

Linked Data for public, siloed, and internal images

Ralph Windsor discusses my previous blog post on DAM News – Applying Linked Data Concepts To Derive A Global Image Search Protocol. He finds better words than I did, rephrasing my suggestion as “a universal protocol where images get described like web pages (HTML) so you can crawl them using search engine techniques”.

Ralph points out that large commercial image sellers might not want to participate in an open network: “Allowing their media out into the open for some third party to index – who they probably regard with wary suspicion (e.g. Google) is likely to be a step too far.” Maybe. Although they’ll go where the customers are – a Google Images search for “airport hamburg 92980935” turns up Getty Images image #92980935, so I assume that Getty Images wants Google to crawl their database. If an open image network emerges on the public Web, the commercial platforms will want to become a part of it once it reaches critical mass. What’s more, one of them could even embrace the change and start building the best image search engine that crawls the Web! (A bit like the Getty Images Flickr cooperation but without the need to copy the images over into their database.) 

But “out in the open” is an important point: Many images (and other content types) will always be restricted to limited groups of users. Still, this is no reason to invent a complicated API for accessing them: In intranets, lots of non-public documents are available as HTML, allowing users and internal search engines to easily access them. You can do the same for image metadata – restrict access to the local network, require username and password (or API key, authorization token etc.) as you see fit, but serve it to authenticated search engines (and users) as HTML + RDFa anyway.

A Web of images (to paraphrase Mike Eisenberg) with rich metadata that’s easy to read for machines and humans? I have no idea whether we’ll actually get there in the near future, but that’s what we should aim for!

Wed, 05 Jun 2013 21:21:42 +0000

ImageSnippets | A Metadata Authoring System for Images

ImageSnippets™ is a system for creating structured, transportable metadata for your images. It can be used as a digital asset management tool as well as an image/metadata publishing platform.”

Take a look at the help pages, and read Margaret Warren’s post introducing ImageSnippets to the iptc-photometadata Yahoo! Group – a new system which can help with protecting images from becoming orphans:

“ImageSnippets is a bit of a swiss-army knife prototype at the moment with many new types of terms and features not typically found in current metadata editing environments.

[…] The system creates an HTML+RDFa file containing a link to the image AND all of it's metadata is represented as structured data in the file.”

I like that it combines public, application-level and personal datasets. That you can reference an image by its URL, i.e. you don’t have to upload it and can still add metadata for it. (Reminds me of the DAM Value Chains – Metadata article by Ralph Windsor: “separate a digital file from metadata and other associated asset data so you could more easily delegate the task of managing it.”) And I love that it publishes RDFa!

Tue, 28 May 2013 20:57:07 +0000

Linked Data for better image search on the Web

Today, searching the Web for an image that you’re allowed to use in public (either at no cost or after paying a license fee) is a suboptimal experience. Web search engines Google or Bing turn up images with unclear rights or in bad quality. Specialized “silos” like Getty Images or iStock Photos work well for professionals but only find those images that were submitted to them on their terms.

(An interesting alternative approach is the German i-picturemaxx (APIS) network that allows distributed searches across a network of servers, but is closed / “pay to play” and based on proprietary technology.)

I think the future lies in publishing better image metadata on the Web, and better image search engines that make use of that metadata. Whether you’re a pro photographer, a hobbyist or a news agency – make sure there’s a simple HTML page on the Web for each of your images. With essential metadata (license or offer, description, your contact information) embedded in the HTML source code as semantic RDFa markup. Then let the search engine crawlers do their job. If they don’t pick up and make good use of that metadata, let’s build a new image search engine that does!

Sounds too simple? I’m actually a Semantic Web skeptic. Cory Doctorow’s 2001 criticism is still very much valid and explains why the “SemWeb” hasn’t taken off yet. But I think it could work here: Image licensing is an existing market with some money on the table. There is an incentive for both producers and consumers of digital images; finding the right photo is hard and copyright and licensing become increasingly important. (Plus it helps that it’s potentially a global market with few barriers: If you find the perfect photo of a rose, it shouldn’t matter that it was taken by an amateur who lives on a different continent and doesn’t speak English.)

What is difficult, and will remain so, is getting content creators to take the time to add meaningful, structured metadata. And to make their metadata play along well with other creators’. People describe things in different words: There’ll never be perfect alignment. But some common usage should evolve once the benefits become obvious (think folksonomy and SEO).

These things are also difficult, but we can do something about them: Reusing and improving common vocabularies and combining them with our own, custom terms. Building and spreading software that makes metadata editing and vocabulary juggling easy, or even fun. Agreeing on the protocols and formats to be used for publishing metadata on the Web, and having software support them. Getting existing or new image search engines to use the metadata. And helping creators and customers make transactions. 

Lots of work to do. But I think publishing and crawling metadata on the open Web are the critical first step.

The protocol and format should be HTTP and HTML with RDFa: HTTP and HTML (and the ecosystem of browsers and search engines) have proven to work well at “Web scale”, with millions of producers and billions of consumers of information. HTML is readable by any human with a Web browser, which is its killer feature. And RDFa seems to win the race against microdata for semantic markup within HTML. (The current discussion on embedded metadata in image files is important as well, but in HTML it’s so much easier to access and modify that I see it as the primary data source.)

Note that I don’t care whether image distributors offer an API. As a developer, I’m getting tired of APIs (at least for read access). Imagine you have three sources for image metadata; one offering a CMIS API, one implementing OAI-PMH and the third being the Getty Images API. How many pages of documentation are you going to read, how much development time are you going to spend until you can do a simple keyword search and list essential metadata from each? (And once you’re done, how about the other 215 photo Web service APIs?)

What do you think – am I aiming too high, missing something, or am I on the right track? I’d love to hear your thoughts.

Update: Ralph Windsor replies – Applying Linked Data Concepts To Derive A Global Image Search Protocol. My follow-up: Linked Data for public, siloed, and internal images. And an in-depth article by Ralph Windsor: The Building Blocks Of Digital Asset Management Interoperability.

Sun, 26 May 2013 21:46:18 +0000

Image metadata on the Web: URL as identifier

Before you start thinking about common metadata for your images (creator, date created, caption, license), first consider what I think is the most important piece of metadata: A unique identifier for your image. And please make it a URL. Why?

First, you want to avoid duplicates in search engine results. You’ll be using the same image on different Web pages, possibly with slight variations: Different sizes, file formats, or cropping. Which means that the URL to the image file is not the same. A unique identifier makes sure others can find out these are just renditions or variations of the same image. (Current image search engines often show lots of duplicates. If they don’t make use of our nice identifiers once we add them, we can always roll our own search engine… ☺ Yes, I’m serious.)

Second reason: A well-groomed image will have lots of metadata. Temporal, geographical, creator and licensor related, subject descriptions, licensing terms. You don’t want to add all this baggage to each Web page the image is used on, so you need a separate place to publish all the metadata for that image. And once you have it, it makes perfect sense use that place as the permanent home for your image and use its URL as the image’s unique identifier.

Suppose that you’re using that URL/identifier whenever you publish or distribute the image: You put it into your HTML, embed it into the image files, and make sure it doesn’t get lost if you register the image with a registry like PLUS or distribute it through third parties like Flickr or Getty Images. What have you just gained? Well, now you can remain the authoritative source of your image’s metadata! You can fix mistakes, add renditions or links or legal notes and change licensing terms at will because you’re in control of that URL. (Third parties probably won’t recognize your self-hosted metadata yet, but let’s move into that direction.) 

To practice what I preach, I have added an RDFa resource attribute to the HTML div containing the blog post’s photo (you might want to view the HTML source code of the previous post). An example:

<div resource="http://www.strehle.de/tim/data/document/doc69wpi6bms01kix6470q" typeof="schema:ImageObject">
<img src="/device_strehle/dev1/2013/05-02/72/65/file69wpi6cfox11c7cgw70q.jpg" />

With this HTML markup, I’m also telling search engines that the referenced URL is about an image, using the schema.org ImageObject type. (I’m a newbie re schema.org and RDFa, suggestions for improvement are welcome!)

What if someone just downloads the image file, ignoring my lovingly-crafted HTML markup? I want them to see my URL as well. So I’m embedding it in the XMP-plus:ImageSupplierImageID metadata field of the JPEG file using ExifTool:

exiftool -XMP-plus:ImageSupplierImageID=http://www.strehle.de/tim/data/document/doc69wpi6bms01kix6470q IMG_1980.jpg

(This is just a first try, there’s probably other metadata fields I should write it to. I’m choosing this field for now because you can see and modify it in Photoshop: File / File Info… / IPTC Extension / Supplier’s Image ID.)

Note that the URL I’m pointing to doesn’t yet exist: I’ll create that page in the next step. For now, I have just added a unique identifier that looks like a URL (so the correct name is probably URI or IRI, can’t get used to that).

For reference, here’s a few other places that I don’t fully understand yet, but look like they should possibly also contain the URL/identifier if the image gets distributed in a suitable format:

EXIF ImageUniqueID. PLUS LDF Terms and Conditions URL / Licensor Image ID / Copyright Owner Image ID / Image Creator Image ID. ODRL Asset uid. schema.org url property. IPTC NewsML G2 newsItem guid attribute / web (Web address) element. PRISM url element. XMP xmp:Identifier / xmpRights:WebStatement / xmpMM:DocumentID. Dublin Core Metadata Element Set identifier. 

(I’m sure there’s more. Yes, this makes my head explode as well. Please tell me that it’s much simpler than that.)

What do you think? I’d love to hear your feedback (@tistre on Twitter; for e-mail addresses see my home page).

Update (2018-09-06): Five years later, I still don’t know… There’s also plus:licensorImageID. See also Christian Weiske – Adding the source URL to an image's meta data.

Wed, 08 May 2013 05:44:37 +0000

Heath, Bizer: Linked Data: Evolving the Web into a Global Data Space

Tom Heath, Christian Bizer – Linked Data: Evolving the Web into a Global Data Space:

“This book gives an overview of the principles of Linked Data as well as the Web of Data that has emerged through the application of these principles. The book discusses patterns for publishing Linked Data, describes deployed Linked Data applications and examines their architecture.”

The Web page contains the whole book, for free. I still dislike RDF triples, but there’s heaps of useful information.

I especially like this one:

“Linked Data commits itself to a specific pattern of using the HTTP protocol. This agreement allows data sources to be accessed using generic data browsers and enables the complete data space to be crawled by search engines. In contrast, Web APIs are accessed using different proprietary interfaces.”

Each big corporation’s information silo uses their own API. That’s crazy. If you don’t want to be open to the public, prevent access by requiring authentication. But don’t force developers to reimplement simple data access (search, read). I’m currently in favor of HTML with semantic markup (probably RDFa)…

Mon, 29 Apr 2013 08:14:06 +0000

Why I prefer Topic Maps to RDF

I enjoy modeling data. As students, we were taught the relational data model (as used by SQL databases) and hierarchical database structures. But the real eye-opener was when our professor started modeling a supposedly simple example: an address book. Very soon, we ran into lots of questions with no easy answers: How are persons and addresses, companies, and other persons actually related? How about several persons sharing the same address? What about the temporal dimension, would you want to keep former addresses or employers? We learned what questions to ask, that there’s no silver bullet for the perfect data model, and how to choose a good compromise.

I did a lot of SQL database modeling, which was fun and powerful and easy to code against, but still relatively limited and complicated. (Think multi-valued fields and the need for separate tables for m:n relations.) So when I first read the Topic Maps specification (XTM 1.0 back in the day) and the TAO of Topic Maps article, I was thrilled. The data structures immediately made sense to me: Every thing can have names, types, properties, and identifiers. Then there’s relations between two or more things, where each thing can play a certain role. Metadata can have its own metadata, and scopes help qualify it. That’s all.

It took a few years before I could sneak a tiny Topic Map engine into our DAM software (see the blog posts). It still isn’t fully standard conformant but serves us very well: People started using it for simple lists of countries or keywords without even knowing anything about Topic Maps. (This works fine because almost every Topic Map feature is optional.) Some time later, they would notice how powerful and flexible it is: Whether hierarchical thesaurus structures, names in multiple languages, subsets of lists or custom metadata for a topic, it’s easy to think up and implement new stuff. And you don’t have to change database structures or throw away existing data.

When I learned about RDF, it totally didn’t “click” for me. Everything’s a triple? How is this better than “everything’s a row in a table”? Yes, I’m simplifying and probably not getting it – but I know that RDF doesn’t help me think. To me, it’s a low-level abstraction, too technical and too theoretical. There’s too many options for implementing basic use cases, which makes interoperability harder. Topic Maps provide me with a way to think about data structures that makes my work easier, that helps clarify my thinking and communicate it to others.

It’s a bit sad that Topic Maps have never been widely used or even known. In terms of adoption, RDF has certainly won (even though the Semantic Web is failing so far). And I love that RDFa allows embedding data structures into HTML: Now Web service APIs can be built in HTML, to be browsed by humans and still be machine readable (the ability to “view source” is a pillar of the Web). So I’ll go with keeping the data in a Topic Map, but will probably make it available through RDFa. (I hope these two can be made to play nicely together…)

Update: My experimental TopicBank engine runs this blog – see strehle.de now powered by Topic Maps.

Update: See my blog post Topic Maps (as a standard) are dead, I’m afraid.

Update: A must-read in this context is Steve Pepper’s 2002 article Ten Theses on Topic Maps and RDF. Sample: “[Topic maps] association roles also make it possible to go beyond binary relationships. In RDF, assertions are always binary.”

Fri, 08 Feb 2013 08:02:42 +0000

My thoughts on DAM value chains

Ralph Windsor and Naresh Sarwan of DAM News have published an excellent article: The Digital Asset Management Value Chain: The Future Direction Of DAM In 2013 And Beyond describes the current market, with literally dozens of similar vendors “commit[ing] major proportions of their development team’s time […] replicating functionality that a section of their competitors already offer so they won’t be seen to be falling behind in RFPs and pitches”. They think that this situation might change when platforms emerge that allow combining software components from different vendors: These “value chains” would allow customers and integrators to “review, select and assemble custom applications using a variety of component choices available”.

I agree with their market analysis (isn’t the Web CMS market similar?), and the “value chain” theory sounds good. It reminds me of social networks, a market with a small number of competing platforms: Your “social” application or feature will have to integrate with at least one of Facebook, Google+, Twitter or LinkedIn to be broadly adopted.

As a DAM software programmer (or “software architect”, if you’re into fancy titles), I’m wondering how this would turn out in practice: Would there be a “Facebook of DAM systems”, a hosted, ready-made application with basic features, and full control over the APIs and UI integration points that others may plug their code into? Or rather a Lego-like toolbox, a development environment, where you start building UI and features from scratch? (Box is an example of the former, Nuxeo IDE of the latter.)

For the vendor willing to go the “value chain” route, it’s not as easy as it sounds: Integrating and reusing software is actually a lot of work. Non-technical people often take it for granted that software components can be, and are, reused. But dependencies on other code, on the data model, storage layer and user interface make it hard. Often so hard that it’s not worth the effort, so you rewrite from scratch or copy and paste and modify instead. (Additionally, software can usually only be reused in applications written in the same programming language.)

Integrating software that runs as a self-contained executable is a lot easier – that’s why many DAM systems already make use of Solr, Tika, ImageMagick, Ghostscript, FFmpeg, ExifTool, dcraw and the like. Maybe there’s something to learn from this list: Open source software and open standards are much more likely to be adopted. Not just because of the price tag, but also because being dependent on commercial software often seems the bigger risk (might be abandoned by the vendor, evolve into the wrong direction, become much more expensive, be badly-supported). And a competitor might not even be willing to sell to you. (All of this has happened to us at one time or the other…) Adobe being one of the exceptions, they’re large enough and are trusted by the DAM market (although quite a few of their integration offerings promised too much or were later abandoned).

Software integration can happen on several levels (see my old post on the five faces of a web application). Here’s some thoughts on web services, data models and UI components:

Web service APIs are an important part of integration work (mostly backend to backend, sometimes user interface to backend). CMIS promises to standardize APIs for ECM and DAM use cases, although I personally don’t like it – it seems too complex; I believe standards can and should be simpler, otherwise they won’t be broadly or fully adopted. (Software development went from C-based SDKs to complex SOAP services to the RSS-like AtomPub, each step a great and successful simplification. Let’s not go back and overcomplicate things.) But I think backend API integration is the area that is most easily covered. IFTTT is a nice showcase of what’s already possible. (By the way: In addition to a web service API, our DC-X software comes with a host of Unix command line tools for easy data import/export/manipulation. The Unix pipe is the grandfather of software integration.)

The data model is a bigger problem: Each application has its own data format and metadata structure. (Examples: Which metadata fields mean the same, do they support rich text, multiple values, additional attributes?) This is usually not a problem for a one-way export to another application, just reformat and drop the information not supported by the receiving system. But a full two-way integration can cause a lot of headaches. (At least a common structure should be doable: I imagine a future where each web application makes all of its data available through RDFa in HTML. This offers both a nice human-readable representation of the data, and complete, machine-readable data that you can peek into via “view source”.)

Finally, customers will want to combine different components in the same user interface. Web-based applications make this technologically possible, but to be usable as a UI widget, features must be specifically programmed with this use case in mind. And web widgets aren’t standardized yet. But things like Mashups, Portlets, the Google Gadgets API and Web Intents point in the right direction. Our bigger customers already express interest in getting or building their own custom user interfaces, a lightweight integration in the browser, powered by JavaScript.

This is very interesting territory. I think of it as the “consumerization of software integration”. I hope there will be enough inventiveness, courage, and incentive for interoperability, that some of the energy poured into duplicating features can be spent improving the user experience instead. I’m looking forward to how this will play out (and what role our company will play), and to the follow-up posts promised by DAM News (there’s already one on Asset Manipulation).

P.S.: The book What Would Google Do? by Jeff Jarvis gives the advice to not offer a product, but “become a platform” instead (to help users create their own products)… Recommended reading.

Update (2015-04-29): See my blog post Web app interoperability – the missing link.

Update (2016-03-08): My follow-up blog post System architecture: Splitting a DAM into Self-Contained Systems.

Tue, 15 Jan 2013 22:23:19 +0000

Semantic markup for “You can license this image”

Searching the web for images you can actually (legally) use, for commercial or non-commercial purposes, is almost impossible: Google or Bing will show you millions of images, but have no clue under which terms you’re allowed to use them. Lots of “information silos” let professionals search for, and license, rights cleared images, from iStockphoto to Getty Images. If you want your photos to be found there, you’ll have to copy them into one (or more) of these sites (see the Flickr / Getty Images cooperation), which means more work for the photographer. And the user or buyer has to search through multiple silos. Since a lot of these silos exist, most searches will miss out on most of the photos out there.

While curated image collections are fine and can offer consistent, high quality, spam-free content, I think there should also be usable image search engines with much greater coverage. With more and more images being put on the web, it would be great if image search engines could index the most important information directly off the referencing HTML page: Title, description, date created, whether the image is free for non-commercial or commercial use, whether and where I can buy a license.

To the user, it should be a simple list of options in, say, Google image search: “Only images which are free for non-commercial use”, “Only images that are free or can be licensed”. (And if Google doesn’t implement this, others can roll their own image search engines.)

The Semantic Web is trending again and offers great options for marking up metadata within HTML, but unfortunately there’s no “one true way”. What exactly should the HTML markup look like? Would one use WhatWG microdata, schema.org microdata, schema.org RDFa Lite? (As far as I know, PLUS and RightsML cannot be embedded in HTML.)

I have created a separate page with four examples of different ways to mark up an image license. Warning: Since I’m a Semantic Web newbie, they may be wrong or suboptimal…

Example #1: To refer to a Creative Commons license in HTML, you can use “RDFa and the rel=license microformat”, according to this Stack Overflow page on “Semantic HTML markup for a copyright notice”

Example #2: The WhatWG HTML microdata proposal contains a section on “Licensing works”, with a nice example of an image available under both a Creative Commons and the MIT license – using the microdata format with itemprop=license. 

Example #3: The schema.org CreativeWork type has the properties copyrightHolder and copyrightYear, but no license property. IPTC rNews extends schema.org, adding copyrightNotice and usageTerms. The latter sounds like it could refer to a license URL: “xsd:string | xsd:anyURI | owl:thing. A human or machine-readable statement about the usage terms pertaining to the NewsItem.”

Example #4: Same as above, but (instead of microdata) in RDFa Lite format (which in the future can maybe also be used for schema.org markup).

The Google Rich Snippets Testing Tool recognizes only #2 and #3. It likes example #3 best (schema.org in microdata format), but complains about the rNews extension: “Warning: Page contains property "usageterms" which is not part of the schema.”

Do you know of a better markup alternative? Does a license-/rights-aware image search engine already exist? I’m looking forward to your feedback!

Thu, 31 May 2012 09:11:24 +0000

Google Announces Support for Microformats and RDFa

Timothy M. O'Brien at O'Reilly Radar – Google Announces Support for Microformats and RDFa:

"On Tuesday, Google introduced a feature called Rich Snippets which provides users with a convenient summary of a search result at a glance. They have been experimenting with microformats and RDFa, and are officially introducing the feature and allowing more sites to participate. While the Google announcement makes it clear that this technology is being phased in over time making no guarantee that your site's RDFa or microformats will be parsed, Google has given us a glimpse of the future of indexing."

Wed, 13 May 2009 07:09:59 +0000

Introducing RDFa

Bob DuCharme - Introducing RDFa:

"For a long time now, RDF has shown great promise as a flexible format for storing, aggregating, and using metadata. Maybe for too long—its most well-known syntax, RDF/XML, is messy enough to have scared many people away from RDF. The W3C is developing a new, simpler syntax called RDFa (originally called "RDF/a") that is easy enough to create and to use in applications that it may win back a lot of the people who were first scared off by the verbosity, striping, container complications, and other complexity issues that made RDF/XML look so ugly."

Wed, 14 Feb 2007 22:58:49 +0000