2015-05-08

RDF and schema.org for DAM interoperability

There’s no widely-accepted standard for DAM data yet

Digital Asset Management (DAM) systems are the hubs for organizations’ creative content. DAMs need to exchange data with other systems all the time: import creative works and metadata from external content providers, export digital assets and metadata to Web Content Mananagement systems and so on. Sadly, none of the various DAM related standards (like the Dublin Core Metadata Element Set Lisa Grimm writes about, or IPTC NewsML G2) have been broadly adopted by DAM vendors. At least not broadly enough that you can expect to exchange data between DAM systems without programming effort.

Do we need a new standard?

Inventing a new standard is rarely a good idea. (You’ve probably seen the XKCD comic on standards.) If there is an existing open standard that more or less matches our use case, we better use that one to benefit from the existing documentation, tools, and adoption.

I suggest that we encourage the DAM community to move towards the schema.org vocabulary in an RDF syntax. This is the stuff that already powers large parts of the emerging Semantic Web. It introduces the DAM to the world of Linked Data.

Why schema.org?

The schema.org vocabulary seems quite sensible. But the great thing about schema.org is that it is supported by large search engines: Bing, Google, and more. Which means the SEO guys are picking it up, so businesses will (finally) want to invest money in structured data! The chance of a lifetime for librarians who were struggling to prove DAM ROI, isn’t it?

And it’s good that this vocabulary isn’t DAM specific. Most of the time, DAM interoperability is about dealing with non-DAM systems. They’re not too likely to support DAM standards. I bet the Web CMS market will move towards schema.org soon (because SEO).

Why RDF?

There’s various well-supported syntaxes for representing RDF data: RDF/XML, RDFa (embedding RDF within HTML), JSON-LD, textual formats like Turtle

That’s a plus – you’ll likely find a syntax that suits you well – but makes it a bit harder to adopt. Even within the RDF/XML format, the same information can be encoded in many different ways. So you’ll likely have to use RDF-aware software (like EasyRDF for PHP) to produce and process RDF. Directly dealing with, say, RDF/XML via XSLT is too hard.

Who’s in?

Me, of course. Mark Davey of the DAM Foundation is also in favor of schema.org, apparently, see this tweet by Jeff Lawrence or this video. Update: Margaret Warren, too (ImageSnippets). And Lisa Grimm.

How about you?

Related blog posts: Web of information vs DAM, DM, CM, KM silos. XHTML+RDFa for structured data exchange. Dreaming of a shared content store.

Fri, 08 May 2015 12:28:00 +0000
2015-05-06

Vom Archiv- zum Redaktionssystem: Die Drehscheibe für kreative Inhalte

Das folgende Referat habe ich am 5. Mai 2015 bei der Frühjahrstagung des Vereins für Medieninformation und Mediendokumentation (vfm) in Bremen gehalten, zum Thema “Vom Archiv- zum Redaktionssystem” beim Presse-Panel (etwas überarbeitet, entspricht nicht dem genauen Wortlaut). Die anderen beiden Redner waren Christian Wagner, Geschäftsführender Redakteur beim WESER-KURIER (der unser DC-X einsetzt) und André Maerz, Projektleiter bei der Neuen Zürcher Zeitung (HUGO-Anwender). Moderiert hat Jutta Heselmann vom WDR.

Ich arbeite als Software-Architekt bei DC (“DC” ist die bei uns übliche Abkürzung für Digital Collections). Ich bin übrigens kein Informatiker, sondern gelernter Dokumentar; deshalb freue ich mich besonders, heute hier sein zu können.

Ich werde erst kurz erzählen, wie wir vom Archiv- zum Redaktionssystem gekommen sind. Und dann etwas über die Drehscheibe für kreative Inhalte sagen.

Vielen Dank an meinen Vorredner Christian Wagner für die Schilderung, wie noch vor zwanzig Jahren beim WESER-KURIER Zeitungen produziert wurden. Das habe ich so nicht mehr zu sehen bekommen.

Die Firma DC wurde 1991 gegründet. Unsere Firmengründer haben Verlagen bei der Einführung von Redaktionssystemen geholfen. Sie waren begeistert von den neuen Möglichkeiten, die sich durch die Digitalisierung boten – und wollten digitale Publikationen verlegen. Die Vision damals: “Das Wissen der Welt in Lichtgeschwindigkeit zur Verfügung stellen.”

Das “Wissen der Welt” war etwas hoch gegriffen. Aber wir konnten aus dem Redaktionssystem Daten abzweigen und sie weiter verteilen, noch bevor der Druck begonnen hatte. Lichtgeschwindigkeit war es natürlich nicht, sondern eine ISDN-Leitung für die Datenfernübertragung, aber wir waren doch um Längen schneller als die analoge Welt mit Druck und Versand.

Der Verleger Hubert Burda war fasziniert, den FOCUS so vorab digital lesen zu können. Unsere Firmengründer haben die Empfangstechnik persönlich in seinem Büro aufgestellt. So wurde der FOCUS eines der ersten digitalen Verlagsprodukte weltweit und unser erster Kunde.

Zum schnellen Bereitstellen von Informationen gehörte neben der Datenfernübertragung auch eine Suchfunktion. Volltextsuche war damals etwas Besonderes. Bei DC wurde extra eine eigene Volltextsuchmaschine entwickelt. Das DC-Suchformular sah so ähnlich aus wie Google heute: ein Suchfeld, dann die Trefferliste. (Später haben die Archivare bei unseren Kunden daraus ein riesiges Suchformular mit oft einem Dutzend Eingabefeldern gemacht, was ihnen geholfen, normale Anwender aber eher abgeschreckt hat. Mittlerweile sind wir fast wieder beim Google-Layout.)

Die Suchfunktion fand die Deutsche Presseagentur dpa spannend, sie wurde ein wichtiger Kunde. Unser DC3 wurde eines der ersten Systeme weltweit, mit denen Agenturmeldungen und -bilder digital empfangen und durchsucht werden konnten. Verlage in der ganzen Welt kauften DC-Systeme.

Ein Archivsystem wollten wir eigentlich nicht bauen. Aber die Daten, die von Nachrichtenagenturen und aus dem Redaktionssystem in die DC3-Datenbank strömten, führten fast von allein zu einem digitalen und komfortabel durchsuchbaren Archiv – mit alten und neuen Bildern, Meldungen, Artikeln, Anzeigen und Ganzseiten. Für den Import und Export dieser Daten mussten wir jede Menge Formate unterstützen und Schnittstellen zu anderen Systemen bieten. Das ist bis heute das Typische an einem DC-System. Unsere Kunden nennen ihr DC-System gern neudeutsch “Content Hub”, die Drehscheibe für kreative Inhalte.

Das Redaktionssystem, in dem die Inhalte erfasst werden, war aus unserer Perspektive der “Platzhirsch” im Verlag – lebenswichtig, aber oft auch teuer, komplex und eng verwoben mit den Eigenheiten der analogen Print-Welt.

Nach und nach änderte sich dieses Bild durch den Siegeszug von WWW, Social Media und Apps: Das Endprodukt redaktioneller Arbeit ist nicht mehr der gedruckte Artikel, sondern eine Story, die zu verschiedenen Zeitpunkten in unterschiedlichen Kanälen veröffentlicht wird, meist durch verschiedene Systeme. Und bei der Unterstützung von Formaten und Schnittstellen ist DC oft besser aufgestellt als das Redaktionssystem.

Wenn ein Redaktionssystem abgelöst werden sollte (oder gar keins vorhanden war), stellte sich uns manchmal die Frage, ob man nicht direkt im DC-System einen Artikel schreiben könne. Warum eigentlich nicht? Die Strukturen und Schnittstellen sind vorhanden. Nur die Print-spezifischen Layout- und anderen Anforderungen konnten wir nicht abdecken. Mit ppi Media haben wir einen Partner gefunden, der die Anbindung an InDesign und Planungssysteme übernommen hat.

Die Kombination nennt sich Content-X und ist bewusst einfach gehalten – ein ziemlich schlankes und eher günstiges Redaktionssystem, mit dem schon einige Kunden täglich ihre Zeitung produzieren.

Wir haben unser Digital Asset Management-System also um die Erstellung von Inhalten erweitert. Anbieter von Redaktionssystemen sind den umgekehrten Weg gegangen und haben ihre Software um Archiv, Agentureingänge und Anbindung von anderen Kanälen erweitert.

Eine zentrale Drehscheibe für alle kreativen Inhalte macht ganz offensichtlich Sinn, auch für unsere Mitbewerber. Dafür brauchen Sie nicht unbedingt ein DC-System. (Und Sie müssen auch nicht zwingend Redaktionssystem und “Content Hub” verschmelzen, auch wenn das heute unser Thema ist. Hauptsache, die beiden Komponenten arbeiten gut zusammen.) Ich möchte aber Werbung machen für das Prinzip des “Content Hub”, der Drehscheibe für Inhalte: Suche, Distribution und Zweitverwertung werden einfacher, und Sie können Rechte und Kosten zentral verwalten.

So ein “Content Hub” braucht auf der Software-Seite vor allem: 1) Saubere Datenstrukturen, und als Voraussetzung dafür Software, mit der sich die für Sie wesentlichen Daten gut abbilden lassen. (WordPress wäre die falsche Software für ein Zeitungsarchiv, das Datenmodell gibt das nicht her.)

2) Performance: Achten Sie auf Systeme, die mit Ihrer Datenmenge und Nutzeranzahl zurecht kommen. Sie möchten keine zentrale Drehscheibe einführen, die dann alles andere ausbremst.

3) Schnittstellen sind das A und O. Sie brauchen jemanden, der Schnittstellen für Sie neu programmieren und anpassen kann. Das kann der Anbieter, ein Partner oder IT-Personal in Ihrem Unternehmen sein. Ihre “Content Hub”-Software muss auch fertige APIs anbieten.

4) Flexibilität: Ausgabekanäle und Abläufe verändern sich immer wieder. Flexible Software ist wichtig, damit sich das System schnell an geänderte Anforderungen anpassen lässt.

Wenn man die zentrale Sicht auf Inhalte eingeführt hat, eröffnen sich über die oben genannten offensichtlichen Vorteile hinaus weitere spannende Möglichkeiten:

Wie wäre es, wenn Sie beim Schreiben eines Textes automatisch passende Artikel aus Ihrem eigenen Archiv und aus anderen Quellen angeboten bekommen? Gestern gab es auf dieser Tagung ein sehr schönes Zitat von Eckhardt John dazu: “Holt die Archive in die Mitte, besser an den Anfang des Produktionsprozesses.” Nicht nur Archivinhalte lassen sich verknüpfen, sondern auch Inhalte, die andere gerade erstellen, vielleicht für einen anderen Kanal oder ein verwandtes Thema.

Auch eine thematische und regionale Einordnung schon während der Produktion kann Mehrwert für Ihre Kunden generieren, z.B. in Form von Themenseiten.

Ich glaube nicht an künstliche Intelligenz, aber daran, dass man mit Software die Hebelwirkung menschlicher intellektueller Anstrengung vergrößern kann. Wir haben vor mehr als zehn Jahren gemeinsam mit der Firma ARCUS aus hunderttausenden manuell verschlagworteten Artikeln eine Wissensbasis errechnet, die für neue Artikel recht gut eine passende Verschlagwortung vorschlagen kann. Das Stichwort “Big Data” ist ja gerade in Mode. Solche Algorithmen sinnvoll zu nutzen und kontinuierlich zu justieren, gehört für mich zu den Aufgaben eines Dokumentars.

Ich habe vorhin bewusst nicht “ein zentrales System für Inhalte” gesagt, sondern “eine zentrale Sicht auf Inhalte”: Sie können niemals alle interessanten Fremdinhalte in Ihr System kopieren. Und Sie möchten Ihre Inhalte mit Informationen anreichern, die in anders strukturierte Systeme gehören, z.B. aus Ihrer Kunden- oder Sportdatenbank. Die Zukunft gehört meiner Meinung nach Linked Data; Ihr System wird irgendwann zu einer Suchmaschine werden, die Zugriff auf viele verschiedene Quellen hat. Das Vernetzen und Zusammenführen dieser Quellen ist ebenfalls eine dokumentarische Tätigkeit.

Wenn alles gut klappt, bekommen Sie von Ihren Software-Lieferanten ein “Schweizer Taschenmesser”, das Ihnen mit vielen guten Werkzeugen immer neue Möglichkeiten eröffnet. Jetzt bin ich gespannt auf Ihre Erfahrungen und den nächsten Referenten. Vielen Dank!

Nachtrag: Siehe auch meine Blog-Beiträge Dreaming of a shared content store, Web of information vs DAM, DM, CM, KM silos, Texte und Bilder elegant austauschen zwischen Redaktionen und Nachrichtenagenturen, Software für Journalismus – zwei Ideen vor dem scoopcamp 2013.

Wed, 06 May 2015 08:44:00 +0000
2015-04-29

Workflow awareness of DAM systems

Workflow doesn’t

This blog post is inspired by Roger Howard’s excellent, thought-provoking “Workflow doesn’t” critique of the workflow functionality in today’s Digital Asset Management systems.

(I don’t know which systems Roger has worked with. If you want to catch up with the state of DAM workflow engines, here’s some links to get you started: Status-based workflows in Canto Cumulus. The MerlinOne workflow engine. ADAM Workflow. DAM News on configurable workflow systems. DAM News on workflow in DAM value chains. Anything else?)

I like Roger’s observations that “exceptions are the rule in production”, exceptions require decisions, and “decision making is something humans excel at”. This reminds me of Jon Udell’s old motto that “human beings are the exception handlers for all automated workflows”. Software that doesn’t embrace this fact will stop the work from flowing.

“Workflow systems are rigid and don’t reflect the constantly changing realities of most businesses” – I can absolutely confirm this. So many times have we written code to automate a process, making the customer happy, until new business opportunities required changes and our code kept them from adapting quickly. Some larger customers make sure to have in-house DAM technicians who can code and configure without always having to go through us, the vendor.

Roger argues that DAM software should support power users, not enforce rigid workflow definitions. This is quite interesting: It obviously appeals to me as a developer (i.e., power user), but quite a few of our customers seem rather scared of power users. They’d rather have a “power administrator” and not leave much freedom to the regular user.

To me, another important point in Roger’s article is this: “Valuable integration begins with the user experience, with the frontend.” David Diamond demanded very much the same in Reinventing Digital Asset Management. I have written about frontend Web app interoperability before; it’s hard, but this is the problem we have to solve.

The rest of this post is a scenario that helps me think about what workflow support might mean in practice. It’s a bit lengthy, so feel free to stop reading here :-)

An example DAM workflow

This workflow is common to our newspaper-publishing customers:

A newspaper editor asks a photographer to take pictures of an event. The photographer sends the pictures to the newspaper, and one of those gets printed in the paper.

Workflows consist of processes. Let’s look at the processes involved: First, there’s a planning phase where editors decide which topics and events to cover. One of the available photographers needs to be assigned the task of taking the pictures. The photographer, after shooting, selects the best photos and adds descriptive text and metadata to them. Then she sends the photos to the newspaper. At the newspaper, the favorite picture is chosen, the image cropped and enhanced, placed on the page and sent to the printer. Someone in accounting makes sure the photographer gets paid. And finally, the pictures – with more metadata for better findability – are added to the newspaper’s image archives to allow reuse.

Note that there’s no mention of software so far: This workflow is decades old and doesn’t require any software at all, let alone “workflow engines”. Now let’s see how software can help.

Level 0: Do everything manually

Today, almost every task mentioned above involves software. But often, these tasks are performed in separate systems that don’t talk to each other: Editorial topic planning might happen in Trello, while photographer assignments are tracked in a Google calendar. Photos are sent to the newspaper via e-mail, then manually uploaded into a DAM system. For image retouching, the image is downloaded from the DAM, the retouched version re-uploaded, then manually exported to the editorial system where it gets placed on a newspaper page. The next day, a librarian searches the DAM for each picture that appears in the printed paper and manually adds metadata including the date of publication. Based on that metadata, someone else can search for published images and enter payment data into the accounting system so the photographer gets paid at the end of the month.

In this scenario, there’s many isolated software systems. Humans need to know where to look for information, and which data to manually copy between systems.

Level 1: Automation

Pretty soon, people will want to automate parts of these processes: The photographer assignment notification should contain a DAM upload link so that images go directly to the DAM, with automatically-added assignment metadata. The editorial system should tell the DAM which images have been printed in the newspaper, so that the DAM can add publication metadata automatically, move the image into the long-term image archive, and send payment data to the accounting system.

A lot of time can be saved automating processes. But of course there’s drawbacks, too: Automation implies the assumption that we always want the same things to happen. It means taking human oversight and decisions out of the loop.

What can go wrong? The photographer might send pictures not related to the current assignment, so the automatically-attached metadata is wrong. Not every published image may be copied into the archives for reuse. And humans would know that they can skip the logo that’s in the paper every single day, even though it technically is a published picture.

This kind of automation is usually neither visible to end users, nor can they stop it from happening. Changing automated processes – to better align them with always-changing business processes – involves technicians, whether processes are “hardcoded” in software or configurable.

And of course, no software works perfectly all of the time. When an automated process fails, you have a whole new class of problems: possible data loss, follow-up processes that already have run with incomplete input data, the difficulty of manually working around the problem while it persists, and cleaning up afterwards.

Level 2: Workflow awareness

Even with automation, the software has no notion of the overarching workflow. It isn’t aware of the context, so it cannot present the context to users. Human communication revolving around the workflow needs to happen “out of band”, i.e. via phone and e-mail, outside of the systems the information is living in.

Which information can get lost in our example workflow? Well, the photographer might want to communicate something to the Photoshop guy (“make sure to blacken out the license plate”), or to the accounting staff (“the editor agreed on paying twice the standard fee for this assignment”). She might have to alert the paper that these pictures must not be reused, an exception from the rule that all published pictures are marked as “available for reuse”. To make this last one more difficult, let’s assume that she gets aware of this after she sent the pictures to the newspaper, but before the automated archiving process has run (so she cannot add this information to the image metadata, and the archivist she might call doesn’t yet see the image in the archive).

Each of the persons (and automated processes) involved may have information to add, or questions to ask, or decisions to make that affect other (possibly automated) processes.

A fully workflow-aware DAM system would treat a workflow instance as an asset-related entity with its own metadata. Each asset used within a workflow would display its own routing sheet that shows what kind of workflow this is, which additional information has been attached to it, what happened so far and what’s to happen next (manually or automatically). With appropriate permissions, the user could modify that sheet to add information, change what happens next, or move the asset out of this workflow instance.

The routing sheet with its workflow and process data would probably live in a separate system because real-life workflows cross system boundaries. And the same sheet would appear in all systems involved in the workflow; the Photoshop guy would see it in Photoshop, the photographer in her photo upload app, the accountant in SAP.

Can we do this? Does something like this already exist? Or would it be overkill and we should indeed refrain from workflow features and just let the power users define their own automation?

Wed, 29 Apr 2015 22:09:00 +0000
2015-04-24

Find the latest articles on Digital Asset Management via Planet DAM

There’s lots of interesting articles being written on Digital Asset Management topics. How do you keep up-to-date? That’s easy – do as I do, follow the more than 250 DAM related Twitter accounts and about 100 RSS feeds, and I promise you won’t miss out on anything important! (I’m suffering from “FOMO”, I guess: “fear of missing out”.)

If this seems like too much work, check out my Planet DAM:

This is the place where I link all new DAM-related articles I come across when scanning Twitter and RSS feeds. A manually curated “river of news” for DAM writing. If you’re using a feed reader (highly recommended), you can even subscribe to the Planet DAM RSS feed (well, Atom feed) and get all the latest stuff without even visiting the Planet DAM Web site.

If you know of an article that should be on Planet DAM but isn’t yet, please send me its URL via e-mail or Twitter. Thanks in advance!

Fri, 24 Apr 2015 07:00:00 +0000
2015-04-13

Is there an XML standard for digital magazine replicas?

Many printed newspapers and magazines offer digital replicas to their subscribers – Web or mobile apps that let readers browse the publication in the exact print layout. Often with added functionality, like fulltext search, PDF download or an optional HTML-formatted article view for better readability. You’ll find lots of examples in the Apple Newsstand or Google Play Kiosk. In Germany, these digital replicas are called “ePaper” and are a must-have for publishers because they count towards the official print circulation figures tracked by the IVW.

Technically, replica editions are usually built from PDF files of the printed pages. A decent editorial system will also provide articles and images with structured metadata separately, which means better quality for added functionality compared to content extracted from the PDF. Really good systems can provide page coordinates for articles and images, so that a tap or click on the page can send the reader to the right article or image. (Remember the good old HTML image map?) Companies like Visiolink1000°ePaper or Paperlit help publishers create and publish replicas.

Since our DC-X DAM is used as a PDF, article and image archive by many newspaper and magazine publishers, we often have to make it interoperable with “ePaper systems”. (We even built one or two of these systems ourselves.) The main work is in formatting and packaging page, article and image contents and metadata the way the ePaper system needs it. And sometimes we’re on the receiving end, having to ingest such a feed into the DAM.

To clarify, here’s the information that needs to be transported:

  • Edition/issue level: One object per printed edition of a newspaper or magazines. Properties: Edition name (“My magazine”), publication date, month, or issue number, page count
  • Page level: One object per page (or spread, i.e. two adjacent pages), linked to the printed edition it’s in (see above). Properties: Reference to PDF file, page number, size / physical dimensions, section
  • Article level: One object per article, linked to the pages it appears on. Properties: Title (for table of contents), formatted text (ideally HTML or XHTML), coordinates on the page
  • Media level: One object per image, linked to the articles it appears in. Properties: Reference to media file, title, dimensions, content type

What’s really painful is that we’ve been doing this kind of integration work for almost two decades, and we keep writing customer-specific code from scratch every time because there seems to be no standardized exchange format for this kind of data. The PRISM standards come very close, IPTC NewsML G2 might also work – but both seem to miss the edition-level and page-level information. Or do I miss something? What would you recommend? I’d love to hear from you!

Mon, 13 Apr 2015 08:50:00 +0000
2015-03-19

Trust your developers’ opinion on technical debt

While “onboarding” a new developer, we get to see our software product, code, development environment and culture through a fresh pair of eyes. There’s moments of pride – when we show something to the new guy and he likes how well-executed it is – and of embarrassment. We’ve cut corners in some places, and they not only make us feel ashamed, but cost the new employee much more time (and our company, more money) to understand and use them than the good, well-documented parts.

Technical debt in software development is a common topic: When writing code, we cut corners now to deliver something faster, just like we borrow money to buy something earlier than we could normally afford. As with financial debt, technical debt isn’t bad in itself but it does accumulate, and usually needs to be repaid someday.

There’s much more to building good software than just writing code. When we neglect other things, we can accrue competitive debt, learning debt, performance debt, testing debt, usability debt, architecture debtinnovation debt etc.

I have experienced both ends of the spectrum: There were times when our company invested heavily in research and new product development. It took us two years to lay the foundation for our current DAM system, almost driving our CEO mad having to wait for something tangible. But there’s also been periods with so many projects and urgent feature requests that we tried to save time skipping one or more of the following: doing research, innovating, planning strategically, writing documentation, testing, doing quality assurance and UX reviews, reviewing code and software architecture, training colleagues and partners.

What I have learned watching these decisions and their long-term effects over more than 15 years is that you should trust your development team with deciding which shortcuts are okay, which ones will cost too much later on, and what past technical debt must be repaid now.

I vividly remember a discussion where the team wanted to invest more time to get something done properly, but wasn’t allowed to because something else was prioritized higher by management. The week we saved back then has cost us at least six weeks since, and keeps costing us time and money. On the other hand, when we built things the right way – with enough time for collaboration, feedback, and proper architecture and design – they turned out to stand the test of time very well and enabled new features years later without much hassle.

Yes, we developers sometimes waste time because we want to play with fancy new technology, or make code more complex than it needs to be. But we’re grown-ups with an understanding of business requirements too. Make sure to openly communicate the situation our company is in. And if we’re still convinced that investing a few hours or days now will save us much more time and trouble later, you better trust us!

Thu, 19 Mar 2015 11:50:00 +0000
2015-02-28

strehle.de now powered by Topic Maps

If you’ve visited my blog before: Do you notice anything unusual? I hope not – because that’s what I’ve been working on almost every night for four months… 😄

My personal Web site had been hosted on a company-sponsored server for the last 15 years (thanks guys!), but that server is finally going to be shut down. So I rented a virtual server and started to move my stuff over.

Being a developer, I couldn’t just copy the software and data onto the new server. That would have been way too easy! Developers always feel the need to remove cruft, to improve on their old code, and to learn about new technology in the process.

Many parts of my Web site (my blog, a photographer’s portfolio, Planet DAM) had been powered by our company’s DAM software in the background. I learnt a lot by “eating my own DAM dog food”, but now I wanted to explore something else: Topic Maps.

Topic Maps are a standardized way for modeling information – sadly, they never gained traction and are long-forgotten by almost everyone. But I can’t stop loving Topic Maps, and in 2014, I finally started to build my own open-source Topic Maps engine, TopicBank. (It’s very experimental, useful for no-one but me at the moment, and evolves at an extremely slow pace.) I’m also using this side project to learn Elasticsearch, and dive deeper into Linked Data, Amazon S3, PostgreSQL and the latest PHP 5 features.

So I’m now “dogfooding” TopicBank. I rewrote the DAM-powered parts of my Web site to use my Topic Maps engine instead. And today, after months of hard work, I switched my live Web site over to the new server, and to the new code. Some things are still missing (Planet DAM’s RSS feeds, which I plan to reimplement soon). But for a full heart transplant, the patient seems to be doing very well!

Sat, 28 Feb 2015 21:34:00 +0000
2015-01-27

Demian Hess: Managing Digital Rights Metadata with Semantic Technologies

Very interesting webinar recording by Demian Hess: Managing Digital Rights Metadata with Semantic Technologies. (You need to supply your e-mail address to view the recording, and playback requires Windows Media Player, but it’s worth it if you’re into Digital Asset Management and rights metadata.)

Demian explains how complex licenses are, and how people try to simplify them because the complexity / variability doesn’t fit into their DAM systems. And why dumbing down doesn’t work too well in the long term.

His approach is to losslessly store all the licensing terms in a separate RDF database, which is integrated with the DAM system so that terms can be displayed along with the DAM asset information in the user interface. Special licensing reports (using SPARQL in the backend) can list all the different terms for a set of assets.

I only wonder why there’s no mention of RightsML? RightsML, hopefully, is going to become the standard for rights metadata, and it’s built on semantic technology. See my blog post Rights Management in the DC-X DAM – and RightsML.

Demian also wrote an article on Digital Rights and the Cost of "Lousy Record Keeping".

Tue, 27 Jan 2015 09:23:59 +0000
2014-12-22

Hunting for well-known Semantic Web vocabularies and terms

As a Semantic Web / Linked Data newbie, I’m struggling with finding the right URIs for properties and values.

Say I have a screenshot as an PNG image file.

If I were to describe it in the Atom feed format, I’d make an “entry” for it, write the file size into the “link/@length” attribute, the “image/png” MIME type into the “link/@type” attribute, and a short textual description into “content” (with “@xml:lang” set to “en”). Very easy for me to produce, and the semantics would be clear to everyone reading the Atom standard.

Now I want to take part in the “SemWeb” and describe my screenshot in RDFa instead. (In order to allow highly extensible data exchange between different vendors’ Digital Asset Management systems, for example.) But suddenly life is hard: For each property (“file size”, “MIME type”, “description”) and some values (“type: file”, “MIME type: image/png”, “language: English”) I’ve got to provide a URL (or URI).

I could make up URLs on my own domain – how about http://strehle.de/schema/fileSize ? But that would be missing the point and prevent interoperability. How to Publish Linked Data on the Web puts it like this: “A set of well-known vocabularies has evolved in the Semantic Web community. Please check whether your data can be represented using terms from these vocabularies before defining any new terms.”

The previous link lists about a dozen of vocabularies. There’s a longer list in the State of the LOD Cloud report. And a W3C VocabularyMarket page. These all seem a bit dated and incomplete: None of them link to schema.org, one of the more important vocabularies in my opinion. (Browsing Semantic Web resources in general is no fun, you run into lots of outdated stuff and broken links.) And I haven’t found a good search engine that covers these vocabularies: I don’t want to browse twenty different sites to find out which one defines a “file size” term.

I’m pretty sure the Semantic Web pros know where to look, and how to do this best. Please drop me a line (e-mail or Twitter) if you can help :-)

Update: The answer is the Linked Open Vocabularies site. Check it out!

For the record, here’s what I found so far for my screenshot example:

“file size”: https://schema.org/contentSize

“MIME type: http://en.wikipedia.org/wiki/Internet_media_type or http://www.wikidata.org/wiki/Q1667978

“description”: http://purl.org/dc/terms/description or https://schema.org/text

“type: file”: http://en.wikipedia.org/wiki/Computer_file or http://www.wikidata.org/wiki/Q82753, or more specific: http://schema.org/MediaObject or http://schema.org/ImageObject or even http://schema.org/screenshot

“MIME type: image/png”: http://purl.org/NET/mediatypes/image/png or http://www.iana.org/assignments/media-types/image/png

“language: English”: http://en.wikipedia.org/wiki/English_language or http://www.lingvoj.org/languages/tag-en.html or https://www.wikidata.org/wiki/Q1860

Mon, 22 Dec 2014 10:38:13 +0000
2014-12-10

Dreaming of a shared content store

All the content-based software I know (WCMS, DAM and editorial systems) is built the same way: It stashes its data (content, metadata, workflow definitions, permissions) in a private, jealously guarded database. Which is great for control, consistency, performance, simpler development. But when you’re running multiple systems – each of which is an isolated data silo – what are the drawbacks of this approach?

First, you’ve got to copy data back and forth between systems all the time. We’re doing that for our DAM customers, and it’s painful: Copying newspaper articles from the editorial system into the DAM. Then copying them from the DAM into the WCMS, and WCMS data back into the DAM. Developers say “the truth is in the database”, but there’s lots of databases which are slightly out of sync most of the time.

You’re also stuck with the user interfaces offered by each vendor. There’s no way you can use the nice WordPress editor to edit articles that are stored inside your DAM. You’d first have to copy the data over, then back again. User interface, application logic and the content store are tightly coupled.

And your precious content suffers from data lock-in: Want to switch to another product? Good luck migrating your data from one silo into the other without losing any of it (and spending too much time and money)! Few vendors care about your freedom to leave.

I don’t believe in a “central content repository” in the sense of one application which all other systems just read off and write to (that’s how I understand CaaS = Content as a Service). No single software is versatile enough to fulfill all other application’s needs. If we really want to share content (unstructured and structured) between applications without having to copy it, we need a layer that isn’t owned by any application, a shared content store. Think of it like a file system: The file system represents a layer that applications can build on top of, and (if they want to) share directories and files with other software.

Of course, content (media files and text) and metadata are an order of magnitude more complex than hierarchical folders and named files. I’m not sure a generally useful “content layer” can be built in such a way that software developers and vendors start adopting it. Maybe this is just a dream. But at least in part, that’s what the Semantic Web folks are trying to do with Linked Data: Sharing machine-readable data without having to copy it.

P.S.: You don’t want to boil the ocean? For fellow developers, maybe I can frame it differently: Why should the UI that displays search results care where the displayed content items are stored? (Google’s search engine certainly doesn’t.) The assumption that all your data lives in the same local (MySQL / Oracle / NoSQL) database is the enemy of a true service-oriented architecture. Split your code and data structures into self-contained, standalone services that can co-exist in a common database but can be moved out at the flip of a switch. Then open up these data structures to third party data, and try to get other software developers to make use of them. If you can replace one of your microservices with someone else’s better one (more mature, broadly adopted), do so. (We got rid of our USERS table and built on LDAP instead.) How about that?

Related posts: Web of information vs DAM, DM, CM, KM silosCloud software, local files: A hybrid DAM approachLinked Data for better image search on the Web.

 

Wed, 10 Dec 2014 11:40:16 +0000