2017-03-10

Metadata values have metadata, too

Field values – inside an SQL database column, an XML tag, or an image’s embedded metadata – are the “atoms” of a data model: the smallest unit. For example, in this record:

<country>
<name>Germany</name>
<population>82175700</population>
</country>

… the number “82,175,700” is the field value in the “Germany” country record’s “population” field.

Just as with atoms, there’s a bit more to them once you dig a little deeper. Imagine your boss complaining “this number is wrong – how did it end up in our database?” You might explain to him that you entered this number into the database because the Wikipedia page you visited last week said this was the approximate population of Germany, according to a 2015 estimate.

Suddenly, that “atomic”, inconspicuous number has a bunch of metadata attached to it: a validity date (in 2015), a precision qualifier (“approximately”), provenance (Wikipedia), and user information (you entered it).

Obviously, the real world is way more complicated than our database structures. That’s one of the points of data modeling; real-world complexities which aren’t of much use to our business should be left out of the data model to keep it simple. (My post Why I prefer Topic Maps to RDF has a few remarks on data modeling.)

But what if this kind of additional metadata matters to your business? Let’s look at the kinds of metadata which can be applied to data points:

  • Data type and format: You’ll usually have that already, i.e. you know whether the value in your “Date created” field is a simple string (“yesterday, I think”), a number (seconds since Jan 1st, 1970) or a proper date/time format (“2017-03-10T15:00:00+01:00”). Programming languages and data exchange formats each have their own data types, check out XML Schema Datatypes (XSD) for a good example.
  • Unit: An amount of money requires the currency (“€35.11”), and length, weight etc. have a unit of measurement (“110.5 cm”).
  • Language: Important for names and text fields (“en”).
  • Provenance: the source of the value (“https://en.wikipedia.org/wiki/Germany”).
  • User: who added or edited the value (common in auditing).
  • Last modified: when the value was added or edited.
  • Application: the software which wrote the value. For example, the IPTC IIM has an “Originating Program” field.
  • Transaction: the larger transaction during which the value was written; as described by Ralph Windsor in The Digital Asset Transaction Management System – A Time Machine For Digital Assets: “a unique identifier for each batch operation”.
  • Confidence: how sure you are the value is correct. Automatic image recognition and classification software usually provides a numeric confidence score in percent. In It’s About Time, Kurt Cagle suggests “Approximate”, “Inferred”, “Reported”, “Confirmed”.
  • Validity: the date/time range – often an open range (when the start date is unknown, or when there’s no end date yet) – during which the value was valid. Useful to mark a former e-mail address, or a maiden name. (Could also be used to “surf” past versions of your data if you manage to implement something like the Memento framework.)
  • Accuracy / Precision: how accurate the value is. In historical archives and museums, you’re likely to deal with data you know to be only guesswork. (You might have a photo that’s definitely from Christmas Eve, but you don’t have an exact year.) The Extended Date/Time Format (EDTF) draft offers an extensive syntax for inexact dates.

To be able to add metadata to your field values, be prepared for a lot of work if you use a relational (SQL) database or a simple NoSQL database. In XML, you can often use attributes. For full flexibility, Topic Maps which support “scope” and reification would be great if they were more widely available. RDF also does reification.

Fri, 10 Mar 2017 12:50:00 +0000