Counting word frequency using NLTK FreqDist()

A pretty simple programming task: Find the most-used words in a text and count how often they’re used. (With the goal of later creating a pretty Wordle-like word cloud from this data.)

I assumed there would be some existing tool or code, and Roger Howard said NLTK’s FreqDist() was “easy as pie”.

So today I wrote the first Python program of my life, using NLTK, the Natural Language Toolkit. With the help of the NLTK tutorial and StackOverflow. I’m sure it’s terrible Python and bad use of NLTK. Sorry, I’m a total newbie.

Read the full article…

Thu, 03 Sep 2015 19:53:00 +0000

“iTunes for news” – first impressions of Blendle in Germany


You have probably heard of Blendle, the promising “iTunes for News” startup from the Netherlands that’s backed by the NYT and Axel Springer. They’re launching in Germany on September 14th. I’ve been a fan of the Blendle idea for a long time: effortlessly buying print articles from a central “online kiosk”, with an equally easy refund option if the article wasn’t worth its money. So I was extra happy to be able to try out the beta version. Some first impressions:

Read the full article…

Fri, 28 Aug 2015 21:35:00 +0000

Use ppolicy_hash_cleartext to keep OpenLDAP from storing and returning plain text passwords

OpenLDAP slapd is popular open source server software that implements the LDAP protocol – you use it to store users, groups and their attributes, and you can use it for authentication (checking whether a given username/password combination is valid).

The experts will know about this, but to a novice it’s counter-intuitive that by default, OpenLDAP stores passwords in plain text on disk, and even returns plain text passwords in search results. Here’s what this looks like:

Read the full article…

Thu, 27 Aug 2015 11:28:00 +0000

Listen to your engineers, don’t sink the Vasa

Friends of ours recently visited the Vasa Museum in Stockholm, Sweden. They told us the story of the Vasa, a warship that sank almost immediately after setting out for her maiden voyage in 1628. According to Wikipedia:

“Richly decorated as a symbol of the king's ambitions for Sweden and himself, upon completion she was one of the most powerfully armed vessels in the world. However, Vasa was dangerously unstable due to too much weight in the upper structure of the hull. Despite this lack of stability she was ordered to sea and foundered only a few minutes after encountering a wind stronger than a breeze. The order to sail was the result of a combination of factors. The king […] was impatient to see her take up her station as flagship of the reserve squadron […]. At the same time the king's subordinates lacked the political courage to openly discuss the ship's structural problems or to have the maiden voyage postponed.”

Read the full article…

Sun, 23 Aug 2015 19:24:00 +0000

Dave Winer: Too much linear thinking in news

Dave Winer – Too much linear thinking in news:

“I think they [Circa] were on to something. Starting topics, and then adding stories to each topic as the news comes in. A story isn't something that's published once and done, it's more of a process.

[…] Circa resisted joining the open web. I think that was a fundamental mistake. […] Each story must have a way to get to it through a Web address.

[…] There's never been a long-term thriving tech company that wasn't run by a user.”

See my blog post (in German) on topic centric news publishing: Journalismus: Themenzentriertes Arbeiten, vernetzte Beiträge und hilfreiche Software


Mon, 06 Jul 2015 07:52:00 +0000

DC-X DAM system architecture, data structures, and APIs

Yesterday, we met with a potential customer’s tech and development team who were interested in the backend of our DC-X DAM system.

I gave a quick overview of the system architecture, the most important data structures, and our brand new JSON API.

Here’s the slides of my presentation (or download a PDF of the presentation):

Read the full article…

Thu, 02 Jul 2015 09:04:00 +0000

Topic Maps (as a standard) are dead, I’m afraid

[Update: This post got some well-deserved pushback. Thanks to Patrick Durusau, Lars Marius Garshol and Jack Park for the feedback – and sorry for the controversial headline. I’ve added “(as a standard)” and did some editing to make clear that people still use Topic Maps.]

I’m a fan of Topic Maps – the very well-thought-out Topic Maps Data Model standard with an XML serialization called XTM (XML Topic Maps) dating back to the year 2000. (See also the Topic Maps Reference Model, TMRM).

Even as a fan, I must admit that the Topic Maps standards are dead. They have never been widely adopted, and the key contributors have long moved on. Measured by the value you expect to get out of a successful standard – good visibility, adoption, interoperability, tooling, ongoing development – Topic Maps haven’t been the success we were hoping for.

Read the full article…

Sun, 14 Jun 2015 19:57:00 +0000

RDF and schema.org for DAM interoperability

There’s no widely-accepted standard for DAM data yet

Digital Asset Management (DAM) systems are the hubs for organizations’ creative content. DAMs need to exchange data with other systems all the time: import creative works and metadata from external content providers, export digital assets and metadata to Web Content Mananagement systems and so on. Sadly, none of the various DAM related standards (like the Dublin Core Metadata Element Set Lisa Grimm writes about, or IPTC NewsML G2) have been broadly adopted by DAM vendors. At least not broadly enough that you can expect to exchange data between DAM systems without programming effort.

Do we need a new standard?

Inventing a new standard is rarely a good idea. (You’ve probably seen the XKCD comic on standards.) If there is an existing open standard that more or less matches our use case, we better use that one to benefit from the existing documentation, tools, and adoption.

I suggest that we encourage the DAM community to move towards the schema.org vocabulary in an RDF syntax. This is the stuff that already powers large parts of the emerging Semantic Web. It introduces the DAM to the world of Linked Data.

Read the full article…

Fri, 08 May 2015 12:28:00 +0000

Vom Archiv- zum Redaktionssystem: Die Drehscheibe für kreative Inhalte

Das folgende Referat habe ich am 5. Mai 2015 bei der Frühjahrstagung des Vereins für Medieninformation und Mediendokumentation (vfm) in Bremen gehalten, zum Thema “Vom Archiv- zum Redaktionssystem” beim Presse-Panel (etwas überarbeitet, entspricht nicht dem genauen Wortlaut). Die anderen beiden Redner waren Christian Wagner, Geschäftsführender Redakteur beim WESER-KURIER (der unser DC-X einsetzt) und André Maerz, Projektleiter bei der Neuen Zürcher Zeitung (HUGO-Anwender). Moderiert hat Jutta Heselmann vom WDR.

Read the full article…

Wed, 06 May 2015 08:44:00 +0000

Workflow awareness of DAM systems

Workflow doesn’t

This blog post is inspired by Roger Howard’s excellent, thought-provoking “Workflow doesn’t” critique of the workflow functionality in today’s Digital Asset Management systems.

(I don’t know which systems Roger has worked with. If you want to catch up with the state of DAM workflow engines, here’s some links to get you started: Status-based workflows in Canto Cumulus. The MerlinOne workflow engine. ADAM Workflow. DAM News on configurable workflow systems. DAM News on workflow in DAM value chains. Anything else?)

I like Roger’s observations that “exceptions are the rule in production”, exceptions require decisions, and “decision making is something humans excel at”. This reminds me of Jon Udell’s old motto that “human beings are the exception handlers for all automated workflows”. Software that doesn’t embrace this fact will stop the work from flowing.

Read the full article…

Wed, 29 Apr 2015 22:09:00 +0000