Many printed newspapers and magazines offer digital replicas to their subscribers – Web or mobile apps that let readers browse the publication in the exact print layout. Often with added functionality, like fulltext search, PDF download or an optional HTML-formatted article view for better readability. You’ll find lots of examples in the Apple Newsstand or Google Play Kiosk. In Germany, these digital replicas are called “ePaper” and are a must-have for publishers because they count towards the official print circulation figures tracked by the IVW.
Technically, replica editions are usually built from PDF files of the printed pages. A decent editorial system will also provide articles and images with structured metadata separately, which means better quality for added functionality compared to content extracted from the PDF. Really good systems can provide page coordinates for articles and images, so that a tap or click on the page can send the reader to the right article or image. (Remember the good old HTML image map?) Companies like Visiolink, 1000°ePaper, iApps or Paperlit help publishers create and publish replicas.
Since our DC-X DAM is used as a PDF, article and image archive by many newspaper and magazine publishers, we often have to make it interoperable with “ePaper systems”. (We even built one or two of these systems ourselves.) The main work is in formatting and packaging page, article and image contents and metadata the way the ePaper system needs it. And sometimes we’re on the receiving end, having to ingest such a feed into the DAM.
To clarify, here’s the information that needs to be transported:
- Edition/issue level: One object per printed edition of a newspaper or magazines. Properties: Edition name (“My magazine”), publication date, month, or issue number, page count
- Page level: One object per page (or spread, i.e. two adjacent pages), linked to the printed edition it’s in (see above). Properties: Reference to PDF file, page number, size / physical dimensions, section
- Article level: One object per article, linked to the pages it appears on. Properties: Title (for table of contents), formatted text (ideally HTML or XHTML), coordinates on the page
- Media level: One object per image, linked to the articles it appears in. Properties: Reference to media file, title, dimensions, content type
What’s really painful is that we’ve been doing this kind of integration work for almost two decades, and we keep writing customer-specific code from scratch every time because there seems to be no standardized exchange format for this kind of data. The PRISM standards come very close, IPTC NewsML G2 might also work – but both seem to miss the edition-level and page-level information. Or do I miss something? What would you recommend? I’d love to hear from you!
Update: I found the EPUB derivative OpenEFT, an Idealliance standard. It seems to match the use case almost perfectly. But I haven’t found anyone implementing it so far. ePaper vendors don’t seem to care about standardization at all. Maybe it’s because US newspapers aren’t into ePapers, as Mario R. García writes in Those European e-papers are hot?
Update (2016-10-28): Related, by Stefan Boddie – What is METS/ALTO?: “The combination of METS and ALTO (often written METS/ALTO) is the current industry standard for newspaper digitization.”