Tim's Weblog
Tim Strehle’s links and thoughts on Web apps, software development and Digital Asset Management, since 2002.
2014-02-18

Questions to ask before building a DAM importer

Our DC-X DAM systems often are the central content hub at large publishers, with lots of data flowing in from photographers, news agencies, editorial systems, Web CMSes. These provide data (articles, photos, graphics, ads, pages) in a host of different formats, which means we’re building “importers” all the time to ingest content into the DAM system.

As a developer, I’m often told to estimate how long building an importer will take. I can be sure that there’s some information missing, so here’s my checklist of things I need to know before I can give a rough estimate of the development time:

  • Is the data copied into a local “hotfolder” (DC-X default), or does the importer have to fetch it (via FTP, an RSS feed etc.)?
  • Which file format does the data come in (XML, HTML, CSV, JPEG, PDF, …)? Can it be in different formats?
  • Can you provide the data in a format that the DAM system already supports? (Then we’re done.)
  • How large are the files (typically, and maximum)? How many files are expected per hour/day/week?
  • Is there a naming convention for directories and files? What should the importer do if files don’t follow that convention?
  • Should metadata be read from the file and directory name? Which exactly?
  • If some data arrives as a set of multiple files (e.g., a PDF file with an accompanying XML file): When starting with one of the files, how can the importer find the other files in the set (naming convention, file name given in the XML etc.)? Will they arrive roughly at the same time? If the set is incomplete, how long should the importer wait for the missing files to arrive? Should it import anyway when files are missing, or report an error?
  • How about duplicate files coming in? Can they simply be rejected by the importer (DC-X default), or is there a need to update or replace data from previous imports? How can the importer detect duplicates? (DC-X default: A checksum on the file’s contents.)
  • Should preview images be rendered? (DC-X will do this by default.) Or are preview images provided? Any special requirements when rendering preview images (like adding a watermark)?
  • When rendering preview images from graphical file formats, is a colorspace or ICC profile conversion needed? (By default, DC-X will detect CMYK and create RGB previews.)
  • Should text be extracted from textual files (PDF, EPS, Word)? (DC-X will do this by default, details depending on file format specifics.)
  • Are there special requirements for reading file metadata (EXIF, IPTC, XMP etc.)? (DC-X reads and imports common metadata by default.)
  • Have you provided representative samples of the input files?
  • What exactly do your XML / CSV files contain? Have you provided a textual description? (It’s great if you’re using a standardized format, but please describe how exactly you’re using that format – most standards leave room for interpretation or extensions.) What metadata fields should the XML tags be mapped into on import?
  • Are the files linked in some way? How can the importer find out what links where, and must the files be imported in a certain order to be able to establish these links?
  • Does the new data fit in with the existing metadata schema, or will we have to define new fields? Any special expectations regarding searching the new data?

I’m sure this list is incomplete – please let me know what I’m missing!