Tuesday, July 20, 2010

image metadata

I thought I'd write a blog post about my google summer of code project. I've never been much of a blogger, but I see lots of my fellow gsoc'ers blogging, so I thought I'd write a post. My project is to try to improve mediawiki's support for image metadata. Currently mediawiki will extract metadata from an image, and put a little table at the bottom of the image page detailing all the metadata (for example, see http://commons.wikimedia.org/wiki/File:%C3%89cole_militaire_2545x809.jpg#metadata ).

However this is far from all the metadata embedded in an image. In fact mediawiki currently only extracts Exif metadata. Exif metadata is arguably the most popular form of metadata, so if you're going to only extract one, Exif is a good choice. Every time you take a picture with your digital camera, it adds exif data to your picture. Most of this type of data is technical - fNumber, shutter speed, camera model, etc. You can also encode things like Artist, copyright, image description in exif, however that is much more rare.

What I'm doing is first of all fixing up the exif support a little bit. Currently some of the exif tags are not supported (Bug 13172). Most of these are fairly obscure tags no one really cares about, but there are some exceptions like GPSLatitude, GPSLongitude, and UserComment.

I'm also (among other things) adding support for iptc-iim tags. IPTC-IIM is a very old format for transmitting news stories between news agencies. Adobe adopted parts of this format to use for embedding metadata in jpeg files with photoshop. Now a days its being slowly replaced by XMP, but many photos still use it. IPTC metadata tends to be more descriptive (stuff like title, author, etc) in nature compared to how exif metadata is technical (aperature, shutter speed) in nature.

My code will also try to sort out conflicts. Sometimes there are conflicting values in the different metadata formats. If an image has two different descriptions in the exif and iptc data, which should be displayed? Exif, IPTC, or both? Luckily for me, several companies involved in images got together and thought long and hard about that issue. They then produced a standard for how to act if there is a conflict [1]. For example If both iptc and exif data conflict on the image description, then the exif data wins.

Consider [[File:2005-09-17 10-01 Provence 641 St Rémy-de-Provence - Glanum.jpg]]

On commons the metadata table looks like:

But on my test wiki the table looks like:

Camera manufacturerCASIO COMPUTER CO.,LTD
Camera modelEX-Z55
Exposure time1/800 sec (0.00125)
F Numberf/4.3
Date and time of data generation14:21, 28 September 2005
Lens focal length5.8 mm
Latitude43° 46′ 21.35″ N
Longitude4° 50′ 1.34″ E
Horizontal resolution72 dpi
Vertical resolution72 dpi
Software usedMicrosoft Pro Photo Tools
File change date and time14:21, 28 September 2005
Y and C positioningCentered
Exposure ProgramNormal program
Exif version2.21
Date and time of digitizing14:21, 28 September 2005
Meaning of each component
  1. Y
  2. Cb
  3. Cr
  4. does not exist
Image compression mode3.66666666667
Exposure bias0
Maximum land aperture2.8
Metering modePattern
Light sourceUnknown
FlashFlash did not fire, compulsory flash suppression
Supported Flashpix version0,100
Color spacesRGB
File sourceDSC
Custom image processingNormal process
Exposure modeAuto exposure
White balanceAuto white balance
Focal length in 35 mm film35
Scene capture typeStandard
Scene controlNone

Most notably, GPS information is now supported. As a note, the wikipedia links for camera model are a commons customization, which is why they don't appear on my test output.

As another example, consider [[file:Pöstlingbahn TFXV.jpg]]. On commons, it has no metadata extracted. (It does have some information about the image on the page, but this was all hand-entered by a human). On my test wiki, the following metadata table is generated:

I'm almost done with iim metadata, and plan to start working on XMP metadata soon. If your curious, all the code is currently in the img_metadata branch. You can also look at the status page which I will try to update occasionally.