Getting MimeType subtype with Apache tika

lisak picture lisak · Aug 21, 2011 · Viewed 28.7k times · Source

I'd need to get the iana.org MediaType rather than application/zip or application/x-tika-msoffice for documents like, odt, ppt, pptx, xlsx etc.

If you look at mimetypes.xml there are mimeType elements composed of the iana.org mime-type and "sub-class-of"

   <mime-type type="application/msword">
    <alias type="application/vnd.ms-word"/>
    ............................
    <glob pattern="*.doc"/>
    <glob pattern="*.dot"/>
    <sub-class-of type="application/x-tika-msoffice"/>
  </mime-type>

How to get the iana.org mime-type name instead of the parent type name ?

When testing mime type detection, I do :

MediaType mediaType = MediaType.parse(tika.detect(inputStream));
String mimeType = mediaType.getSubtype();

Test Results :

FAILED: getsCorrectContentType("application/vnd.ms-excel", docs/xls/en.xls)
java.lang.AssertionError: expected:<application/vnd.ms-excel> but was:<x-tika-msoffice>

FAILED: getsCorrectContentType("vnd.openxmlformats-officedocument.spreadsheetml.sheet", docs/xlsx/en.xlsx)
java.lang.AssertionError: expected:<vnd.openxmlformats-officedocument.spreadsheetml.sheet> but was:<zip>

FAILED: getsCorrectContentType("application/msword", doc/en.doc)
java.lang.AssertionError: expected:<application/msword> but was:<x-tika-msoffice>

FAILED: getsCorrectContentType("application/vnd.openxmlformats-officedocument.wordprocessingml.document", docs/docx/en.docx)
java.lang.AssertionError: expected:<application/vnd.openxmlformats-officedocument.wordprocessingml.document> but was:<zip>

FAILED: getsCorrectContentType("vnd.ms-powerpoint", docs/ppt/en.ppt)
java.lang.AssertionError: expected:<vnd.ms-powerpoint> but was:<x-tika-msoffice>

Is there any way to get the actual subtype from mimetypes.xml ? Instead of x-tika-msoffice or application/zip ?

Moreover I never get application/x-tika-ooxml, but application/zip for xlsx, docx, pptx documents.

Answer

Gagravarr picture Gagravarr · Jul 1, 2012

Originally, Tika only supported detection by Mime Magic or by file extension (glob), as this is all most mime detection before Tika did.

Because of the problems with Mime Magic and globs when it comes to detecting container formats, it was decided to add some new detectors to Tika to handle these. The Container Aware Detectors took the whole file, opened and processed the container, and then worked out the exact file type based on the contents. Initially, you needed to call them explicitly, but then they were wrapped up in ContainerAwareDetector which you'll see in some of the answers.

Since then, Tika has added a service loader pattern, initially for Parsers. This allowed classes to be auto-loaded when present, with a general way to identify which ones were appropriate and use those. This support was then extended to cover Detectors too, at which point the old ContainerAwareDetector could be removed in favour of something cleaner.

If you're on Tika 1.2 or later, and you want accurate detection of all formats, including container formats, you want to do something like:

 TikaConfig config = TikaConfig.getDefaultConfig();
 Detector detector = config.getDetector();

 TikaInputStream stream = TikaInputStream.get(fileOrStream);

 Metadata metadata = new Metadata();
 metadata.add(Metadata.RESOURCE_NAME_KEY, filenameWithExtension);
 MediaType mediaType = detector.detect(stream, metadata);

If you run this with only the Core Tika jar (tika-core-1.2-....), then the only detector present will be the mime magics one, and you'll get the old style detection based on magic + glob only. However, if you run this with both the Core and Parser Tika jars (plus their dependencies), or from Tika App (which includes core + parsers + dependencies automatically), then the DefaultDetector will use all the various different Container Detectors to process your file. If your file is zip based, then detection will include processing the zip structure to identify the file type based on what's in there. This will give you the high accuracy detection you're after, without needing to call lots of different parsers in turn. DefaultDetector will use all Detectors that are available.