How to accurately determine mime data from a file?

Alexei Blue picture Alexei Blue · Dec 13, 2011 · Viewed 26.1k times · Source

I'm adding some functionality to a program so that I can accurately determine the files type by reading the MIME data. I've already tried a few methods:

Method 1:

javax.activation.FileDataSource

FileDataSource ds = new FileDataSource("~\\Downloads\\777135_new.xls");  
String contentType = ds.getContentType();  
System.out.println("The MIME type of the file is: " + contentType);

//output = The MIME type of the file is: application/octet-stream

Method 2:

import net.sf.jmimemagic.*;

try
{
    RandomAccessFile f = new RandomAccessFile("~\\Downloads\\777135_new.xls", "r");
    byte[] fileBytes = new byte[(int)f.length()];
    f.read(fileBytes);
    MagicMatch match = Magic.getMagicMatch(fileBytes);
    System.out.println("The Mime type is: " + match.getMimeType());
}
catch(Exception e)
{
    System.out.println(e);
}

//output = The Mime type is: application/msword

Method 3:

import eu.medsea.mimeutil.*;

MimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.MagicMimeMimeDetector");
File f = new File ("~\\Downloads\\777135_new.xls");
Collection<?> mimeTypes = MimeUtil.getMimeTypes(f);
String mimeType = MimeUtil.getFirstMimeType(mimeTypes.toString()).toString();
String subMimeType = MimeUtil.getSubType(mimeTypes.toString());
System.out.println("The Mime type is: " + mimeTypes + ", " + mimeType + ", " + subMimeType);

//output = The Mime type is: application/msword, application/msword, msword

I found these three methods at http://www.rgagnon.com/javadetails/java-0487.html. However my problem is that the file I am testing these methods on is one I created and so I know it's an Excel file, but still all three methods are incorrectly picking up the type as msword except the first method which I believe is because of the limited number of file types in the built in FileTypeMap that the method uses.

I've had a look around and some people say that it's because the way the offset is detected in the files and so the content type is picked up incorrectly, as pointed out in this wiki on detecting file types in PHP. Unfortunately the wiki then goes on to use the extension to determine the file type which isn't what I want to do as it's unreliable.

Can anyone point me in the right direction to a method that will detect the file types correctly within Java please?

Cheers, Alexei Blue.

Edit: Looks like there is no specific solution to this as @IronMensan said in the comment below. I did find this really interesting research paper that applies machine learning in a few ways to help the issue but there doesn't seem to be a full proof answer. I think my best bet here will be to try and pass the file to an excel file reader and catch any incorrect format exceptions.

Answer

rodrigo.garcia picture rodrigo.garcia · Feb 12, 2012

So far, the most accurate tool I've found to determine a file's MIME type is Apache Tika. This is a slight modification of what I currently use (with Tika version 1.0)

import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MimeTypes;

private static final Detector DETECTOR = new DefaultDetector(
        MimeTypes.getDefaultMimeTypes());

public static String detectMimeType(final File file) throws IOException {
    TikaInputStream tikaIS = null;
    try {
        tikaIS = TikaInputStream.get(file);

        /*
         * You might not want to provide the file's name. If you provide an Excel
         * document with a .xls extension, it will get it correct right away; but
         * if you provide an Excel document with .doc extension, it will guess it
         * to be a Word document
         */
        final Metadata metadata = new Metadata();
        // metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());

        return DETECTOR.detect(tikaIS, metadata).toString();
    } finally {
        if (tikaIS != null) {
            tikaIS.close();
        }
    }
}

Since Tika will use magic numbers, but also look at the contents of files when unsure, the process can be a little time-expensive (it took 3.268 secs for my PC to examine 15 files).

Also, don't make the same mistake I did at first. If you get the tika-core JAR, you should also get the tika-parsers JAR. If you don't get tika-parsers you won't get any exceptions, you will simply not get the MIME type accurately, so it is REALLY important to include it.

An alternative is to get the tika-app JAR, which contains tika-core, tika-parsers and all of the dependencies (they are a lot: poi, poi-ooxml, xmlbeans, commons-compress, just to name a few).