Using the linux 'file' command to determine type (ie. image, audio, or video)

puk picture puk · Nov 12, 2011 · Viewed 28.2k times · Source

The word file here refers to the shell file command, and not actual files. I want to determine whether a file is a, for example, video file (.mpg, .mkv, .avi). file is pretty good at returning image for image files, video for video files, and audio for audio files (and application/x-empty for some reason for text). My question is how reliable this is for identifying types. If I did a simple

file -ib deliverance.avi | grep video

would that work for all of the main video files outlined here?

Answer

John Flatness picture John Flatness · Nov 12, 2011

The results from file are less than perfect, and it has more problems with some types of files than others. File basically just looks for particular pieces of binary data in predictable patterns to figure out filetypes.

Unfortunately, in particular, some of the filetypes often used for video fall into this "problematic" category. The newer container formats like .mp4 and .mkv usually have several different MIME types that should properly depend on what type of data is being contained. For example, an .mp4 could properly be identified as video/mp4, audio/mp4, or application/mp4 depending on the content.

In practice, file often makes guesses that simply conform with common usage, and it may work perfectly well for you. For example, while I mentioned some theoretical difficulties with identifying Matroska files correctly, file basically just assumes that any Matroska file is a video. On the other hand, the usage of the Ogg container is more evenly split between audio and video, and I believe the current version of file just splits the difference, and identifies Ogg files as application/ogg, which wouldn't fall into any of your categories.

The one thing I can say with certainty is that you want the most up-to-date version of file you can get your hands on. The "magic" files that contain the patterns to match against and the MIME types that will result from a match are updated fairly often to include newer filetypes like WebM, or just to improve accuracy for older types.