Algorithm to compare two images in C#

Byyo picture Byyo · Feb 2, 2016 · Viewed 61.6k times · Source

I'm writing a tool in C# to find duplicate images. Currently I create an MD5 checksum of the files and compare those.

Unfortunately, the images can be:

  • Rotated by 90 degrees.
  • Have different dimensions (smaller image with same content).
  • Have different compression or file types (e.g. jpeg artifacts, see below).

higher resolution koalalower resolution koala

What would be the best approach to solve this problem?

Answer

fubo picture fubo · Feb 2, 2016

Here is a simple approach with a 256 bit image-hash (MD5 has 128 bit)

  1. resize the picture to 16x16 pixel

16x16 resized

  1. reduce colors to black/white (which equals true/false in this console output)

enter image description here

  1. read the boolean values into List<bool> - this is the hash

Code:

public static List<bool> GetHash(Bitmap bmpSource)
{
    List<bool> lResult = new List<bool>();         
    //create new image with 16x16 pixel
    Bitmap bmpMin = new Bitmap(bmpSource, new Size(16, 16));
    for (int j = 0; j < bmpMin.Height; j++)
    {
        for (int i = 0; i < bmpMin.Width; i++)
        {
            //reduce colors to true / false                
            lResult.Add(bmpMin.GetPixel(i, j).GetBrightness() < 0.5f);
        }             
    }
    return lResult;
}

I know, GetPixel is not that fast but on a 16x16 pixel image it should not be the bottleneck.

  1. compare this hash to hash values from other images and add a tolerance.(number of pixels that can differ from the other hash)

Code:

List<bool> iHash1 = GetHash(new Bitmap(@"C:\mykoala1.jpg"));
List<bool> iHash2 = GetHash(new Bitmap(@"C:\mykoala2.jpg"));

//determine the number of equal pixel (x of 256)
int equalElements = iHash1.Zip(iHash2, (i, j) => i == j).Count(eq => eq);

So this code is able to find equal images with:

  • different file formats (e.g. jpg, png, bmp)
  • rotation (90, 180, 270), horizontal /vertical flip - by changing the iteration order of i and j
  • different dimensions (same aspect is required)
  • different compression (tolerance is required in case of quality loss like jpeg artifacts) - you can accept a 99% equality to be the same image and 50% to be a different one.
  • colored changed to geyscaled and the other way round (because brightness is independent of color)

Update / Improvements:

after using this method for a while I noticed a few improvements that can be done

  • replacing GetPixel for more performance
  • using the exeif-thumbnail instead of reading the whole image for a performance improvement
  • instead of setting 0.5f to differ between light and dark - use the distinct median brightness of all 256 pixels. Otherwise dark/light images are assumed to be the same and it enables to detect images which have a changed brightness.
  • if you need fast calculations, use bool[] or List<bool> if you need to store a lot hashes with the need to save memory, use a Bitarray because a Boolean isn't stored in a bit, it takes a byte!