I'm writing a tool in C# to find duplicate images. Currently I create an MD5 checksum of the files and compare those.
Unfortunately, the images can be:
What would be the best approach to solve this problem?
Here is a simple approach with a 256 bit image-hash (MD5 has 128 bit)
List<bool>
- this is the hash Code:
public static List<bool> GetHash(Bitmap bmpSource)
{
List<bool> lResult = new List<bool>();
//create new image with 16x16 pixel
Bitmap bmpMin = new Bitmap(bmpSource, new Size(16, 16));
for (int j = 0; j < bmpMin.Height; j++)
{
for (int i = 0; i < bmpMin.Width; i++)
{
//reduce colors to true / false
lResult.Add(bmpMin.GetPixel(i, j).GetBrightness() < 0.5f);
}
}
return lResult;
}
I know, GetPixel
is not that fast but on a 16x16 pixel image it should not be the bottleneck.
Code:
List<bool> iHash1 = GetHash(new Bitmap(@"C:\mykoala1.jpg"));
List<bool> iHash2 = GetHash(new Bitmap(@"C:\mykoala2.jpg"));
//determine the number of equal pixel (x of 256)
int equalElements = iHash1.Zip(iHash2, (i, j) => i == j).Count(eq => eq);
So this code is able to find equal images with:
i
and j
Update / Improvements:
after using this method for a while I noticed a few improvements that can be done
GetPixel
for more performance0.5f
to differ between light and dark - use the distinct median brightness of all 256 pixels. Otherwise dark/light images are assumed to be the same and it enables to detect images which have a changed brightness.bool[]
or List<bool>
if you need to store a lot hashes with the need to save memory, use a Bitarray
because a Boolean isn't stored in a bit, it takes a byte!