Each Answer to this Q is separated by one/two green lines.
I’m writing a script to calculate the MD5 sum of an image excluding the EXIF tag.
In order to do this accurately, I need to know where the EXIF tag is located in the file (beginning, middle, end) so that I can exclude it.
How can I determine where in the file the tag is located?
The images that I am scanning are in the format TIFF, JPG, PNG, BMP, DNG, CR2, NEF, and some videos MOV, AVI, and MPG.
It is much easier to use the Python Imaging Library to extract the picture data (example in iPython):
In : import Image In : import hashlib In : im = Image.open('foo.jpg') In : hashlib.md5(im.tobytes()).hexdigest() Out: '171e2774b2549bbe0e18ed6dcafd04d5'
This works on any type of image that PIL can handle. The
tobytes method returns the a string containing the pixel data.
BTW, the MD5 hash is now seen as pretty weak. Better to use SHA512:
In : hashlib.sha512(im.tobytes()).hexdigest() Out: '6361f4a2722f221b277f81af508c9c1d0385d293a12958e2c56a57edf03da16f4e5b715582feef3db31200db67146a4b52ec3a8c445decfc2759975a98969c34'
On my machine, calculating the MD5 checksum for a 2500×1600 JPEG takes around 0.07 seconds. Using SHA512, it takes 0,10 seconds. Complete example:
#!/usr/bin/env python3 from PIL import Image import hashlib import sys im = Image.open(sys.argv) print(hashlib.sha512(im.tobytes()).hexdigest(), end="")
For movies, you can extract frames from them with e.g. ffmpeg, and then process them as shown above.
One simple way to do it is to hash the core image data. For PNG, you could do this by counting only the “critical chunks” (i.e. the ones starting with capital letters). JPEG has a similar but simpler file structure.
The visual hash in ImageMagick decompresses the image as it hashes it. In your case, you could hash the compressed image data right away, so (if implemented correctly) a it should be just as quick as hashing the raw file.
This is a small Python script illustrating the idea. It may or may not work for you, but it should at least give an indication to what I mean 🙂
import struct import os import hashlib def png(fh): hash = hashlib.md5() assert fh.read(8)[1:4] == "PNG" while True: try: length, = struct.unpack(">i",fh.read(4)) except struct.error: break if fh.read(4) == "IDAT": hash.update(fh.read(length)) fh.read(4) # CRC else: fh.seek(length+4,os.SEEK_CUR) print "Hash: %r" % hash.digest() def jpeg(fh): hash = hashlib.md5() assert fh.read(2) == "\xff\xd8" while True: marker,length = struct.unpack(">2H", fh.read(4)) assert marker & 0xff00 == 0xff00 if marker == 0xFFDA: # Start of stream hash.update(fh.read()) break else: fh.seek(length-2, os.SEEK_CUR) print "Hash: %r" % hash.digest() if __name__ == '__main__': png(file("sample.png")) jpeg(file("sample.jpg"))
$ stream -map rgb -storage-type short image.tif - | sha256sum d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64 -
$ sha256sum <(stream -map rgb -storage-type short image.tif -) d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64 /dev/fd/63
This method reports the same
signature hash that the verbose Imagemagick
identify command reports:
$ identify -verbose image.tif | grep signature signature: d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64
(for ImageMagick v6.x; the hash reported by
identify on version 7 is different to that obtained using
stream, but the latter may be reproduced by any tool capable of extracting the raw bitmap data – such as
dcraw for some image types.)
I would use a metadata stripper to preprocess your hashing :
From ImageMagick package you have …
mogrify -strip blah.jpg
and if you do
identify -list format
it apparently works with all the cited formats.