sha
Generation of MD5/SHA File Hashes in Java
Sunday, July 26th, 2009 | Java, Tech-savvy | 1 Comment
This post is about generating file hashes in Java. I came across the need to generate file hashes for a media application that I am working on and I wanted to implement a way to identify dupes. The best way IMHO is a hash code of the file, which has a constant size (even for large files) and can be easily compared to other hash codes thus making the identification of dupes a breeze.
Java provides a .hashcode()method for all objects, inherited by java.lang.Object – but this is not what we are looking for as this excerpt of the Java SE6 API Doc states:
The general contract of
hashCodeis:
- Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
- If two objects are equal according to the equals(Object) method, then calling the
hashCodemethod on each of the two objects must produce the same integer result.- It is not required that if two objects are unequal according to the
equals(java.lang.Object)method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hashtables.
To be perfectly honest it would be quite silly to believe that the .hashcode() method of java.lang.Object would be sufficient for generating file hashes. We might be lucky enough, that the .hashcode() method of java.io.File overrides the default behaviour of Object to something more suitable for files. Well, it does indeed, but this is still not what we want (API Doc excerpt):
Computes a hash code for this abstract pathname.
Well, java.io.File.hashCode() computes a hash based on the pathname. Again, not suitable.
What we really need is a method, that reads all the bytes of a file and computes a hash of the file contents, not some meta data. This is how we do it (not my work, just the first snippet Google provided):
public static String generateHash(File file) throws NoSuchAlgorithmException, FileNotFoundException, IOException { MessageDigest md = MessageDigest.getInstance("SHA"); // SHA or MD5 String hash = ""; byte[] data = new byte[(int)file.length()]; FileInputStream fis = new FileInputStream(file); fis.read(data); fis.close(); // Reads it all at one go. Might be better to chunk it. md.update(data); byte[] digest = md.digest(); for (int i = 0; i < digest.length; i++) { String hex = Integer.toHexString(digest[i]); if (hex.length() == 1) hex = "0" + hex; hex = hex.substring(hex.length() - 2); hash += hex; } return hash; } |
This worked for me, but there are (at least) two things I don’t like about this solution. First, as the comment already states, this method reads the whole file at once – this will give you an java.lang.OutOfMemoryError: Java heap space exception quite fast. Second, the for loop tinkers the String representation of the hash – this is error prone and not easily maintainable.
So I looked further an came across this solution:
public static String generateBufferedHash(File file) throws NoSuchAlgorithmException, FileNotFoundException, IOException { MessageDigest md = MessageDigest.getInstance("MD5"); InputStream is= new FileInputStream(file); byte[] buffer=new byte[8192]; int read=0; while( (read = is.read(buffer)) > 0) md.update(buffer, 0, read); byte[] md5 = md.digest(); BigInteger bi=new BigInteger(1, md5); return bi.toString(16); } |
Wow, just a small helper method, a buffered reader that hashes large files without taking too much memory and a provided toSting() method. This is just what I was looking for. I hope some people out there save some time trying to implement their file hash solution reading this post. Happy coding!
P.S.: If you care about the hash algorithm used (e.g. MD5 or SHA) have a look at java.security.Security.getProviders() and the .getInfo() of each given Provider.
Search
Categories
- (X)HTML/CSS (5)
- Activities (29)
- Gadgets (35)
- Insights (2)
- Java (22)
- Certification (1)
- IDE (10)
- JSP (1)
- Language (16)
- Quirks (9)
- Vocabulary (10)
- Linux (16)
- Misc (58)
- Photography (16)
- Reviews (69)
- Tech-savvy (81)
Tag Cloud
Archives
- May 2013 (3)
- April 2013 (1)
- March 2013 (1)
- February 2013 (1)
- January 2013 (1)
- December 2012 (3)
- November 2012 (1)
- October 2012 (3)
- September 2012 (3)
- July 2012 (1)
- May 2012 (1)
- April 2012 (1)
- February 2012 (7)
- January 2012 (1)
- December 2011 (2)
- November 2011 (4)
- October 2011 (5)
- September 2011 (3)
- August 2011 (3)
- July 2011 (2)
- June 2011 (4)
- May 2011 (1)
- April 2011 (2)
- March 2011 (2)
- February 2011 (2)
- January 2011 (6)
- December 2010 (2)
- November 2010 (5)
- October 2010 (7)
- September 2010 (13)
- August 2010 (6)
- July 2010 (4)
- June 2010 (3)
- May 2010 (3)
- April 2010 (2)
- March 2010 (2)
- February 2010 (1)
- January 2010 (1)
- December 2009 (1)
- November 2009 (2)
- October 2009 (5)
- September 2009 (1)
- August 2009 (3)
- July 2009 (5)
- June 2009 (5)
- May 2009 (6)
- April 2009 (3)
- March 2009 (3)
- February 2009 (2)
- January 2009 (1)
- December 2008 (9)
- November 2008 (15)
- October 2008 (15)
- September 2008 (13)