MD5 results differ from other MD5 generators


#1

Hi,
when I generate an MD5 checksum from a String, the value returned by toHexString() differs from the values that other MD5 generators calculate.
I checked these three generators:
http://www.adamek.biz/md5-generator.php
http://www.md5generator.com/index.php
http://files.kniebes.net/php/md5/
Their results are consistent.
The juce::MD5 returns a 32 digit hex string that differs from the other results (which are 32 digit hex strings, too).

Is this a bug or am I missing something?


#2

It could be a character encoding difference - juce treats the strings as wide-character unicode. These other ones might be doing it as utf8, or ascii, or who knows what… Maybe try passing your string as utf8 to the md5 class rather than as a string.


#3

Ok, I compared the result of the juce::MD5 to the result of the PHP MD5 and the MySQL MD5 functions and yield the following results:
(Let’s call the two different resulting strings ‘A’ and ‘B’, which represent the 32 digit hex strings.)

PHP MD5: A
MySQL MD5: A

MD5(myString).toHexString(): B
MD5(myString.toUTF8()).toHexString(): B
MD5(myString).toHexString().toUTF8(): B
MD5(myString.toUTF8()).toHexString().toUTF8(): B
MD5((const char*) myString, myString.length()).toHexString(): A

So the only way to get the standard MD5 result with juce MD5 class is to call it with:
MD5((const char*) myString, myString.length())

But I expected that
MD5 (const String &text)
and
MD5 (const char *data, const int numBytes)
give the same result, if I call them both using the same String object as source data.
But obviously this isn’t the case, which is really confusing.

What do you think?


#4

[quote]But I expected that
MD5 (const String &text)
and
MD5 (const char *data, const int numBytes)
give the same result, if I call them both using the same String object as source data.[/quote]

Why would you assume that? An MD5 is calculated from raw data, and there are many ways to turn a string into raw data… I decided to do it by treating the string as a series of wide chars, and these others seem to either be using utf8 or ascii or something. It might be the case that they produce different values from each other if you feed them a string containing multi-byte characters, as they might not all be using the same encoding.

Is there a standard for this? If so I’d be happy to change my code to match it!


#5

Good point, there is no standard way for doing this.
So, the best way seems to find out what kind of data the target to compare with uses and then chose the appropriate data to call the juce::MD5 with.
Thanks


#6

TBH looking at my code, converting the string to UTF8 would have been a much neater way to do it, (though I might actually have written the md5 code before I had a UTF8 converter). I’d change it to work that way, although that risks breaking people’s code that already uses the old method…