MD5 output different from common generators

Hi.

I know there’s already a thread on that matter, but its old and juce MD5 has been changed since then so I don’t want to create confusion.
For reference, its http://www.rawmaterialsoftware.com/viewtopic.php?f=2&t=5079

Today I tried using MD5 on string as simple as “dog”, “cat” etc, and the result of MD5(“dog”).toHexString() is different from common MD5 generators.

MD5(“dog”).toHexString(); = 4dc9a19944bad6ba904b6f5189d6dd0e
http://www.md5.net for “dog” = 06d80eb0c50b49a509b49f2424e8c805

Is there some other internal conversion needed, rather than UTF8?

It seems like there’s a very defined and standardized algorithm for it: http://en.wikipedia.org/wiki/MD5#Algorithm

Sigh… I knew I should have just removed the constructor that takes a string.

Look, I think the other thread explains the situation perfectly, but here are the key facts again:

  • There are an infinite number of ways to turn a string into an MD5. There is no ‘right’ or ‘wrong’ way to do it.
  • The juce MD5 string constructor uses utf-32, not utf-8, so obviously if you compare it to something that used utf-8 it will be different.
  • Yes, I should probably have used utf-8.
  • No, I can’t change it now, because that’d break everybody’s existing code.
  • It’s easy to get the MD5 of a utf-8 string: just give it a pointer to your utf-8 data, rather than using the constructor that takes a string.

[quote=“jules”]Sigh… I knew I should have just removed the constructor that takes a string.

Look, I think the other thread explains the situation perfectly, but here are the key facts again:

  • There are an infinite number of ways to turn a string into an MD5. There is no ‘right’ or ‘wrong’ way to do it.
  • The juce MD5 string constructor uses utf-32, not utf-8, so obviously if you compare it to something that used utf-8 it will be different.
  • Yes, I should probably have used utf-8.
  • No, I can’t change it now, because that’d break everybody’s existing code.
  • It’s easy to get the MD5 of a utf-8 string: just give it a pointer to your utf-8 data, rather than using the constructor that takes a string.[/quote]

I can understand when you say it is too late to change it (we don’t want to break other projects), but I disagree about the 'infinite number of ways; it might be infinite, but still, I see MD5 checksum used between programs and across websites without any problems.
If there are infinite options, the most logical one to choose is the most supported/popular one (obviously there is such an option)…

The minimum you can do, imo, is remove that String CTor (or maybe change its implementation to use the char* one)…

Ok, I’m exaggerating with ‘infinite’, but there are [number of possible string encodings] * [number of possible byte-orderings] * [extra formatting options, e.g. length, zero-terminator, etc]. That’s a large number.

I’ve already done that, it’ll be in my next check-in. What I’ve done is to move the old constructor to make it more explicit, and to add a constructor that takes a CharPointer_UTF8 so it’s clear what’s actually happening.

Got it… Thanks!

Please, could someone provide some snippet code about this topic?

I’m using this method, but result is different from common generators when the plain string contains multibyte characters.

String computeMd5(String plain){
	MD5 md5String(plain.toUTF8(), plain.length());
	return md5String.toHexString();
};

Cheers
Emanuele

String::length() returns the number of characters in the string, NOT the number of bytes in its utf-8 encoding.

WTF ?
Since when does it do that ?
One of the most important advantage for using a string class is to avoid computing the string length (I mean the memory consumption) for each string operation.
OMG!! I probably have bad code around assuming string.length() == stringBuffer.memorySize().

Can you add a getRequiredBytesForUTF8() method to the string class and change the toUTF8() signature to read toUTF8(const int requiredBytes = 0), so we avoid doing a useless strlen() each time we convert to UTF8 ?

It has always done that! Since the internal format used by the string may change (and has changed in the past, from UTF32 to UTF8), it wouldn’t make any sense at all for length() to return anything other than the number of characters.

And also note the fact that toUTF8 does nothing except to return the string’s CharPointer_UTF8 object, which already provides methods you can call to get things like the byte size. There would be no point in me adding any new methods to String to do that, since they’re already available in that class.

So:

String computeMd5(String plain){
   MD5 md5String(plain.toUTF8(), plain.toUTF8().sizeInBytes());
   return md5String.toHexString();
};

should do the trick…

[quote=“lelepar”]So:

String computeMd5(String plain){
   MD5 md5String(plain.toUTF8(), plain.toUTF8().sizeInBytes());
   return md5String.toHexString();
};

should do the trick…[/quote]

Yes, as long as you want your checksum to also include the string’s terminating zero. I don’t know if that’s how MD5s are commonly calculated or not.

No there are not (hence the other thread, with the exact same issue with SHA256).
Anyway, the code should read

    String computeMd5(String plain){
       MD5 md5String(plain.toUTF8(), strlen(plain.toUTF8()));
       return md5String.toHexString();
    };

…or just

String computeMd5 (const String& plain) { return MD5 (plain.toUTF8()).toHexString(); }

(assuming you’re using the latest modules branch, where there’s an MD5 constructor that takes a CharPointer_UTF8)