URL escape characters


#1

Hi,

I have a trouble with removeEacapeChars function in Juce::URL.

E.g.
If I try to remove ecaped chars from the url below:
http://www.google.co.jp/search?q=テスト&start=0&hl=ja&lr=lang_ja&ie=utf-8&oe=utf-8&client=firefox&rls=org.mozilla:ja:official

In removeEscapeChars(), simply one sequence starting at % is analyzed one symbol. (e.g. %20 -> whitespace)
But if this escaped chars are coded by UTF-8 (I think UTF-8 is now standard encoding type in Web.), this way is too bad.
%E3%83%86, these 3 escaped characters represent one symbol. (A Japanese “Katakana” word). By using current method, these are translated into 3 chars…

So, I’m sure that removeEacapeChar function needs encoding-type.

Best regards,
ioue from Japan


#2

I think it’s the right behaviour.
%E3%83%86 should return the 3 chars 0xe3, 0x83, 0x86
However, they are not translated from UTF8, as the function doesn’t know it was encoded in UTF8 (can be UTF16, or any other encoding).
What you can do, is to specify the conversion explicitely like :

String url = yourURLHere;
String decodedUrl = String::fromUTF8(URL::removeEscapedChars(url));
// Then use decodedUrl 

#3

Ah, I see… Ok, I’ll definitely get that fixed for the next release. Thanks for letting me know.


#4

I wrote a function to detect utf8, it might be useful here:

[code]bool isUtf8(MemoryBlock& mb)
{
if (mb.getSize() < 2)
return false;

int goodUtf = 0;
int badUtf  = 0;
for (int i = 1; i < mb.getSize(); i++)
{
	uint8 currByte = mb[i];
	uint8 prevByte = mb[i - 1];

	if ((currByte & 0xC0) == 0x80)
	{
		if ((prevByte & 0xC0) == 0xC0)
			goodUtf++;
		else if ((prevByte & 0x80) == 0x00)
			badUtf++;
	} 
	else if ((prevByte & 0xC0) == 0xC0)
	{
		badUtf++;
	}
}
return goodUtf >= badUtf;

}[/code]

I use it in InputStream::readString(), since the behavior of writeString() changed a while back, breaking my data files.


#5

Thank you all. I’ll try to implement your suggestions.
And thank you jules, I look forward to the next version. :slight_smile:


#6