URL escape characters

ioue · May 30, 2007, 12:24pm

Hi,

I have a trouble with removeEacapeChars function in Juce::URL.

E.g.
If I try to remove ecaped chars from the url below:
http://www.google.co.jp/search?q=テスト&start=0&hl=ja&lr=lang_ja&ie=utf-8&oe=utf-8&client=firefox&rls=org.mozilla:ja:official

In removeEscapeChars(), simply one sequence starting at % is analyzed one symbol. (e.g. %20 -> whitespace)
But if this escaped chars are coded by UTF-8 (I think UTF-8 is now standard encoding type in Web.), this way is too bad.
%E3%83%86, these 3 escaped characters represent one symbol. (A Japanese “Katakana” word). By using current method, these are translated into 3 chars…

So, I’m sure that removeEacapeChar function needs encoding-type.

Best regards,
ioue from Japan

X-Ryl669 · May 30, 2007, 12:45pm

I think it’s the right behaviour.
%E3%83%86 should return the 3 chars 0xe3, 0x83, 0x86
However, they are not translated from UTF8, as the function doesn’t know it was encoded in UTF8 (can be UTF16, or any other encoding).
What you can do, is to specify the conversion explicitely like :

String url = yourURLHere;
String decodedUrl = String::fromUTF8(URL::removeEscapedChars(url));
// Then use decodedUrl

jules · May 30, 2007, 1:11pm

Ah, I see… Ok, I’ll definitely get that fixed for the next release. Thanks for letting me know.

RolandMR · May 30, 2007, 3:32pm

I wrote a function to detect utf8, it might be useful here:

[code]bool isUtf8(MemoryBlock& mb)
{
if (mb.getSize() < 2)
return false;

int goodUtf = 0;
int badUtf  = 0;
for (int i = 1; i < mb.getSize(); i++)
{
	uint8 currByte = mb[i];
	uint8 prevByte = mb[i - 1];

	if ((currByte & 0xC0) == 0x80)
	{
		if ((prevByte & 0xC0) == 0xC0)
			goodUtf++;
		else if ((prevByte & 0x80) == 0x00)
			badUtf++;
	} 
	else if ((prevByte & 0xC0) == 0xC0)
	{
		badUtf++;
	}
}
return goodUtf >= badUtf;

}[/code]

I use it in InputStream::readString(), since the behavior of writeString() changed a while back, breaking my data files.

ioue · May 31, 2007, 2:02am

Thank you all. I’ll try to implement your suggestions.
And thank you jules, I look forward to the next version.

Topic		Replies	Views
Non-ASCII String(URL address) need to be escaped for WebBrowserComponent on OSX General JUCE discussion	11	772	January 12, 2017
Problem decoding characters from websites General JUCE discussion	2	492	October 7, 2014
URL::getFileName() how to remove % encodings [solved] General JUCE discussion	1	614	April 16, 2020
Problem with URL encoding URL::addEscapeChars General JUCE discussion	5	499	October 23, 2008
Unicode, utf-8, and POST data General JUCE discussion	1	332	January 15, 2010

URL escape characters

Purchase

Discover

Learn

Support

About

Events

URL escape characters

Related Topics

Purchase

Discover

Learn

Support

About

Events