Problem with URL in OS X


#1

Hey All,

I am having a problem with the URL class. Whenever I request a URL using readEntireTextStream(bool) I always get a couple of garbage lines, at the start of the stream and at the end. The beginning garbage line seems to be always composed of three characters, and the end is usually “0”. This means that if I request an XML file via URL, it ALWAYS fails to parse.

In addition, I get the occasional 3 character garbage line in some webpages, between the beginning and the end.

Is this normal? Can anyone reproduce this?

Also, to reiterate a previous problem I was having, URL doesn’t seem to encode non-ascii characters properly. Things like curly quotes and em-dashes produce the wrong output.

Again, I am using OS X, 10.5.

thanks,

c.


#2

Just a wild guess that is most likely wrong, but is the document using characters not supported (installed) on that mac system?

FlyingIsFun1217


#3

We are talking em dashes, apostrophes and curly quotes here, not anything rare or fancy. Things like these:

— “ ” ′

(which I can see in the browser, in OS X) are totally mangled by URL.

c.


#4

Testing in Windows shows that the garbage lines at the beginning and the end are not present. Also, readEntireXmlStream works in the same uri that JUCE - OS X fails with. So this seems to be an OS X - only issue.

As far as “special characters” (like: — “ ”) that are not properly escaped, they get mangled in Windows as well. It would be nice if we would get the actual character… I thought we should, since pages are usually encoded in UTF-8 or better.

c.


#5

Sounds like a bug to me. Can you give me a URL to try that definitely does this?


#6

You can try:

http://matadata.com/bucket/test.html for escaping and non-escaping curly quotes. Non-escaped quotes produce “â” (plus apparently some other invisible stuff) in my system. This page, however, does not display the initial “garbage line”… I don’t know why.

my site: http://matadata.com is hopelessly broken by URL. Newlines basically produce garbage, and there are random characters interspersed throughout. I changed the character encoding of the source files to no effect. It might be how PHP is serving the HTML, but I have no control over that.

http://nytimes.com displays the initial garbage line, and the closing “0” line.

http://search.yahooapis.com/WebSearchService/V1/webSearch?appid=YahooDemo&query=madonna&results=3&start=0 returns an XML file that is can’t be parsed in my system due to the initial garbage line and the closing 0.

thanks,

c.


#7

It’s probably HTTP1.1 Chunk encoding you’re getting.

Please refer to this page to understand what’s going on…


#8

Hi Jules,

Are you working on this (I see in the juce_mac_HTTPStream.h a commented line about chunk encoding) or is it worth that I spend some time on it ?

update: a trivial fix is of course to replace HTTP/1.1 by HTTP/1.0 in the headers sent


#9

I hadn’t had time to look at it yet, but I’ll need to very soon. (Good idea about just changing the header version)


#10

I was seeing the same behavior on Mac OS X doing CDDB lookups over HTTP with juce 1.45: garbage bytes at the beginning and end of the message (as it turns out, chunked-encoding headers)–the same code worked fine on Windows for months.

Changing the header to HTTP/1.0 in juce_mac_HTTPStream.h seems to have fixed it–big thanks to jpo for posting that suggestion here. :slight_smile:

Art


#11

Yes, thanks - in the tip version it now uses http 1.0, and seems to work for me.