[solved] save URL::readEntireTextStream() to UTF8 encoded file


#1

Aloha, I'm trying to save UTF8 encoded HTML:

String word = textWord->getText().trim();
String dictUrl = "http://de.thefreedictionary.com/p/" + word;
URL url(dictUrl);    

String path = File::getCurrentWorkingDirectory().getFullPathName() + "/" + word + ".html";
File file(path);
FileOutputStream out(file);

out.writeText(url.readEntireTextStream(), false, false);  // save as UTF-8 ?
// out.writeString(url.readEntireTextStream()); // tried this too
out.flush();

Alas, when I open the file (in Internet Explorer 9, Win7 x64), the characters don't display properly. However, if I load the contents from URL::readEntireTextStream() into a TextEditor, then right-click/copy/paste into Notepad++ and choose Encoding->Encode in UTF-8, I can save the file and re-open it in IE9, and it displays as expected.

I've lurked a bit on the forums and grokked some of the source, and my understanding is that:

1) the internal String representation is UTF8 by default

2) OutputStream::writeText() saves the String as UTF8 or UTF16, depending on parameters

What am I missing? Thanks in advance.


#2

The problem turned out to be that I wasn't interpreting/reading the file correctly. The code I posted works exactly as one would expect--byte for byte, the data was faithfully written as UTF8. For those with UTF8 growing pains like me, here's the rub.

First of all, I don't want to "peel onions for 6 months in a submarine," and if you don't catch my reference, I'll direct you to read Vinnie's post:

http://www.juce.com/comment/272799#comment-272799

Solution 1:

After having a second read myself, I found out that the web pages I was accessing did not have a content type meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Because I was dealing with HTML, it was easy enough to inject this tag directly following <head>. These pages display correctly in IE9 because the content type is negotiated before the main HTML content is delivered (so IE doesn't need a meta tag)--at least that's my understanding, so experts, please correct me if I'm wrong.

Solution 2:

I viewed the file that was saved by Notepad++ in a hex editor and found that the UTF8 BOM (byte order mark) had been prepended to the (beginning) of the file:

  // three bytes: EF BB BF

I'm guessing that this is less portable than Solution 1, but it does work under Win7:

FileOutputStream out(file);
out.write ("\x0ef\x0bb\x0bf", 3);  // first three bytes of file
// save rest of file

There might be compatibility issues with this method, so I'll leave you with the Wikipedia entry on UTF8 byte order marks:

http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark


#3

Still File.appendText does not support UTF-8?


#4

No… appendText has always written UTF8. I think you may have misunderstood this thread.