Problem decoding characters from websites


#1

Hey All,

I am using URL.readEntireTextStream(false) to get text from websites, and it works well except with characters that haven’t been encoded into HTML entities (I think that is the right term, correct?). Basically, special characters like — (em dashes) or curly quotes that have been left “in the raw” within the page always get mangled by readEntireTextStream… an em dash, for instance, translates to “å”. I understand that you are supposed to encode these things, but is there anything that can be done about this?

And while you are reading this, might as well make it a twofer… does anyone know of a cross-platform class or lib to transform HTML entities into String friendly text (that would be UTF-8 right?). I really don’t feel like diving into this, and I can live with a dependency.

Sorry if this is basic, I am not that experienced with text encodings and Unicode.

many thanks,

c.


#2

sorry for necromancing this thread but was looking for the same.

for a possible solution check this;

http://stackoverflow.com/questions/1082162/how-to-decode-html-entities-in-c/1082191#1082191

 


#3

Ask and ye shall receive.

I've recently adapted this to work with Juce Strings from http://www.codecodex.com/wiki/Unescape_HTML_special_characters_from_a_String

It seems to work nicely for my current use case (which is building a reader to keep up to date with this forum), but I haven't tested it thoroughly so YMMV. Let me know if you find anything wrong with it or make any fixes.

-Andrew