Problem decoding characters from websites

trip099 · March 12, 2008, 7:35am

Hey All,

I am using URL.readEntireTextStream(false) to get text from websites, and it works well except with characters that haven’t been encoded into HTML entities (I think that is the right term, correct?). Basically, special characters like — (em dashes) or curly quotes that have been left “in the raw” within the page always get mangled by readEntireTextStream… an em dash, for instance, translates to “å”. I understand that you are supposed to encode these things, but is there anything that can be done about this?

And while you are reading this, might as well make it a twofer… does anyone know of a cross-platform class or lib to transform HTML entities into String friendly text (that would be UTF-8 right?). I really don’t feel like diving into this, and I can live with a dependency.

Sorry if this is basic, I am not that experienced with text encodings and Unicode.

many thanks,

c.

twistedlemon · October 7, 2014, 9:06pm

sorry for necromancing this thread but was looking for the same.

for a possible solution check this;

http://stackoverflow.com/questions/1082162/how-to-decode-html-entities-in-c/1082191#1082191

andrewj · October 7, 2014, 9:53pm

Ask and ye shall receive.

I've recently adapted this to work with Juce Strings from http://www.codecodex.com/wiki/Unescape_HTML_special_characters_from_a_String

It seems to work nicely for my current use case (which is building a reader to keep up to date with this forum), but I haven't tested it thoroughly so YMMV. Let me know if you find anything wrong with it or make any fixes.

-Andrew

Topic		Replies	Views
URL special character question General JUCE discussion	2	291	January 24, 2011
Problem with URL in OS X General JUCE discussion	10	603	May 29, 2008
Using unicode in Juce General JUCE discussion	6	1520	September 28, 2009
URL-How to Read unicode Chinese string General JUCE discussion	1	597	June 20, 2013
URL escape characters General JUCE discussion	5	908	May 12, 2017

Problem decoding characters from websites

Purchase

Discover

Learn

Support

About

Events

Problem decoding characters from websites

Related topics

Purchase

Discover

Learn

Support

About

Events