I am using URL.readEntireTextStream(false) to get text from websites, and it works well except with characters that haven’t been encoded into HTML entities (I think that is the right term, correct?). Basically, special characters like — (em dashes) or curly quotes that have been left “in the raw” within the page always get mangled by readEntireTextStream… an em dash, for instance, translates to “å”. I understand that you are supposed to encode these things, but is there anything that can be done about this?
And while you are reading this, might as well make it a twofer… does anyone know of a cross-platform class or lib to transform HTML entities into String friendly text (that would be UTF-8 right?). I really don’t feel like diving into this, and I can live with a dependency.
Sorry if this is basic, I am not that experienced with text encodings and Unicode.