XML Special Characters


#1

Hi!,

There is something I don't understand and for sure I am missing..

I want to save an XML with special characters inside (like æ, ø, etc).
I would use "utf-8" encoding, which in theory supports that.

But when XMLElement class writes the file, this "escapes" all that characters.
My expected behaviour is to get the file without escaping that. Is it possible?

 


Example:
If I execute this:

XmlElement elementList("RootTree");
XmlElement* element = new XmlElement("Element1");

element->addTextElement(String("aeiouåøæñ"));

elementList.addChildElement(element);
elementList.writeToFile( File::getCurrentWorkingDirectory().getChildFile ("test.xml"), "", "UTF-8");

I have this (in test.xml file):

<?xml version="1.0" encoding="UTF-8"?>

<RootTree>
  <Element1 name1="value1">&#241;&#241;p&#229;p&#248;</Element1>
</RootTree>

But I expect to have something like this:

<?xml version="1.0" encoding="UTF-8"?>

<RootTree>
  <Element1 name1="value1">aeiouåøæñ</Element1>
</RootTree>


 

 

Thanks!

 


#2

String("aeiouåøæñ")

<sigh> See this kind of thing so often... It's not safe to embed unicode in C++ source files - how could the compiler know what encoding to read your source file in, or what encoding to use when converting these characters into an 8-bit const char* ? Always use the Introjucer's UTF8 string literal creator tool to generate safe string literals.

And yes, the XML stuff escapes all the characters down to ascii, to make sure that it can be read by any reader without problems. That's intentional, and the result will be correct once it has been loaded.


#3

Yes, you are completely right. That was a sample to explain the behaviour. In the app I have, the strings are taken from the file directory, where some folder-names contains those chars (it's user data, basically). Thanks for the explanation anyway.

BTW, if I understand rightly there is not a way to unescape those chars using juce, right?
 


#4

there is not a way to unescape those chars using juce, right?

If you mean "generate raw XML output in which the strings do not use escaping" then there is no way to do that.

If you mean "unescape those strings when reading the XML" then of course they'll get unescaped automatically by the parser.


#5

Ok, thanks!


#6

Can't you save the xml-file as utf-8?

I would have expected the same result as ivanslo did, namely an xml file encoded with utf-8 with human readable characters, not charachter codes.

It doesn't really make sense to me to utilize a character coding as utf-8 that's mainly invented to allow storing and displaying all sorts of characters from non-english languages, and end up with a file that's hardly readable!

After all, it's hard to expect anyone to have a unicode translator at hand just to enter a file name like Größenmaßstäbe.txt in an xml-settings file. Or Möbelträgerfüße.doc. Let alone to be able to read such a file correctly.
 


#7

Producing human-friendly output was never the intention of the XML classes - I designed them for storing and retrieving data without any mistakes, and IMHO the only foolproof way to do that is to only use ascii + escapes.


#8

Hi jules.

ascii + escapes.

   Might put it optional