XmlElement returning trimmed text elements


#1

Hi Jules,

The following example:

XmlDocument doc(T("<body>\n foo </body>")); XmlElement *e = doc.getDocumentElement(); std::cerr << "'" << e->getChildElement(0)->getText().toUTF8() << "' '" << e->getAllSubText().toUTF8() << "'\n";

outputs:
‘foo’ ‘foo’

So juce has removed all the whitespace/linefeed stuff at the beginning and the end of the text node. Is it possible to fix that ?


#2

But that’s how XML works!

If you need the whitespace preserved, you should put it in an attribute, not in the body text.


#3

Then I’m confused… How can I render some pseudo xhtml stuff if whitespaces are not preserved ? When I parse xml from python, the whitespaces are preserved:

from xml.dom import minidom doc=minidom.parseString('<body> foo </body>'); print doc.childNodes[0].childNodes[0]

shows ‘<DOM Text node " foo ">’

I am by no way an xml expert, but as far as I know, all xml parsers do preserve whitespace inside text nodes. See for example http://www.usingxml.com/Basics/XmlSpace


#4

Hmm. I was under the impression that leading and trailing whitespace was fair game in text nodes… I guess that’s not the case!

Very easy to fix though - it’s just a two-line change, in juce_XmlDocument.cpp, 629:

[code] //textElementContent = textElementContent.trim();

        if (textElementContent.trim().isNotEmpty())
            e->setText (textElementContent);

[/code]

Hopefully that’ll sort it for you.


#5

ok great, thanks ! To be completely standard conformant I think the blank text nodes should be also be returned (those that contain a single linefeed, etc). However I don’t think I need it, and maybe that would break some existing xml reading code from other users that do not expect such nodes which are generally useless. Or maybe that could be an option of XmlDocument:

if (trim_text_nodes) // defaults to true textElementContent = textElementContent.trim(); if (textElementContent.isNotEmpty()) e->setText (textElementContent);


#6

Yes - that’s the approach that MS used, and seems pretty sensible.


#7

I think a version of this has crept in again.

XmlElement* XmlDocument::readNextElement (const bool alsoParseSubElements)
{

// parse the guts of the element…
if (c == ‘>’)
{
++input;
skipNextWhiteSpace(); // line 366 as of 17/09/10

            if (alsoParseSubElements)
                readChildElements (node);

            break;
        }


}
The call to skipNextWhiteSpace means that leading whitespace is trimmed from text elements. Trailing whitespace is left alone: as far as I can tell, either are legal, and the trim is breaking the parsing of a number of files I’ve been handed…


#8

Looking with a bit more care I see that there are essential calls to skipNextWhiteSpace() all over XmlDocument::readNextElement(). I suppose one would need to identify the case of naked text for special treatment. Have to make a chocolate cake right now, but later…


#9

I never actually stopped it trimming the start of the text, just the end… I do think I should change that, but it risks breaking code where people are reading messy xml files and not bothering to trim the text themselves before using it.

Not 100% sure what the best thing to do is, because I do think you’re right that it ought to leave the space on there, and people should already have written robust code that would handle a bit of whitespace, as there was never anything that explictly said it would be trimmed… I just know that there will inevitably be people who haven’t been careful in that way.


#10

Well, you’ll have much better things to do than this, but it may be worth it sometime. I tried finding ways to pre-process my files, but came up blank. So I have hacked an answer. This is the kind of thing I have to deal with:

Not in the other universe. This Cat is dead because that Cat killed me. Me! In cold blood! After everything I did for her.

The actual text is naked within the element, which may or may not have other named attributes. Clearly, you have to keep all whitespace.

What I’ve hacked is to note the input-pointer early in XmlDocument::readChildElements, and then wind back in the case where it has been identified as a simple text-block. This shouldn’t break anything else, and - since it only applies to raw text and is technically correct, I doubt that it will upset any existing code.

It solves my problem. But it is hacky. If it’s any use, I’ll send or post the changes.


#11

Thanks, I sorted this out yesterday - will check in soon…


#12

You move eerily fast.