XmlElement returning trimmed text elements

jpo · January 15, 2009, 10:09am

Hi Jules,

The following example:

XmlDocument doc(T("<body>\n foo </body>")); XmlElement *e = doc.getDocumentElement(); std::cerr << "'" << e->getChildElement(0)->getText().toUTF8() << "' '" << e->getAllSubText().toUTF8() << "'\n";

outputs:
‘foo’ ‘foo’

So juce has removed all the whitespace/linefeed stuff at the beginning and the end of the text node. Is it possible to fix that ?

jules · January 15, 2009, 10:23am

But that’s how XML works!

If you need the whitespace preserved, you should put it in an attribute, not in the body text.

jpo · January 15, 2009, 11:22am

Then I’m confused… How can I render some pseudo xhtml stuff if whitespaces are not preserved ? When I parse xml from python, the whitespaces are preserved:

from xml.dom import minidom doc=minidom.parseString('<body> foo </body>'); print doc.childNodes[0].childNodes[0]

shows ‘<DOM Text node " foo ">’

I am by no way an xml expert, but as far as I know, all xml parsers do preserve whitespace inside text nodes. See for example http://www.usingxml.com/Basics/XmlSpace

jules · January 15, 2009, 11:45am

Hmm. I was under the impression that leading and trailing whitespace was fair game in text nodes… I guess that’s not the case!

Very easy to fix though - it’s just a two-line change, in juce_XmlDocument.cpp, 629:

[code] //textElementContent = textElementContent.trim();

        if (textElementContent.trim().isNotEmpty())
            e->setText (textElementContent);

[/code]

Hopefully that’ll sort it for you.

jpo · January 15, 2009, 12:03pm

ok great, thanks ! To be completely standard conformant I think the blank text nodes should be also be returned (those that contain a single linefeed, etc). However I don’t think I need it, and maybe that would break some existing xml reading code from other users that do not expect such nodes which are generally useless. Or maybe that could be an option of XmlDocument:

if (trim_text_nodes) // defaults to true textElementContent = textElementContent.trim(); if (textElementContent.isNotEmpty()) e->setText (textElementContent);

jules · January 15, 2009, 12:36pm

Yes - that’s the approach that MS used, and seems pretty sensible.

PBNV · September 18, 2010, 7:08pm

I think a version of this has crept in again.

XmlElement* XmlDocument::readNextElement (const bool alsoParseSubElements)
{
…
// parse the guts of the element…
if (c == ‘>’)
{
++input;
skipNextWhiteSpace(); // line 366 as of 17/09/10

            if (alsoParseSubElements)
                readChildElements (node);

            break;
        }

…
}
The call to skipNextWhiteSpace means that leading whitespace is trimmed from text elements. Trailing whitespace is left alone: as far as I can tell, either are legal, and the trim is breaking the parsing of a number of files I’ve been handed…

PBNV · September 18, 2010, 11:33pm

Looking with a bit more care I see that there are essential calls to skipNextWhiteSpace() all over XmlDocument::readNextElement(). I suppose one would need to identify the case of naked text for special treatment. Have to make a chocolate cake right now, but later…

jules · September 19, 2010, 1:32pm

I never actually stopped it trimming the start of the text, just the end… I do think I should change that, but it risks breaking code where people are reading messy xml files and not bothering to trim the text themselves before using it.

Not 100% sure what the best thing to do is, because I do think you’re right that it ought to leave the space on there, and people should already have written robust code that would handle a bit of whitespace, as there was never anything that explictly said it would be trimmed… I just know that there will inevitably be people who haven’t been careful in that way.

PBNV · September 19, 2010, 7:11pm

Well, you’ll have much better things to do than this, but it may be worth it sometime. I tried finding ways to pre-process my files, but came up blank. So I have hacked an answer. This is the kind of thing I have to deal with:

Not in the other universe. This Cat is dead because that Cat killed me. Me! In cold blood! After everything I did for her.

The actual text is naked within the element, which may or may not have other named attributes. Clearly, you have to keep all whitespace.

What I’ve hacked is to note the input-pointer early in XmlDocument::readChildElements, and then wind back in the case where it has been identified as a simple text-block. This shouldn’t break anything else, and - since it only applies to raw text and is technically correct, I doubt that it will upset any existing code.

It solves my problem. But it is hacky. If it’s any use, I’ll send or post the changes.

jules · September 20, 2010, 8:05am

Thanks, I sorted this out yesterday - will check in soon…

PBNV · September 20, 2010, 3:43pm

You move eerily fast.

Topic		Replies	Views
"Not quite a String"-bug report	2	345	April 17, 2022
XML parsing problem General JUCE discussion	7	1385	March 22, 2008
XmlElements: Adding Comments & Whitespace General JUCE discussion	3	477	November 28, 2012
XML parsing General JUCE discussion	13	1815	May 12, 2017
XmlElement issue (raw data ?) General JUCE discussion	5	369	April 18, 2008

XmlElement returning trimmed text elements

Purchase

Discover

Learn

Support

About

Events

XmlElement returning trimmed text elements

Related Topics

Purchase

Discover

Learn

Support

About

Events