XML parsing problem

Gwynhale · March 21, 2008, 7:17am

I try to parse an HTML page exported from Excel. For our purposes let’s say it looks like:

My code:

XmlDocument* xmlDocument = new XmlDocument(file);
XmlElement* htmlElement = xmlDocument->getDocumentElement();
XmlElement* bodyElement = htmlElement->getChildByName(T(“body”));
XmlElement* divElement = bodyElement->getChildByName(T(“div”));
XmlElement* tableElement = divElement->getChildByName(T(“table”));

The problem is that divElement exists, but does not contain correct data about its children, so I can’t find tableElement.

For all it may help, I experimented with it a little and found that the JUCE XML parser works fine if in

we have align=“center” but fails with Excel’s align=center version.

Anybody knows about a workaround? Thanks.

atom · March 21, 2008, 7:51am

if you have align=cetner NOT align=“CENTER” and the JUCE parses fails, that means it works perfectly right. Every value within a tag a property must be quoted with single or double quotes, otherwise any decent parser/validator will fail, cause you are using a XML parser and this is a HTML file, there is a difference, a web browser will open it but with errors, you can check this here http://validator.w3.org/

Gwynhale · March 21, 2008, 7:09pm

Thanks for the lesson, but I think that instead of forcing me to write a new parser by hand, it would be better having a flexible parser accepting realities of life.

Sure it’s HTML, but it would perfectly work with the XML parser if it were a bit more flexible. I won’t have any success convincing Microsoft about the issue, so I tried here.

Sastraxi · March 21, 2008, 8:58pm

How about a pre-process step? Match any string inside of a <> block that is of the form <.(\w+)=(\S+).> and turn it into </\1>="</\2>". I can’t for the life of me remember exactly what the exact regexp syntax would be but it would probably be the easiest solution.

Gwynhale · March 21, 2008, 9:38pm

Yes, pre-processing is the only idea I have for the moment.

Still I think that a single word value (not containg spaces, so practically not needing apostrophes) should be acceptable by the parser. Maybe with a “flexible” flag, defaulting to false.

atom · March 22, 2008, 12:32am

i’m not saying that it’s a good idea, i’ve been working with some web apps myself and i know it would be nice to have those parsers to work more flexible. i’m just saying this is a XML parser and XML has it’s own quirks, html has different ones, i guess it’s up to jules to decies. I’m sorry for the “lesson” it’s hard to say what you ment just by a single post.

zamrate · March 22, 2008, 8:01am

I think JUCE only “eats” single quoted attributes? At least I had problems with double quoted attributes. I think the standart allows single or double quoted attributes.

jules · March 22, 2008, 1:44pm

Yeah, it’s an XML parser, so I don’t want to bodge it to handle HTML, that’s an entirely different problem.

nope, it handles either type of quote, as long as the open and close quotes match.

Topic		Replies	Views
Parsing(?) HTML General JUCE discussion	4	936	April 10, 2016
Parse HTML Feature Requests	2	702	July 25, 2019
Not parsing an xml General JUCE discussion	11	947	May 12, 2017
XmlDocument parser question General JUCE discussion	3	919	September 5, 2013
Sotring quotes in text in xml element General JUCE discussion	2	391	April 5, 2008

XML parsing problem

Purchase

Discover

Learn

Support

About

Events

XML parsing problem

Related topics

Purchase

Discover

Learn

Support

About

Events