XML parsing problem


#1

I try to parse an HTML page exported from Excel. For our purposes let’s say it looks like:

My code:

XmlDocument* xmlDocument = new XmlDocument(file);
XmlElement* htmlElement = xmlDocument->getDocumentElement();
XmlElement* bodyElement = htmlElement->getChildByName(T(“body”));
XmlElement* divElement = bodyElement->getChildByName(T(“div”));
XmlElement* tableElement = divElement->getChildByName(T(“table”));

The problem is that divElement exists, but does not contain correct data about its children, so I can’t find tableElement.

For all it may help, I experimented with it a little and found that the JUCE XML parser works fine if in

we have align=“center” but fails with Excel’s align=center version.

Anybody knows about a workaround? Thanks.


#2

if you have align=cetner NOT align=“CENTER” and the JUCE parses fails, that means it works perfectly right. Every value within a tag a property must be quoted with single or double quotes, otherwise any decent parser/validator will fail, cause you are using a XML parser and this is a HTML file, there is a difference, a web browser will open it but with errors, you can check this here http://validator.w3.org/


#3

Thanks for the lesson, but I think that instead of forcing me to write a new parser by hand, it would be better having a flexible parser accepting realities of life.

Sure it’s HTML, but it would perfectly work with the XML parser if it were a bit more flexible. I won’t have any success convincing Microsoft about the issue, so I tried here.


#4

How about a pre-process step? Match any string inside of a <> block that is of the form <.(\w+)=(\S+).> and turn it into </\1>="</\2>". I can’t for the life of me remember exactly what the exact regexp syntax would be but it would probably be the easiest solution.


#5

Yes, pre-processing is the only idea I have for the moment.

Still I think that a single word value (not containg spaces, so practically not needing apostrophes) should be acceptable by the parser. Maybe with a “flexible” flag, defaulting to false.


#6

i’m not saying that it’s a good idea, i’ve been working with some web apps myself and i know it would be nice to have those parsers to work more flexible. i’m just saying this is a XML parser and XML has it’s own quirks, html has different ones, i guess it’s up to jules to decies. I’m sorry for the “lesson” it’s hard to say what you ment just by a single post.


#7

I think JUCE only “eats” single quoted attributes? At least I had problems with double quoted attributes. I think the standart allows single or double quoted attributes.


#8

Yeah, it’s an XML parser, so I don’t want to bodge it to handle HTML, that’s an entirely different problem.

nope, it handles either type of quote, as long as the open and close quotes match.