Parsing(?) HTML


#1

Hi guys,

I need a couple of days to feel friendly with the new look of forum. However, it is far more functional now.

My current problem is called “HTML parsing”.
I am trying to process the code of some Juce pages (www.juce.com/doc/annotated for example), but I can’t find an HTML parser.

Is there a way to treat the HTML as XML pages ?

When I am trying to parse the www.juce.com/doc/annotated page, the xmlDocument->getDocumentElement() parser, throws an “unmatched tags” error.

Thanks in advance

George


#2

Even though html is considered a subset of xml, it allows some simplifications:
the closing tag for <p> may be omitted. Not so in xml.
The closing tag of <li> may also be omitted, not valid in xml.
There is the XHTML, where all tags needs to be closed…


#3

So, there is no juce-way to parse a juce-documentation page.
It means that I must implement a very dirty program code to extract the info I want.


#4

…well I’m not speaking for juce, and I have also only limited knowledge of all the juce features…
Did you have a look at WebBrowserComponent? https://www.juce.com/doc/classWebBrowserComponent
It obviously can parse html. And even if it doesn’t help, have a look into the source, how they parse it…
Good luck…


#5

All I want to do is to build a database (using SQLite), with Juce classes and member functions.
Obviously, I was hoping that the XmlDocument/XmnElement parser, would make my job much easier.
That’s all.