Recently I had to write an HTML parser for a project I've been working on for some time now. First I tried translating an open source C++ parser but it really wasn't what I wanted and it was also under the GPL. After contacting the author and realizing (or re-remembering) that I could not use a GPL derivative in a commercial library or application, I scrapped that and went back to the source: the official HTML DTD.
Re-remembering how to read a DTD after not having done so for so long was a chore, but the folks at Autistic Cuckoo helped. So I found a very helpful tutorial. I spent the next day or two writing the code in the file you linked below. I took some inspiration from a few files I found while browsing the FireFox code under the Mozilla license. The rest of it came from studying the DTD and trying to figure out a way to encapsulate that in a usable object model.
Here's an example of how to use it:
HtmlDocument doc = new HtmlDocument(url, html);StringBuilder sb = new StringBuilder();Collection<HtmlTag> pcdata = doc.GetList(DtdElement.A);foreach (HtmlTag tag in pcdata){ if (!tag.EndTag) { Dictionary<string, string> attributes = doc.GetAttributes(tag); sb.AppendLine(""); sb.AppendLine("A: " + doc.ReadSlice(tag.Slice));
foreach (KeyValuePair<string, string> pair in attributes) { sb.AppendLine(" " + pair.Key + "=" + pair.Value); } }}
I'm releasing it under the BSD license, which I like much more than the GPL as I'm not really a "true" free software zealot. The only think I ask is that if you fix a bug or make an improvement, please share it with me and I'll put up a new version here.
Page rendered at Tuesday, January 06, 2009 4:13:22 AM (Mountain Standard Time, UTC-07:00)
DisclaimerThe opinions expressed herein are just that, opinions. Don't have a fit if you think they're wrong. Post your comment or write your own blog.