A tiny HTML parser in F#

Leverages F# Active Patterns and so less than 140 lines of code were needed.

Below is some sample code that shows the round-tripping (parsing, serializing, parsing) of HTML from some major sites:

#load "FsHtml.fs"

let download (href:string)=
    use ws = new System.Net.WebClient()
    ws.DownloadString(href)

[
    "http://msdn.microsoft.com"
    "http://wordpress.com"
    "http://www.microsoft.com"
    "http://www.google.com" 
] 
|> List.map (download >> FsHtml.parse >> FsHtml.serialize >> FsHtml.parse)

The parser has not been tested extensively. In particular, it did not work on www.yahoo.com (because of invalid html returned). Note that HTML is meant to be hand crafted and parsers should be very tolerant (which this one is not).

The parsed content is returned as a flat list of tags (see the code for details). Higher level structure is not inferred from this list because that requires too many rules to be handled. Nevertheless, the format is sufficient for scraping HTML and for injecting additional content (such as scripts).


Last edited Jun 24, 2012 at 9:57 PM by fwaris, version 5