Parsing with HandsomeSoup

I’ve been playing around scraping web pages in Haskell using HXT and HandsomeSoup; HXT defines a bunch of arrow combinators that traverse the html document, and HandsomeSoup builds a layer on top that accepts CSS selectors.

Arrows have fallen out of favour lately; there seems to be a trend towards using weaker structures (Applicative and Functors), of which I approve. With my basic use case, I’m not fond of the IO and State monad being incorporated into the main HXT arrow type – it seems unnecessary for what appears to be a pure transformation.

Being fairly ignorant of CSS selectors in the first place, I struggled with the libraries for a bit – the arrow interface (and Haskell’s declarative style) eschews the explicit traversal commands that JQuery supports, forcing me to actually skim a couple of tutorials (this one was pretty helpful). Nonetheless I got the hang of most of it fairly quickly.

The toughest thing I struggled with was applying functions to children of nodes while retaining the parent node groupings – the documentation and tutorials weren’t too clear on this. For example, the following would return the paragraph text within content divs, but you’d lose which div each paragraph belonged to:

doc >>> css “div.content” >>> css “p” /> getText

It turns out that you need to use the listA combinator to alter the child arrow:

doc >>> css “div.content” >>> listA (css “p” /> getText)

This way, you end up with a list of lists of strings, where each inner list corresponds to one div.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s