I’ve been playing around scraping web pages in Haskell using HXT and HandsomeSoup; HXT defines a bunch of arrow combinators that traverse the html document, and HandsomeSoup builds a layer on top that accepts CSS selectors.
Arrows have fallen out of favour lately; there seems to be a trend towards using weaker structures (Applicative and Functors), of which I approve. With my basic use case, I’m not fond of the IO and State monad being incorporated into the main HXT arrow type – it seems unnecessary for what appears to be a pure transformation.
Being fairly ignorant of CSS selectors in the first place, I struggled with the libraries for a bit – the arrow interface (and Haskell’s declarative style) eschews the explicit traversal commands that JQuery supports, forcing me to actually skim a couple of tutorials (this one was pretty helpful). Nonetheless I got the hang of most of it fairly quickly.
The toughest thing I struggled with was applying functions to children of nodes while retaining the parent node groupings – the documentation and tutorials weren’t too clear on this. For example, the following would return the paragraph text within content divs, but you’d lose which div each paragraph belonged to:
doc >>> css “div.content” >>> css “p” /> getText
It turns out that you need to use the listA combinator to alter the child arrow:
doc >>> css “div.content” >>> listA (css “p” /> getText)
This way, you end up with a list of lists of strings, where each inner list corresponds to one div.