When web-scraping you’re at the mercy of whatever crazed document structure your target pages use; when using HXT and HandsomeSoup you’re stuck inside Arrows so dealing with messy input is more involved that you’d expect.
In my example, I had paragraphs of text followed by an automatically-inserted paragraph containing a link:
<p>Blah blah blah</p>
Here I couldn’t select the paragraphs of normal text because they didn’t have a class or other attributes; instead it was the paragraph I didn’t want that had the meta-data. Using “.” as a selector was rejected by the selector parser.
After scrounging through CSS selector documentation, I found the pseudo-function “:not()”; unfortunately HandsomeSoup didn’t recognise it (giving me a pattern-match failure!), and using “not” without the colon returned nothing at all.
All I wanted to do was to test the node for the presence of a “reply” class and invert the selection, but it seemed I might have to learn what the ArrowIf typeclass was and how to use it with “ifA”. After digging through the HXT documentation I found the “neg” function; while the type signature is impenetrable to my eyes, it did the job.
doc >>> css “p” >>> neg (css “.auto-link”) //> getText