Inverting CSS Selection in HXT

When web-scraping you’re at the mercy of whatever crazed document structure your target pages use; when using HXT and HandsomeSoup you’re stuck inside Arrows so dealing with messy input is more involved that you’d expect.

In my example, I had paragraphs of text followed by an automatically-inserted paragraph containing a link:

<p>Blah blah blah</p>

<p class=”auto-link”>Etc</p>

Here I couldn’t select the paragraphs of normal text because they didn’t have a class or other attributes; instead it was the paragraph I didn’t want that had the meta-data. Using “.” as a selector was rejected by the selector parser.

After scrounging through CSS selector documentation, I found the pseudo-function “:not()”; unfortunately HandsomeSoup didn’t recognise it (giving me a pattern-match failure!), and using “not” without the colon returned nothing at all.

All I wanted to do was to test the node for the presence of a “reply” class and invert the selection, but it seemed I might have to learn what the ArrowIf typeclass was and how to use it with “ifA”. After digging through the HXT documentation I found the “neg” function; while the type signature is impenetrable to my eyes, it did the job.

Thus:

doc >>> css “p” >>> neg (css “.auto-link”) //> getText

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s