Downloading Web Pages with Haskell

So it turns out that using Network.HTTP on Windows is really bad – I can download some websites fine, but others become a garbled mess due to some encoding issues that are beyond my patience to understand – and the GHC.IO.Encoding functions setLocaleEncoding, setForeignEncoding, and setFileSystemEncoding weren’t able to solve the issue.

I next attempted to install the download-curl package, but it seems that the Haskell curl package won’t install on Windows. Great. Combined with console character problems (every time unicode hit PowerShell I got a “hPutChar: invalid argument (invalid character)” followed by flaky behaviour), it’s almost enough to make me move to Linux – I don’t even like Linux, but I really don’t want to be sidetracked with finicky issues.

Finally I found my way to http-conduit. The API is better (it’s called “simpleHttp” instead of “simpleHTTP”, and you call it directly on your URL instead of passing it to getRequest) and the documentation points out a Windows gotcha – you need to call “withSocketsDo” on your network IO action (otherwise you’ll hit an “InternalIOException getAddrInfo: does not exist (error 10093)” error. This time, I managed to pull the page and write it to a file, preserving the utf8 content.

As I’m not especially concerned with unicode content, I ran the text through Data.Text.Encoding.unicodeRemoveNoneAscii (having to surround it with decodeUtf8, unpack, pack, and encodeUtf8).

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s