[LJ Genie] Using Scrapy programmatically: maradydd

maradydd

[LJ Genie] Using Scrapy programmatically

Aug 10, 2009 12:01

Has anyone reading this used Scrapy, the Python HTML-scraping framework, programmatically as part of a larger system? I'm interested in using it to replace BeautifulSoup in a project I'm working on which involves extracting specific, XPath-targetable tags from the contents of a whole bunch of different URLs. BeautifulSoup can do it, but the CPU and memory load is really heavy and I'd like to find a lighter-weight solution. (Scrapy supports XPath out of the box, which was a great design decision on their part.)

The specific problem I'm having with Scrapy is that despite the fact that it supports writing custom scrapers, it's designed as a command-line-driven tool to the exclusion of anything else. I want to instantiate a scraper from within a routine, run it, and hand the contents of the tags it collects off to another routine all within the same process, without having to invoke a separate process or touch the disk -- this system has to consume a lot of network data and I can't afford for it to become I/O bound. (I can queue the inbound network data -- in fact, since my current architecture is completely synchronous, I already am -- but not having to do so is preferable. Scrapy is asynchronous and that's a plus.)

Since it's written in Python, I can trace the control flow and figure out what specific pieces I need to import and/or customise to get it to do what I want, but it's a pretty densely layered system and it would be nice to have some examples to use for guidance. The documentation is unfortunately useless in this regard -- all the examples are for command-line invocation -- and neither Google Code Search nor koders.com turn up anything useful.

N.B.: I'm reluctant to just use libxml2, because most of the pages I'm scraping are not XHTML-compliant. In fact, a surprisingly large number of them have HTML so malformed that BeautifulSoup chokes on them and I have to use an exponential-backoff approach to parse only a piece of the document at a time. (And in practice, that means I sometimes lose data anyway; this is annoying, but frustratingly necessary. Dear web developers who cannot be bothered to make their content machine-readable without lots of massaging: die in a fire.) It is my understanding that Scrapy is quite tolerant of bad markup, but if I'm wrong about that, please correct me.

lj genie, software engineering, python