Comments | maradydd: [LJ Genie] Using Scrapy programmatically

maradydd

[LJ Genie] Using Scrapy programmatically

Aug 10, 2009 12:01

Has anyone reading this used Scrapy, the Python HTML-scraping framework, programmatically as part of a larger system? I'm interested in using it to replace BeautifulSoup in a project I'm working on which involves extracting specific, XPath-targetable tags from the contents of a whole bunch of different URLs. BeautifulSoup can do it, but the CPU and ( Read more... )

lj genie, software engineering, python

Comments 2

anonymous August 11 2009, 00:50:06 UTC

Hi Meredith, I'm a Pablo Hoffman, a Scrapy developer.

There is currently a proposal for adding programatically control of Scrapy: http://dev.scrapy.org/wiki/SEP-004

However, it's not gonna happen soon as we're now focusing on cleaning up the code and documenting the last bits and pieces for the first stable release, which we hope to do it on 2-4 weeks.

Btw, Scrapy doesn't perform any extra cleansing on markup, it just uses libxml2 directly, which is probably not that bad as you suspect. Latest versions of libxml2 (the 2.6 series at least) are reasonable good at dealing with bad markup (as long as you're using the HTML parser, not the XML one) and even if it isn't as good as BeautifulSoup, the performance gain probably outperforms the lack of parsing robustness.

I'd recommend you to try libxml2 2.6.32 with a few ugly pages and judge for yourself.

Another good library to consider is html5lib: http://code

maradydd August 11 2009, 01:35:52 UTC

Hi Pablo -- thanks for dropping by! (The LJ Genie must be working overtime ( ... )