Has anyone reading this used
Scrapy, the Python HTML-scraping framework, programmatically as part of a larger system? I'm interested in using it to replace BeautifulSoup in a project I'm working on which involves extracting specific, XPath-targetable tags from the contents of a whole bunch of different URLs. BeautifulSoup can do it, but the CPU and
(
Read more... )
Comments 2
There is currently a proposal for adding programatically control of Scrapy: http://dev.scrapy.org/wiki/SEP-004
However, it's not gonna happen soon as we're now focusing on cleaning up the code and documenting the last bits and pieces for the first stable release, which we hope to do it on 2-4 weeks.
Btw, Scrapy doesn't perform any extra cleansing on markup, it just uses libxml2 directly, which is probably not that bad as you suspect. Latest versions of libxml2 (the 2.6 series at least) are reasonable good at dealing with bad markup (as long as you're using the HTML parser, not the XML one) and even if it isn't as good as BeautifulSoup, the performance gain probably outperforms the lack of parsing robustness.
I'd recommend you to try libxml2 2.6.32 with a few ugly pages and judge for yourself.
Another good library to consider is html5lib: http://code
Reply
Reply
Leave a comment