It still needs some cleanup: - weed out the links within google. You could just pipe the file through "grep -v google.com/" to do that. But you also need to - use just the lines with a serial number leading them. And you only need the link, not the serial number. Putting these two together:
Also Google only shows you up to 100 hits at a time, so you'll have to dump search results for the next pages: append &start=100 to the search URL to page.
All this can be automated, of course, more or less elegantly. There used to be a websearch API for using Google but it involved writing actual code and also, I can't find it now.
You have to throttle your access to services. Google doesn't mind occasional access like this but make sure you don't hammer it in a tight loop or you'll start getting blocked for a while.
When you use wget, you also have some options to rate limit but they're probably not as important because you're visiting different sites. There are also options to attempt retries if something didn't answer the first time, those may be useful to you.
Remember to use separate working areas for each query, if you care about having separate corpora.
RE: (b) -- there are interesting theoretical approaches to this but I don't know of an off-the-shelf tool you can use. Sounds like you have a bit too much to check everything manually, so you'll have to sample the data to estimate the noise caused by dupes.
lynx -dump -listonly "http://www.google.com/search?q=moose&num=100"
you get a list of links on that page, something that's almost ready to be used as input for wget.
It still needs some cleanup:
- weed out the links within google. You could just pipe the file through "grep -v google.com/" to do that. But you also need to
- use just the lines with a serial number leading them. And you only need the link, not the serial number.
Putting these two together:
cat dump-from-lynx | perl -nle 'print $1 if m#\d+\. (.*)# && $1 !~ m#google.com/#'
Also Google only shows you up to 100 hits at a time, so you'll have to dump search results for the next pages: append &start=100 to the search URL to page.
All this can be automated, of course, more or less elegantly. There used to be a websearch API for using Google but it involved writing actual code and also, I can't find it now.
You have to throttle your access to services. Google doesn't mind occasional access like this but make sure you don't hammer it in a tight loop or you'll start getting blocked for a while.
When you use wget, you also have some options to rate limit but they're probably not as important because you're visiting different sites. There are also options to attempt retries if something didn't answer the first time, those may be useful to you.
Remember to use separate working areas for each query, if you care about having separate corpora.
RE: (b) -- there are interesting theoretical approaches to this but I don't know of an off-the-shelf tool you can use. Sounds like you have a bit too much to check everything manually, so you'll have to sample the data to estimate the noise caused by dupes.
Reply
Leave a comment