Do you already have the search results available, or do you need a tool to fetch them too?
How many documents do you expect to visit in total?
How bad do you care about possible duplications within your corpus, e.g., how bad is it if your initial URL list ends up pointing you to the same document twice?
The basic task is simple but all of the above make it more complicated to do in large scale with fully accurate results. If you have a lot of documents, especially if a lot of them are hosted on the same site, you have to think of throttling your requests so that the site owners don't hate you and your Internet provider (Uni I suppose) don't send you a nasty letter asking you to stop wasting bandwidth. Also getting a large number of results from Google programmatically is possible, but is more involved than just ripping the HTML from the result of "http://www.google.com/search?q=moose&num=100".
On a Unix machine (or a Windows with Cygwin), the tool wget can fetch multiple documents for you. You create a temporary working directory, then supposing you have a file with list of URLs to fetch sitting in your home directory, do "wget -i ~/urllist". This throws each document to a separate directory keyed by its hostname, and you wanted everything concatenated together, so you do:
find . -type f -print | xargs cat > ~/corpus (This lists all regular the files under the current directory, then turns the file list into argument to the con"cat"enate command, and redirects the output to the file "corpus" in your home directory.)
Making this more robust takes more effort depending on what your precise requirements are. Note that anything relying on the web cannot be 100% reliable (sometimes some of the sites you're hoping to visit are down. Do you revisit them? When?) and if you're really talking many documents, it can actually get expensive.
I do need the tool to fetch the list of hits from Google given a query, yes. My queries so far usually get a few hundred hits at most, and there could be, I don't know, something on the order of a couple hundred total queries. I don't think bandwidth should be a problem given that scale.
I hadn't thought about the duplication issue - it could skew the counts substantially if there were lots of hits leading to the same page or even just two hits to a page containing lots of tokens of the search string. I guess it depends on the degree of duplication - I don't need 100% reliability, but too much noise would make the results meaningless.
So to use wget I need (a) a way of generating a list of the URLs for a given query and (b) ideally a way of weeding out duplicates - right?
It still needs some cleanup: - weed out the links within google. You could just pipe the file through "grep -v google.com/" to do that. But you also need to - use just the lines with a serial number leading them. And you only need the link, not the serial number. Putting these two together:
Also Google only shows you up to 100 hits at a time, so you'll have to dump search results for the next pages: append &start=100 to the search URL to page.
All this can be automated, of course, more or less elegantly. There used to be a websearch API for using Google but it involved writing actual code and also, I can't find it now.
You have to throttle your access to services. Google doesn't mind occasional access like this but make sure you don't hammer it in a tight loop or you'll start getting blocked for a while.
When you use wget, you also have some options to rate limit but they're probably not as important because you're visiting different sites. There are also options to attempt retries if something didn't answer the first time, those may be useful to you.
Remember to use separate working areas for each query, if you care about having separate corpora.
RE: (b) -- there are interesting theoretical approaches to this but I don't know of an off-the-shelf tool you can use. Sounds like you have a bit too much to check everything manually, so you'll have to sample the data to estimate the noise caused by dupes.
How many documents do you expect to visit in total?
How bad do you care about possible duplications within your corpus, e.g., how bad is it if your initial URL list ends up pointing you to the same document twice?
The basic task is simple but all of the above make it more complicated to do in large scale with fully accurate results. If you have a lot of documents, especially if a lot of them are hosted on the same site, you have to think of throttling your requests so that the site owners don't hate you and your Internet provider (Uni I suppose) don't send you a nasty letter asking you to stop wasting bandwidth. Also getting a large number of results from Google programmatically is possible, but is more involved than just ripping the HTML from the result of "http://www.google.com/search?q=moose&num=100".
On a Unix machine (or a Windows with Cygwin), the tool wget can fetch multiple documents for you. You create a temporary working directory, then supposing you have a file with list of URLs to fetch sitting in your home directory, do "wget -i ~/urllist". This throws each document to a separate directory keyed by its hostname, and you wanted everything concatenated together, so you do:
find . -type f -print | xargs cat > ~/corpus
(This lists all regular the files under the current directory, then turns the file list into argument to the con"cat"enate command, and redirects the output to the file "corpus" in your home directory.)
Making this more robust takes more effort depending on what your precise requirements are. Note that anything relying on the web cannot be 100% reliable (sometimes some of the sites you're hoping to visit are down. Do you revisit them? When?) and if you're really talking many documents, it can actually get expensive.
Reply
I hadn't thought about the duplication issue - it could skew the counts substantially if there were lots of hits leading to the same page or even just two hits to a page containing lots of tokens of the search string. I guess it depends on the degree of duplication - I don't need 100% reliability, but too much noise would make the results meaningless.
So to use wget I need (a) a way of generating a list of the URLs for a given query and (b) ideally a way of weeding out duplicates - right?
Reply
lynx -dump -listonly "http://www.google.com/search?q=moose&num=100"
you get a list of links on that page, something that's almost ready to be used as input for wget.
It still needs some cleanup:
- weed out the links within google. You could just pipe the file through "grep -v google.com/" to do that. But you also need to
- use just the lines with a serial number leading them. And you only need the link, not the serial number.
Putting these two together:
cat dump-from-lynx | perl -nle 'print $1 if m#\d+\. (.*)# && $1 !~ m#google.com/#'
Also Google only shows you up to 100 hits at a time, so you'll have to dump search results for the next pages: append &start=100 to the search URL to page.
All this can be automated, of course, more or less elegantly. There used to be a websearch API for using Google but it involved writing actual code and also, I can't find it now.
You have to throttle your access to services. Google doesn't mind occasional access like this but make sure you don't hammer it in a tight loop or you'll start getting blocked for a while.
When you use wget, you also have some options to rate limit but they're probably not as important because you're visiting different sites. There are also options to attempt retries if something didn't answer the first time, those may be useful to you.
Remember to use separate working areas for each query, if you care about having separate corpora.
RE: (b) -- there are interesting theoretical approaches to this but I don't know of an off-the-shelf tool you can use. Sounds like you have a bit too much to check everything manually, so you'll have to sample the data to estimate the noise caused by dupes.
Reply
Leave a comment