programming help?

Feb 08, 2010 19:53

I wonder if one of you programmers out there would be willing to help with a small desideratum of mine ( Read more... )

Leave a comment

Comments 4

gaal February 9 2010, 07:20:49 UTC
Do you already have the search results available, or do you need a tool to fetch them too ( ... )

Reply

goliard February 10 2010, 02:17:54 UTC
I do need the tool to fetch the list of hits from Google given a query, yes. My queries so far usually get a few hundred hits at most, and there could be, I don't know, something on the order of a couple hundred total queries. I don't think bandwidth should be a problem given that scale.

I hadn't thought about the duplication issue - it could skew the counts substantially if there were lots of hits leading to the same page or even just two hits to a page containing lots of tokens of the search string. I guess it depends on the degree of duplication - I don't need 100% reliability, but too much noise would make the results meaningless.

So to use wget I need (a) a way of generating a list of the URLs for a given query and (b) ideally a way of weeding out duplicates - right?

Reply

gaal February 10 2010, 09:51:10 UTC
For (a) perhaps the simplest way to get started is to use lynx to dump the result of a search. This is useful because if you do

lynx -dump -listonly "http://www.google.com/search?q=moose&num=100"
you get a list of links on that page, something that's almost ready to be used as input for wget.

It still needs some cleanup:
- weed out the links within google. You could just pipe the file through "grep -v google.com/" to do that. But you also need to
- use just the lines with a serial number leading them. And you only need the link, not the serial number.
Putting these two together:

cat dump-from-lynx | perl -nle 'print $1 if m#\d+\. (.*)# && $1 !~ m#google.com/#'Also Google only shows you up to 100 hits at a time, so you'll have to dump search results for the next pages: append &start=100 to the search URL to page ( ... )

Reply


goliard February 12 2010, 23:27:44 UTC
Much thanks! It looks like this should basically do what I need. Here's a different question, though: what if I do want to just count Google hits rather than total tokens - can this be automated? That is, can I feed Google a list of queries and get a list of hit counts in return (true ones, not the bizarrely exaggerated ones it gives on the first bunch of results pages)? (This seems like it should be simpler and that there should be programs or sites out there that do it, but I can't seem to find any.)

Reply


Leave a comment

Up