How to parse Google search results

topic posted Thu, March 29, 2007 - 9:40 AM by  Glenn
Share/Save/Bookmark
Advertisement
In another thread, in another tribe, B asked: "Do you know of a good way to parse google search results pages. I am automating some querries."

If I recall correctly, you were developing your DB in Java. I am an almost complete novice to Java, but I have some expertise at doing text mangling operations in Perl. If nothing else, you could use a Perl program to perform a retrieval, extraction and transformation, then pipe the output to your Java app for loading.

In particular, I like the Perl module HTML::TreeBuilder and its counterpart, XML::TreeBuilder. With these two, you can parse a document into an n-ary tree structure, and then there is a set of methods available for traversing these trees. HTML::TreeBuilder isa HTML::Tree, and XML::TreeBuilder isa XML::Tree, so the methods availble to the ::Tree classes are available to objects of the ::TreeBuilder classes, also.

Taking that to its next step, the ::Tree modules have a method called look_down(), which can search for particular tags. You can use this to reduce your context to the smallest unit before trying to extract the useful bits from a given web page.

Of course, there is a drawback to all this, and that is that if the web page layout changes, it all goes to hell.

My understanding is that Google has some sort of a search API, but I don't know if it us usable to the main Google search engine, or if it can only be used to access private subsets (e.g. a company that has installed a Google Search Appliance).
posted by:
Glenn
New York
Advertisement
Advertisement
  • B
    B
    offline 121

    Re: How to parse Google search results

    Thu, March 29, 2007 - 9:05 PM
    The Google api is closed to new developers. It requires a key. Would be kwel because it was a Java interface.

    I already have a HTML connection class in Java and my own HTML parser which build a graph from the HTML tag which can be traversed.

    The problem that I see is that there is so much extraneous stuff with the results. Like all those ads and things and no nice clear delineation that I can find so far to show where the links starts. I was hoping for a nice comment tag saying search results start here. But no such luck.