Advertisement
In another thread, in another tribe, B asked: "Do you know of a good way to parse google search results pages. I am automating some querries."
If I recall correctly, you were developing your DB in Java. I am an almost complete novice to Java, but I have some expertise at doing text mangling operations in Perl. If nothing else, you could use a Perl program to perform a retrieval, extraction and transformation, then pipe the output to your Java app for loading.
In particular, I like the Perl module HTML::TreeBuilder and its counterpart, XML::TreeBuilder. With these two, you can parse a document into an n-ary tree structure, and then there is a set of methods available for traversing these trees. HTML::TreeBuilder isa HTML::Tree, and XML::TreeBuilder isa XML::Tree, so the methods availble to the ::Tree classes are available to objects of the ::TreeBuilder classes, also.
Taking that to its next step, the ::Tree modules have a method called look_down(), which can search for particular tags. You can use this to reduce your context to the smallest unit before trying to extract the useful bits from a given web page.
Of course, there is a drawback to all this, and that is that if the web page layout changes, it all goes to hell.
My understanding is that Google has some sort of a search API, but I don't know if it us usable to the main Google search engine, or if it can only be used to access private subsets (e.g. a company that has installed a Google Search Appliance).
If I recall correctly, you were developing your DB in Java. I am an almost complete novice to Java, but I have some expertise at doing text mangling operations in Perl. If nothing else, you could use a Perl program to perform a retrieval, extraction and transformation, then pipe the output to your Java app for loading.
In particular, I like the Perl module HTML::TreeBuilder and its counterpart, XML::TreeBuilder. With these two, you can parse a document into an n-ary tree structure, and then there is a set of methods available for traversing these trees. HTML::TreeBuilder isa HTML::Tree, and XML::TreeBuilder isa XML::Tree, so the methods availble to the ::Tree classes are available to objects of the ::TreeBuilder classes, also.
Taking that to its next step, the ::Tree modules have a method called look_down(), which can search for particular tags. You can use this to reduce your context to the smallest unit before trying to extract the useful bits from a given web page.
Of course, there is a drawback to all this, and that is that if the web page layout changes, it all goes to hell.
My understanding is that Google has some sort of a search API, but I don't know if it us usable to the main Google search engine, or if it can only be used to access private subsets (e.g. a company that has installed a Google Search Appliance).
Advertisement
Advertisement
-
Re: How to parse Google search results
Thu, March 29, 2007 - 9:05 PMThe Google api is closed to new developers. It requires a key. Would be kwel because it was a Java interface.
I already have a HTML connection class in Java and my own HTML parser which build a graph from the HTML tag which can be traversed.
The problem that I see is that there is so much extraneous stuff with the results. Like all those ads and things and no nice clear delineation that I can find so far to show where the links starts. I was hoping for a nice comment tag saying search results start here. But no such luck. -
-
Re: How to parse Google search results
Fri, March 30, 2007 - 7:36 PMOn further inspection, it also appears that automated querying violates Google's terms of service. I don't know if you care or not, but I thought you should know.
There is some discussion of getting XML out of them at googlesystem.blogspot.com/2007/...h.html -
-
Re: How to parse Google search results
Fri, March 30, 2007 - 7:40 PMOn the other hand, it looks like Google News is fair game. Go to news.google.com, plug in a search term, and, along with your search results, you will get an XML feed of your search results. That's actually pretty cool. -
-
Re: How to parse Google search results
Fri, March 30, 2007 - 10:19 PMHmmm I didn't think of that. Thanks! -
-
Re: How to parse Google search results
Fri, April 6, 2007 - 8:53 PMIt appears that Google News is clean with no advertising links. This is good for current news stories.
-
-
-
-
-
Re: How to parse Google search results
Fri, December 14, 2007 - 9:48 PMYou can do it using regular expressions or any DOM parser, in perl or php for example.
Here is a link to the online tool: goohackle.com/scripts/google_parser.php
And the complete post about this script: goohackle.com/get-google-...a-text-file/
-
-
Re: How to parse Google search results
Sat, December 15, 2007 - 8:10 AMI like it! -
-
Re: How to parse Google search results
Wed, April 23, 2008 - 11:03 PM
-
-
-
To parse Google search results with biterscripting
Fri, February 20, 2009 - 2:13 PMI use biterscripting for a lot of parsing documents from web.
The following biterscripting command will get you the google page.
cat "http: //www .google.com/search?q=<some key words>" .
You can store it to a file and parse it using any of their stream editors.
To get you started, they have a sample script SS_URLs (www.biterscripting.com/SS_URLs.html), which will extract the URLs from google results. SO, the following command will show you all the URLs google is showing for the keywords - cheap, laptop.
scr SS_URLs.txt URL("http: //www .google.com/search?q=cheap+laptop")
They have a few other sample scripts at www.biterscripting.com/SS_SearchURL , www.biterscripting.com/SS_SearchWeb , etc. The SS_SearchWeb script is basically your own search engine.
Hope this helps. (biterscripting is free at www.biterscripting.com .)
Sen