tags:

views:

44

answers:

4

Given this curl command: curl --user-agent "fogent" --silent -o page.html "http://www.google.com/search?q=insansiate"

* Spelling is intentionally incorrect. I want to grab the suggestion as my result.

I want to be able to either grep into the page.html file perhaps with grep -oE or pipe it right from curl and never store a file.

The result should be: 'instantiate'

I need only the word 'instantiate', or the phrase, whatever google is auto correcting, is what I am after.

Here is the basic html that is returned:

<span class=spell style="color:#cc0000">Did you mean: </span><a href="/search?hl=en&amp;ie=UTF-8&amp;&amp;sa=X&amp;ei=VEMUTMDqGoOINraK3NwL&amp;ved=0CB0QBSgA&amp;q=instantiate&amp;spell=1"class=spell><b><i>instantiate</i></b></a>&nbsp;&nbsp;<span class=std>Top 2 results shown</span>

So perhaps from/to of the string below, which I hope is unique enough to cover all my bases.

class=spell><b><i>instantiate</i></b></a>&nbsp;&nbsp;

I keep running into issues with greedy grep; perhaps I should run it though an html prettify tool first to get a line break or 50 in there. I don't know of any simple way to do so in bash, which is what I would ideally like this to be in. I really don't want to deal with firing up perl, and making sure I have the correct module.

Any suggestions, thank you?

A: 

curl --> tidy -asxml --> xmlstarlet sel

Ignacio Vazquez-Abrams
A: 

Edit: Sorry, did not see your Perl notice.

#!/usr/bin/perl use strict; use LWP::UserAgent;

my $arg = shift // 'insansiate';

my $lwp = LWP::UserAgent->new(agent => 'Mozilla');
my $c = $lwp->get("http://www.google.com/search?q=$arg") or die $!;

my @content = split(/:/, $c->content);

for(@content) {
  if(m;<b><i>(.+)</i></b>;) {
    print "$1\n";
    exit;
    }
}

Running:

 > perl google.pl 
    instantiate
 > perl google.pl disconect
    disconnect
trapd00r
+1  A: 

As I'm sure you're aware, screen scraping is a delicate business. This command sequence is no exception since it relies on the specific structure of the page which could change at any time without notice.

grep -o 'Did you mean:\([^>]*>\)\{5\}' page.html | sed 's/.*<i>\([^<]*\)<.*/\1/' page.html

In a pipe:

curl --user-agent "fogent" --silent "http://www.google.com/search?q=insansiate" | grep -o 'Did you mean:\([^>]*>\)\{5\}' page.html | sed 's/.*<i>\([^<]*\)<.*/\1/'

This relies on finding five ">" characters between "Did you mean:" and the "</i>" after the word you're looking for.

Have you considered other methods of getting spelling suggestions or are you specifically interested in what Google provides?

If you have ispell or aspell installed, you can do:

echo insansiate | ispell -a

and parse the result.

Dennis Williamson
A: 

Do you have lynx?

$ lynx --dump "http://www.google.com/search?q=insansiate" | \
> grep "Did you mean"
    1. Did you mean: [12]instantiate  Top 2 results shown
   Did you mean to search for: [31]instantiate
mobrule