tags:

views:

1471

answers:

8

Hi,

I have these page http://www.elseptimoarte.net/. The page have a search field, If I put for instance "batman" it give me some searchs results with a url of every result: http://www.elseptimoarte.net/busquedas.html?cx=003284578463992023034%3Alraatm7pya0&cof=FORID%3A11&ie=ISO-8859-1&oe=ISO-8859-1&q=batman#978

I would like to parse the html code to get the url for example of the firse link: Example: www.elseptimoarte.net/peliculas/batman-begins-1266.html

The problem it is that I use curl (in bash) but when I do a curl -L -s http://www.elseptimoarte.net/busquedas.html?cx=003284578463992023034%3Alraatm7pya0&cof=FORID%3A11&ie=ISO-8859-1&oe=ISO-8859-1&q=batman#978 it doesn't give the link.

Any help?

Many thanks and sorry for my english!

A: 

I'll give you a more thorough command-line answer in a second, but in the mean time, have you considered using Yahoo Pipes? It's little more than a proof-of-concept now, but it has everything you need.

Parker
A: 

You don't get the link using cURL because the page uses Javascript to get that data.

Using FireBug I found the real URL to be here - quite monstrous!

Greg
+1  A: 

This might not be exactly what you're looking for, but it gives me the same response as your example. Perhaps you can adjust it to suit your needs:

From bash, type:

$ wget -U 'Mozilla/5.0' -O - 'http://www.google.com/search?q=batman+site%3Awww.elseptimoarte.net' | sed 's/</\
</g' | sed -n '/href="http:\/\/www\.elseptimoarte\.net/p'

the "</g" starts a new line. Don't include the prompt ($). Someone more familiar with sed might do a better job than me. You can replace the query string 'batman' and/or the duplicate site url strings to suit your needs.

The following was my output:

<a href="http://www.elseptimoarte.net/peliculas/batman-begins-1266.html" class=l>
<a href="http://www.elseptimoarte.net/peliculas/batman:-the-dark-knight-30.html" class=l>El Caballero Oscuro (2008) - El Séptimo Arte
<a href="http://www.elseptimoarte.net/-batman-3--y-sus-rumores-4960.html" class=l>&#39;
<a href="http://www.elseptimoarte.net/esp--15-17-ago--batman-es-lider-y-triunfadora-aunque-no-bate-record-4285.html" class=l>(Esp. 15-17 Ago.) 
<a href="http://www.elseptimoarte.net/peliculas/batman-gotham-knight-1849.html" class=l>
<a href="http://www.elseptimoarte.net/cine-articulo541.html" class=l>Se ponen en marcha las secuelas de &#39;
<a href="http://www.elseptimoarte.net/trailers-de-buena-calidad-para--indiana--e--batman--3751.html" class=l>Tráilers en buena calidad de &#39;Indiana&#39; y &#39;
<a href="http://www.elseptimoarte.net/usa-8-10-ago--impresionante--batman-sigue-lider-por-4%C2%AA-semana-consecutiva-4245.html" class=l>(USA 8-10 Ago.) Impresionante. 
<a href="http://www.elseptimoarte.net/usa-25-27-jul--increible--batman-en-su-segunda-semana-logra-75-millones-4169.html" class=l>(USA 25-27 Jul.) Increíble. 
<a href="http://www.elseptimoarte.net/cine-articulo1498.html" class=l>¿Aparecerá Catwoman en &#39;
Parker
A: 

Many Thanks to all!

But I need generate the html code in a text file (shell script with curl for instance) to have the urls search results:

for instance:

www.elseptimoarte.net/peliculas/the-other-boleyn-girl-1121.html www.elseptimoarte.net/foro/index.php?topic=9274.0 www.elseptimoarte.net/esp--15-17-ago--batman-es-lider-y-triunfadora-aunque-no-bate-record-4285.html etc

Because then, I do curl in this urls to extract some values.

Many thanks and sorry for my english!

A: 

Pepe,

Here's the command you can use to get what you want:

$ wget -U 'Mozilla/5.0' -O - 'http://www.google.com/search?q=batman+site%3Awww.elseptimoarte.net' | sed 's/</\                                                            
</g' | sed -n 's/<a href="\(http:\/\/www\.elseptimoarte\.net[^"]*\).*$/\1/gp' > myfile.txt

It's a slight alteration of the above command. Puts line breaks in between urls, but it wouldn't be difficult to change it to give your exact output.

Parker
A: 

Parker many thanks!

Of this, I understand that I can get the html code of the page with wget but not with curl, true?

Many thanks

A: 

curl and wget share many uses. I'm sure people have their preferences, but I tend to go to wget first for crawling, as it has auto-following of links to a specified depth and tends to be a bit more versatile with common text web pages, while I use curl when I need a less-common protocol or I have to interact with form data.

You can use curl if you have some preference for it, though I think wget is more suited. In the command above, just replace 'wget' with 'curl' and '-U' with '-A'. Omit '-O -' (I believe curl defaults to stdout, if not on your machine, use its appropriate flag) and leave everything else the same. You should get the same output.

Parker
A: 

There are Watir for Java

And if you are on .NET C#/VB you can use WatiN which is an awesome browser manipulation tool.

It is sort of a testing framework with tools to manipulate the browser DOM and poke around it but I believe you can also use those outside of a "testing" context.

chakrit