tags:

views:

68

answers:

4

hello
I need to scrape the data from an html page

<div style="margin-top: 0px; padding-right: 5px;" class="lftFlt1">

    <a href="" onclick="setList1(157204);return false;" class="contentSubHead" title="USA USA">USA USA</a>
    <div style="display: inline; margin-right: 10px;"><a href="" onclick="rate('157204');return false;"><img src="http://icdn.raaga.com/3_s.gif" title="RATING: 3.29" style="position: relative; left: 5px;" height="10" width="60" border="0"></a></div>
    </div>

I need to scrape the "USA USA" and 157204 from the onclick="setList1...

+1  A: 

Use regex:

/setList1\(([0-9]+)\)[^>]+title="([^"]+)"/si

and preg_match() or preg_match_all()

killer_PL
Regex is not the way to parse HTML. See: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 and other responses on that thread.
Mark Trapp
+2  A: 

You should use DOMDocument or XPath. RegEx is generally not recommended for parsing HTML.

Shubham
+1  A: 

Please go through my previous answers about how to handle HTML with DOM.

XPath to get the Text Content of all anchor elements:

//a/text()

XPath to get the title attribute of all anchor elements:

//a/@title

XPath to get the onclick attribute of all anchor elements:

//a/@onclick

You will have to use some string function to extract the number from the onclick text.

Gordon
A: 

By far the best lib for scraping is simple html dom. basically uses jquery selector syntax.

http://simplehtmldom.sourceforge.net/

The way you'd get the data in this example:

include("simple_html_dom.php");
$dom=str_get_html("page.html");
$text=$dom->find(".lftFlt1 a.contentSubHead",0)->plaintext;
//or 
$text=$dom->find(".lftFlt1 a.contentSubHead",0)->title;
steve
That's your opinion. How many other libs have your tried? Suggested third party alternatives that actually use DOM instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html) and [FluentDom](http://www.fluentdom.org).
Gordon
ya it helped for me but one correction i changed the $dom=file_get_html("page.html");can u please explain the .lftFlt1 a.contentSubHead this thing???
Ram
@Ram it's a CSS Selector, which (contrary to steve's suggestion) is not a jQuery thing but a W3C standard: http://www.w3.org/TR/CSS2/selector.html
Gordon
ya ok.. but i am having more class with lftFlt1 in an webpage i used this code it didnt work $dom=file_get_html("http://www.raaga.com/channels/tamil/moviedetail.asp?mid=T0001923");foreach($text=$dom->find(".lftFlt1 a.contentSubHead",0) as $a){ echo $a->plaintext;}
Ram
@Ram obviously. the second argument to `find` returns the nth-child. See the docs http://simplehtmldom.sourceforge.net/manual.htm
Gordon
ya i got it i need to remove that zero..
Ram
$str = <<<HTML<a href="" onClick="setList1(50992);return false;" class="contentSubHead" title="Kakha Kakha">Kakha Kakha</a>HTML;$d = str_get_html($str);foreach($d->find('a') as $e) echo $e->onClick.'<br>';this return null value i cannot get the value of on click???
Ram