views:

23

answers:

1

Hi. I dont parse this url: http://foldmunka.net

$ch = curl_init("http://foldmunka.net");

//curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
//curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); //not necessary unless the file redirects (like the PHP example we're using here)
$data = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);
clearstatcache();
if ($data === false) {
  echo 'cURL failed';
  exit;
}
$dom = new DOMDocument();
$data = mb_convert_encoding($data, 'HTML-ENTITIES', "utf-8");
$data = preg_replace('/<\!\-\-\[if(.*)\]>/', '', $data);
$data = str_replace('<![endif]-->', '', $data);
$data = str_replace('<!--', '', $data);
$data = str_replace('-->', '', $data);
$data = preg_replace('@<script[^>]*?>.*?</script>@si', '', $data);
$data = preg_replace('@<style[^>]*?>.*?</style>@si', '', $data);

$data = mb_convert_encoding($data, 'HTML-ENTITIES', "utf-8");
@$dom->loadHTML($data);

$els = $dom->getElementsByTagName('*');
foreach($els as $el){
  print $el->nodeName." | ".$el->getAttribute('content')."<hr />";
  if($el->getAttribute('title'))$el->nodeValue = $el->getAttribute('title')." ".$el->nodeValue;
  if($el->getAttribute('alt'))$el->nodeValue = $el->getAttribute('alt')." ".$el->nodeValue;
  print $el->nodeName." | ".$el->nodeValue."<hr />";
}

I need sequentially the alt, title attributes and the simple text, but this page i cannot access the nodes within the body tag.

A: 

I'm not sure I'm getting what this script does - the replace operations look like an attempt at sanitation but I'm not sure what for, if you're just extracting some parts of the code - but have you tried the Simple HTML DOM Browser? It may be able to handle the parsing part more easily. Check out the examples.

Pekka
I need the plaintext and the alt and title attributes. example: <html><title>Hello</title><body>Hello this site<img src="asdasd.jpg" alt="alt attr" title="title attr"><a href="open.php" alt="alt attr" title="title attr">click</a> Some text.</body></html>I need this output: Hello Hello this site alt attr title attr alt attr title attr click Some Text.
turbod
@turbod The Simple HTML DOM browser can do both. The plaintext should be something like `$html->find("body",0)->plaintext` see the examples on the site to see how to run through a list of all tags to get their `alt` and `title` atributes.
Pekka
Yes now I read the examples, but I can not find how to do it.I need the plaintext and the alt and title attributes same time.
turbod
print file_get_html('http://foldmunka.net')->plaintext;This print the plaintext, but the alt and the title attributes no.
turbod
@turbod aahh I see now, you need this *sequentially*. I didn't understand that. Whew, that's going to be more difficult, I have no solution ready for that, sorry. I'll leave my answer in place to prevent others from making the same mistake. You should edit your example into your answer to make this clear.
Pekka