I'm using PHP libcurl to load a page. Now I need to get this page's <title> tag's content, and some other information too. I've tried to parse it using SimpleXML, but with no luck, because the page isn't valid XML. Can you suggest some other way to easily get contents of <title> tag? Thank you.
You can use DOMDocument::loadHTML.
This will echo "The title":
<?php
$doc = <<<HTML
<html>
<head>
<title>The title</title>
<body>
hhhhhh
HTML;
libxml_use_internal_errors(true);
$d = new DOMDocument;
$d->loadHTML($doc);
$ts = $d->getElementsByTagName("title");
if ($ts->length > 0) {
echo $ts->item(0)->textContent;
}
You can use this script to get the title of a page.
# Script Title.txt
var str page, content
cat $page > $content
stex -r -c "^<title&</title&\>^" $content
Save this little code in file C:/Scripts/Title.txt. Code is in biterscripting. Start biterscripting, and enter this command.
script "C:/Scripts/Title.txt" page("http://stackoverflow.com/questions/3135488/how-can-i-get-pages-title-tags-content-if-it-cant-be-parsed-as-xml")
It will get the title of this page (the one you are viewing). Use any other URL or local file path as the value of page(). Use double quotes. When I executed this command, I got
How can I get page's <title> tag's content if it can't be parsed as XML? - Stack Overflow
You can call this script from any executable or batch file.
Try using Yahoo's YQL console. You can query almost any url and then ask for results back in XML. You can even add xpath to narrow it down.
http://developer.yahoo.com/yql/console/
Maybe you can call this service using curl. It's pretty handy.