I want to extract contents of title tag from html string. I have done some search but so far i am not able to find such code in VB/C# or PHP. Also this should work with both upper and lower case tags e.g. should work with both <title></title>
and <TITLE></TITLE>
. Thank you.
views:
1078answers:
2
+2
A:
Sounds like a job for a regular expression. This will depend on the HTML being well-formed, i.e., only finds the title element inside a head element.
Regex regex = new Regex( ".*<head>.*<title>(.*)</title>.*</head>.*",
RegexOptions.IgnoreCase );
Match match = regex.Match( html );
string title = match.Groups[0].Value;
I don't have my regex cheat sheet in front of me so it may need a little tweaking. Note that there is also no error checking in the case where no title element exists.
tvanfosson
2009-04-04 13:51:20
"Sounds like a job for ... The More-Than-Regular Expressor!" A developer by day, a superhero by night ;)
Piskvor
2009-04-04 15:46:48
soypunk
2009-04-04 23:07:41
Even worse than soypunk correctly points out, there are many usable HTML files with a title that are not valid. e.g. <tiTlE>a<boDy>b You really need to use an HTML parser if you're going to handle real-world HTML.
Alohci
2009-04-04 23:51:45
So can any one suggest how to use an HTML parser to extract title?
2009-04-05 17:28:12
+2
A:
You can use regular expressions for this but it's not completely error-proof. It'll do if you just want something simple though (in PHP):
function get_title($html) {
return preg_match('!<title>(.*?)</title>!i', $html, $matches) ? $matches[1] : '';
}
cletus
2009-04-04 13:52:04