views:

1078

answers:

2

I want to extract contents of title tag from html string. I have done some search but so far i am not able to find such code in VB/C# or PHP. Also this should work with both upper and lower case tags e.g. should work with both <title></title> and <TITLE></TITLE>. Thank you.

+2  A: 

Sounds like a job for a regular expression. This will depend on the HTML being well-formed, i.e., only finds the title element inside a head element.

 Regex regex = new Regex( ".*<head>.*<title>(.*)</title>.*</head>.*",
                          RegexOptions.IgnoreCase );
 Match match = regex.Match( html );
 string title = match.Groups[0].Value;

I don't have my regex cheat sheet in front of me so it may need a little tweaking. Note that there is also no error checking in the case where no title element exists.

tvanfosson
"Sounds like a job for ... The More-Than-Regular Expressor!" A developer by day, a superhero by night ;)
Piskvor
soypunk
Even worse than soypunk correctly points out, there are many usable HTML files with a title that are not valid. e.g. <tiTlE>a<boDy>b You really need to use an HTML parser if you're going to handle real-world HTML.
Alohci
So can any one suggest how to use an HTML parser to extract title?
+2  A: 

You can use regular expressions for this but it's not completely error-proof. It'll do if you just want something simple though (in PHP):

function get_title($html) {
  return preg_match('!<title>(.*?)</title>!i', $html, $matches) ? $matches[1] : '';
}
cletus
Looks like this function is case sensitive, this function does not extract title if its in upper case, can you alter this function to ignore the case?
The 'i' flag after the pattern makes it case insensitive.
cletus