ansaurus

Question

Answer 1

+2 A:

Sounds like a job for a regular expression. This will depend on the HTML being well-formed, i.e., only finds the title element inside a head element.

 Regex regex = new Regex( ".*<head>.*<title>(.*)</title>.*</head>.*",
                          RegexOptions.IgnoreCase );
 Match match = regex.Match( html );
 string title = match.Groups[0].Value;

I don't have my regex cheat sheet in front of me so it may need a little tweaking. Note that there is also no error checking in the case where no title element exists.

tvanfosson 2009-04-04 13:51:20

"Sounds like a job for ... The More-Than-Regular Expressor!" A developer by day, a superhero by night ;)

Piskvor 2009-04-04 15:46:48

soypunk 2009-04-04 23:07:41

Even worse than soypunk correctly points out, there are many usable HTML files with a title that are not valid. e.g. <tiTlE>a<boDy>b You really need to use an HTML parser if you're going to handle real-world HTML.

Alohci 2009-04-04 23:51:45

So can any one suggest how to use an HTML parser to extract title?

2009-04-05 17:28:12

Answer 2

+2 A:

You can use regular expressions for this but it's not completely error-proof. It'll do if you just want something simple though (in PHP):

function get_title($html) {
  return preg_match('!<title>(.*?)</title>!i', $html, $matches) ? $matches[1] : '';
}

cletus 2009-04-04 13:52:04

Looks like this function is case sensitive, this function does not extract title if its in upper case, can you alter this function to ignore the case?

2009-04-05 17:35:41

The 'i' flag after the pattern makes it case insensitive.

cletus 2009-04-05 20:53:37

ansaurus

tags:

views:

answers:

extract title tag from html

related questions