tags:

views:

52

answers:

5

Here is my current xpath code "/html/head/title".

But you know, in the real world html environment, the code format usually broken, e.g. <html> tag is missing could cause an exception. So, I would like to know if there's a safe way to extract the <title> tag? (something like getElementByTagName)

+2  A: 

"//title" perhaps?

meder
+2  A: 

Because of the unruly nature of html markup you should use an html parsing library. You didn't specify a platform or language but there are a number of open source libraries out there.

Paul Sasik
You can use XPath *with* an HTML parsing library. Html Agility Pack is just one example, of many, that supports both.
Matthew Flaschen
@Matthew: Good point. I qualified the xpath statement in my answer.
Paul Sasik
I don't get what "attempting xpath [...] directly on the markup" means. XPath requires the markup is already parsed to a DOM.
Matthew Flaschen
@Matthew: Fair enough. I was making assumptions (such as some HTML->XML process) with very little context. Paring down the answer to just suggest the use of a library, which I supposed I'm assuming is not being used.
Paul Sasik
A: 

If you have something that an XML parser can parse (which is not the case with most HTML, but needs to be the case to use XPath), then you could use //title to get the element.

jwismar
A: 

If you can use javascript, you can do it:

document.title
Topera
A: 

Actually /html/head/title should work just fine, even on badly malformed mark-up, assuming:

  • there is a title element;
  • your HTML parser behaves the same way browser parsers do;
  • your HTML parser puts the HTML elements into the null namespace.

You will have to allow for the possibility of there being multiple title elements in invalid HTML, so /html/head/title[1] is possibly better.

Alohci