ansaurus

Question

What is the safest way to extract <title> from an HTML file using xpath?

Answer 1

+2 A:

"//title" perhaps?

meder 2010-08-18 01:25:14

Answer 2

+2 A:

Because of the unruly nature of html markup you should use an html parsing library. You didn't specify a platform or language but there are a number of open source libraries out there.

Paul Sasik 2010-08-18 01:25:25

You can use XPath *with* an HTML parsing library. Html Agility Pack is just one example, of many, that supports both.

Matthew Flaschen 2010-08-18 01:28:35

@Matthew: Good point. I qualified the xpath statement in my answer.

Paul Sasik 2010-08-18 01:31:48

I don't get what "attempting xpath [...] directly on the markup" means. XPath requires the markup is already parsed to a DOM.

Matthew Flaschen 2010-08-18 01:35:23

@Matthew: Fair enough. I was making assumptions (such as some HTML->XML process) with very little context. Paring down the answer to just suggest the use of a library, which I supposed I'm assuming is not being used.

Paul Sasik 2010-08-18 01:45:42

Answer 3

A:

If you have something that an XML parser can parse (which is not the case with most HTML, but needs to be the case to use XPath), then you could use //title to get the element.

jwismar 2010-08-18 01:26:07

Answer 4

A:

If you can use javascript, you can do it:

document.title

Topera 2010-08-18 01:26:51

Answer 5

A:

Actually /html/head/title should work just fine, even on badly malformed mark-up, assuming:

there is a title element;
your HTML parser behaves the same way browser parsers do;
your HTML parser puts the HTML elements into the null namespace.

You will have to allow for the possibility of there being multiple title elements in invalid HTML, so /html/head/title[1] is possibly better.

Alohci 2010-08-18 08:13:15

ansaurus

tags:

views:

answers:

What is the safest way to extract <title> from an HTML file using xpath?

related questions