tags:

views:

1080

answers:

3

I need some help ... I'm a bit (read total) n00b when it comes to regular expressions, and need some help writing one to find a specific piece of text contained within a specific HTML tag from PHP.

The source string looks like this:

<span lang="en">English Content</span><span lang="fr">French content</span> ... etc ...

I'd like to extract just the text of the element for a specific language.

Can anyone help?

+6  A: 

There are plenty of HTML parsers available for PHP. I suggest you check out one of those, (for example: PHP Simple HTML DOM Parser).

Shooting yourself in the foot with trying to read HTML with regex is a lot easier than you think, and a lot harder to avoid than you wish (especially when you don't know regex thoroughly, and your input is not guaranteed to be 100% clean HTML).

Tomalak
Thanks for the idea, and you're probably right. I thought that a simple regex would be the quickest and easiest way of achieving this, as I'm not parsing a whole HTML document, just a little string that will always look like the example.
David Heggie
Then you are probably still better off with two calls to strpos() to get the indexes for the substring you need.
Tomalak
Just to note; PHP 5.x includes the SimpleXML DOM parser, which makes doing this sort of thing nice and straightforward. You can easily use an XPath query to traverse the DOM and pick out the parts you need.
Rob
Does it handle (possibly ill-formed) HTML as well?
Tomalak
A: 

(Bad, not working) example which shows why you should not use regex for parsing html.

/<span lang="en">(.*)<\/span>/

Will output:

English Content</span><span lang="fr">French content

More stuff to read:

Parsing: Beyond Regex

For-the-2,295,485th-time-DO-NOT-PARSE-HTML-WITH-REGULAR-EXPRESSIONS

Karsten
No. That is what I mean by "shooting yourself in the foot".
Tomalak
I agree using regex for parsing html is not the thing you want to do, but I tried to answer the question.
Karsten
That's not the point. Your answer is wrong. :-)
Tomalak
Well, feel free to correct me then :)
Karsten
Sorry but that's wrong on at least two fronts (and if you can't figure out which two, that's a good reason why you should be using a parser).
cletus
@cletus: *lol* :-) @Karsten: If you want to find out the error no. 1, simply run your regex against the given example.
Tomalak
Thanks for idea, Karsten. I modified it slightly, and it works for me:<span lang="gd">(.*?)</span>
David Heggie
@David, please read my answer again ;)
Karsten
Thanks for the links. I'm suitably castigated. :)
David Heggie
@David Heggie: Yours is better, but it will fail at: "<span lang="gd">foo <span>bar</span> foo</span>"
Tomalak
A: 

There's this most awesome class that lets you do SQL-like queries on HTML pages. It might be worth a look:

HTML SQL

I've used it a bunch and I love it.

Hope that helps...