tags:

views:

490

answers:

3

I'm looking for a regex that matches all used HTML tags in a text consisting of several lines. It should read out "b", "p" and "script" in the following lines:

<b>
<p class="normalText">
<script type="text/javascript">

Is there such thing? The start I have is that it should start with a "<" and read until it hits a space or a ">", but at the same time, it should not include the starting "<" since I just want to match the letter/word itself. Thoughts?

+1  A: 

I don't know what system you are using, but it can be done to a certain extent. Look at this online flex-based application. Check out the Published > XML regex examples. You will get an idea.

dirkgently
Can't find an example that helps me with the problem, but it's a great resource! I'm using ASP.net regex.
miccet
+6  A: 

There are many similar questions on SO:

  1. http://stackoverflow.com/questions/37486/filter-out-html-tags-and-resolve-entities-in-python
  2. http://stackoverflow.com/questions/29869/regex-to-match-all-html-tags-except-p-and-p
  3. http://stackoverflow.com/questions/44078/strip-all-html-tags-except-links

etc. The general agreement is that it's best not to use regular expressions to parse HTML instead of doing it properly by applying a DOM parser and traversing the DOM tree.

David Hanak
You might want to change that link text from the URL to the question text so it's more readable.
cletus
Yea, I have seen them. I'm not really worried about best practice here though since it's not gonna end up in an application anyway. The biggest problem I see with what I want is to match the first char "<" but not include it in the match, if that makes sense.
miccet
@miccet: use parentheses to group the stuff you are interested in.
dirkgently
@cletus: I might, but I'm a lazy bastard. Besides, it's not really the title that matters, given that they are all related to the same problem.
David Hanak
+3  A: 

It's virtually impossible to regex HTML once you start considering all the special cases and malformed HTML that browsers sometimes happilly parse anyway. That said however I thought it might be fun to get the names without using capture groups and thus I present too you with the following sollution:

(?<=<)\w+(?=[^<]*?>)

For the record I hold little faith in it being at all useful in any but the most trivial of cases.

Kit Sunde
It's just made for an example anyway, and doesn't need to be bullet proof. This worked perfectly, and I see how the exclude function works. Thanks a bunch.
miccet
-1 Wrong on so many levels.
cletus
@cletus: On what level is this wrong that he did not already cover?
Ant P.