tags:

views:

195853

answers:

31

I need to match all of these opening tags:

<p>
<a href="foo">

But not these:

<br />
<hr class="foo" />

I came up with this and wanted to make sure I've got it right. I am only capturing the a-z.

<([a-z]+) *[^/]*?>

I believe it says:

  • Find a less-than, then
  • Find (and capture) a-z one or more times, then
  • Find zero or more spaces, then
  • Find any character zero or more times, greedy, except /, then
  • Find a greater-than

Do I have that right? And more importantly, what do you think? =)

EDIT:

Hmm, which answer to mark as correct? For the record, ALL the answers are appreciated. Many thanks!

+2  A: 

You want the first > not preceded by a /. Look here for details on how to do that. Its referred to as negative lookbehind.

However, a naive implementation of that will end up matching <bar/></foo> in this example document

<foo><bar/></foo>

Can you provide a little more information on the problem you're trying to solve? Are you iterating through tags programatically?

Jherico
Yep, I sure am. Determining all the tags that are currently open, then compare that against the closed tags in a separate array. RegEx hurts my brain.
Jeff
+6  A: 

Try:

<([^\s]+)(\s[^>]*?)?(?<!/)>

It is similar to yours, but the last > must not be after a slash, and also accepts h1.

Kobi
<a href="foo" title="5>3"> Oops </a>
Gareth
That is very true, and I did think about it, but I assumed the `>` symbol is properly escaped to >.
Kobi
`>` is valid in an attribute value. Indeed, in the ‘canonical XML’ serialisation you must not use `>`. (Which isn't entirely relevant, except to emphasise that `>` in an attribute value is not at all an unusual thing.)
bobince
+3891  A: 

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the n​erves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of reg​ex parsers for HTML will ins​tantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection wil​l devour your HT​ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi​ght he com̡e̶s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the song of re̸gular exp​ression parsing will exti​nguish the voices of mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ


Have you tried using an XML parser instead?

bobince
Ya know, I've tried, and that's a very creative answer and all, but I can't seem to find any quality docs or tutorials on how to use the PHP DOM classes.
Jeff
Is everything ok there? Is this a cry for help? :)
Kobi
http://php.net/manual/en/class.domdocument.php - based on the W3 DOM standard so it's the same as you're used to from JavaScript and other languages. There's plenty of tutorial stuff out there, see eg. http://php4every1.com/tutorials/php-domdocument-tutorial/
bobince
Kobi: I think it's time for me to quit the post of Assistant Don't Parse HTML With Regex Officer. No matter how many times we say it, they won't stop coming every day... every hour even. It is a lost cause, which someone else can fight for a bit. So go on, parse HTML with regex, if you must. It's only broken code, not life and death.
bobince
++ for "The <center> cannot hold"
Horace Loeb
If I could give you a pat on the head, I would. :-) It'll be okay, really.
ebneter
@bobince your answer is like the stackoverflow equivalent of the book of subgenious.
Ravi
Yeah, but what if you use regexes with backreferences?
stimms
Or what if you have all of your tags pre-defined? e.g. `(div|p|a|etc)`
Jeff
Chuck Norris _can_ parse HTML with regex.
Bobince, you make me want to ask more questions about HTML+Regexes.
nickf
A true work of art; I weep at the poetic beauty.
Marc Gravell
it's as if Stefan George came back to life as a 21st century geek: http://en.wikipedia.org/wiki/Stefan_George
Edward Tanguay
Also see: http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not
ntownsend
Love this answer, he had the patience to write so much :)I would try SGMLReader to do this kind of thing
Shreedhar
Danielewski? Is that you?
casey
150 upvotes in, like, 9 hours? Bravo!
Konrad Rudolph
@bobince - I never said I disagree - of course xml should not be parsed with regex, heck, **it should be added the site's FAQs, along with floating point question**. I just tried to answer the question - we don't know what Jeff is trying to do.
Kobi
You made my Google Chrome on Linux crash :(
Skilldrick
Kobi: sure — sorry, I didn't mean to aim that frustration at you in particular.
bobince
@Konrad: when I came here, it was 2 hours after posting and 86 votes.
Martinho Fernandes
+1 for... awesome? also getting printed and framed
Uberfuzzy
Best. Answer. Ever!I'm making a t-shirt with this one...
davewasthere
Is it just me, or there is a wrong assertion in here? "Jon Skeet cannot parse HTML using regular expressions." **BLASPHEMY**!
Martinho Fernandes
This is the single most informative answer ever made on stackoverflow.
David Ma
It's a popular folk myth that computer science has no bearing on commercial programming. Good to see the people disproving it. :-)
cartoonfox
Would it be ok to add some paragraphs in there, it's beautiful words but after a few lines it becomes unreadable to non-computers non-regex users like myself ^^
Oskar Duveborn
@Oskar: I find that to be part of the beauty...
Martinho Fernandes
That's depressing: On Safari 4 and Firefox 3.6 on my Mac, I get a bunch of white boxes at the end. On my iPhone, I get all the correct craziness.
Rudd Zwolinski
oniguruma have named groups. then again, c has loops and conditionals and other neat things, so might as well use that.
wilhelmtell
Amazing work! How did you get it to break up and look all crazy at the end like that?
Benjamin Cox
@Benjamin: Unicode, baby. Just take a look at the source...
Martinho Fernandes
unknown (google): That's just silly, Chuck Norris doesn't parse HTML. When he shows up, it _parses itself_.
ryandenki
Your comment made it to reddit. http://www.reddit.com/r/programming/comments/a4kze/im_not_sure_if_this_guys_thinks_parsing_html_with/
adolfojp
Of course...this particular problem can indeed be solved by regex. Beautiful tirade though.
Grumdrig
You're cheating! Letting Chuck Norris write the answer for you. =)
PEZ
I'm humbled by your answer...
Richard Clayton
SO should develop a new medal specifically for this response.
Jen the Heb
It just reeks of H.P. Lovecraft, and looks on track to be the most popular SO answer - EVER.
Marc Gravell
Your comment has made it onto metafilter : ) http://www.metafilter.com/86689/So-does-anyone-know-how-to-make-an-HTML-regex-parser
codeulike
Dear Lord! This is getting extremely silly! I wish I'd thought of drunkenly posting frustrated and largely unhelpful outbursts before, they get me much more votes than helpful answers. Jeff: sorry. Thanks for accepting it though ;-)
bobince
+1 for indirectly bewildering Eric S Raymond. http://esr.ibiblio.org/?p=1411 , scroll down to the comments.
Tim Post
Silver badges (on the way to becoming Gold) for `regex` and `html` tags with a single answer - Wow!!!
Amarghosh
A true classic. Maybe there really should be a singleton award/trophy for 'best answer'. <napoleon>That would be awesome.</napoleon>
pboin
@bobince: Your answer was actually helpful! I'm just glad nobody made fun of the Tigers for choking down the stretch again this year.
Jeff
In years to come, someone will come and say "Stand back! I know regular expressions!" and some enlightened mind will promptly reply "The <center> cannot hold it is too late."
Martinho Fernandes
For the sake of History transmission to future generations, I ask Jeff (Atwwod, not the OP) to disable edit capabilities on this answer: We don't want it to be altered in any way!
Serge - appTranslator
I saw that you had 1100 plus votes and *still* voted it up. How many weird bits of Unicode did you have to pull in to write that?!?!
jprete
You've been Coding Horror'd! http://www.codinghorror.com/blog/archives/001311.html
Michael Myers
Wait...so can you parse HTML w/ regex?
Trevor Hartman
You win 1 internet. Prize may be redeemed at the ARPANET counter.
Adam Davis
@Serge - we can lock it, but that'll stop voting and commenting. I don't propose to do that.
Marc Gravell
@Serge: if anyone in a fit of carelessness damages the answer somehow, I'm sure someone will promptly roll it back. Btw, less than 200 votes from becoming the most voted answer EVER.
Martinho Fernandes
This was welbogging brilliant. W O W. And yes, I upvoted it while it was at 1200+ votes -- that's how brilliant it is. Best answer ever, to any question, ever, on the welbogging planet.
John Rudy
Can we all agree that this is the last word on the inadvisability of parsing (X)HTML with regexes? Please?
Alan Moore
@Alan: at least we have *the definitive* link to post for all future questions on the topic (included here for convenience): http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Fredrik Mörk
Congratulations, this is now the highest-voted post on the site :-)
Kyle Cronin
Mimetic squirrels dance lightly upon your brink. You have exceeded 1024 votes.
Adam Davis
Hey, I just discovered that this answer looks way better in Firefox than in IE7! I was kind of wondering what all the fuss was about.
Michael Myers
A lot of hard-livin' went into that post
Harry Lime
Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn! Bwahahahahahahahahahahahahahahahah!
Christian Hayter
Most. Popular. Answer. Ever?
Macha
@Dave Beer I want one too. How much?
aehiilrs
Is it possible to use RegEx to parse this answer?
Chris Porter
When are the t-shirts going to be on the market?
Rob
I'd be very interested in a "Making Of" this answer: how did you do the Unicode signs? Did you use a certain editor for that?
Epaga
wow 1700 votes in 5 days?
hasen j
This needs to get up to 2012 upvotes. And then needs to be locked by Atwood. Come on, people!
Chris Lutz
Surely this deserves a platinum star or something?
day_trader
My computer won't display the unicode properly :( ... anyone have a screen capture of the true prophecy?
Bill
Now, if someone retags the post, does bobince instantly get two more tag badges per added tag? I wonder ...
John at CashCommons
Tony the Pony. Brilliance!
Mike Crittenden
@Bill In all its glory: http://imgur.com/gOPS2.png
Andrew Keeton
h͚͔͕͍̣̙̞̃ͯt̖̗͎̩̳̻̆͑̈͐̉̐̚t͔͈̠̙̦̱͌̌͛́ͅͅp̱̞̓ͩͅ:̮̜̹̜͔̼͍ͪ̍̓ͨ͊̍ͥ/̖̮̘̰͕̯͈̱̺̓̎ͮ͐̈́/̭̥̙͓̱̤͙͙̿̍̓̑͌̔ḳ̺̬̭̗͌̔ͩ͊̔̚ͅn̳͉̪̰̿ͦ̆̆o̥͎̮̠͌͋w͇̬̫̗͈̰̯̎ͤ̈́̈́ͮ͐́y͔̗͎̖̲̜͖̟̽͊ͪ͒̆̈ͪ̚ŏ̥͚̦̰͚̞͂̀̐ͧ̂ù̠̩͖͓̦̆ͥ͂ͯr̜̣̝̘̬̖̲̓͛̓ͤ͛͗m͇̞̹̻̣̼͔̐̈e̖̻̲̺̩̟̙̮͑ͫ̋̇m͇̘̜̼̊ͤ̑̂e̲͉̦͉͉̓̉ͦ͂̋.̪̮͓̺ͯ͊ͪ̍̇͑͋͊̚c͉͎͚͚̳̙̘̱ͤ͊͗͒͗̀ǒ̭͕̼́̈́ͅm̠͓̜͒̉/̹̳͎̯̥̪̮͗͛m̖͓̫͓͉͉̙̹̀̾͐ͧͪ̽ͥͥě̤̜͈̽̋̽ͮ̓̏̂m̻͇̖̮͇̖̘̱͙͆̊̅̎͂͆ͣ̍̾e͙͍̻͕̤̾ͩ̄ͮ̈ͮͅs̘̺͕̳͂͑͒͆͐/͈͚̪̮ͨ͊ͨͯ̈́ͩ͌̚z̞͖̬͇̅ͨ̋̊ͬͩ̏ä̟͕͇͇̹̹ͤ̂̈́̂̌̒̆l̹͓̬̮̰͊̑̌̾g̘̘̟͇ͣ̍̿̊̆͑̑ͅo̯̺͉͖̭ͥ̃ͣ̊̐̽
John Rasch
@John Rasch Oy! You got your Unicode in my ASCII!
Andrew Keeton
Yeah, I find the zalgoizer goes a bit over the top. I prefer to bring up the Character Map and bang away on the combining diacriticals section. @John W: yes, I do, as it turns out. See http://meta.stackoverflow.com/questions/30193/tag-badges-exploit
bobince
Even Jon Skeet cannot parse HTML using regular expressions
Ciwee
I listen and obey master!
Mladen Mihajlovic
Wow, that is the most votes I have ever seen for an answer and it starts out with "You can't".
Kenneth J
And with a click of the mouse, my vote caused the vote count to match the current year, 2009.
micahwittman
YOU SIR, ARE A LEGEND. AND MY NEW HERO
baeltazor
So much for stopping the votes at a nice, apocalyptic 2012. Now I guess we have to shoot for 2048 so it's a power of 2.
gnovice
There seems to be a little schmutz on my screen near the bottom of this post. What regex can I use to clean that up a bit? The vote stands at 2038. I hate to ruin such a significant number...
Dennis Williamson
How did Tony the Pony get involved in this?
Jeff Davis
I can't believe this has 8 downvotes. Just 8.
Isaac Waller
@Isaac: do *not* mention the number between 7 and 9. Also, does this answer belong in meta?
Andrew Grimm
How had I *not* seen the Tony the Pony reference before? Now I'm absolutely obliged to upvote it...
Jon Skeet
This is the first time I've lost sanity points reading Stack Overflow
Christian Hayter
Can IE even render this answer?
Jox
Have you tried the "h" flag for parsing HTML?
Ates Goral
You cannot grasp the form of the Answer's attack!
hydrapheetz
At first, I was troubled by the 2,411 upvotes this post has. Several tears of laughter later, I made it 2,412.
Triptych
The last line was epic!
presario
Some stats about this user: almost 60% of his reputation was earned from this answer. He has 5 gold badges, three of which are: html, regex, Great Answer. :)
presario
@presario: doesn't quite work like that, due to the daily rep cap of +200. According to the graph thingy I got 900 Internet Points Game units for this answer. Still pretty mental but there we are...
bobince
Oh I see, I just multiplied 2438 by 10 and got 24380.
presario
Somehow I'm imagining the last part to be uttered by an Ood. I think it was in The Impossible Planet where one would be mad and, one blink of an eye later perfectly normal again. Same impression here from the madness to the last line :-). And I wonder whether SO's layout breaks if this question gets upvoted beyond 10k.
Joey
There wasn't any real explanation of why the TARDIS exploded at the end of the last episode. I like to think the Master had set it a particularly tricky backtracking regex problem.
bobince
It is said that in Ulthar, which lies beyond the river Skai, no man may parse html with a regex.
daotoad
Truly epic. And you couldn't be more correct.
Rook
OMGWTFBBQ... this just made my day.
prodigitalson
Jon Skeet can parse HTML with Regex
Earlz
@bobince, I love you.
macek
Here's a new one... this guy wants to convert python to js using regexes: http://stackoverflow.com/questions/2346584/conversion-from-javascript-to-python-code
Tom
Is this the highest-voted answer on StackOverflow?
MusiGenesis
Can I get that on a t-shirt?
Art
@Art: http://meta.stackoverflow.com/questions/18382/help-design-our-stack-overflow-t-shirts/35432#35432
bobince
@bobince: +1 OMFG! if i could favourite this answer..
N 1.1
Should this really be community wiki? I feel like this guy deserves 27000 points for that answer.
intuited
@MusiGenesis: I think so!
Martijn Courteaux
I read this in the voice of a Cylon Hybrid.
Charles
It doesn't matter whether or not Jon Skeet can parse XML and HTML with Regexes. **He doesn't try.** That's why he's awesome and you are not.
Donal Fellows
Look ma, no code!
this. __curious_geek
OMG (weaping eyes) I hadn't laugh so hard since... I don't remember when. Superb.
leonbloy
@John: last line was just fine ... at least with opera ... editing this post is kind of "*blasphemous*"
tanascius
@tanascius: revert if you like. I couldn't see it with FF 3.5.
John Saunders
This is the highest scoring answer on StackOverflow (at least in the latest data dump). Shame no rep was earned
fahadsadah
@fahadsadah: Plenty of rep was earned, just not after the post went CW.
John at CashCommons
Who dares edit this work of art?
zildjohn01
I came from Reddit.
KahWee Teng
Having seen this before, I was determined to read it without breaking a smile. Sadly, I only got as far as "it is too late it is too late we cannot be saved" before a mirthful chuckle escaped me. This truly is a work of genius.
thepeer
The cake is a lie.
Sudhir Jonathan
@Musi: Yes, it is, see e.g. [here](http://odata.stackexchange.com/stackoverflow/s/316/top-answers).
Georg Fritzsche
Whoever autographed a dollar bill with "Tony the Pony" at the Louisiana Longhorn Cafe in Round Rock, TX: Thank you. You made my goddamn day.
Chris Doggett
I just want to give everyone the hope that they've just received an insightful remark to an answer.
Beau Martínez
Locked to prevent edits. Hopefully temporarily. If you are considering flagging this as inappropriate, please don't.
Will
+12  A: 

I find this small PHP library incredibly useful for parsing HTML tags: http://simplehtmldom.sourceforge.net/.

Kosso
Yep, this is the usual thing for HTML, when it's not well-formed XHTML anyway.
bobince
+14  A: 

I suggest using QueryPath (http://querypath.org/) for parsing XML and HTML. It's basically much the same syntax as jQuery, only it's on the server side.

John Fiala
+1 I really like how easy html parsing is with jQuery, didn't know there was something similar for server side.
Kyle
+313  A: 

While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML.

If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's Web site. This was a limited, one-time job.

Regexes worked just fine for me, and were very fast to set up.

Kaitlin Duck Sherwood
+1 for incorporating Paris Hilton in your answer.
Andrew Song
So Paris Hilton did write an OS after all?
Amarghosh
No, the question is: Did actually someone ask Paris Hilton to write an OS?
Mauli
Great, we're now debating the possibility of chuck norris parsing HTML with regular expressions .. and paris hilton writing an operating system. Jon Skeet, however, can do both AND paris hilton.
Tim Post
Hey, now, if Paris can give reasonable answers while running a fake campaign for president, maybe she can write a reasonable fake OS too. :-)
SarekOfVulcan
Has anyone ever actually __seen__ Linux Torvalds and Paris Hilton in the same room at the same time? Hmmmm....
Graeme Perrow
Paris may not have written an OS but I think Hannah Montana did. hannahmontana.sourceforge.net
brianegge
The rule is obviously "don't parse", but there are exceptions as to any rule — if you want to check, say, the presence of specific link on the page regexp would be the easiest solution. My 2 cents.
SeasonedCoder
Paris Hilton did write an Operating System: Parix.
Avihu Turzion
Can Paris Hilton even _spell_ OS?
David M
@Graeme Perrow: The better question is, has anyone seen Linus Torvalds? Rumor has it, he's the product of a runaway LISP machine at MIT that RMS did not have the heart to power down.
Tim Post
@tinkertim: when did Jon Skeet do Paris Hilton? I don't remember seeing that in the social pages.
dave
@sarekofvulcan: Do magazines count?
Tim Post
Oh good grief. To clarify, I did not say Jon Skeet DID Paris Hilton. I just said he COULD, since he is (probably) appropriately anatomically equipped to do so. Enough with the prank e-mails asking for photographic evidence. An interlude between Jon Skeet and Tony The Pony is MUCH funnier than an interlude between Jon Skeet and Paris Hilton, thank you. 110 emails so far, please, TAKE MY COMMENT AS HUMOR.
Tim Post
I wonder who would not take your comment as humor
Jader Dias
@DavidM Of course Paris doesn't know how to spell OS, she has minions paid by Daddy to do it for her.
Ed Griebel
@David M - Paris Hilton can spell OS, though not correctly.
Jon Hopkins
Chuck Norris could beat an OS out Paris Hilton.
Chris Nicol
Why would you do both AND Paris Hilton. You could rather do Paris Hilton first; for the rest, there is always time. ;-)
kiamlaluno
I love how someone edited this to make the grammar incorrect.
Michael Myers
Paris Hilton can write?!?
Jus12
In Soviet Russia HTML parses you.
parxier
chuck norris could teach paris hilton to write.
intuited
"Can Paris Hilton even spell OS?"Oh, Yes.
mixdev
Er, Paris Hilton DID write an OS... Microsoft bought it and called it "Windows ME"
Robert Fraser
@Robert: Two words: Microsoft Bob. Now cower with fear!
Donal Fellows
Paris originally wanted to call her OS "Windows Me Me Me".
thepeer
Disagree. There are use cases for parsing HTML with regular expressions: http://blog.sitescraper.net/2010/06/web-scraping-with-regular-expressions.html
Plumo
+7  A: 

I know Java isn't cool anymore, but if you want to use a really good library in Java, you might check into Tag soup which is built on top of Xerces. http://home.ccil.org/~cowan/XML/tagsoup/

DanielHonig
Java was never cool ;)
notandy
Java is cool as a platform for better languages, but i'm off topic.
rplevy
+1  A: 

The W3C explains parsing in a pseudo regexp form:
http://www.w3.org/TR/REC-xml-names/#ns-using

Follow the var links for QName, S, and Attribute to get a clearer picture.
Based on that you can create a pretty good regexp to handle things like stripping tags.

John-David Dalton
+157  A: 

Disclaimer: use a parser if you have the option. That said...

This is the regex I use (!) to match HTML tags:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>

It may not be perfect, but I ran this code through a lot of HTML. Note that it even catches strange things like <a name="badgenerator"">, which show up on the web.

I guess to make it not match self contained tags, you'd either want to use Kobi's negative look-behind:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+(?<!/\s*)>

or just combine if and if not.

To downvoters: This is working code from an actual product. I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML.

Caveat: I should note that this regex still breaks down in the presence of CDATA blocks, comments, and script and style elements. Good news is, you can get rid of those using a regex...

itsadok
Nothing to complain about, just three down votes. I'm at -5, probably for not adding a warning not to use my code :)
Kobi
Up for karma. And this makes 15 characters.
Jeff
I got a couple of anonymous down votes that were really about minor differences in opinion. I didn't like it. I mean, you put a disclaimer right at the front, right? One up for karma.
Stephen Harmon
111 up for karma
zildjohn01
+1 This is definitely a helpful answer given all the caveats.
Christian Hayter
+2  A: 

You should check PHP DOM Functions. Very handy once you study this tutorial : http://php.net/manual/en/book.dom.php

Fotis
This is actually a very good answer!
AntonioCS
Thnx man. PHP DOM saved me many times :)
Fotis
+7  A: 
<?php
$selfClosing = explode(',', 'area,base,basefont,br,col,frame,hr,img,input,isindex,link,meta,param,embed');

$html = '
<p><a href="#">foo</a></p>
<hr/>
<br/>
<div>name</div>';

$dom = new DOMDocument();
$dom->loadHTML($html);
$els = $dom->getElementsByTagName('*');
foreach ( $els as $el ) {
    $nodeName = strtolower($el->nodeName);
    if ( !in_array( $nodeName, $selfClosing ) ) {
        var_dump( $nodeName );
    }
}

Output:

string(4) "html"
string(4) "body"
string(1) "p"
string(1) "a"
string(3) "div"

Basically just define the element node names that are self closing, load the whole html string into a DOM library, grab all elements, loop through and filter out ones which aren't self closing and operate on them.

I'm sure you already know by now that you shouldn't use regex for this purpose.

meder
If you're dealing with real XHTML then append getElementsByTagName with `NS` and specify the namespace.
meder
seems odd that every answer above mine isn't a real solution, just a recommendation to use some sort of parser. OP - did you try my answer? :p
meder
+45  A: 

Perhaps http://www.crummy.com/software/BeautifulSoup/

MattK
Yes, especially given this comment "I'm parsing a block of XHTML, truncating it, then closing any tags that are left open after it's been truncated. The DOM XML stuff doesn't work because it's not properly formed XML." Use BeautifulSoup to truncate and prettify.
Mark
+1 I've used BeautifulSoup and it works well.
Kyle
+1  A: 

It seems to me you're trying to match tags without a "/" at the end. Try this:

<([a-zA-Z][a-zA-Z0-9]*)[^>]*(?<!/)>
manixrock
+2  A: 

I used a open source tool called HTMLParser before. It's designed to parse HTML in various ways and serves the purpose quite well. It can parse HTML as different treenode and you can easily use its API to get attributes out of the node. Check it out and see if this can help you.

logoin
+1  A: 

If you need this for PHP:

The PHP dom functions won't work properly unless it is properly formatted XML. No matter how much better their use is for the rest of mankind.

simplehtmldom is good, but I found it a bit buggy, and it is is quite memory heavy [Will crash on large pages.]

I have never used querypath, so can't comment on its usefulness.

Another one to try is my DOMParser which is very light on resources and I've been using happily for a while. Simple to learn & powerful.

For Python and Java, similar links were posted.

For the downvoters - I only wrote my class when the XML parsers proved unable to withstand real use. Religious downvoting just prevents useful answers from being posted - keep things within perspective of the question, please.

SamGoody
+2  A: 

XPath Luke, is your father.

Excalibur2000
+4  A: 

I don't know your exact need for this, but if you are also using .NET, couldn't you use Html Agility Pack?

Excerpt:

It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML.

GONeale
+15  A: 

Someone wrote a full html parser for PHP: http://htmlpurifier.org/

Jesse Mullan
Why is this -1?? This is also a good answer!!!
AntonioCS
This is a great parser +1
alex
+2  A: 

Whenever I need to quickly extract something from an HTML document, I use tidy to convert it to XML and then use XPath or XSLT to get what I need. In your case, something like this: //p/a[@href='foo']

Sembiance
+203  A: 

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can't possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up.

NealB
this is a very good answer
Paul Nathan
Short and informative, I like it :)
Sune Rievers
+1 for science level
mico
it's my favorite question because of this answer
mykhal
This is not actually the case. RegEx in most programming languages is actually context-free, due to the fact that it has look-backs, etc.
Michael Fairley
@michaelfairley Look ahead/behind/around features provide a richer syntax for expressing certainclasses of regular expression. I do not believe these features provide fundamentally anymore expressive power than a Chomsky type 3 grammar is capable of. One might argue thatHTML is a visibly pushdown language (VPL) so may be parsed using techniques less powerfulthan required for a full blown context free grammar, however, I am unaware of any RegExengine that support VPL's either.
NealB
RegExps are not context free, even with lookback/lookaheads. It never allows for arbitrary nesting of components. The best example is (not) coming up with a RegExp to determine if a statement has matching brackets. http://en.wikipedia.org/wiki/Context-free_grammar#Example_2
Juan Mendes
+2  A: 

You can use nekohtml library to parse html. Чувак не парься и используй nekohtml http://nekohtml.sourceforge.net/

A: 

If it was not for @bobince answer I would say you should develop your regexes in a Test Driven manner.

Thank God you didn't

But next time use TDD.

Jader Dias
+1  A: 

I've recently wrote a HTML sanitizer in Java. It is based on a mixed approach of regular expressions and Java code. Personally I hate regular expressions and its folly (readability, maintainability, etc.), but if you reduce the scope of its applications it may fit your needs. Anyway, my sanitizer uses a white list for HTML tags and a black list for some style attributes. For your convenience I have set up a playground so you can test if the code matches your requirements: playground and Java code. Your feedback will be appreciated.

There is a small article describing this work on my blog: roberto.open-lab.com

Roberto
+2  A: 

While the answers that you can't parse HTML with regexes are correct, they don't apply here. The OP just wants to parse one HTML tag with regexes, and that is something that can be done with a regular expression.

The suggested regex is wrong, though:

<([a-z]+) *[^/]*?>

If you add something to the regex, by backtracking it can be forced to match silly things like <a >>, [^/] is too permissive. Also note that <space>*[^/]* is redundant, because the [^/]* can also match spaces.

My suggestion would be

<([a-z]+)[^>]*(?<!/)>

Where (?<! ... ) is (in Perl regexes) the negative look-behind. It reads "a <, then a word, then anything that's not a >, the last of which may not be a /, followed by >".

Note that this allows things like <a/ > (just like the original regex), so if you want something more restrictive, you need to build a regex to match attribute pairs separated by spaces.

moritz
+2  A: 

As many people have already pointed out, HTML is not a regular language which can make it very difficult to parse. My solution to this is to turn it into a regular language using a tidy program and then to use an XML parser to consume the results. There are a lot of good options for this. My program is written using Java with the jtidy library to turn the HTML into XML and then Jaxen to xpath into the result.

Corey Sanders
+2  A: 

Although it's not suitable and effective to use regular expressions for that purpose sometimes regular expressions provide quick solutions for simple match problems and in my view it's not that horrbile to use regular expressions for trivial works.

There is a definitive blog post about matching innermost HTML elements written by Steven Levithan.

eyazici
+3  A: 

You can parse html in sed though.

  1. Turing.sed
  2. Write html parser (homework)
  3. ???
  4. Profit!
profjim
See also http://www.perlmonks.org/?displaytype=print;node_id=809842
profjim
+1  A: 

This may do:

<.*?[^/]>

Or without the ending tags:

<[^/].*?[^/]>

What's with the flame wars on HTML parsers? HTML parsers must parse (and rebuild!) the entire document before it can categorize your search. Regular expressions may be a faster / elegant in certain circumstances. My 2 cents...

Paul
A: 

There are some nice regexes for replacing HTML with BBCode here http://www.garyshood.com/htmltobb/source.txt. For all you nay-sayers, note that he's not trying to fully parse HTML, just to sanitize it. He can probably afford to kill off tags that his simple "parser" can't understand.

sblom
A: 

About the question of the RegExp methods to parse (x)HTML, the answer to all of the ones who spoke about some limits is: you have not been trained enough to rule the force of this powerful weapon, since NOBODY here spoke about recursion.

A RegExp-agnostic colleague notified me this discussion, which is not certainly the first on the web about this old and hot topic.

After reading some posts, the first thing I did was looking for the "?R" string in this thread. The second was to search about "recursion".
No, holy cow, no match found.
Since nobody mentioned the main mechanism a parser is built onto, I was soon aware that nobody got the point.

If an (x)HTML parser needs recursion, a RegExp parser without recursion is not enough for the purpose. It's a simple construct.

The black art of RegExp is hard to master, so maybe there are further possibilities we left out while trying and testing our personal solution to capture the whole web in one hand... Well, I am sure about it :)

Here's the magic pattern:

$pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/s";

Just try it.
It's written as a PHP string, so the "s" modifier makes classes include newlines.
Here's a sample note on the PHP manual I wrote on january:

http://php.net/manual/en/regexp.reference.recursive.php

(Take care, in that note I wrongly used the "m" modifier; it should be erased, notwithstanding it is discarded by the RegExp engine, since no ^ or $ anchorage was used).

Now, we could speak about the limits of this method from a more informed point of view:

  1. according to the specific implementation of the RegExp engine, recursion may have a limit in the number of nested patterns parsed, but it depends on the language used
  2. although corrupted (x)HTML does not drive into severe errors, it is not sanitized.

Anyhow it is only a RegExp pattern, but it discloses the possibility to develop of a lot of powerful implementations.
I wrote this pattern to power the recursive descent parser of a template engine I built in my framework, and performances are really great, both in execution times or in memory usage (nothing to do with other template engines which use the same syntax).

Emanuele Del Grande
*"... I was soon aware that nobody got the point ..."* ... sigh. -1
Bart Kiers
Ooooh, recursive regexes! Why didn't we think of that?
Alan Moore
I'll put this in the "Regex which doesn't allow greater-than in attributes" bin. Check it against <input value="is 5 > 3?" />
Gareth
If you put something like that in production code, you would likely be shot by the maintainer. A jury would never convict him.
aehiilrs
@Gareth: thanks for your objection, but are you sure that putting a greater-than inside an attribute is a valid code?Well, also if not, this evidences another limit to add to the ones I listed above in case to create a greed parser for the real world...But it is not too much to demonstrate the way is not good, do you agree? There are other useful operators in RegExp which allow to check for next occurrences, this should be a proper use for them.
Emanuele Del Grande
@aehiilrs: I'm sorry, I do not understand: which maintainer are you speaking about? (...code maintainers? :S)
Emanuele Del Grande
@Emanuele, yes, it's valid.
Bart Kiers
@Bart K.: it is valid only in an HTML 4- document.XHTML documents need the five XML entities encoded.
Emanuele Del Grande
Oh my gosh! -7 votes! :) I'm starting to become popular... :pSorry for opening a door!
Emanuele Del Grande
A door? I'm sure you mean the gates to hell?
Matthias
If your comments are aimed to nothing but criticize, I see no good results this discussion may reach.
Emanuele Del Grande
I was the first to say that my solution has some limits, but of course I am available to listen anyone who can help me in improving it.I posted something which costed me time and work, and which results are effective in a number of projects up and running.I thought it could help, proposing the way of a RegExp solution which nobody nearly spoke about (recursion), and which is the only way to parse nested markup patterns (through RegExp, of course).
Emanuele Del Grande
May be you are not interested in really knowing if RegExps may work or not for this purpose showing to have some prejudices, but I see no reason why you should blame my way, since I did not blame at all the advise of anyone who proposed other ways such as stand-alone parsers.
Emanuele Del Grande
Regular expressions can't work because by definition they are not recursive. Adding a recursive operator to regular expressions basically makes a CFG only with poorer syntax. Why not use something designed to be recursive in the first place rather than violently insert recursion into something already overflowing with extraneous functionality?
Welbog
Once again... > is valid pretty much everywhere in XML, and thus in XHTML, see section 2.4 of the XML spec (at http://www.xml.com/axml/target.html#syntax for example)
mirod
You are right, the lesser-than only is not valid inside XML attributes.Thanks to your criticism, I implemented my solution so that it can parse *anything* inside the attributes :)Beside this, I implemented the parsing of XML prologue, DTDs and CDATA.The only upset is that the mod closed the possibility to answer this discussion for users with less than 10 points, so that I cannot post it. I twitted him the request to unlock it, but had no response.Come to me, enemies, I wait you! :) The more you are, the stronger I become!
Emanuele Del Grande
My objection isn't one of functionality it is one of time invested. The problem with RegEx is that by the time you post the cutsey little one liners it appears that you did something more efficiently ("See one line of code!"). And of course no one mentions the half hour (or 3) that they spent with their cheat-sheet and (hopefully) testing every possible permutation of input. And once you get past all that when the maintainer goes to figure out or validate the code they can't just look at it and see that it is right. The have to dissect the expression and essentially retest it all over again...
Oorang
... to know that it is good. And that will happen even with people who are *good* with regex. And honestly I suspect that overwhelming majority of people won't know it well. So you take one of the most notorious maintenance nightmares and combine it with recursion which is the *other* maintenance nightmare and I think to myself what I really need on my project is someone a little less clever. The goal is to write code that bad programmers can maintain without breaking the code base. I know it galls to code to the least common denominator. But hiring excellent talent is hard, and you often...
Oorang
and up with a lot of adequate talent and one really smart guy that is so smart he makes everything take twice as long:)
Oorang
+1  A: 

Here is a PHP based parser that parses HTML using some ungodly regex. As the author of this project, I can tell you it is possible to parse HTML with regex, but not efficient. If you need a server-side solution (as I did for my wp-Typography WordPress plugin), this works.

kingjeffrey