ansaurus

Question

RegEx match open tags except XHTML self-contained tags

Answer 1

+2 A:

You want the first > not preceded by a /. Look here for details on how to do that. Its referred to as negative lookbehind.

However, a naive implementation of that will end up matching <bar/></foo> in this example document

<foo><bar/></foo>

Can you provide a little more information on the problem you're trying to solve? Are you iterating through tags programatically?

Jherico 2009-11-13 22:47:17

Yep, I sure am. Determining all the tags that are currently open, then compare that against the closed tags in a separate array. RegEx hurts my brain.

Jeff 2009-11-13 23:04:54

Answer 2

+6 A:

Try:

<([^\s]+)(\s[^>]*?)?(?<!/)>

It is similar to yours, but the last > must not be after a slash, and also accepts h1.

Kobi 2009-11-13 22:50:48

Gareth 2009-11-13 23:11:39

That is very true, and I did think about it, but I assumed the `>` symbol is properly escaped to >.

Kobi 2009-11-13 23:16:59

`>` is valid in an attribute value. Indeed, in the ‘canonical XML’ serialisation you must not use `>`. (Which isn't entirely relevant, except to emphasise that `>` in an attribute value is not at all an unusual thing.)

bobince 2009-11-14 00:15:33

Answer 3

+3891 A:

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the nerves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of regex parsers for HTML will instantly transport a programmer's consciousness into a world of ceaseless screaming, he comes~~, the pestilent sl~~ithy regex-infection will devour your HTML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fight he com̡e̶s, ̕h̵is un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ik͏e liquid pain, the song of re̸gular expre~~ssion parsing~~ will extinguish the voices of mortal man from the sphere I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful the final snuffing of the lies of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the pon̷y he comes he c̶̮om~~es he co~~mes the ichor permeates all MY FACE MY FACE ᵒh god no NO NOO̼OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

Have you tried using an XML parser instead?

bobince 2009-11-13 23:04:30

Ya know, I've tried, and that's a very creative answer and all, but I can't seem to find any quality docs or tutorials on how to use the PHP DOM classes.

Jeff 2009-11-13 23:06:40

Is everything ok there? Is this a cry for help? :)

Kobi 2009-11-13 23:07:04

http://php.net/manual/en/class.domdocument.php - based on the W3 DOM standard so it's the same as you're used to from JavaScript and other languages. There's plenty of tutorial stuff out there, see eg. http://php4every1.com/tutorials/php-domdocument-tutorial/

bobince 2009-11-13 23:12:27

Kobi: I think it's time for me to quit the post of Assistant Don't Parse HTML With Regex Officer. No matter how many times we say it, they won't stop coming every day... every hour even. It is a lost cause, which someone else can fight for a bit. So go on, parse HTML with regex, if you must. It's only broken code, not life and death.

bobince 2009-11-13 23:18:03

++ for "The <center> cannot hold"

Horace Loeb 2009-11-13 23:27:14

If I could give you a pat on the head, I would. :-) It'll be okay, really.

ebneter 2009-11-13 23:31:43

@bobince your answer is like the stackoverflow equivalent of the book of subgenious.

Ravi 2009-11-13 23:37:48

Yeah, but what if you use regexes with backreferences?

stimms 2009-11-13 23:43:21

Or what if you have all of your tags pre-defined? e.g. `(div|p|a|etc)`

Jeff 2009-11-13 23:51:08

Chuck Norris _can_ parse HTML with regex.

2009-11-14 00:03:42

Bobince, you make me want to ask more questions about HTML+Regexes.

nickf 2009-11-14 00:06:09

A true work of art; I weep at the poetic beauty.

Marc Gravell 2009-11-14 00:29:16

it's as if Stefan George came back to life as a 21st century geek: http://en.wikipedia.org/wiki/Stefan_George

Edward Tanguay 2009-11-14 01:46:57

Also see: http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not

ntownsend 2009-11-14 02:41:37

Love this answer, he had the patience to write so much :)I would try SGMLReader to do this kind of thing

Shreedhar 2009-11-14 02:50:24

Danielewski? Is that you?

casey 2009-11-14 04:43:13

150 upvotes in, like, 9 hours? Bravo!

Konrad Rudolph 2009-11-14 08:49:10

@bobince - I never said I disagree - of course xml should not be parsed with regex, heck, **it should be added the site's FAQs, along with floating point question**. I just tried to answer the question - we don't know what Jeff is trying to do.

Kobi 2009-11-14 10:20:17

You made my Google Chrome on Linux crash :(

Skilldrick 2009-11-14 10:42:53

Kobi: sure — sorry, I didn't mean to aim that frustration at you in particular.

bobince 2009-11-14 12:15:29

@Konrad: when I came here, it was 2 hours after posting and 86 votes.

Martinho Fernandes 2009-11-14 12:39:09

+1 for... awesome? also getting printed and framed

Uberfuzzy 2009-11-14 13:17:56

Best. Answer. Ever!I'm making a t-shirt with this one...

davewasthere 2009-11-14 13:57:49

Is it just me, or there is a wrong assertion in here? "Jon Skeet cannot parse HTML using regular expressions." **BLASPHEMY**!

Martinho Fernandes 2009-11-14 16:33:14

This is the single most informative answer ever made on stackoverflow.

David Ma 2009-11-14 18:53:41

It's a popular folk myth that computer science has no bearing on commercial programming. Good to see the people disproving it. :-)

cartoonfox 2009-11-14 18:59:21

Would it be ok to add some paragraphs in there, it's beautiful words but after a few lines it becomes unreadable to non-computers non-regex users like myself ^^

Oskar Duveborn 2009-11-14 19:01:39

@Oskar: I find that to be part of the beauty...

Martinho Fernandes 2009-11-14 19:11:30

That's depressing: On Safari 4 and Firefox 3.6 on my Mac, I get a bunch of white boxes at the end. On my iPhone, I get all the correct craziness.

Rudd Zwolinski 2009-11-14 19:18:23

oniguruma have named groups. then again, c has loops and conditionals and other neat things, so might as well use that.

wilhelmtell 2009-11-14 19:48:21

Amazing work! How did you get it to break up and look all crazy at the end like that?

Benjamin Cox 2009-11-14 22:23:21

@Benjamin: Unicode, baby. Just take a look at the source...

Martinho Fernandes 2009-11-14 22:31:37

unknown (google): That's just silly, Chuck Norris doesn't parse HTML. When he shows up, it _parses itself_.

ryandenki 2009-11-15 14:40:09

Your comment made it to reddit. http://www.reddit.com/r/programming/comments/a4kze/im_not_sure_if_this_guys_thinks_parsing_html_with/

adolfojp 2009-11-15 15:53:24

Of course...this particular problem can indeed be solved by regex. Beautiful tirade though.

Grumdrig 2009-11-15 16:08:24

You're cheating! Letting Chuck Norris write the answer for you. =)

PEZ 2009-11-15 16:18:58

I'm humbled by your answer...

Richard Clayton 2009-11-15 16:59:48

SO should develop a new medal specifically for this response.

Jen the Heb 2009-11-15 18:15:48

It just reeks of H.P. Lovecraft, and looks on track to be the most popular SO answer - EVER.

Marc Gravell 2009-11-15 20:21:13

Your comment has made it onto metafilter : ) http://www.metafilter.com/86689/So-does-anyone-know-how-to-make-an-HTML-regex-parser

codeulike 2009-11-15 22:19:58

Dear Lord! This is getting extremely silly! I wish I'd thought of drunkenly posting frustrated and largely unhelpful outbursts before, they get me much more votes than helpful answers. Jeff: sorry. Thanks for accepting it though ;-)

bobince 2009-11-15 22:44:16

+1 for indirectly bewildering Eric S Raymond. http://esr.ibiblio.org/?p=1411 , scroll down to the comments.

Tim Post 2009-11-16 05:10:03

Silver badges (on the way to becoming Gold) for `regex` and `html` tags with a single answer - Wow!!!

Amarghosh 2009-11-16 09:53:27

A true classic. Maybe there really should be a singleton award/trophy for 'best answer'. <napoleon>That would be awesome.</napoleon>

pboin 2009-11-16 10:24:00

@bobince: Your answer was actually helpful! I'm just glad nobody made fun of the Tigers for choking down the stretch again this year.

Jeff 2009-11-16 10:38:18

In years to come, someone will come and say "Stand back! I know regular expressions!" and some enlightened mind will promptly reply "The <center> cannot hold it is too late."

Martinho Fernandes 2009-11-16 15:45:09

For the sake of History transmission to future generations, I ask Jeff (Atwwod, not the OP) to disable edit capabilities on this answer: We don't want it to be altered in any way!

Serge - appTranslator 2009-11-16 17:18:07

I saw that you had 1100 plus votes and *still* voted it up. How many weird bits of Unicode did you have to pull in to write that?!?!

jprete 2009-11-16 18:48:32

You've been Coding Horror'd! http://www.codinghorror.com/blog/archives/001311.html

Michael Myers 2009-11-16 18:52:33

Wait...so can you parse HTML w/ regex?

Trevor Hartman 2009-11-16 20:05:06

You win 1 internet. Prize may be redeemed at the ARPANET counter.

Adam Davis 2009-11-16 20:10:20

@Serge - we can lock it, but that'll stop voting and commenting. I don't propose to do that.

Marc Gravell 2009-11-16 20:18:25

@Serge: if anyone in a fit of carelessness damages the answer somehow, I'm sure someone will promptly roll it back. Btw, less than 200 votes from becoming the most voted answer EVER.

Martinho Fernandes 2009-11-16 20:50:04

This was welbogging brilliant. W O W. And yes, I upvoted it while it was at 1200+ votes -- that's how brilliant it is. Best answer ever, to any question, ever, on the welbogging planet.

John Rudy 2009-11-16 20:56:25

Can we all agree that this is the last word on the inadvisability of parsing (X)HTML with regexes? Please?

Alan Moore 2009-11-16 22:11:54

@Alan: at least we have *the definitive* link to post for all future questions on the topic (included here for convenience): http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Fredrik Mörk 2009-11-16 23:20:55

Congratulations, this is now the highest-voted post on the site :-)

Kyle Cronin 2009-11-17 02:08:02

Mimetic squirrels dance lightly upon your brink. You have exceeded 1024 votes.

Adam Davis 2009-11-17 03:23:16

Hey, I just discovered that this answer looks way better in Firefox than in IE7! I was kind of wondering what all the fuss was about.

Michael Myers 2009-11-17 05:27:31

A lot of hard-livin' went into that post

Harry Lime 2009-11-17 09:20:27

Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn! Bwahahahahahahahahahahahahahahahah!

Christian Hayter 2009-11-17 10:28:54

Most. Popular. Answer. Ever?

Macha 2009-11-17 16:32:27

@Dave Beer I want one too. How much?

aehiilrs 2009-11-17 18:18:51

Is it possible to use RegEx to parse this answer?

Chris Porter 2009-11-17 18:26:15

When are the t-shirts going to be on the market?

Rob 2009-11-17 20:16:30

I'd be very interested in a "Making Of" this answer: how did you do the Unicode signs? Did you use a certain editor for that?

Epaga 2009-11-18 08:22:34

wow 1700 votes in 5 days?

hasen j 2009-11-18 14:58:22

This needs to get up to 2012 upvotes. And then needs to be locked by Atwood. Come on, people!

Chris Lutz 2009-11-18 18:05:02

Surely this deserves a platinum star or something?

day_trader 2009-11-18 20:18:11

My computer won't display the unicode properly :( ... anyone have a screen capture of the true prophecy?

Bill 2009-11-18 21:31:56

Now, if someone retags the post, does bobince instantly get two more tag badges per added tag? I wonder ...

John at CashCommons 2009-11-19 00:58:54

Tony the Pony. Brilliance!

Mike Crittenden 2009-11-19 14:24:21

@Bill In all its glory: http://imgur.com/gOPS2.png

Andrew Keeton 2009-11-19 14:37:27

h͚͔͕͍̣̙̞̃ͯt̖̗͎̩̳̻̆͑̈͐̉̐̚t͔͈̠̙̦̱͌̌͛́ͅͅp̱̞̓ͩͅ:̮̜̹̜͔̼͍ͪ̍̓ͨ͊̍ͥ/̖̮̘̰͕̯͈̱̺̓̎ͮ͐̈́/̭̥̙͓̱̤͙͙̿̍̓̑͌̔ḳ̺̬̭̗͌̔ͩ͊̔̚ͅn̳͉̪̰̿ͦ̆̆o̥͎̮̠͌͋w͇̬̫̗͈̰̯̎ͤ̈́̈́ͮ͐́y͔̗͎̖̲̜͖̟̽͊ͪ͒̆̈ͪ̚ŏ̥͚̦̰͚̞͂̀̐ͧ̂ù̠̩͖͓̦̆ͥ͂ͯr̜̣̝̘̬̖̲̓͛̓ͤ͛͗m͇̞̹̻̣̼͔̐̈e̖̻̲̺̩̟̙̮͑ͫ̋̇m͇̘̜̼̊ͤ̑̂e̲͉̦͉͉̓̉ͦ͂̋.̪̮͓̺ͯ͊ͪ̍̇͑͋͊̚c͉͎͚͚̳̙̘̱ͤ͊͗͒͗̀ǒ̭͕̼́̈́ͅm̠͓̜͒̉/̹̳͎̯̥̪̮͗͛m̖͓̫͓͉͉̙̹̀̾͐ͧͪ̽ͥͥě̤̜͈̽̋̽ͮ̓̏̂m̻͇̖̮͇̖̘̱͙͆̊̅̎͂͆ͣ̍̾e͙͍̻͕̤̾ͩ̄ͮ̈ͮͅs̘̺͕̳͂͑͒͆͐/͈͚̪̮ͨ͊ͨͯ̈́ͩ͌̚z̞͖̬͇̅ͨ̋̊ͬͩ̏ä̟͕͇͇̹̹ͤ̂̈́̂̌̒̆l̹͓̬̮̰͊̑̌̾g̘̘̟͇ͣ̍̿̊̆͑̑ͅo̯̺͉͖̭ͥ̃ͣ̊̐̽

John Rasch 2009-11-19 15:48:02

@John Rasch Oy! You got your Unicode in my ASCII!

Andrew Keeton 2009-11-19 17:28:06

Yeah, I find the zalgoizer goes a bit over the top. I prefer to bring up the Character Map and bang away on the combining diacriticals section. @John W: yes, I do, as it turns out. See http://meta.stackoverflow.com/questions/30193/tag-badges-exploit

bobince 2009-11-19 18:08:34

Even Jon Skeet cannot parse HTML using regular expressions

Ciwee 2009-11-22 04:56:03

I listen and obey master!

Mladen Mihajlovic 2009-11-26 08:18:01

Wow, that is the most votes I have ever seen for an answer and it starts out with "You can't".

Kenneth J 2009-11-30 22:24:48

And with a click of the mouse, my vote caused the vote count to match the current year, 2009.

micahwittman 2009-12-01 22:37:12

YOU SIR, ARE A LEGEND. AND MY NEW HERO

baeltazor 2009-12-02 10:43:18

So much for stopping the votes at a nice, apocalyptic 2012. Now I guess we have to shoot for 2048 so it's a power of 2.

gnovice 2009-12-03 15:55:59

There seems to be a little schmutz on my screen near the bottom of this post. What regex can I use to clean that up a bit? The vote stands at 2038. I hate to ruin such a significant number...

Dennis Williamson 2009-12-04 01:43:32

How did Tony the Pony get involved in this?

Jeff Davis 2009-12-04 14:39:33

I can't believe this has 8 downvotes. Just 8.

Isaac Waller 2009-12-05 01:27:57

@Isaac: do *not* mention the number between 7 and 9. Also, does this answer belong in meta?

Andrew Grimm 2009-12-05 07:05:18

How had I *not* seen the Tony the Pony reference before? Now I'm absolutely obliged to upvote it...

Jon Skeet 2009-12-07 11:21:37

This is the first time I've lost sanity points reading Stack Overflow

Christian Hayter 2009-12-08 08:32:53

Can IE even render this answer?

Jox 2009-12-12 11:42:11

Have you tried the "h" flag for parsing HTML?

Ates Goral 2010-01-05 20:23:36

You cannot grasp the form of the Answer's attack!

hydrapheetz 2010-01-09 07:18:49

At first, I was troubled by the 2,411 upvotes this post has. Several tears of laughter later, I made it 2,412.

Triptych 2010-01-12 19:15:14

The last line was epic!

presario 2010-01-18 19:04:02

Some stats about this user: almost 60% of his reputation was earned from this answer. He has 5 gold badges, three of which are: html, regex, Great Answer. :)

presario 2010-01-18 19:09:51

@presario: doesn't quite work like that, due to the daily rep cap of +200. According to the graph thingy I got 900 Internet Points Game units for this answer. Still pretty mental but there we are...

bobince 2010-01-18 19:59:30

Oh I see, I just multiplied 2438 by 10 and got 24380.

presario 2010-01-19 05:34:23

Somehow I'm imagining the last part to be uttered by an Ood. I think it was in The Impossible Planet where one would be mad and, one blink of an eye later perfectly normal again. Same impression here from the madness to the last line :-). And I wonder whether SO's layout breaks if this question gets upvoted beyond 10k.

Joey 2010-01-21 16:04:49

There wasn't any real explanation of why the TARDIS exploded at the end of the last episode. I like to think the Master had set it a particularly tricky backtracking regex problem.

bobince 2010-01-21 21:59:42

It is said that in Ulthar, which lies beyond the river Skai, no man may parse html with a regex.

daotoad 2010-01-25 18:30:14

Truly epic. And you couldn't be more correct.

Rook 2010-01-26 07:56:47

OMGWTFBBQ... this just made my day.

prodigitalson 2010-02-11 17:57:30

Jon Skeet can parse HTML with Regex

Earlz 2010-02-14 19:30:19

@bobince, I love you.

macek 2010-02-25 17:12:22

Here's a new one... this guy wants to convert python to js using regexes: http://stackoverflow.com/questions/2346584/conversion-from-javascript-to-python-code

Tom 2010-02-27 08:16:14

Is this the highest-voted answer on StackOverflow?

MusiGenesis 2010-03-10 18:11:57

Can I get that on a t-shirt?

Art 2010-03-14 12:53:19

@Art: http://meta.stackoverflow.com/questions/18382/help-design-our-stack-overflow-t-shirts/35432#35432

bobince 2010-03-14 15:05:54

@bobince: +1 OMFG! if i could favourite this answer..

N 1.1 2010-03-15 15:20:53

Should this really be community wiki? I feel like this guy deserves 27000 points for that answer.

intuited 2010-03-29 01:03:37

@MusiGenesis: I think so!

Martijn Courteaux 2010-04-18 11:57:40

I read this in the voice of a Cylon Hybrid.

Charles 2010-04-19 23:26:36

It doesn't matter whether or not Jon Skeet can parse XML and HTML with Regexes. **He doesn't try.** That's why he's awesome and you are not.

Donal Fellows 2010-04-23 09:51:42

Look ma, no code!

this. __curious_geek 2010-04-26 10:29:39

OMG (weaping eyes) I hadn't laugh so hard since... I don't remember when. Superb.

leonbloy 2010-04-30 00:41:02

@John: last line was just fine ... at least with opera ... editing this post is kind of "*blasphemous*"

tanascius 2010-05-25 14:55:17

@tanascius: revert if you like. I couldn't see it with FF 3.5.

John Saunders 2010-05-25 15:45:01

This is the highest scoring answer on StackOverflow (at least in the latest data dump). Shame no rep was earned

fahadsadah 2010-06-27 16:46:19

@fahadsadah: Plenty of rep was earned, just not after the post went CW.

John at CashCommons 2010-06-28 17:09:34

Who dares edit this work of art?

zildjohn01 2010-07-05 01:14:42

I came from Reddit.

KahWee Teng 2010-07-05 15:39:40

Having seen this before, I was determined to read it without breaking a smile. Sadly, I only got as far as "it is too late it is too late we cannot be saved" before a mirthful chuckle escaped me. This truly is a work of genius.

thepeer 2010-07-05 17:02:27

The cake is a lie.

Sudhir Jonathan 2010-07-05 19:52:42

@Musi: Yes, it is, see e.g. [here](http://odata.stackexchange.com/stackoverflow/s/316/top-answers).

Georg Fritzsche 2010-07-10 07:47:58

Whoever autographed a dollar bill with "Tony the Pony" at the Louisiana Longhorn Cafe in Round Rock, TX: Thank you. You made my goddamn day.

Chris Doggett 2010-08-06 02:30:35

I just want to give everyone the hope that they've just received an insightful remark to an answer.

Beau Martínez 2010-08-25 18:59:36

Locked to prevent edits. Hopefully temporarily. If you are considering flagging this as inappropriate, please don't.

Will 2010-08-27 17:43:17

Answer 4

+12 A:

I find this small PHP library incredibly useful for parsing HTML tags: http://simplehtmldom.sourceforge.net/.

Kosso 2009-11-13 23:27:37

Yep, this is the usual thing for HTML, when it's not well-formed XHTML anyway.

bobince 2009-11-13 23:34:51

Answer 5

+14 A:

I suggest using QueryPath (http://querypath.org/) for parsing XML and HTML. It's basically much the same syntax as jQuery, only it's on the server side.

John Fiala 2009-11-13 23:44:50

+1 I really like how easy html parsing is with jQuery, didn't know there was something similar for server side.

Kyle 2010-09-14 20:33:03

Answer 6

+313 A:

While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML.

If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's Web site. This was a limited, one-time job.

Regexes worked just fine for me, and were very fast to set up.

Kaitlin Duck Sherwood 2009-11-14 06:27:19

+1 for incorporating Paris Hilton in your answer.

Andrew Song 2009-11-14 18:49:20

So Paris Hilton did write an OS after all?

Amarghosh 2009-11-15 15:16:24

No, the question is: Did actually someone ask Paris Hilton to write an OS?

Mauli 2009-11-15 21:47:32

Great, we're now debating the possibility of chuck norris parsing HTML with regular expressions .. and paris hilton writing an operating system. Jon Skeet, however, can do both AND paris hilton.

Tim Post 2009-11-16 05:12:32

Hey, now, if Paris can give reasonable answers while running a fake campaign for president, maybe she can write a reasonable fake OS too. :-)

SarekOfVulcan 2009-11-16 15:34:57

Has anyone ever actually __seen__ Linux Torvalds and Paris Hilton in the same room at the same time? Hmmmm....

Graeme Perrow 2009-11-16 18:06:14

Paris may not have written an OS but I think Hannah Montana did. hannahmontana.sourceforge.net

brianegge 2009-11-17 02:46:19

The rule is obviously "don't parse", but there are exceptions as to any rule — if you want to check, say, the presence of specific link on the page regexp would be the easiest solution. My 2 cents.

SeasonedCoder 2009-11-17 03:27:31

Paris Hilton did write an Operating System: Parix.

Avihu Turzion 2009-11-17 08:04:09

Can Paris Hilton even _spell_ OS?

David M 2009-11-17 11:30:10

@Graeme Perrow: The better question is, has anyone seen Linus Torvalds? Rumor has it, he's the product of a runaway LISP machine at MIT that RMS did not have the heart to power down.

Tim Post 2009-11-17 11:31:41

@tinkertim: when did Jon Skeet do Paris Hilton? I don't remember seeing that in the social pages.

dave 2009-11-17 22:32:26

@sarekofvulcan: Do magazines count?

Tim Post 2009-11-24 18:53:05

Oh good grief. To clarify, I did not say Jon Skeet DID Paris Hilton. I just said he COULD, since he is (probably) appropriately anatomically equipped to do so. Enough with the prank e-mails asking for photographic evidence. An interlude between Jon Skeet and Tony The Pony is MUCH funnier than an interlude between Jon Skeet and Paris Hilton, thank you. 110 emails so far, please, TAKE MY COMMENT AS HUMOR.

Tim Post 2009-11-24 18:58:36

I wonder who would not take your comment as humor

Jader Dias 2009-12-04 01:42:36

@DavidM Of course Paris doesn't know how to spell OS, she has minions paid by Daddy to do it for her.

Ed Griebel 2009-12-04 14:47:26

@David M - Paris Hilton can spell OS, though not correctly.

Jon Hopkins 2009-12-09 12:38:08

Chuck Norris could beat an OS out Paris Hilton.

Chris Nicol 2009-12-18 05:57:38

Why would you do both AND Paris Hilton. You could rather do Paris Hilton first; for the rest, there is always time. ;-)

kiamlaluno 2009-12-21 15:57:43

I love how someone edited this to make the grammar incorrect.

Michael Myers 2010-02-01 22:55:30

Paris Hilton can write?!?

Jus12 2010-02-21 02:22:06

In Soviet Russia HTML parses you.

parxier 2010-03-14 13:01:55

chuck norris could teach paris hilton to write.

intuited 2010-03-29 01:05:53

"Can Paris Hilton even spell OS?"Oh, Yes.

mixdev 2010-05-05 18:10:44

Er, Paris Hilton DID write an OS... Microsoft bought it and called it "Windows ME"

Robert Fraser 2010-05-31 10:34:34

@Robert: Two words: Microsoft Bob. Now cower with fear!

Donal Fellows 2010-07-05 14:44:00

Paris originally wanted to call her OS "Windows Me Me Me".

thepeer 2010-07-05 17:05:11

Disagree. There are use cases for parsing HTML with regular expressions: http://blog.sitescraper.net/2010/06/web-scraping-with-regular-expressions.html

Plumo 2010-09-14 09:05:41

Answer 7

+7 A:

I know Java isn't cool anymore, but if you want to use a really good library in Java, you might check into Tag soup which is built on top of Xerces. http://home.ccil.org/~cowan/XML/tagsoup/

DanielHonig 2009-11-14 20:24:42

Java was never cool ;)

notandy 2009-12-08 19:54:35

Java is cool as a platform for better languages, but i'm off topic.

rplevy 2010-03-24 23:53:54

Answer 8

+1 A:

The W3C explains parsing in a pseudo regexp form:
http://www.w3.org/TR/REC-xml-names/#ns-using

Follow the var links for QName, S, and Attribute to get a clearer picture.
Based on that you can create a pretty good regexp to handle things like stripping tags.

John-David Dalton 2009-11-15 06:18:15

Answer 9

+157 A:

Disclaimer: use a parser if you have the option. That said...

This is the regex I use (!) to match HTML tags:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>

It may not be perfect, but I ran this code through a lot of HTML. Note that it even catches strange things like <a name="badgenerator"">, which show up on the web.

I guess to make it not match self contained tags, you'd either want to use Kobi's negative look-behind:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+(?<!/\s*)>

or just combine if and if not.

To downvoters: This is working code from an actual product. I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML.

Caveat: I should note that this regex still breaks down in the presence of CDATA blocks, comments, and script and style elements. Good news is, you can get rid of those using a regex...

itsadok 2009-11-15 06:37:18

Nothing to complain about, just three down votes. I'm at -5, probably for not adding a warning not to use my code :)

Kobi 2009-11-16 14:10:35

Up for karma. And this makes 15 characters.

Jeff 2009-11-16 14:31:46

I got a couple of anonymous down votes that were really about minor differences in opinion. I didn't like it. I mean, you put a disclaimer right at the front, right? One up for karma.

Stephen Harmon 2010-03-30 14:53:27

111 up for karma

zildjohn01 2010-05-28 10:11:31

+1 This is definitely a helpful answer given all the caveats.

Christian Hayter 2010-06-14 08:57:20

Answer 10

+2 A:

You should check PHP DOM Functions. Very handy once you study this tutorial : http://php.net/manual/en/book.dom.php

Fotis 2009-11-15 14:25:20

This is actually a very good answer!

AntonioCS 2009-12-28 22:11:52

Thnx man. PHP DOM saved me many times :)

Fotis 2010-01-09 14:59:06

Answer 11

+7 A:

<?php
$selfClosing = explode(',', 'area,base,basefont,br,col,frame,hr,img,input,isindex,link,meta,param,embed');

$html = '
<p><a href="#">foo</a></p>
<hr/>
<br/>
<div>name</div>';

$dom = new DOMDocument();
$dom->loadHTML($html);
$els = $dom->getElementsByTagName('*');
foreach ( $els as $el ) {
    $nodeName = strtolower($el->nodeName);
    if ( !in_array( $nodeName, $selfClosing ) ) {
        var_dump( $nodeName );
    }
}

Output:

string(4) "html"
string(4) "body"
string(1) "p"
string(1) "a"
string(3) "div"

Basically just define the element node names that are self closing, load the whole html string into a DOM library, grab all elements, loop through and filter out ones which aren't self closing and operate on them.

I'm sure you already know by now that you shouldn't use regex for this purpose.

meder 2009-11-15 14:37:06

If you're dealing with real XHTML then append getElementsByTagName with `NS` and specify the namespace.

meder 2009-11-15 14:39:44

seems odd that every answer above mine isn't a real solution, just a recommendation to use some sort of parser. OP - did you try my answer? :p

meder 2010-01-09 05:11:34

Answer 12

+45 A:

Perhaps http://www.crummy.com/software/BeautifulSoup/

MattK 2009-11-15 17:06:14

Yes, especially given this comment "I'm parsing a block of XHTML, truncating it, then closing any tags that are left open after it's been truncated. The DOM XML stuff doesn't work because it's not properly formed XML." Use BeautifulSoup to truncate and prettify.

Mark 2009-11-15 18:51:47

+1 I've used BeautifulSoup and it works well.

Kyle 2010-09-14 20:31:12

Answer 13

+1 A:

It seems to me you're trying to match tags without a "/" at the end. Try this:

<([a-zA-Z][a-zA-Z0-9]*)[^>]*(?<!/)>

manixrock 2009-11-15 17:13:19

Answer 14

+2 A:

I used a open source tool called HTMLParser before. It's designed to parse HTML in various ways and serves the purpose quite well. It can parse HTML as different treenode and you can easily use its API to get attributes out of the node. Check it out and see if this can help you.

logoin 2009-11-16 18:34:50

Answer 15

+1 A:

If you need this for PHP:

The PHP dom functions won't work properly unless it is properly formatted XML. No matter how much better their use is for the rest of mankind.

simplehtmldom is good, but I found it a bit buggy, and it is is quite memory heavy [Will crash on large pages.]

I have never used querypath, so can't comment on its usefulness.

Another one to try is my DOMParser which is very light on resources and I've been using happily for a while. Simple to learn & powerful.

For Python and Java, similar links were posted.

For the downvoters - I only wrote my class when the XML parsers proved unable to withstand real use. Religious downvoting just prevents useful answers from being posted - keep things within perspective of the question, please.

SamGoody 2009-11-16 19:02:48

Answer 16

+2 A:

XPath Luke, is your father.

Excalibur2000 2009-11-16 20:11:44

Answer 17

+4 A:

I don't know your exact need for this, but if you are also using .NET, couldn't you use Html Agility Pack?

Excerpt:

It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML.

GONeale 2009-11-16 23:15:03

Answer 18

+15 A:

Someone wrote a full html parser for PHP: http://htmlpurifier.org/

Jesse Mullan 2009-11-17 00:05:53

Why is this -1?? This is also a good answer!!!

AntonioCS 2009-12-28 23:01:52

This is a great parser +1

alex 2010-02-12 03:26:24

Answer 19

+2 A:

Whenever I need to quickly extract something from an HTML document, I use tidy to convert it to XML and then use XPath or XSLT to get what I need. In your case, something like this: //p/a[@href='foo']

Sembiance 2009-11-18 14:50:26

Answer 20

+203 A:

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can't possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up.

Chomsky Hierarchy

NealB 2009-11-18 18:42:40

this is a very good answer

Paul Nathan 2010-01-04 18:25:32

Short and informative, I like it :)

Sune Rievers 2010-07-05 07:59:49

+1 for science level

mico 2010-07-05 11:07:40

it's my favorite question because of this answer

mykhal 2010-07-05 11:29:26

This is not actually the case. RegEx in most programming languages is actually context-free, due to the fact that it has look-backs, etc.

Michael Fairley 2010-07-05 15:27:22

@michaelfairley Look ahead/behind/around features provide a richer syntax for expressing certainclasses of regular expression. I do not believe these features provide fundamentally anymore expressive power than a Chomsky type 3 grammar is capable of. One might argue thatHTML is a visibly pushdown language (VPL) so may be parsed using techniques less powerfulthan required for a full blown context free grammar, however, I am unaware of any RegExengine that support VPL's either.

NealB 2010-07-05 18:12:34

RegExps are not context free, even with lookback/lookaheads. It never allows for arbitrary nesting of components. The best example is (not) coming up with a RegExp to determine if a statement has matching brackets. http://en.wikipedia.org/wiki/Context-free_grammar#Example_2

Juan Mendes 2010-10-20 18:18:15

Answer 21

+2 A:

You can use nekohtml library to parse html. Чувак не парься и используй nekohtml http://nekohtml.sourceforge.net/

2009-12-02 14:13:20

Answer 22

A:

If it was not for @bobince answer I would say you should develop your regexes in a Test Driven manner.

Thank God you didn't

But next time use TDD.

Jader Dias 2009-12-04 01:47:17

Answer 23

+1 A:

I've recently wrote a HTML sanitizer in Java. It is based on a mixed approach of regular expressions and Java code. Personally I hate regular expressions and its folly (readability, maintainability, etc.), but if you reduce the scope of its applications it may fit your needs. Anyway, my sanitizer uses a white list for HTML tags and a black list for some style attributes. For your convenience I have set up a playground so you can test if the code matches your requirements: playground and Java code. Your feedback will be appreciated.

There is a small article describing this work on my blog: roberto.open-lab.com

Roberto 2009-12-10 08:26:53

Answer 24

+2 A:

While the answers that you can't parse HTML with regexes are correct, they don't apply here. The OP just wants to parse one HTML tag with regexes, and that is something that can be done with a regular expression.

The suggested regex is wrong, though:

<([a-z]+) *[^/]*?>

If you add something to the regex, by backtracking it can be forced to match silly things like <a >>, [^/] is too permissive. Also note that <space>*[^/]* is redundant, because the [^/]* can also match spaces.

My suggestion would be

<([a-z]+)[^>]*(?<!/)>

Where (?<! ... ) is (in Perl regexes) the negative look-behind. It reads "a <, then a word, then anything that's not a >, the last of which may not be a /, followed by >".

Note that this allows things like <a/ > (just like the original regex), so if you want something more restrictive, you need to build a regex to match attribute pairs separated by spaces.

moritz 2010-01-27 12:54:35

Answer 25

+2 A:

As many people have already pointed out, HTML is not a regular language which can make it very difficult to parse. My solution to this is to turn it into a regular language using a tidy program and then to use an XML parser to consume the results. There are a lot of good options for this. My program is written using Java with the jtidy library to turn the HTML into XML and then Jaxen to xpath into the result.

Corey Sanders 2010-02-04 16:22:00

Answer 26

+2 A:

Although it's not suitable and effective to use regular expressions for that purpose sometimes regular expressions provide quick solutions for simple match problems and in my view it's not that horrbile to use regular expressions for trivial works.

There is a definitive blog post about matching innermost HTML elements written by Steven Levithan.

eyazici 2010-02-09 03:59:27

Answer 27

+3 A:

You can parse html in sed though.

Turing.sed
Write html parser (homework)
???
Profit!

profjim 2010-02-15 00:55:24

See also http://www.perlmonks.org/?displaytype=print;node_id=809842

profjim 2010-03-03 12:50:29

Answer 28

+1 A:

This may do:

<.*?[^/]>

Or without the ending tags:

<[^/].*?[^/]>

What's with the flame wars on HTML parsers? HTML parsers must parse (and rebuild!) the entire document before it can categorize your search. Regular expressions may be a faster / elegant in certain circumstances. My 2 cents...

Paul 2010-04-23 06:38:31

Answer 29

A:

There are some nice regexes for replacing HTML with BBCode here http://www.garyshood.com/htmltobb/source.txt. For all you nay-sayers, note that he's not trying to fully parse HTML, just to sanitize it. He can probably afford to kill off tags that his simple "parser" can't understand.

sblom 2010-04-25 16:38:42

Answer 30

A:

About the question of the RegExp methods to parse (x)HTML, the answer to all of the ones who spoke about some limits is: you have not been trained enough to rule the force of this powerful weapon, since NOBODY here spoke about recursion.

A RegExp-agnostic colleague notified me this discussion, which is not certainly the first on the web about this old and hot topic.

After reading some posts, the first thing I did was looking for the "?R" string in this thread. The second was to search about "recursion".
No, holy cow, no match found.
Since nobody mentioned the main mechanism a parser is built onto, I was soon aware that nobody got the point.

If an (x)HTML parser needs recursion, a RegExp parser without recursion is not enough for the purpose. It's a simple construct.

The black art of RegExp is hard to master, so maybe there are further possibilities we left out while trying and testing our personal solution to capture the whole web in one hand... Well, I am sure about it :)

Here's the magic pattern:

$pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/s";

Just try it.
It's written as a PHP string, so the "s" modifier makes classes include newlines.
Here's a sample note on the PHP manual I wrote on january:

http://php.net/manual/en/regexp.reference.recursive.php

(Take care, in that note I wrongly used the "m" modifier; it should be erased, notwithstanding it is discarded by the RegExp engine, since no ^ or $ anchorage was used).

Now, we could speak about the limits of this method from a more informed point of view:

according to the specific implementation of the RegExp engine, recursion may have a limit in the number of nested patterns parsed, but it depends on the language used
although corrupted (x)HTML does not drive into severe errors, it is not sanitized.

Anyhow it is only a RegExp pattern, but it discloses the possibility to develop of a lot of powerful implementations.
I wrote this pattern to power the recursive descent parser of a template engine I built in my framework, and performances are really great, both in execution times or in memory usage (nothing to do with other template engines which use the same syntax).

Emanuele Del Grande 2010-07-05 14:16:50

*"... I was soon aware that nobody got the point ..."* ... sigh. -1

Bart Kiers 2010-07-05 14:33:38

Ooooh, recursive regexes! Why didn't we think of that?

Alan Moore 2010-07-05 14:57:01

I'll put this in the "Regex which doesn't allow greater-than in attributes" bin. Check it against <input value="is 5 > 3?" />

Gareth 2010-07-05 16:24:02

If you put something like that in production code, you would likely be shot by the maintainer. A jury would never convict him.

aehiilrs 2010-07-05 16:33:01

@Gareth: thanks for your objection, but are you sure that putting a greater-than inside an attribute is a valid code?Well, also if not, this evidences another limit to add to the ones I listed above in case to create a greed parser for the real world...But it is not too much to demonstrate the way is not good, do you agree? There are other useful operators in RegExp which allow to check for next occurrences, this should be a proper use for them.

Emanuele Del Grande 2010-07-05 17:40:57

@aehiilrs: I'm sorry, I do not understand: which maintainer are you speaking about? (...code maintainers? :S)

Emanuele Del Grande 2010-07-05 17:46:49

@Emanuele, yes, it's valid.

Bart Kiers 2010-07-05 18:21:17

@Bart K.: it is valid only in an HTML 4- document.XHTML documents need the five XML entities encoded.

Emanuele Del Grande 2010-07-06 08:20:09

Oh my gosh! -7 votes! :) I'm starting to become popular... :pSorry for opening a door!

Emanuele Del Grande 2010-07-06 10:21:22

A door? I'm sure you mean the gates to hell?

Matthias 2010-07-06 15:29:16

If your comments are aimed to nothing but criticize, I see no good results this discussion may reach.

Emanuele Del Grande 2010-07-06 16:13:25

I was the first to say that my solution has some limits, but of course I am available to listen anyone who can help me in improving it.I posted something which costed me time and work, and which results are effective in a number of projects up and running.I thought it could help, proposing the way of a RegExp solution which nobody nearly spoke about (recursion), and which is the only way to parse nested markup patterns (through RegExp, of course).

Emanuele Del Grande 2010-07-06 16:16:58

May be you are not interested in really knowing if RegExps may work or not for this purpose showing to have some prejudices, but I see no reason why you should blame my way, since I did not blame at all the advise of anyone who proposed other ways such as stand-alone parsers.

Emanuele Del Grande 2010-07-06 16:20:19

Regular expressions can't work because by definition they are not recursive. Adding a recursive operator to regular expressions basically makes a CFG only with poorer syntax. Why not use something designed to be recursive in the first place rather than violently insert recursion into something already overflowing with extraneous functionality?

Welbog 2010-07-06 18:38:34

Once again... > is valid pretty much everywhere in XML, and thus in XHTML, see section 2.4 of the XML spec (at http://www.xml.com/axml/target.html#syntax for example)

mirod 2010-07-07 04:55:11

You are right, the lesser-than only is not valid inside XML attributes.Thanks to your criticism, I implemented my solution so that it can parse *anything* inside the attributes :)Beside this, I implemented the parsing of XML prologue, DTDs and CDATA.The only upset is that the mod closed the possibility to answer this discussion for users with less than 10 points, so that I cannot post it. I twitted him the request to unlock it, but had no response.Come to me, enemies, I wait you! :) The more you are, the stronger I become!

Emanuele Del Grande 2010-07-09 08:14:54

My objection isn't one of functionality it is one of time invested. The problem with RegEx is that by the time you post the cutsey little one liners it appears that you did something more efficiently ("See one line of code!"). And of course no one mentions the half hour (or 3) that they spent with their cheat-sheet and (hopefully) testing every possible permutation of input. And once you get past all that when the maintainer goes to figure out or validate the code they can't just look at it and see that it is right. The have to dissect the expression and essentially retest it all over again...

Oorang 2010-07-10 15:11:09

... to know that it is good. And that will happen even with people who are *good* with regex. And honestly I suspect that overwhelming majority of people won't know it well. So you take one of the most notorious maintenance nightmares and combine it with recursion which is the *other* maintenance nightmare and I think to myself what I really need on my project is someone a little less clever. The goal is to write code that bad programmers can maintain without breaking the code base. I know it galls to code to the least common denominator. But hiring excellent talent is hard, and you often...

Oorang 2010-07-10 15:17:22

and up with a lot of adequate talent and one really smart guy that is so smart he makes everything take twice as long:)

Oorang 2010-07-10 15:18:04

Answer 31

+1 A:

Here is a PHP based parser that parses HTML using some ungodly regex. As the author of this project, I can tell you it is possible to parse HTML with regex, but not efficient. If you need a server-side solution (as I did for my wp-Typography WordPress plugin), this works.

kingjeffrey 2010-07-18 02:52:04

ansaurus

tags:

views:

answers:

RegEx match open tags except XHTML self-contained tags

related questions