tags:

views:

105

answers:

2

I'm in need of a quick way to put a bunch of html attributes in a Dictionary. Like so

<body topmargin=10 leftmargin=0 class="something"> should amount to

attr["topmargin"]="10"
attr["leftmargin"]="0"
attr["class"]="something"

This is to be done server-side and the tag contents are already available. I just need to weed out the tags with no value and take into account different quotation marks or lack of.

I'm guessing regex should be employed. Found some similar questions, but none that really match my need.

Thanks

edit: clarifying server-side

+3  A: 

What about HtmlAgilityPack?

Arnis L.
What about it? I don't want a new framework or html parser for this one task that I know a nice regex can solve. Only that I still suck in regex after all these years.
danijels
Why is this downvoted? it seems relevent, and useful.
djna
@danijels - It is notoriously difficult to use a regular expression to parse HMTL. I would strongly suggest that you consider this answer. (+1 by the way)
Andrew Hare
You're going to spend a lot of time trying to get regex working, but a library like this is probably the best route. Especially considering how malformed most HTML sources can get.
Jon Smock
Regexps are not great for parsing XML-like stuff. The attributes can bein arbitrary orders, and are optional. The formats don't have to be on one line. Sometimes its better to use a parser that really understands hwat its reading.
djna
+1 I'm curious about the aggregated rep value that was generated in this one year of SO by "Use an actual parser" answers to "Which regex to parse HTML?" questions.
Tomalak
I suppose I was not as clear in my question as I should have been. This is to be done server-side, and I already have the attribute string ready. Meaning, I don't need to process anything else than the contents of the one html tag.
danijels
@danijels I just provided an idea how to solve your problem. If it's wrong (and not undesired) - i got no problems to remove it.
Arnis L.
A: 

I also think that using specialized parsers will be better, but if you want to use regex, try something like:

\<(?<tag>[a-zA-Z]+)( (?<name>\w+)="?(?<value>\w+)"?)*\>

I just tested it, works pretty well

ALOR