I'm currently playing with the Stack Overflow data dumps and am trying to construct (what I imagine is) a simple regular expression to extract tag names from inside of <
and >
characters. So, for each question, I have a list of one or more tags like <tagone><tag-two>...<tag-n>
and am trying to extract just a list of tag names. Here are a few example tag strings taken from the data dump:
<javascript><internet-explorer>
<c#><windows><best-practices><winforms><windows-services>
<c><algorithm><sorting><word>
<java>
For reference, I don't need to divide tag names into words, so for examples like <best-practices>
I would like to get back best-practices
(not best
and practices
). Also, for what it's worth, I'm using Python if it makes any difference. Any suggestions?