views:

116

answers:

3

I'm currently playing with the Stack Overflow data dumps and am trying to construct (what I imagine is) a simple regular expression to extract tag names from inside of < and > characters. So, for each question, I have a list of one or more tags like <tagone><tag-two>...<tag-n> and am trying to extract just a list of tag names. Here are a few example tag strings taken from the data dump:

<javascript><internet-explorer>

<c#><windows><best-practices><winforms><windows-services>

<c><algorithm><sorting><word>

<java>

For reference, I don't need to divide tag names into words, so for examples like <best-practices> I would like to get back best-practices (not best and practices). Also, for what it's worth, I'm using Python if it makes any difference. Any suggestions?

+3  A: 

Since the tag names of Stackoverflow do not have embedded < > you can use the regex:

<(.*?)>

or

<([^>]*)>

Explanation:

  • < : A literal <
  • (..) : To group and remember the match.
  • .*? : To match anything in non-greedy way.
  • > : A literal <
  • [^>] : A char class to match anything other than a >
codaddict
+2  A: 

Here is a quick and dirty solution:

#!/usr/bin/python

import re
pattern = re.compile("<(.*?)>")
data = """
<javascript><internet-explorer>

<c#><windows><best-practices><winforms><windows-services>

<c><algorithm><sorting><word>

<java>
"""

for each in pattern.findall(data):
    print each

Update

Statutory warning: if the data dump is in XML or JSON (as one of the users commented) then you are much better off using a suitable XML or JSON parser.

Manoj Govindan
+3  A: 

Instead of doing data dumps (whatever they are) and using regex, you may be interested in using the Stackoverflow API and json instead.

For example, to cull the tags from this question, you could do this:

import urllib2
import json
import gzip
import cStringIO

f=urllib2.urlopen('http://api.stackoverflow.com/1.0/questions/3708418?type=jsontext')
g=gzip.GzipFile(fileobj=cStringIO.StringIO(f.read()))
j=json.loads(g.read())

print(j['questions'][0]['tags'])
# [u'python', u'regex']
unutbu