ansaurus

Question

Regular Expression (Python) to extract strings of text from inside of < and > - e.g. <stringone><string-two> etc...

Answer 1

+3 A:

Since the tag names of Stackoverflow do not have embedded < > you can use the regex:

<(.*?)>

or

<([^>]*)>

Explanation:

< : A literal <
(..) : To group and remember the match.
.*? : To match anything in non-greedy way.
> : A literal <
[^>] : A char class to match anything other than a >

codaddict 2010-09-14 11:34:28

Answer 2

+2 A:

Here is a quick and dirty solution:

#!/usr/bin/python

import re
pattern = re.compile("<(.*?)>")
data = """
<javascript><internet-explorer>

<c#><windows><best-practices><winforms><windows-services>

<c><algorithm><sorting><word>

<java>
"""

for each in pattern.findall(data):
    print each

Update

Statutory warning: if the data dump is in XML or JSON (as one of the users commented) then you are much better off using a suitable XML or JSON parser.

Manoj Govindan 2010-09-14 11:35:06

Answer 3

+3 A:

Instead of doing data dumps (whatever they are) and using regex, you may be interested in using the Stackoverflow API and json instead.

For example, to cull the tags from this question, you could do this:

import urllib2
import json
import gzip
import cStringIO

f=urllib2.urlopen('http://api.stackoverflow.com/1.0/questions/3708418?type=jsontext')
g=gzip.GzipFile(fileobj=cStringIO.StringIO(f.read()))
j=json.loads(g.read())

print(j['questions'][0]['tags'])
# [u'python', u'regex']

unutbu 2010-09-14 12:11:20

ansaurus

tags:

views:

answers:

Regular Expression (Python) to extract strings of text from inside of < and > - e.g. <stringone><string-two> etc...

related questions