ansaurus

Question

Python parsing

Answer 1

+14 A:

Don't let regex scare you off... it's well worth learning.

Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:

import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()

('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')

To get at each group individual, just call them on the info object:

print info.group(1) # or info.groups()[0]

print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"

The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.

The pattern above breaks down as follows, which is parsed left to right:

([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.

\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.

([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.

(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.

\) : Closing parenthesis for the above.

The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython.org/regular_expressions/index.html.

EDIT: See zacherates below, who has some nice edits. Two heads are better than one!

Jarret Hardie 2009-03-03 19:35:41

Thanks for your answer! That helps a lot! I'm a little confused though... I need to identify fields individually to send to Google and concatenate. How do I call each value? Like, for example, how would I concatenate the values?

Alan 2009-03-03 19:42:54

Your regex leaves trailing spaces on the band and venue names, but that's easy to fix.

Aaron Maenpaa 2009-03-03 19:43:35

Yeah, I noticed that too, but figured I'd just pull the `[0:-1]` trick on the first two values in each `item.title`.

Alan 2009-03-03 19:46:31

zacherates has good suggestions

Jarret Hardie 2009-03-03 19:47:54

And his search for (.*) solves the character problem, so long as you don't set the regex to be greedy

Jarret Hardie 2009-03-03 19:48:27

I edited the post with some info on concatenating... hope I understood your intentions correctly.

Jarret Hardie 2009-03-03 19:51:06

Yes. That makes it clear. I thought that might be the method. Thanks so much! I'm always amazed how much people know when I ask a question here. Do you check "bilingual" on job applications? ;-P

Alan 2009-03-03 19:54:04

With both sets of code (and various mixes of the two) I get the error "AttributeError: 'NoneType' object has no attribute 'groups'." I've tried it in many different ways. What am I doing wrong?

Alan 2009-03-03 20:15:21

If the pattern doesn't match your input, the re.match() command will return None. You should probably check for None just in case the regex failed. What input is it failing on?

Jarret Hardie 2009-03-03 20:21:38

It fails on print info.groups().

Alan 2009-03-03 20:24:09

Sorry.. I meant, what is the title from the RSS feed that you're feeding to re.match()?

Jarret Hardie 2009-03-03 20:26:00

Aha! Left that trailing ')' off... it's always something... :)

Alan 2009-03-03 20:34:02

LOL... sounds like you've got a good handle on regex already if you figured that part out!

Jarret Hardie 2009-03-03 20:35:32

Answer 2

+7 A:

Regular expressions are a great solution to this problem:

>>> import re
>>> s  = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')

As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.

Edit

In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:

>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"

Notice the different quote styles.

Aaron Maenpaa 2009-03-03 19:35:51

Thanks for your answer! As to your side note, I have noticed that some of the entries come out with "" at the beginning and end rather than ''. I wonder if this will be a problem. I used the RSS parser available at http://effbot.org/zone/element-rss-wrapper.htm.

Alan 2009-03-03 19:45:17

Answer 3

A:

Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.

Your code should look something like this:

import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()

import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)

lines = []
for entry in feed.entries:
    m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)  
    if m:
        bandRaw, venue, date = m.groups()

        if band == bandRaw:
            place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
            lines.append(",".join([band, venue, date, lat, lng]))

result = "\n".join(lines)

EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

itsadok 2009-03-03 20:22:36

:::sigh::: looks like you wrote the whole thing in less lines than I have imports... what modules are you using? especially for the get_geo and list.append? list is a __builtin__, right? get_geo? is that from GeoPy?

Alan 2009-03-03 21:14:36

And the last line adds the newline? That's helpful, too. Thanks for taking the time.

Alan 2009-03-03 21:15:32

Sorry if it wasn't clear, but I made up get_geo. I just used it as a placeholder for whatever function you decide to implement.

itsadok 2009-03-05 12:36:24

ansaurus

tags:

views:

answers:

Python parsing

related questions