ansaurus

Question

Python regex search - 'wild card' matching a string

Answer 1

+3 A:

import urllib
import re

fbhandle = urllib.urlopen('http://www.facebook.com/Microsoft')
pattern = "6 of(.*)fans" #this wild card doesnt appear to work?
compiled = re.compile(pattern)

ms = compiled.search(fbhandle.read())
print ms.group(1).strip()
fbhandle.close()

You needed to use re.search() instead. Using re.match() tries to match the pattern against the whole document, but really you're just trying to match a piece inside the document. The code above prints: 79,110. Of course, this will probably be a different number by the time it gets run by someone else.

Evan Fosmark 2010-01-03 22:02:11

thanks very much - works just fine. im not sure i fully understand the distinction though - match() would be used in cases where some sort of bool eval was being done on a small(ish) string?

oneAday 2010-01-03 22:13:50

@oneAday: good explanation of the difference between `match` and `search`: http://www.amk.ca/python/howto/regex/regex.html#SECTION000720000000000000000

Adam Bernier 2010-01-03 22:26:18

@oneAday: why not accept this answer if it works for you?

Adam Bernier 2010-01-04 00:58:27

oops, terribly sorry - done.

oneAday 2010-01-04 11:33:30

Answer 2

A:

don't need regex

import urllib
fbhandle = urllib.urlopen('http://www.facebook.com/Microsoft')
for line in fbhandle.readlines():
    line=line.rstrip().split("</span>")
    for item in line:
        if ">Fans<" in item:
            rind=item.rindex("<span>")
            print "-->",item[rind:].split()[2]

output

$ ./python.py
--> 79,133

ghostdog74 2010-01-04 01:02:51

Answer 3

+3 A:

Evan Fosmark already gave a good answer. This is just more info.

You have this line:

pattern = "6 of(.*)fans"

In general, this isn't a good regular expression. If the input text was:

"6 of 99 fans in the whole galaxy of fans"

Then the match group (the stuff inside the parentheses) would be:

" 99 fans in the whole galaxy of "

So, we want a pattern that will just grab what you want, even with a silly input text like the above.

In this case, it doesn't really matter if you match the white space, because when you convert a string to an integer, white space is ignored. But let's write the pattern to ignore white space.

With the * wildcard, it is possible to match a string of length zero. In this case I think you always want a non-empty match, so you want to use + to match one or more characters.

Python has non-greedy matching available, so you could rewrite with that. Older programs with regular expressions may not have non-greedy matching, so I'll also give a pattern that doesn't require non-greedy.

So, the non-greedy pattern:

pattern = "6 of\s+(.+?)\s+fans"

The other one:

pattern = "6 of\s+(\S+)\s+fans"

\s means "any white space" and will match a space, a tab, and a few other characters (such as "form feed"). \S means "any non-white-space" and matches anything that \s would not match.

The first pattern does better than your first pattern with the silly input text:

"6 of 99 fans in the whole galaxy of fans"

It would return a match group of just 99.

But try this other silly input text:

"6 of 99 crazed fans"

It would return a match group of 99 crazed.

The second pattern would not match at all, because the word "crazed" isn't the word "fans".

Hmm. Here's one last pattern that should always do the right thing even with silly input texts:

pattern = "6 of\D*?(\d+)\D*?fans"

\d matches any digit ('0' to '9'). \D matches any non-digit.

This will successfully match anything that is remotely non-ambiguous:

"6 of 99 fans in the whole galaxy of fans"

The match group will be 99.

"6 of 99 crazed fans"

The match group will be 99.

"6 of 99 41 fans"

It will not match, because there was a second number in there.

To learn more about Python regular expressions, you can read various web pages. For a quick reminder, inside the Python interpreter, do:

>>> import re
>>> help(re)

When you are "scraping" text from a web page, you might sometimes run afoul of HTML codes. In general, regular expressions are not a good tool for disregarding HTML or XML markup (see here); you would probably do better to use Beautiful Soup to parse the HTML and extract the text, followed by a regular expression to grab the text you really wanted.

I hope this was interesting and/or educational.

steveha 2010-01-04 01:03:01

+1 for sheer breadth and volume

Adam Bernier 2010-01-04 02:09:47

interesting AND educational - fantastic. thanks very much.

oneAday 2010-01-04 11:32:56

ansaurus

tags:

views:

answers:

Python regex search - 'wild card' matching a string

related questions