Evan Fosmark already gave a good answer. This is just more info.
You have this line:
pattern = "6 of(.*)fans"
In general, this isn't a good regular expression. If the input text was:
"6 of 99 fans in the whole galaxy of fans"
Then the match group (the stuff inside the parentheses) would be:
" 99 fans in the whole galaxy of "
So, we want a pattern that will just grab what you want, even with a silly input text like the above.
In this case, it doesn't really matter if you match the white space, because when you convert a string to an integer, white space is ignored. But let's write the pattern to ignore white space.
With the *
wildcard, it is possible to match a string of length zero. In this case I think you always want a non-empty match, so you want to use +
to match one or more characters.
Python has non-greedy matching available, so you could rewrite with that. Older programs with regular expressions may not have non-greedy matching, so I'll also give a pattern that doesn't require non-greedy.
So, the non-greedy pattern:
pattern = "6 of\s+(.+?)\s+fans"
The other one:
pattern = "6 of\s+(\S+)\s+fans"
\s
means "any white space" and will match a space, a tab, and a few other characters (such as "form feed"). \S
means "any non-white-space" and matches anything that \s
would not match.
The first pattern does better than your first pattern with the silly input text:
"6 of 99 fans in the whole galaxy of fans"
It would return a match group of just 99
.
But try this other silly input text:
"6 of 99 crazed fans"
It would return a match group of 99 crazed
.
The second pattern would not match at all, because the word "crazed" isn't the word "fans".
Hmm. Here's one last pattern that should always do the right thing even with silly input texts:
pattern = "6 of\D*?(\d+)\D*?fans"
\d
matches any digit ('0'
to '9'
). \D
matches any non-digit.
This will successfully match anything that is remotely non-ambiguous:
"6 of 99 fans in the whole galaxy of fans"
The match group will be 99
.
"6 of 99 crazed fans"
The match group will be 99
.
"6 of 99 41 fans"
It will not match, because there was a second number in there.
To learn more about Python regular expressions, you can read various web pages. For a quick reminder, inside the Python interpreter, do:
>>> import re
>>> help(re)
When you are "scraping" text from a web page, you might sometimes run afoul of HTML codes. In general, regular expressions are not a good tool for disregarding HTML or XML markup (see here); you would probably do better to use Beautiful Soup to parse the HTML and extract the text, followed by a regular expression to grab the text you really wanted.
I hope this was interesting and/or educational.