tags:

views:

441

answers:

6

Hi Everyone,

I am new to Python (I dont have any programming training either), so please keep that in mind as I ask my question.

I am trying to search a retrieved webpage and find all links using a specified pattern. I have done this successfully in other scripts, but I am getting an error that says

raise error, v # invalid expression

sre_constants.error: multiple repeat

I have to admit I do not know why, but again, I am new to Python and Regular Expressions. However, even when I don't use patterns and use a specific link (just to test the matching), I do not believe I return any matches (nothing is sent to the window when I print match.group(0). The link I tested is commented out below.

Any ideas? It usually is easier for me to learn by example, but any advice you can give is greatly appreciated!

Brock

import urllib2
from BeautifulSoup import BeautifulSoup
import re

url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

pattern = r'<a href="http://forums.epicgames.com/archive/index.php?t-([0-9]+).html"&gt;(.?+)&lt;/a&gt; <i>((.?+) replies)'
#pattern = r'href="http://forums.epicgames.com/archive/index.php?t-622233.html"&gt;Gears of War 2: Horde Gameplay</a> <i>(20 replies)'

for match in re.finditer(pattern, page, re.S):
    print match(0)
+1  A: 

You need to escape the literal '?' and the literal '(' and ')' that you are trying to match.

Also, instead of '?+', I think you're looking for the non-greedy matching provided by '+?'.

More documentation here.

For your case, try this:

pattern = r'<a href="http://forums.epicgames.com/archive/index.php\?t-([0-9]+).html"&gt; (.+?)</a> <i>\((.+?) replies\)'
retracile
I changed my pattern and ran the script again, and yet no matches were found, at least I dont have anything printed in the window when I try to iterate over my matches and print them. Any ideas?
Btibert3
Look at the content of the file by hand. When I look at it, I don't see the string 'replies' in it anywhere. So the regex won't find any matches.
retracile
pattern = r'<a href="http://forums.epicgames.com/archive/index.php\?t-([0-9]+).html">(.+?)</a>( <i>\(([0-9]+?) replies\))?'might be closer?
retracile
I tried your new patter,, and what I dont get is that it returned no matches. I even shortened the pattern and tried this code, and when I try to print match.group(0), nothing (I think) gets sent to the console. Any ideas? pattern = r'<a href="forums.epicgames.com/archive/index.php\?t-([0-9]+).html">(.+?)</a>'for match in re.finditer(pattern, page, re.S): print match(0)
Btibert3
+1  A: 

That means your regular expression has an error.

(.?+)</a> <i>((.?+)

What does ?+ mean? Both ? and + are meta characters that does not make sense right next to each other. Maybe you forgot to escape the '?' or something.

Unknown
They make sense in the other order. +? is non-greedy matching form of +.
retracile
A: 

To extend on what others wrote:

.? means "one or zero of any character"

.+ means "one ore more of any character"

As you can hopefully see, combining the two makes no sense; they are different and contradictory "repeat" characters. So, your error about "multiple repeats" is because you combined those two "repeat" characters in your regular expression. To fix it, just decide which one you actually meant to use, and delete the other.

machineghost
Except that .+? is non-greedy matching of one or more characters. Which is what he's after.
retracile
A: 

As you're discovering, parsing arbitrary HTML is not easy to do correctly. That's what packages like Beautiful Soup do. Note, you're calling it in your script but then not using the results. Refer to its documentation here for examples of how to make your task a lot easier!

Ned Deily
I have tried the documentation. As I new to Python, and even HTML for that matter, I am having a hard time 'easily' finding what I need it do, although I have no doubt it can do what I need.
Btibert3
A: 
import urllib2
import re
from BeautifulSoup import BeautifulSoup

url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

# Get all the links
links = [str(match) for match in soup('a')]

s = r'<a href="http://forums.epicgames.com/archive/index.php\?t-\d+.html"&gt;(.+?)&lt;/a&gt;' 
r = re.compile(s)
for link in links:
    m = r.match(link)
    if m:
        print m.groups(1)[0]
hughdbrown
Is it possible to filter the links I want...as you can see in my attempt to do a regex, I want a certain set of links. Additionally, and I know I am pushing my luck, I was hoping to get the link text along with it. In short, is it possible to filter the links returned and get the link text with it?
Btibert3
A couple of things: what is the "link text"? The stuff between <a href...> and </a>? Or the href value? Or some stuff after the opening <a> and closing </a>? Or something else?¶Here's what I don't get: the page you point to, http://forums.epicgames.com/archive/index.php?f-356-p-164.html, doesn't even have a single instance of 'replies' in the HTML source. Are you *sure* you are looking for that? And why have you accepted as an answer a regex that cannot match any links in the data?¶
hughdbrown
New to stack overflow, didnt realize that meant I was done, sorry. By link text, I simply want the text after the link in the source code (the text right before </a>. Since I am new to Python and web scraping, I am starting slow and trying to learn as much as I can. But all I am looking to do is grab the links from that archive (every page), follow each link (discussion), and grab all of the posts for that discussion. I will need to parse the data into a 'dataset', which can be a list, but simply, I want to scrape the archives and collect all of the message titles and posts for each.
Btibert3
Marking a solution as "the one" usually means that you are satisfied with it and responders will not expect to get any credit for further efforts. Also, if you select one of the solutions and it doesn't actually work, what should responders make of that?The new version of the code goes to the web page you cited, scrapes all the links, and then prints all the text between the opening and closing anchor tags. I think that's what you want.
hughdbrown
A: 

Thanks for the help. I knew BeautifulSoup was probably the answer, but I haven't had the best luck of with it since I am new to programming, and html for that matter. I have had some decent luck with regex, so I started with that route. The first ? never occurred to me, but makes perfect sense now.

All I am looking to do is to get a specific set of links (those that meet a specific regex match) and text description of the link. I figure these are basic HTML principles that can be harnessed by BS, but I am not sure where to go.

Thanks,

Brock

Btibert3