ansaurus

Question

Regex Matching Error

Answer 1

+1 A:

You need to escape the literal '?' and the literal '(' and ')' that you are trying to match.

Also, instead of '?+', I think you're looking for the non-greedy matching provided by '+?'.

For your case, try this:

pattern = r'<a href="http://forums.epicgames.com/archive/index.php\?t-([0-9]+).html"&gt; (.+?)</a> <i>\((.+?) replies\)'

retracile 2009-08-12 21:19:14

I changed my pattern and ran the script again, and yet no matches were found, at least I dont have anything printed in the window when I try to iterate over my matches and print them. Any ideas?

Btibert3 2009-08-13 14:24:46

Look at the content of the file by hand. When I look at it, I don't see the string 'replies' in it anywhere. So the regex won't find any matches.

retracile 2009-08-13 14:45:37

pattern = r'<a href="http://forums.epicgames.com/archive/index.php\?t-([0-9]+).html">(.+?)</a>( <i>\(([0-9]+?) replies\))?'might be closer?

retracile 2009-08-13 14:49:43

I tried your new patter,, and what I dont get is that it returned no matches. I even shortened the pattern and tried this code, and when I try to print match.group(0), nothing (I think) gets sent to the console. Any ideas? pattern = r'<a href="forums.epicgames.com/archive/index.php\?t-([0-9]+).html">(.+?)</a>'for match in re.finditer(pattern, page, re.S): print match(0)

Btibert3 2009-08-13 21:23:59

Answer 2

+1 A:

That means your regular expression has an error.

(.?+)</a> <i>((.?+)

What does ?+ mean? Both ? and + are meta characters that does not make sense right next to each other. Maybe you forgot to escape the '?' or something.

Unknown 2009-08-12 21:19:26

They make sense in the other order. +? is non-greedy matching form of +.

retracile 2009-08-12 21:23:57

Answer 3

A:

To extend on what others wrote:

.? means "one or zero of any character"

.+ means "one ore more of any character"

As you can hopefully see, combining the two makes no sense; they are different and contradictory "repeat" characters. So, your error about "multiple repeats" is because you combined those two "repeat" characters in your regular expression. To fix it, just decide which one you actually meant to use, and delete the other.

machineghost 2009-08-12 21:24:03

Except that .+? is non-greedy matching of one or more characters. Which is what he's after.

retracile 2009-08-12 21:24:53

Answer 4

A:

As you're discovering, parsing arbitrary HTML is not easy to do correctly. That's what packages like Beautiful Soup do. Note, you're calling it in your script but then not using the results. Refer to its documentation here for examples of how to make your task a lot easier!

Ned Deily 2009-08-12 21:46:48

I have tried the documentation. As I new to Python, and even HTML for that matter, I am having a hard time 'easily' finding what I need it do, although I have no doubt it can do what I need.

Btibert3 2009-08-13 14:23:59

Answer 5

A:

import urllib2
import re
from BeautifulSoup import BeautifulSoup

url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

# Get all the links
links = [str(match) for match in soup('a')]

s = r'<a href="http://forums.epicgames.com/archive/index.php\?t-\d+.html"&gt;(.+?)&lt;/a&gt;' 
r = re.compile(s)
for link in links:
    m = r.match(link)
    if m:
        print m.groups(1)[0]

hughdbrown 2009-08-12 22:01:25

Is it possible to filter the links I want...as you can see in my attempt to do a regex, I want a certain set of links. Additionally, and I know I am pushing my luck, I was hoping to get the link text along with it. In short, is it possible to filter the links returned and get the link text with it?

Btibert3 2009-08-13 14:26:18

A couple of things: what is the "link text"? The stuff between <a href...> and </a>? Or the href value? Or some stuff after the opening <a> and closing </a>? Or something else?¶Here's what I don't get: the page you point to, http://forums.epicgames.com/archive/index.php?f-356-p-164.html, doesn't even have a single instance of 'replies' in the HTML source. Are you *sure* you are looking for that? And why have you accepted as an answer a regex that cannot match any links in the data?¶

hughdbrown 2009-08-13 16:26:26

New to stack overflow, didnt realize that meant I was done, sorry. By link text, I simply want the text after the link in the source code (the text right before </a>. Since I am new to Python and web scraping, I am starting slow and trying to learn as much as I can. But all I am looking to do is grab the links from that archive (every page), follow each link (discussion), and grab all of the posts for that discussion. I will need to parse the data into a 'dataset', which can be a list, but simply, I want to scrape the archives and collect all of the message titles and posts for each.

Btibert3 2009-08-13 21:17:59

Marking a solution as "the one" usually means that you are satisfied with it and responders will not expect to get any credit for further efforts. Also, if you select one of the solutions and it doesn't actually work, what should responders make of that?The new version of the code goes to the web page you cited, scrapes all the links, and then prints all the text between the opening and closing anchor tags. I think that's what you want.

hughdbrown 2009-08-13 21:55:19

Answer 6

A:

Thanks for the help. I knew BeautifulSoup was probably the answer, but I haven't had the best luck of with it since I am new to programming, and html for that matter. I have had some decent luck with regex, so I started with that route. The first ? never occurred to me, but makes perfect sense now.

All I am looking to do is to get a specific set of links (those that meet a specific regex match) and text description of the link. I figure these are basic HTML principles that can be harnessed by BS, but I am not sure where to go.

Thanks,

Brock

Btibert3 2009-08-13 03:30:52

ansaurus

tags:

views:

answers:

Regex Matching Error

related questions