ansaurus

Question

Regular Expression to match a string only when certain characters don't exist

Answer 1

A:

You will need to crawl the pages upto ?param=1&param=5

because normally param=1 and param=2 could give you completely different web page.

pick up one the wordpress website to confirm that.

Try like this one, It will try to match just before # char

(http://www.example.com/)([^#]*?)

S.Mark 2009-11-25 10:42:09

Yep, the site that I'm crawling uses parameters, but these don't provide any difference in the content of the pages so it would be a waste for both myself and their website if I crawled it (which is why I want to exclude URLs that contain parameters and #)

johneth 2009-11-25 10:50:15

ok, if you really really sure that you dont need those parts after ?=# , use like others peeople suggest, ([^=\?#]*?), and vote up / accept answers to other people reply, cheers! :-)

S.Mark 2009-11-25 10:57:54

Answer 2

+1 A:

(http://www.example.com/)([^=?#]*?)

Should do it, this will allow any URL that does not contain the characters you don't want.

It might however be a little bit hard to extend this approach. A better option is to have the system work two-tiered, i.e. one set of matching regex, and one set of blocking regex. Then only URL:s which pass both of these will be allowed. I think this solution will be a bit more transparent and flexible.

Joakim Lundborg 2009-11-25 10:42:21

I never thought of it like that, I'll give that a go

johneth 2009-11-25 10:44:48

If you do, please accept/upvote, otherwise you'll have a neverending army of regexers aswering the question =).

Joakim Lundborg 2009-11-25 10:52:09

The backslash is not necessary inside the character class.

Tim Pietzcker 2009-11-25 10:55:46

Your method almost worked, just needed a $ on the end (outside the parenthesis)! It yields the same results as VoDurden's method (which is the same except for the missing ?). I've updated the question with the answer and accepted VoDurden's as the correct one (because I read it first)Thanks very much everyone!

johneth 2009-11-25 11:01:37

Answer 3

A:

This expression should be what you're looking for:

(http://www.example.com/subdirectory/)([^=?#]*)$

[^=\?#] Will match anything except for the characters you specified.

For Example:

http://www.example.com/subdirectory/ Match
http://www.example.com/subdirectory/index.php Match
http://www.example.com/subdirectory/somepage?param=1&param=5#print No Match
http://www.example.com/subdirectory/index.php?param=1 No Match

VoDurden 2009-11-25 10:43:05

Your method almost worked - I tried it and it seemed not to work, so I added $ to the end, and it seems to work (it'll need more testing, but your method has just saved me a lot of time!):(http://www.example.com/subdirectory/)([^=\?#]*)$

johneth 2009-11-25 10:56:55

Updated the answer with the trailing $. Make sure to leave a comment if you find any other problems during testing :)

VoDurden 2009-11-25 11:07:02

Answer 4

A:

I'm not sure of what you want. If you wan't to match anything that doesn't containst any ?, #, and = then the regex is

([^=?#]*)

Tristram Gräbener 2009-11-25 10:43:27

You can drop the backslash - inside the character class, the ? is not a special character.

Tim Pietzcker 2009-11-25 10:54:20

Good remark :) I just copy-pasted without thinking

Tristram Gräbener 2009-11-25 13:25:22

Answer 5

A:

As an alternative there's always the urlparse module which is designed for parsing urls.

from urlparse import urlparse

urls= [
    'http://www.example.com/subdirectory/',
    'http://www.example.com/subdirectory/index.php',
    'http://www.example.com/subdirectory/somepage?param=1&amp;param=5#print',
    'http://www.example.com/subdirectory/index.php?param=1',
]

for url in urls:
    # in python 2.5+ you can use urlparse(url).query instead
    if not urlparse(url)[4]:
        print url

Provides the following:

http://www.example.com/subdirectory/
http://www.example.com/subdirectory/index.php

muffinresearch 2009-11-25 21:22:31

ansaurus

tags:

views:

answers:

Regular Expression to match a string only when certain characters don't exist

related questions