ansaurus

Question

regex that matches a string that contains some text

Answer 1

A:

/((u=userpage).*?(p=1))|((p=1).*?(u=userpage))/

This will get all strings that contain the two bits you're looking for.

Borealid 2010-08-14 00:04:13

ugly and inefficient, but probably working :) btw, i'd insert `\>` after `p=1`'s

mykhal 2010-08-14 00:14:48

@mykhal, then you'd get an unusable RE (since it's not what Python REs use to indicate word boundaries -- there are many dialects of REs, and I think you're thinking about e.g. vim's). If you used instead `\b`, like in my answer, you wouldn't have this problem (since it _is_ what Python REs use for the purpose;-).

Alex Martelli 2010-08-14 00:24:50

@Alex Martelli, you *still* do have problems. See my comment on your answer.

Aaron Gallagher 2010-08-14 00:27:34

@[Alex Martelli] you're right, sir, thanks. i do not use the boundary in python rex often, if ever.. :)

mykhal 2010-08-14 00:29:44

@downvoter: I think this regex works. Did I say something incorrect? Or was it just an answer you ideologically disagreed with? If so, post a comment.

Borealid 2010-08-14 00:30:00

@Aaron, seen and answered. And, how can you compare a pattern with `\>`, which will match **no** URLs of interest, with one with `\b`, which will falsely-match some values containing %-escapes?!

Alex Martelli 2010-08-14 00:31:47

@Borealid, look at the comments on all of the other regex-using answers. You get a lot of false positives.

Aaron Gallagher 2010-08-14 00:32:20

Answer 2

A:

To make sure you don't accidentally match parts like bu=userpage, u=userpagezap, p=111 or zap=1, you need abundant use of the \b "word-boundary" RE pattern element. I.e.:

re.compile(r'\bp=1\b.*\bu=userpage\b|\bu=userpage\b.*\bp=1\b')

The word-boundary elements in the RE's pattern prevent the above-mentioned, presumably-undesirable "accidental" matches. Of course, if in your application they're not "undesirable", i.e., if you positively want to match p=123 and the like, you can easily remove some or all of the word-boundary elements above!-)

Alex Martelli 2010-08-14 00:10:11

+1 Forgot about partial matches.

NullUserException 2010-08-14 00:16:32

`\b` doesn't protect you from everything. Your code still breaks on `?u=userpage%20whatever`.

Aaron Gallagher 2010-08-14 00:22:18

@Aaron, true, %-escapes do introduce a word-boundary. If you need to protect against that, `(\?|\-).

Alex Martelli 2010-08-14 00:28:48

@Alex, why do you think `parse_qs` isn't portable? 2to3 fixes `urlparse` imports correctly.

Aaron Gallagher 2010-08-14 00:37:33

@Aaron, it's been in `cgi_bin` "since forever", in `urlparse` only recently -- right now, at work, I'm focused on writing code supporting 2.4 to 2.7, _and_ future 2to3 runs, and there's enough headaches that avoidable ones are best avoided (sure, worst case, you can import "conditionally", that is, with try/except support, but I don't know how well that plays with 2to3).

Alex Martelli 2010-08-14 01:47:22

@Alex, I pity you having to use 2to3 in ways it was not intended for.

Aaron Gallagher 2010-08-14 02:22:04

Answer 3

+4 A:

import lxml.html, urlparse

d = lxml.html.parse(...)
for link in d.xpath('//a/@href'):
    url = urlparse.urlparse(link)
    if not url.query:
        continue
    params = urlparse.parse_qs(url.query)
    if 'userpage' in params.get('u', []) and '1' in params.get('p', []):
        print link

Aaron Gallagher 2010-08-14 00:10:48

+1 nice answer.

nosklo 2010-08-14 00:26:46

Answer 4

A:

It is possible to do this with string hacking, but you shouldn't. It's already in the standard library:

>>> import urllib.parse
>>> urllib.parse.parse_qs("u=userpage&as=233&p=1")
{'u': ['userpage'], 'as': ['233'], 'p': ['1']}

and hence

import urllib.parse
def filtered_urls( urls ):
    for url in urls:
        try:
            attrs = urllib.parse.parse_qs( url.split( "?" )[ 1 ] )
        except IndexError:
            continue

        if "userpage" in attrs.get( "u", "" ) and "1" in attrs.get( "p", "" ):
            yield url

foo = [ "www.example.com?u=userpage&as=233&p=1", "www.example.com?u=userpage&as=233&p=2" ]

print( list( filtered_urls( foo ) ) )

Note that this is Python 3 -- in Python parse_qs is in urlparse instead.

katrielalex 2010-08-14 00:12:04

This raises a SyntaxError, and `'userpage' != ['userpage']`. Also, why not urlparse.urlparse to get the query out of the URL?

Aaron Gallagher 2010-08-14 00:16:48

True, but a pretty trivial one (= for ==). Forgot about the lists though, thanks. And `urlparse` is fine, but overkill if we just want the query string.

katrielalex 2010-08-14 00:22:12

`TypeError: argument of type 'NoneType' is not iterable`; you can't do `in` on `None`. Please *try* your solution before you post it.

Aaron Gallagher 2010-08-14 00:23:58

Argh. I tried it and then changed the `None`s. +1 for being thorough, but I should point out that these are basically trivial bugs; if the OP wants to use the code they can easily fix the problems.

katrielalex 2010-08-14 00:28:16

@killown: works fine for me. Did you forget to pass in `foo`?

katrielalex 2010-08-19 14:05:51

Answer 5

+4 A:

if you want to use, in my opinion, something more proper approach, than regexp:

from urlparse import *
urlparsed = urlparse('www.example.com?u=userpage&as=233&p=1')
# -> ParseResult(scheme='', netloc='', path='www.example.com', params='', query='u=userpage&as=233&p=1', fragment='')
qdict = dict(parse_qsl(urlparsed.query))
# -> {'as': '233', 'p': '1', 'u': 'userpage'}
qdict.get('p') == '1' and qdict.get('u') == 'userpage'
# -> True

mykhal 2010-08-14 00:22:21

Ugh. `import *`. =p

katrielalex 2010-08-14 00:33:36

@katrielalex what' you did not sees such thing yet? :) btw, originally it was: `from urlparse import urlparse, parse_qsl`, but i shortened it for sake of readability (it's not the key part, and `from urlparse import urlparse` is also not very aesthetical)

mykhal 2010-08-14 00:36:40

Heh, I know, don't worry. It's just one of those things like `don't parse HTML with regex` that come up an awful lot here. =p

katrielalex 2010-08-14 00:39:31

Answer 6

+2 A:

Regex is not a good choice for this because 1) the params could appear in either order, and 2) you need to do extra checks for query separators so that you don't match potential oddities like "flu=userpage", "sp=1", "u=userpage%20haha", or "s=123". (Note: I missed two of those cases in my first pass! So did others.) Also: 3) you already have a good URL parsing library in Python which does the work for you.

With regex you'd need something clumsy like:

q = re.compile(r'([?&]u=userpage&(.*&)?p=1(&|$))|([?&]p=1&(.*&)?u=userpage(&|$))')
return q.search(href) is not None

With urlparse you can do this. urlparse gives you a little more than you want but you can use a helper function to keep the result simple:

def has_qparam(qs, key, value):
    return value in qs.get(key, [])

qs = urlparse.parse_qs(urlparse.urlparse(href).query)
return has_qparam(qs, 'u', 'userpage') and has_qparam(qs, 'p', '1')

Owen S. 2010-08-14 00:23:39

those `qs['u'] == 'userpage'` won't work, because `parse_qs` dict has list values.. might be then `'userpage' in qs['u']`.. or use `parse_qsl` (tuple) and convert to dict

mykhal 2010-08-14 00:40:32

Ah, yes, that's tripped me up in the past too :-P . Fixed.

Owen S. 2010-08-14 00:48:25

ansaurus

tags:

views:

answers:

regex that matches a string that contains some text

related questions