tags:

views:

104

answers:

6

I need a regex that matches

re.compile('userpage')


href="www.example.com?u=userpage&as=233&p=1"
href="www.example.com?u=userpage&as=233&p=2"

I want to get all urls that have u=userpage and p=1

How can I modify the regex above to find both u=userpage and p=1?

A: 

/((u=userpage).*?(p=1))|((p=1).*?(u=userpage))/

This will get all strings that contain the two bits you're looking for.

Borealid
ugly and inefficient, but probably working :) btw, i'd insert `\>` after `p=1`'s
mykhal
@mykhal, then you'd get an unusable RE (since it's not what Python REs use to indicate word boundaries -- there are many dialects of REs, and I think you're thinking about e.g. vim's). If you used instead `\b`, like in my answer, you wouldn't have this problem (since it _is_ what Python REs use for the purpose;-).
Alex Martelli
@Alex Martelli, you *still* do have problems. See my comment on your answer.
Aaron Gallagher
@[Alex Martelli] you're right, sir, thanks. i do not use the boundary in python rex often, if ever.. :)
mykhal
@downvoter: I think this regex works. Did I say something incorrect? Or was it just an answer you ideologically disagreed with? If so, post a comment.
Borealid
@Aaron, seen and answered. And, how can you compare a pattern with `\>`, which will match **no** URLs of interest, with one with `\b`, which will falsely-match some values containing %-escapes?!
Alex Martelli
@Borealid, look at the comments on all of the other regex-using answers. You get a lot of false positives.
Aaron Gallagher
A: 

To make sure you don't accidentally match parts like bu=userpage, u=userpagezap, p=111 or zap=1, you need abundant use of the \b "word-boundary" RE pattern element. I.e.:

re.compile(r'\bp=1\b.*\bu=userpage\b|\bu=userpage\b.*\bp=1\b')

The word-boundary elements in the RE's pattern prevent the above-mentioned, presumably-undesirable "accidental" matches. Of course, if in your application they're not "undesirable", i.e., if you positively want to match p=123 and the like, you can easily remove some or all of the word-boundary elements above!-)

Alex Martelli
+1 Forgot about partial matches.
NullUserException
`\b` doesn't protect you from everything. Your code still breaks on `?u=userpage%20whatever`.
Aaron Gallagher
@Aaron, true, %-escapes do introduce a word-boundary. If you need to protect against that, `(\?|\-).
Alex Martelli
@Alex, why do you think `parse_qs` isn't portable? 2to3 fixes `urlparse` imports correctly.
Aaron Gallagher
@Aaron, it's been in `cgi_bin` "since forever", in `urlparse` only recently -- right now, at work, I'm focused on writing code supporting 2.4 to 2.7, _and_ future 2to3 runs, and there's enough headaches that avoidable ones are best avoided (sure, worst case, you can import "conditionally", that is, with try/except support, but I don't know how well that plays with 2to3).
Alex Martelli
@Alex, I pity you having to use 2to3 in ways it was not intended for.
Aaron Gallagher
+4  A: 
import lxml.html, urlparse

d = lxml.html.parse(...)
for link in d.xpath('//a/@href'):
    url = urlparse.urlparse(link)
    if not url.query:
        continue
    params = urlparse.parse_qs(url.query)
    if 'userpage' in params.get('u', []) and '1' in params.get('p', []):
        print link
Aaron Gallagher
+1 nice answer.
nosklo
A: 

It is possible to do this with string hacking, but you shouldn't. It's already in the standard library:

>>> import urllib.parse
>>> urllib.parse.parse_qs("u=userpage&as=233&p=1")
{'u': ['userpage'], 'as': ['233'], 'p': ['1']}

and hence

import urllib.parse
def filtered_urls( urls ):
    for url in urls:
        try:
            attrs = urllib.parse.parse_qs( url.split( "?" )[ 1 ] )
        except IndexError:
            continue

        if "userpage" in attrs.get( "u", "" ) and "1" in attrs.get( "p", "" ):
            yield url

foo = [ "www.example.com?u=userpage&as=233&p=1", "www.example.com?u=userpage&as=233&p=2" ]

print( list( filtered_urls( foo ) ) )

Note that this is Python 3 -- in Python parse_qs is in urlparse instead.

katrielalex
This raises a SyntaxError, and `'userpage' != ['userpage']`. Also, why not urlparse.urlparse to get the query out of the URL?
Aaron Gallagher
True, but a pretty trivial one (= for ==). Forgot about the lists though, thanks. And `urlparse` is fine, but overkill if we just want the query string.
katrielalex
`TypeError: argument of type 'NoneType' is not iterable`; you can't do `in` on `None`. Please *try* your solution before you post it.
Aaron Gallagher
Argh. I tried it and then changed the `None`s. +1 for being thorough, but I should point out that these are basically trivial bugs; if the OP wants to use the code they can easily fix the problems.
katrielalex
@killown: works fine for me. Did you forget to pass in `foo`?
katrielalex
+4  A: 

if you want to use, in my opinion, something more proper approach, than regexp:

from urlparse import *
urlparsed = urlparse('www.example.com?u=userpage&as=233&p=1')
# -> ParseResult(scheme='', netloc='', path='www.example.com', params='', query='u=userpage&as=233&p=1', fragment='')
qdict = dict(parse_qsl(urlparsed.query))
# -> {'as': '233', 'p': '1', 'u': 'userpage'}
qdict.get('p') == '1' and qdict.get('u') == 'userpage'
# -> True
mykhal
Ugh. `import *`. =p
katrielalex
@katrielalex what' you did not sees such thing yet? :) btw, originally it was: `from urlparse import urlparse, parse_qsl`, but i shortened it for sake of readability (it's not the key part, and `from urlparse import urlparse` is also not very aesthetical)
mykhal
Heh, I know, don't worry. It's just one of those things like `don't parse HTML with regex` that come up an awful lot here. =p
katrielalex
+2  A: 

Regex is not a good choice for this because 1) the params could appear in either order, and 2) you need to do extra checks for query separators so that you don't match potential oddities like "flu=userpage", "sp=1", "u=userpage%20haha", or "s=123". (Note: I missed two of those cases in my first pass! So did others.) Also: 3) you already have a good URL parsing library in Python which does the work for you.

With regex you'd need something clumsy like:

q = re.compile(r'([?&]u=userpage&(.*&)?p=1(&|$))|([?&]p=1&(.*&)?u=userpage(&|$))')
return q.search(href) is not None

With urlparse you can do this. urlparse gives you a little more than you want but you can use a helper function to keep the result simple:

def has_qparam(qs, key, value):
    return value in qs.get(key, [])

qs = urlparse.parse_qs(urlparse.urlparse(href).query)
return has_qparam(qs, 'u', 'userpage') and has_qparam(qs, 'p', '1')
Owen S.
those `qs['u'] == 'userpage'` won't work, because `parse_qs` dict has list values.. might be then `'userpage' in qs['u']`.. or use `parse_qsl` (tuple) and convert to dict
mykhal
Ah, yes, that's tripped me up in the past too :-P . Fixed.
Owen S.