tags:

views:

52

answers:

3

I need a regex that will give me the following results from each example and I can't seem to get it right:

example.com yields -> nothing / empty

www.example.com yields -> nothing / empty

account.example.com yields -> account

mywww.example.com yields -> mywww

wwwboys.example.com yields -> wwwboys

cool-www.example.com yields -> cool-www

So, it doesn't matter if they use 'www' in the subdomain, but it can't be only 'www'. It can also contain hyphens.

+1  A: 
x="""example.com yields -> nothing / empty

www.example.com yields -> nothing / empty

account.example.com yields -> account

mywww.example.com yields -> mywww

wwwboys.example.com yields -> wwwboys

cool-www.example.com yields -> cool-www"""

>>> re.findall("^([A-Za-z0-9-]+)\.(?<!^www\.)[A-Za-z0-9-]+\.[A-Za-z]+",x,re.MULTILINE)
['account', 'mywww', 'wwwboys', 'cool-www']
S.Mark
+1  A: 
mystrings="""
example.com
www.example.com
account.example.com
mywww.example.com
wwwboys.example.com
cool-www.example.com
"""

junk=["example.com","www.example.com"]
for url in mystrings.split("\n"):
    if url and not url.strip() in junk:
       print "-->",url.split(".",2)[0]

output

$ ./python.py
--> account
--> mywww
--> wwwboys
--> cool-www
ghostdog74
@ghostdog74 +1 Wow, so no re module needed? Thanks.
orokusaki
@ghostdog74 On second thought, that's way better. Now I can have a configuration setting to add more default not-allowed subdomains (like `api.example.com`, etc).
orokusaki
This fails for other input, like "www.google.com" or "www.stackoverflow.com", because it doesn't really check if the subdomain is "www".
Roger Pate
right, but OP's sample strings is just that, all with "example.com". "example.com" may be "google.com" for all you know.
ghostdog74
@Roger @ghost It's OK. It showed me what I needed to make my function. I'll post an extra answer below with my solution built on this.
orokusaki
A: 

Here's my solution based on ghostdog74's example:

OFF_LIMITS = ('api', 'www', 'secure', 'account')

def get_safe_subdomain_or_none(host):
    subdomain = None
    L = host.split('.')
    if len(L) is 3 and not L[0] in OFF_LIMITS:  # 3 ensures that you don't have a sub-sub domain, and that you don't have just `example.com`
        subdomain = L[0]
    return subdomain
orokusaki
Use == instead of *is* with numbers. What about "www.blah.example.com"?
Roger Pate
@Roger In the case of `www.blah.example.com`, it returns `None` as it should, but I could modify it to sort out sub-sub domains. Also, I only used `is` instead of `==` because `is` is sort of like `===` and I know that it needs to be exactly `3`. Is that frowned upon in the Python world for esoteric style reasons or is it bad practice? Either way, I can change it.
orokusaki
I asked about www.blah... because it wasn't clear to me what behavior you wanted in that case. *is* is not sort of like ===; *is* checks object identity, while === (in other languages) checks value and type. You will almost exclusively use *is* with None and similar singletons.
Roger Pate
@Roger Roger that. I read that before, but for some reason it didn't stay very long. Thank you.
orokusaki