tags:

views:

322

answers:

2

Hi there!

For my users I need to present a screen where they can input multiple domain names in a textarea. The users can put the domain names on different lines, or separate them by spaces or commas (maybe even semicolons - I dont know!)

I need to parse and identify the individual domain names with extension (which will be .com, anything else can be ignored).

User input can be as:

asdf.com

qwer.com

AND/OR

wqer.com, gwew.com

AND/OR

ertert.com gdfgdf.com

No one will input a 3 level domain like www.abczone.com, but if they do I'm only interested in extracting the abczone.com part. (I can have a separate regex to verify/extract that from each).

+1  A: 

This will do it:

(\b[a-zA-Z][a-zA-Z0-9-]*)(?=\.com\b)

"Find all sequences of a letter followed by letters, digits, or hyphens, followed by .com then a word break."

(You need the last bit to protect against picking up bim.com from bim.command.com.)

Python test case because I don't have a PHP test environment to hand:

DATA = "asdf.com\nx-123.com, gwew.com bim.command.com 123.com, x_x.com"
import re
print re.findall(r'(\b[a-zA-Z][a-zA-Z0-9-]*)(?=\.com\b)', DATA)
# Prints ['asdf', 'x-123', 'gwew', 'command']
RichieHindle
Almost, but: 1) a domain name cannot start with numbers, 2) a domain name cannot contain more than 63 chars, 3) a domain name cannot contain "_".
Alix Axel
@eyze: Fixed 1 and 3.
RichieHindle
@RichieHindle: Also, why is the .com inside a non capturing group? There is no need to in my point of view.
Alix Axel
@eyze: No, you're probably right.
RichieHindle
Thanks folks! I modified it slightly to include the .com portion and for it to work with php - but it seems to work great!!! Thanks again!Here's what I used: preg_match_all('/(\b[a-zA-Z][a-zA-Z0-9-]*)\.com\b/', ...)
Steve
A: 

Here it is, you can use the i modifier and delete all the uppercase A-Z if you want to:

\b([a-zA-Z][0-9a-zA-Z\-]{1,62})\.com\b
Alix Axel
Sadly this also fails for "this-domain-name-is-longer-than-63-characters-and-hence-not-valid.com", returning "domain-name-is-longer-than-63-characters-and-hence-not-valid".
RichieHindle
@RichieHindle: I disagree, if finds a sub string that can be considered a valid domain. It's either that or nothing, whereas your implementation just returns a domain name that cannot exist.
Alix Axel