tags:

views:

84

answers:

4

I want a regex to find the following types of strings:

where

abc -> abc always remains abc

anything -> it could be any string

tld -> it could be any tld (top-level-domain) like .com .net .co.in .co.uk etc.

Note: The url must not contain any other thing at the end, means http://anything.abc.tld/xyz is not acceptable.

Note: As the list of tlds is a long list and still there are chances that you forget to include some tlds, I don't want to write each tld in the regex to check for. Instead I would like to have a regex that checks for the following (for tld):

  • After abc, there is a period (.)

  • After the period(.) there is atleast one character

+2  A: 

There are quite a lot TLDs and their number is growing. You could use

^http://[\w.-]+\.abc\.(com|net|co\.in|....  )/?$

But this would have to be maintained on a regular basis. Just using [^/]* for the TLD might be easier. This would look like

^http://[\w.-]+\.abc\.[^/]+/?$
Jens
@Jens Kindly see the updated question.
Yatendra Goel
@Jens What would you say if I use `^http://[^/]+\.abc\.[^/]+/?$` so as to free us from thinking about what characters can a url contain.
Yatendra Goel
@Yatendra: Sounds good. You may want to thing about using `^http://([^/]+\.)?abc\.[^/]+/?$` if you want to allow something like `http://abc.com`.
Jens
+1  A: 
^http://[a-zA-Z0-9.-]+\.abc\.[a-zA-Z.]+/?$

Might differ a little depending on which regex dialect are you using.

gpvos
Are underscores not allowed in subdomain names?
Jens
I don't think they are
used2could
Nope. They used to be allowed a long time ago, but are forbidden now. Also, I don't think it's likely dashes or numbers will ever be used in a tld, so I left them out there.
gpvos
A: 

First identify which kind of data you will be dealing with: are these line-based records, or XML (for example, they could be anything else)? That will tell you how you need to anchor the matches. If you can anchor them with ^, then that makes it easier. Do you need a variable number of strings between "http://" and the top-level domain? If you don't want to write out the top-level domain, then use

\.[a-z]\{2,3\}

The exact form will depend on whether you are using Basic Regular Expressions (sed, grep) or Extended Regular Expressions (awk), or Perl Compatible Regular Expressions.

What have you tried already? How have you tested it?

Joel J. Adamson
+1  A: 

^(http://)(.+)(abc)+.([^/]+)$

All grouped for you too :)

I highly suggest using the RegEx tool by gskinner.com

alt text

used2could