tags:

views:

21

answers:

3

I have a string with roughly 2k URLs embedded in it and need help with regular expression pattern to extract the URLs.

Example of string with URLs embedded

"blahblahblah;http://subdomain.server.com/index.asp?id=1000;blahblahblah;"

The URL will always begin with "http://subdomain.server.com/" and end with the first ";". The only thing that changes is the subdirectories and pages.

From the example above, I need to capture "http://subdomain.server.com/index.asp?id=1000"

I've tried (http://subdomain.server.com/).*(;) - but it doesn't stop at the first ";". It will actually grab "http://subdomain.server.com/index.asp?id=1000;blahblahblah;"

Any help would be greatly appreciated.

Thank you!

A: 

Nevermind - I got it. (http://subdomain.server.com/).*?(;)

Joey
fix for capture groups: `(http://subdomain.server.com/.*?)(?:;)`
jnpcl
A: 

Any reason why you can't just use whatever your languages string.split(';') equivalent is?

ceasterday
+1  A: 

A more accurate regular expression would be (http://subdomain.server.com/[^;]*);

It matches the domain, and then matches all characters except semicolon, and then matches semicolon at the end. The backslashes on the periods are needed to escape them, as the period is a special character in regex.

blwy10