views:

212

answers:

4

I have a huge list of URL's, in the format:

What RegEx could I use to get the last three URL's, but miss the first two, so that every URL without a city attached is given, but the ones with cities are denied?

Note: I am using Google Analytics, so I need to use RegEx's to monitor my URL's with their advanced feature. As of right now Google is rejecting each regular expression.

+1  A: 

Generally, the best suggestion I can make for parsing URL's with a Regex is don't.

Your time is much much better spent finding a libary that exists for your language dedicated to the task of processing URLs.

It will have worked out all the edge cases, be fully RFC compliant, be bug free, secure, and have a great user interface so you can just suck out the bits you really want.

In your case, the suggested way to process it would be, using your URL library, extract the element s and then work explicitly on them.

That way, at most you'll have to deal with the path on its own, and not have to worry so much wether its

http://site.com/
https://site.com/
http://site.com:80/ 
http://www.site.com/

Unless you really want to.

For the "Path" you might even wish to use a splitter ( or a dedicated path parser ) to tokenise the path into elements first just to be sure.

Kent Fredric
People giving me downvotes due to my answer not being relevant any more, please consider that the answer was posted *prior* to the OP stating it was outside a programming language. In a programming language using a Parsing library *is* still the best way to go.
Kent Fredric
( the only good reason not to simply delete this answer is others might unwittingly come here thinking the answer is to use a regex, not seeing the google-analytics as a major part. This stands to try avert them from certain danger.
Kent Fredric
A: 

tj111's current solution doesn't work - it matches all your urls.

Here's one that works (and I checked with your values). It also matches, no matter if there is a trailing slash or not:

http:\/\/.*dest\/\w+/?$
Artem Russakovskii
A: 

/http:\/\/www.site.com\/dest\/\w+\/?$/i

matches if they're all the same site with the "dest" there. you could also do this:

/\w+:\/\/[^/]+\/dest\/\w+\/?$/i

which will match any site with any protocal (http,ftp) and any site with the /dest/country at the end, and an optional /

Note, that this will only work with a subset of what the urls could legitimately be.

Scott Dugas
A: 

Try this regular expression:

^http://www\.example\.com/dest/[^/]+/$

This would only match the last three URLs.

Gumbo