views:

2014

answers:

3

Given a html document, what is the most correct and concise regular expression pattern to remove the query strings from each url in the document?

+4  A: 

You can't usefully parse HTML with a regexp. If you know the format of the page in advance — eg.

  • links are always in the form < a href="url with no unnecessary character escapes">, or
  • all links are absolute, and no other non-link strings beginning with http: exist

then you can just about get away with it, but for general [X]HTML a regexp parser is unsuitable.

Depending on what language you're using, you'd need to find either an HTML parser library (eg. Python's BeautifulSoup), or an HTML tidier combined with a standard XML parser, then scan the document for < a> elements (and maybe others, eg. < img> if you're interested in those?), then split the attribute value on ‘?’.

bobince
Thank you bobince, I was actually using BeautifulSoup but was looking for a quick and dirty way rather than iterating through all the links.
EoghanM
+1  A: 

Re: Bobince's comment, the HTMLAgilityPack is a good html parser for .NET, its more forgiving with dealing with incorrect markup than other parsers.

Using this will let you find all the A tags, then you can get the HREF and simply remove anything after and including a '?'

Andrew Bullock
A: 

Find this:

/href="([^\?"]*?)\?[^\"]*"/

Replace with:

href="\1"

you may have to watch out that it doesn't strip <link> tags.

nickf
There's quite a few cases that won't match: href = "foo?bar", href = foo?bar (not valid but still could appear) href='foo?bar'
Greg