Given a html document, what is the most correct and concise regular expression pattern to remove the query strings from each url in the document?
You can't usefully parse HTML with a regexp. If you know the format of the page in advance — eg.
- links are always in the form < a href="url with no unnecessary character escapes">, or
- all links are absolute, and no other non-link strings beginning with http: exist
then you can just about get away with it, but for general [X]HTML a regexp parser is unsuitable.
Depending on what language you're using, you'd need to find either an HTML parser library (eg. Python's BeautifulSoup), or an HTML tidier combined with a standard XML parser, then scan the document for < a> elements (and maybe others, eg. < img> if you're interested in those?), then split the attribute value on ‘?’.
Re: Bobince's comment, the HTMLAgilityPack is a good html parser for .NET, its more forgiving with dealing with incorrect markup than other parsers.
Using this will let you find all the A tags, then you can get the HREF and simply remove anything after and including a '?'
Find this:
/href="([^\?"]*?)\?[^\"]*"/
Replace with:
href="\1"
you may have to watch out that it doesn't strip <link>
tags.