ansaurus

Question

Remove the Query String from a Url in HTML with a Regular Expression

Answer 1

+4 A:

You can't usefully parse HTML with a regexp. If you know the format of the page in advance — eg.

links are always in the form < a href="url with no unnecessary character escapes">, or
all links are absolute, and no other non-link strings beginning with http: exist

then you can just about get away with it, but for general [X]HTML a regexp parser is unsuitable.

Depending on what language you're using, you'd need to find either an HTML parser library (eg. Python's BeautifulSoup), or an HTML tidier combined with a standard XML parser, then scan the document for < a> elements (and maybe others, eg. < img> if you're interested in those?), then split the attribute value on ‘?’.

bobince 2008-11-07 10:57:01

Thank you bobince, I was actually using BeautifulSoup but was looking for a quick and dirty way rather than iterating through all the links.

EoghanM 2008-11-07 12:50:13

Answer 2

+1 A:

Re: Bobince's comment, the HTMLAgilityPack is a good html parser for .NET, its more forgiving with dealing with incorrect markup than other parsers.

Using this will let you find all the A tags, then you can get the HREF and simply remove anything after and including a '?'

Andrew Bullock 2008-11-07 11:02:29

Answer 3

A:

Find this:

/href="([^\?"]*?)\?[^\"]*"/

Replace with:

href="\1"

you may have to watch out that it doesn't strip <link> tags.

nickf 2008-11-07 11:07:59

There's quite a few cases that won't match: href = "foo?bar", href = foo?bar (not valid but still could appear) href='foo?bar'

Greg 2008-11-07 11:28:27

ansaurus

tags:

views:

answers:

Remove the Query String from a Url in HTML with a Regular Expression

related questions