So, here's my question:
I have a crawler that goes and downloads web pages and strips those of URLs (for future crawling). My crawler operates from a whitelist of URLs which are specified in regular expressions, so they're along the lines of:
(http://www.example.com/subdirectory/)(.*?)
...which would allow URLs that followed the pattern to be crawled in the future. The problem I'm having is that I'd like to exclude certain characters in URLs, so that (for example) addresses such as:
(http://www.example.com/subdirectory/)(somepage?param=1¶m=5#print)
...in the case above, as an example, I'd like to be able to exclude URLs that feature ?, #, and = (to avoid crawling those pages). I've tried quite a few different approaches, but I can't seem to get it right:
(http://www.example.com/)([^=\?#](.*?))
etc. Any help would be really appreciated!
EDIT: sorry, should've mentioned this is written in Python, and I'm normally fairly proficient at regex (although this has me stumped)
EDIT 2: VoDurden's answer (the accepted one below) almost yields the correct result, all it needs is the $ character at the end of the expression and it works perfectly - example:
(http://www.example.com/)([^=\?#]*)$