I see two problems to solve.
The first one: You don't have any real central directory of all URLs in the world, and even you will not have a sitemap on every site you know
An idea would be to check if a search engine (Google or other) let you works at URL level instead of content level for searching. You would then generate search query that could return list of sites matching your regex and try to do it.
The second one: For certain webservices which may exposing functions as resources, you may have an infinite URL list matching a regex
You may use several check to avoid this.
By the way, you are facing the same problem as every search engine ... making an inventory of all the web. No one ever solved this problem.
EDIT:
webcrawler
basic algorithm
take a list of seed sites
for each seed
parse the webpage returned
add each link found in the page to the seed list
apply some algorithms for referencing the page to several keywords in a db