I need to match input strings (URLs) against a large set (anywhere from 1k-250k) of string rules with simple wildcard support.
Requirements for wildcard support are as follows:
Wildcard (*) can only substitute a "part" of a URL. That is fragments of a domain, path, and parameters. For example, "*.part.part/*/part?part=part&part=*". The only exception to this rule is in the path area where "/*" should match anything after the slash.
Examples:
- *.site.com/* -- should match sub.site.com/home.html, sub2.site.com/path/home.html
- sub.site.*/path/* -- should match sub.site.com/path/home.html, sub.site.net/path/home.html, but not sub.site.com/home.html
Additional requirements:
- Fast lookup (I realize "fast" is a relative term. Given the max 250k rules, still fall within < 1.5s if possible.)
- Work within the scope of a modern desktop (e.g. not a server implementation)
- Ability to return 0:n matches given a input string
- Matches will have rule data attached to them
What is the best system/algorithm for such as task? I will be developing the solution in C++ with the rules themselves stored in a SQLite database.