ansaurus

Question

Storing millions of URLs in a database for fast pattern matching

Answer 1

A:

For rapid searching the data store, I would suggest creating an index of the urls (or any other string based criteria) that is based on a suffix tree data structure. The search would be done in O(k), where k is the length of the url (which is really fast). A good introduction to such kind of trees you could find here.

When it comes to logging, try not to store them by one. I/O operations are quite resource intensive and are in most cases the bottlenecks of such systems. Try to write the urls into your data store in batch. For example keep the submitted urls in memory and store them only by 1000 chunks at once. Just remember to update on some background or scheduled task the suffix tree to keep the data synchronized.

Karim 2010-06-06 02:23:43

Answer 2

A:

I was faced with this exact issue in SQL Server and the solution for me was a table to store all my unique URLS/TITLES with a unique key on a two computed columns containing a checksum of URL and TITLE. It took up about a tenth of the space as an equivalent key on the string URL/Title.and was 10X faster than a direct index.

I'm using SQL server so the statement was

(checksum([URL],(0)))

and

(checksum([URL],(0)))

I found this for MySql.

Since most of the traffic came from many of the same websites, it allowed me to consolidate urls/titles without having to search the whole table with each insert to enforce the unique constraint. My Procedure just returned an url/title PK if it already existed.

To tie to your users, use a USER_URL table with a FK of the PK of USER and URL.

Good luck.

Laramie 2010-06-06 02:58:39

Thanks for your suggestion. Though checksum strategy might not work for me because I may need to do pattern matching, like: search all URLs that contain campaign=twitter

Paras Chopra 2010-06-06 05:32:05

ansaurus

tags:

views:

answers:

Storing millions of URLs in a database for fast pattern matching

related questions