views:

51

answers:

2

I am developing a web analytics kind of system which needs to log referring URL, landing page URL and search keywords for every visitor on the website. What I want to do with this collected data is to allow end-user to query the data such as "Show me all visitors who came from Bing.com searching for phrase that contains 'red shoes'" or "Show me all visitors who landed on URL that contained 'campaign=twitter_ad'", etc.

Because this system will be used on many big websites, the amount of data that needs to log will grow really, really fast. So, my question: a) what would be the best strategy for logging so that scaling the system doesn't become a pain; b) how to use that architecture for rapid querying of arbitrary requests? Is there a special method of storing URLs so that querying them gets faster?

In addition to MySQL database that I use, I am exploring (and open to) other alternatives better suited for this task.

A: 

For rapid searching the data store, I would suggest creating an index of the urls (or any other string based criteria) that is based on a suffix tree data structure. The search would be done in O(k), where k is the length of the url (which is really fast). A good introduction to such kind of trees you could find here.

When it comes to logging, try not to store them by one. I/O operations are quite resource intensive and are in most cases the bottlenecks of such systems. Try to write the urls into your data store in batch. For example keep the submitted urls in memory and store them only by 1000 chunks at once. Just remember to update on some background or scheduled task the suffix tree to keep the data synchronized.

Karim
A: 

I was faced with this exact issue in SQL Server and the solution for me was a table to store all my unique URLS/TITLES with a unique key on a two computed columns containing a checksum of URL and TITLE. It took up about a tenth of the space as an equivalent key on the string URL/Title.and was 10X faster than a direct index.

I'm using SQL server so the statement was

(checksum([URL],(0)))

and

(checksum([URL],(0)))

I found this for MySql.

Since most of the traffic came from many of the same websites, it allowed me to consolidate urls/titles without having to search the whole table with each insert to enforce the unique constraint. My Procedure just returned an url/title PK if it already existed.

To tie to your users, use a USER_URL table with a FK of the PK of USER and URL.

Good luck.

Laramie
Thanks for your suggestion. Though checksum strategy might not work for me because I may need to do pattern matching, like: search all URLs that contain campaign=twitter
Paras Chopra