I need to analyze 1 TB+ of web access logs, and in particular I need to analyze statistics relating to requested URLs and subsets of the URLs (child branches). If possible, I want the queries to be fast over small subsets of the data (e.g. 10 million requests).
For example, given an access log with the following URLs being requested:
/ocp/about_us.html
/ocp/security/ed-209/patches/urgent.html
/ocp/security/rc/
/ocp/food/
/weyland-yutani/products/
I want to do queries such as:
- Count the number of requests for everything 'below' /ocp.
- Same as above, but only count requests for child nodes under /ocp/security
- Return the top 5 most frequently requested URLs.
- Same as above, except group by an arbitrary depth,
e.g. For the last query above, depth 2 for the data would return:
2: /ocp/security/
1: /ocp/
1: /ocp/food/
1: /weyland-yutani/products/
I think the ideal approach would probably be to use a column DB and tokenize the URLs such that there is a column for each element in the URL. However, I would really like to find a way to do this with open source apps if possible. HBase is a possibility, but query performance seems too slow to be useful for real-time queries (also, I don't really want to be in the business of re-implementing SQL)
I'm aware there are commercial apps for doing this this type of analytics, but for various reasons I want to implement this myself.