views:

40

answers:

2

I want to be able to query a database so that I can find out which pages on a web site users visit before registering.

I was planning on using a graph database - neo4j probably - but I can't decide how the data should be represented.

Each user has an ID and can make multiple visits, each with their own ID. Obviously each page is identified by its URL, and a timestamp allows me to determine the order in which pages are visited.

So, how could I go about designing the database so I can answer the question "What are the most common paths that lead to registration"? I want to know the 5 pages a user visits before ending up on our registration page.

I could create nodes for each url, but then I'd have millions of user IDs, visit IDs and timestamps as properties on the url node.

Alternatively I could create one graph per visit, but then I wouldn't be able to begin at a single node representing the registration page and work back from there. Instead I'd have to iterate through all of the graphs, and only scan them if they contained a registration node.

If I stored URLs and users as nodes, how could I represent paths through the site? I could easily say 'User A visited URL 1', but wouldn't be able to represent that after that they visited URL 2 in an intuitive way.

I'm pretty stumped here. All the data is already in a relational DB, but I think a query like this would bring it down for quite some time since the site has several million visits per month.

How can I approach this? Is neo4j even the right solution?

Thanks

A: 

The Site usage analytics page on the Neo4j wiki describes one possible approach to this problem. The graph structure it uses looks like this: alt text

nawroth
A: 

thanks a lot, i'll check it out

Michael Jones