views:

175

answers:

4

Hi there,

First of all I'm quite a Java beginner, so I'm not sure if this is even possible! Basically I have a huge (3+million) data source of relational data (i.e. A is friends with B+C+D, B is friends with D+G+Z (but not A - i.e. unmutual) etc.) and I want to find every cycle within this (not necessarily connected) directed graph.

I've found the thread Finding all cycles in graph, which has pointed me to Donald Johnson's (elementary) cycle-finding algorithm which, superficially at least, looks like it'll do what I'm after (I'm going to try when I'm back at work on Tuesday - thought it wouldn't hurt to ask in the meanwhile!).

I had a quick scan through the code of the Java implementation of Johnson's algorithm (in that thread) and it looks like a matrix of relations is the first step, so I guess my questions are:

a) Is Java capable of handling a 3+million*3+million matrix? (was planning on representing A-friends-with-B by a binary sparse matrix)

b) Do I need to find every connected subgraph as my first problem, or will cycle-finding algorithms handle disjoint data?

c) Is this actually an appropriate solution for the problem? My understanding of "elementary" cycles is that in the graph below, rather than picking out A-B-C-D-E-F it'll pick out A-B-F, B-C-D etc. but that's not the end of the world given the task.

    E
   / \
  D---F
 / \ / \
C---B---A

d) If necessary, I can simplify the problem by enforcing mutuality in relations - i.e. A-friends-with-B <==> B-friends-with-A, and if really necessary I can maybe cut down the data size, but realistically it is always going to be around the 1mil mark.

z) Is this a P or NP task?! Am I biting off more than I can chew?

Thanks all, any help appreciated! Andy

A: 

c) Is this actually an appropriate solution for the problem? My understanding of "elementary" cycles is that in the graph below, rather than picking out A-B-C-D-E-F it'll pick out A-B-F, B-C-D etc. but that's not the end of the world given the task.

    E
   / \
  D---F
 / \ / \
C---B---A

No. "Elementary" in the sense of Donald Johnson's paper means simple, that is, no node appears twice in the circle. That means the algorithm won't pick AFBCDBA, but will pick ABCDEF.

d) If necessary, I can simplify the problem by enforcing mutuality in relations - i.e. A-friends-with-B <==> B-friends-with-A, and if really necessary I can maybe cut down the data size, but realistically it is always going to be around the 1mil mark.

Undirected graphs have vector space of (non-simple) cycles (which has a nice basis, etc.), but I don't know if it'll help you.

z) Is this a P or NP task?

It is an enumeration problem, which, by itself, cannot be in P nor NP.

jpalecek
A: 

Finding every cycle does not sound reasonable. There will be exponential number of cycles, all overlapping each other.

As for P or NP, the slowest part is actually enumerating each cycle (because there can be so many of them). If there are no cycles, a linear algorithm exists.

Maybe you actually want to divide the graph in biconnected components? See http://en.wikipedia.org/wiki/Biconnected_component

Also it is almost NEVER reasonable to store graphs in matrices. It sounds nice in theory, but does not scale in practice, Use adjacency lists instead ( http://en.wikipedia.org/wiki/Adjacency_list )

extropy
+2  A: 

What you're doing is similar to a very well studied problem in data mining, known as association rule mining or more generally frequent itemset mining. What you can find with frequent itemset mining, is a little bit more specific than what you're doing, but also more useful.

We'll go with closed frequent itemset mining, what this will do is find all groups of friends, where everyone is friends with each other.

I'm going to say this right now, that Java can't do what you want it to do. It can't load that much memory and it's not efficient enough to process that data in any reasonable amount of time, you're going to need to use C/C++. I'd suggest using LCM which is a closed frequent itemset miner, but you're also going to need set the support pretty high, because of the amount of data that you have.

Another thing you might want to consider is reading up on Large Graph Mining, which is also a fairly large area of research, but Java isn't going to cut it. Also you're not going to be able to find all the cycles in your data (unless it is incredibly sparse), there's going to be far too many of them. They're also going to be overlapping and not very meaningful, what you're going to maybe possibly be able to find is several of the largest cycles.

Jacob Schlather
A: 

Based on your comment about fraud detection, perhaps you might find this useful: Cycle Space.

Basically you treat each cycle of the graph as a binary vector of dimension m (where m is the number of edges in the graph), denoting the edges in the cycle by 1 and edges not in the cycle by 0.

We define the sum operation on cycles as the binary vector addition: 1+1 = 0, 1+0 = 1, 0+0 = 0 and so (1,0,1) + (1,1,1) = (0,1,0) etc.

The cycle space consists of each those vectors which are either cycles or disjoint union of cycles. You can find a basis of this vector space by finding the Fundamental Cycles of some spanning tree of the graph.

This basis has m-n+1 (n is number of nodes in the graph) vectors and any member of the cycle space (and hence any cycle) can be written as a 0-1 linear combination of those basis vectors.

Since this is a banking graph, I am guessing (I have no clue, though) m-n+1 will be a small enough number for this to be a feasible approach.

Moron