views:

455

answers:

10

I have some data like this:

1 2
3 4
5 9
2 6
3 7

and am looking for an output like this (group-id and the members of that group):

1: 1 2 6
2: 3 4 7
3: 5 9

First row because 1 is "connected" to 2 and 2 is connected to 6. Second row because 3 is connected to 4 and 3 is connected to 7

This looked to me like a graph traversal but the final order does not matter so I was wondering if someone can suggest a simpler solution that I can use on a large dataset (billions of entries).


From the comments:

  • The problem is to find the set of disjoint sub-graphs given a set of edges.
  • The edges are not directed; the line '1 2' means that 1 is connected to 2 and 2 is connected to 1.
  • The '1:' in the sample output could be 'A:' without changing the meaning of the answer.

EDIT 1:

Problem looks solved now. Thanks to everyone for their help. I need some more help picking the best solution that can be used on billions of such entries.

EDIT 2:

Test Input file:

1 27
1 134
1 137
1 161
1 171
1 275
1 309
1 413
1 464
1 627
1 744
2 135
2 398
2 437
2 548
2 594
2 717
2 738
2 783
2 798
2 912
5 74
5 223
7 53
7 65
7 122
7 237
7 314
7 701
7 730
7 755
7 821
7 875
7 884
7 898
7 900
7 930
8 115
9 207
9 305
9 342
9 364
9 493
9 600
9 676
9 830
9 941
10 164
10 283
10 380
10 423
10 468
10 577
11 72
11 132
11 276
11 306
11 401
11 515
11 599
12 95
12 126
12 294
13 64
13 172
13 528
14 396
15 35
15 66
15 210
15 226
15 360
15 588
17 263
17 415
17 474
17 648
17 986
21 543
21 771
22 47
23 70
23 203
23 427
23 590
24 286
24 565
25 175
26 678
27 137
27 161
27 171
27 275
27 309
27 413
27 464
27 627
27 684
27 744
29 787

Benchmarks:

I tried out everything and the version posted by TokenMacGuy is the fastest on the sample dataset that I tried. The dataset has about 1 million entries for which it took me about 6 seconds on a Dual Quad-Core 2.4GHz machine. I haven't gotten a chance to run it on the entire dataset yet but I will post the benchmark as soon as it is available.

+1  A: 

Here is a sample Perl solution that works on the original data set:

1 2
3 4
5 9
2 6
3 7

Group 1: 1 2 6
Group 2: 3 4 7
Group 3: 5 9

On the big data set, it produces the output:

Group 1: 1 27 134 137 161 171 275 309 413 464 627 684 744
Group 2: 2 135 398 437 548 594 717 738 783 798 912
Group 3: 5 74 223
Group 4: 7 53 65 122 237 314 701 730 755 821 875 884 898 900 930
Group 5: 8 115
Group 6: 9 207 305 342 364 493 600 676 830 941
Group 7: 10 164 283 380 423 468 577
Group 8: 11 72 132 276 306 401 515 599
Group 9: 12 95 126 294
Group 10: 13 64 172 528
Group 11: 14 396
Group 12: 15 35 66 210 226 360 588
Group 13: 17 263 415 474 648 986
Group 14: 21 543 771
Group 15: 22 47
Group 16: 23 70 203 427 590
Group 17: 24 286 565
Group 18: 25 175
Group 19: 26 678
Group 20: 29 787

Whether it is efficient enough is a separate matter...

use strict;
use warnings;
my %cache = ();
while (<>)
{
    chomp;
    my($x,$y) = split /\s+/;
    #print "$x $y\n";
    $cache{$x}{$y} = 1;
    $cache{$y}{$x} = 1;
}

my $grp = 1;
foreach my $key (sort { $a <=> $b } keys %cache)
{
    #print "key: $key\n";
    if (defined $cache{$key})
    {
        my %result = ();
        subkey_search(\%result, $key);
        print "Group $grp:";
        $grp++;
        foreach my $val (sort { $a <=> $b } keys %result)
        {
            print " $val";
        }
        print "\n";
    }
}

sub subkey_search
{
    my($resultref, $key) = @_;
    my %hash = %{$cache{$key}};
    delete $cache{$key};
    $resultref->{$key} = 1;
    foreach my $subkey (sort keys %hash)
    {
        #print "subkey: $subkey\n";
        subkey_search($resultref, $subkey) if (defined $cache{$subkey});
    }
}
Jonathan Leffler
@Jonathan: Thank You for your time. I am running some benchmarks. I will get back.
Legend
+1  A: 

Ok so I got something working in parallel to the other solution posted by @Jonathan (first of all, many thanks for your time). My solution looks deceptively simple but would love some suggestions on whether this is correct (maybe I'm missing a corner case somewhere?) because it seems to produce the output I wanted but I'll have to parse it in a second pass to group the same group numbers which is trivial. The logic is that everytime it finds a new number not in the array it increments a group_id counter:

My code in PHP:

<?php

//$fp = fopen("./resemblance.1.out", "r");
$fp = fopen("./wrong", "r");

$groups = array();
$group["-1"] = 1;
$groups[] = $group;

$map = array();

//Maintain a count
$group = 1;

while(!feof($fp)) {
        $source = trim(fgets($fp, 4096));
        //echo $source."\n";

        $source = explode(" ", $source);

        if(array_key_exists($source[0], $map) && !array_key_exists($source[1], $map)) {
                $map[$source[1]] = $map[$source[0]];
        } else if(array_key_exists($source[1], $map) && !array_key_exists($source[0], $map)) {
                $map[$source[0]] = $map[$source[1]];
        } else if(array_key_exists($source[1], $map) && array_key_exists($source[0], $map) && $map[$source[1]] != $map[$source[0]]) {
                // Adjust the groups - change the groups of one of the elements to the other
                $keys = array_keys($map, $map[$source[1]]);
                print_r($keys);
                foreach($keys as $key) {
                        $map[$key] = $map[$source[0]];
                }
        } else {
                $group++;
                $map[$source[0]] = $group;
                $map[$source[1]] = $group;
        }
}

print_r($map);
?>

Output:

Array
(
    [1] => 2
    [2] => 2
    [3] => 3
    [4] => 3
    [5] => 4
    [9] => 4
    [6] => 2
    [7] => 3
    [] => 5
)

EDIT: Fixed the bug that was mentioned in the comment. Just playing around out of curiosity :) Feel free to point out any other bugs. I am currently testing out which solution is faster.

Legend
your algorithm is surely wrong. The e.g intput that make your result wrong is:1 4 | 2 3 | 4 2 . You should create a graph from your input first, then you could use Depth-first search algorithm to find the dis-joint subgraph.
coolkid
@coolkid: Thanks for your input. That was a very good corner case. :) I think I fixed it in my new code but I will test it more thoroughly. I am a little worried about creating an entire graph because I need to do this on billions of records. In any case, I will do some benchmarks before coming to a conclusion on which approach is better.
Legend
I think you should find out the algorithm by yourself. THis is a fundamental problem of Computer Science. If you cannot solve it, we will help you by psuedo-code , not language specific code.
coolkid
@coolkid: I didn't realize the language-agnostic tag. I removed it. Though I am concerned with arriving at the right algorithm, I am a little more concerned about efficiency too. But thanks for your pointers.
Legend
@Legend: the problem you face when deal with big input is the size of representation for the graph. There're 2 kind of graph representative: matix and linked list. If you're dealing with big input, I suggest that you use the linked list representation, the size is linear with the number of node in graph. THen you can use depth first search to solve the problem as I proposed.
coolkid
If you use `isset` instead of `array_key_exists` the code would be way more efficient. Functions are slow in PHP. So better use language constructs instead.
nikic
Your code seems to produce the correct output apart from one thing: It puts `27 744` in a new group, but they should be in the first.
nikic
Your new algorithm runs in order V^2+E time where V is the number of distinct numbers in the input and E is the number of pairs in the list. The problem can be solved in order V+E time using DFS, as coolkid writes.
Reid Barton
+1  A: 

Here's a slightly different version in Python, which builds a graph containing the edges specified, then converts that to a list of connected subgraphs.

I might want to use this later so I wrote it as a general-purpose version that doesn't do input from a file or output with print statements, just converting data structures.

def graph_to_connected_subgraphs(graph):
    trees = []
    for start in graph.keys():
        if start in graph:
            list = [start]
            append_tree_from(graph, start, list)
            trees.append(list)
    return trees

def append_tree_from(graph, node, list):
    if node in graph:
        for endpoint in graph[node]:
            list.append(endpoint)
            append_tree_from(graph, endpoint, list)
        del graph[node]
    return list

def add_edge(graph, f, s):
    if s < f: # ensure f < s to handle cyclic graphs
        f, s = s, f
    if f not in graph:
        graph[f] = [s]
    else:
        graph[f].append(s)

graph = {}

add_edge(graph, 1,2)
add_edge(graph, 2,6)
add_edge(graph, 3,4)
add_edge(graph, 5,9)
add_edge(graph, 3,7)

print graph_to_connected_subgraphs(graph)

Output

[[1, 2, 6], [3, 4, 7], [5, 9]]
Lee Reeves
Is it not more Pythonic to do '`s, f = f, s`' to swap to the values?
Jonathan Leffler
Yep, thanks. I've been working too much in C++ today.
Lee Reeves
A: 

Here's my stab at an answer. Python.

groups = []

infile = open("so2.txt")

for line in infile.readlines():
  newset = set(line.split())
  matchgroups = []
  excludegroups = []
  for group in groups:
    if len(newset & group):
      newset |= group
    else:
      excludegroups.append(group)
  groups = excludegroups
  groups.append( newset)

for i, s in enumerate(groups):
  print "%d: %s"%(i, " ".join(s))

The Idea here is that forming graphs is not really right. Each pair of numbers in the input is a set. The rule is to return only disjoint sets. So I read each line and convert them to sets, then I check all of the existing sets for intersections, and merge those into the new set. Nonintersecting sets are just added to the new list of sets, and once I'm done I add the new, merged set into the new list of sets. This way I can be sure that only disjoint sets make it into the list.

TokenMacGuy
+1  A: 

This is a typical application of DFS (Depth First Search) algorithm performed on graphs. Try read this dfs Complexity of this algorithm is O(|V|+|E|), where V - number of vertices and E - number of edges

Ilija
A: 
nikic
+3  A: 
TokenMacGuy
@TokenMacGuy: After trying out pretty much all the versions here including mine, your version is the fastest :) Thank you very much for the detailed explanations and thanks to everyone else for their valuable time.
Legend
Thank you very much. I'd love to see the run statistics for the different implementations! If you really need to push the speed envelope, there are some further optimizations we can make, especially if you know the maximum element.
TokenMacGuy
A: 
Olathe
A: 

After not being completely satisfied with my first 2 attempts at this, and some research, I came across this recipe for disjoint sets in Python, with Raymond Hettinger's blessing and input. (Raymond Hettinger is a long-time very active Python core developer.)

Here is an adaptation of that recipe that's very close to my first 2 attempts, but the recipe itself may be a better reference.

Collecting should be as efficient as possible for very large sets of data, as much of the set operations in Python are implemented in C. The input data does not have to be sorted. For printing, I sorted outputs solely for readability, but if this affects performance, connections can be printed without sorting.

# ~~~~~
# data, setup

input = '''
    1 2
    3 4
    2 3
    ''' # etc.

def lgen():
    for l in input.splitlines():
        l = l.strip()
        if l:
            yield tuple(int(i) for i in l.split())

# ~~~~~
# collect

connections = {} # this is a mapping of values to the connections they are in
                 # each node will map to a shared object instance of the connection it is in
                 # e.g. {1: set([1,2]), 2: set([1,2])}, where the 2 sets are the same object

for x, y in lgen():
    cx = connections.setdefault(x, set([x]))      # if not found, create new connection with this single value

    cy = connections.get(y)                 # defaults to None if not found
    if not cy:                              # if we haven't come across this value yet...
        cx.add(y)                           # ...add it to the current connection...
        connections[y] = cx                 # ...and update the reference
    elif cy is not cx:                      # if the cy connection is not the exact same object as the cx connection...
        if len(cy) > len(cx):               # \
            cx, cy = cy, cx                 #  >... merge them ...
        cx |= cy                            # /
        connections[y] = cx                 # ...and update the reference

# ~~~~~
# print

seen = set()
for key in sorted(connections.keys()):
    if key not in seen:
        c = connections[key]
        print sorted(c)
        seen |= c
ma3
A: 

This actually is a very classic Union-find.

If M is the number of edges, N the number of nodes, the time complexity is O(M * α(M)) which is a O(M) for all practical M and the space complexity if O(N) with N the number of nodes.

The algorithm is online and does not need to know in advance all the edges (compared to other graph-traversal solutions) and hence can scale very well. Also there is no need to order the edges, they can be given in any order.

For graph with billions of nodes you'll need 64 bits / long int and a lot of RAM, but it should handle millions of nodes and billions of edges very well.

The implementation is in C++ but only use vector/map that you can find in almost any language you might want to use.


But since you have unique id for each element we need to map these id to (contiguous) integers.

First version without mapping (suppose that all nodes between 1 and N exists):

#include <cstdio>
#include <algorithm>
#include <vector>
using namespace std;

const int MAX_N = 1000*1000;
int p[MAX_N],f[MAX_N];

int parent(int a) {
  return p[a] == a ? a : p[a] = parent(p[a]);
}
bool join(int a, int b) {
  p[a = parent(a)] = parent(b);
  return p[a] != a;
}

int main()
{
    // First integer in the file : number of nodes in the graph
    int N;
    scanf("%d",&N);
    // Union-find in O(M * alpha(M)) ~= O(M)
    // M = number of lines in the file
    for(int i = 1; i <= N ; i++)
    {
        p[i] = i;
        f[i] = -1;
    }
    int a,b;
    while(scanf("%d%d",&a,&b) != EOF)
        join(a,b);
    // Determine the number of groups : O(M)
    int nG = 0;
    for(int i = 1 ; i <= N ; i++)
    {
        p[i] = parent(p[i]);
        if(f[p[i]] == -1)
            f[p[i]] = nG++;
    }
    // Build groups : O(M)
    vector< vector<int> > Groups(N+1);
    for(int i = 1 ; i <= N ; i++)
        Groups[ f[p[i]] ].push_back(i);
    // Output result
    for(int i = 0 ; i < Groups.size() ; i++)
    {
        if(!Groups[i].size())
            continue;
        printf("%d : ",i);
        for(int j = 0 ; j < Groups[i].size() ; j++)
            printf("%d ",Groups[i][j]);
        printf("\n");
    }

}

Version with mapping : for that version we need to build the mapping. Since I don't know anything about your data, I'm using a classical map to build it in O(M log(N)), if you can send the id of all nodes at the begin of the input file, it can be O(N log(N)) or even better if your are using a Hash Map ( O(N)) or if you can build the mapping yourself, with some knowing of the graph.

Anyway, here is the code :

#include <cstdio>
#include <algorithm>
#include <vector>
#include <map>
using namespace std;

const int MAX_N = 1000*1000;
int p[MAX_N],f[MAX_N];

int parent(int a) {
  return p[a] == a ? a : p[a] = parent(p[a]);
}
bool join(int a, int b) {
  p[a = parent(a)] = parent(b);
  return p[a] != a;
}
// Mapping
int N = 0;
map<int,int> indMap,invMap;
int IND(int x) {
    if(indMap.find(x) == indMap.end())
    {
        N++;
        p[N] = N;
        f[N] = -1;
        indMap[x] = N;
    }
    invMap[ indMap[x] ] = x;
    return indMap[x];
}


int main()
{
    // Union-find in O(M * alpha(M)) ~= O(M)
    // M = number of lines in the file
    int a,b;
    while(scanf("%d%d",&a,&b) != EOF)
        join(IND(a),IND(b));
    // Determine the number of groups : O(M)
    int nG = 0;
    for(int i = 1 ; i <= N ; i++)
    {
        p[i] = parent(p[i]);
        if(f[p[i]] == -1)
            f[p[i]] = nG++;
    }
    // Build groups : O(M)
    vector< vector<int> > Groups(N+1);
    for(int i = 1 ; i <= N ; i++)
        Groups[ f[p[i]] ].push_back(i);
    // Output result
    for(int i = 0 ; i < Groups.size() ; i++)
    {
        if(!Groups[i].size())
            continue;
        printf("%d : ",i+1);
        for(int j = 0 ; j < Groups[i].size() ; j++)
            printf("%d ", invMap[ Groups[i][j] ]);
        printf("\n");
    }

}
Loïc Février