ansaurus

Question

Answer 1

+1 A:

Here is a sample Perl solution that works on the original data set:

1 2
3 4
5 9
2 6
3 7

Group 1: 1 2 6
Group 2: 3 4 7
Group 3: 5 9

On the big data set, it produces the output:

Group 1: 1 27 134 137 161 171 275 309 413 464 627 684 744
Group 2: 2 135 398 437 548 594 717 738 783 798 912
Group 3: 5 74 223
Group 4: 7 53 65 122 237 314 701 730 755 821 875 884 898 900 930
Group 5: 8 115
Group 6: 9 207 305 342 364 493 600 676 830 941
Group 7: 10 164 283 380 423 468 577
Group 8: 11 72 132 276 306 401 515 599
Group 9: 12 95 126 294
Group 10: 13 64 172 528
Group 11: 14 396
Group 12: 15 35 66 210 226 360 588
Group 13: 17 263 415 474 648 986
Group 14: 21 543 771
Group 15: 22 47
Group 16: 23 70 203 427 590
Group 17: 24 286 565
Group 18: 25 175
Group 19: 26 678
Group 20: 29 787

Whether it is efficient enough is a separate matter...

use strict;
use warnings;
my %cache = ();
while (<>)
{
    chomp;
    my($x,$y) = split /\s+/;
    #print "$x $y\n";
    $cache{$x}{$y} = 1;
    $cache{$y}{$x} = 1;
}

my $grp = 1;
foreach my $key (sort { $a <=> $b } keys %cache)
{
    #print "key: $key\n";
    if (defined $cache{$key})
    {
        my %result = ();
        subkey_search(\%result, $key);
        print "Group $grp:";
        $grp++;
        foreach my $val (sort { $a <=> $b } keys %result)
        {
            print " $val";
        }
        print "\n";
    }
}

sub subkey_search
{
    my($resultref, $key) = @_;
    my %hash = %{$cache{$key}};
    delete $cache{$key};
    $resultref->{$key} = 1;
    foreach my $subkey (sort keys %hash)
    {
        #print "subkey: $subkey\n";
        subkey_search($resultref, $subkey) if (defined $cache{$subkey});
    }
}

Jonathan Leffler 2010-10-03 02:58:41

@Jonathan: Thank You for your time. I am running some benchmarks. I will get back.

Legend 2010-10-03 03:54:58

Answer 2

+1 A:

Ok so I got something working in parallel to the other solution posted by @Jonathan (first of all, many thanks for your time). My solution looks deceptively simple but would love some suggestions on whether this is correct (maybe I'm missing a corner case somewhere?) because it seems to produce the output I wanted but I'll have to parse it in a second pass to group the same group numbers which is trivial. The logic is that everytime it finds a new number not in the array it increments a group_id counter:

My code in PHP:

<?php

//$fp = fopen("./resemblance.1.out", "r");
$fp = fopen("./wrong", "r");

$groups = array();
$group["-1"] = 1;
$groups[] = $group;

$map = array();

//Maintain a count
$group = 1;

while(!feof($fp)) {
        $source = trim(fgets($fp, 4096));
        //echo $source."\n";

        $source = explode(" ", $source);

        if(array_key_exists($source[0], $map) && !array_key_exists($source[1], $map)) {
                $map[$source[1]] = $map[$source[0]];
        } else if(array_key_exists($source[1], $map) && !array_key_exists($source[0], $map)) {
                $map[$source[0]] = $map[$source[1]];
        } else if(array_key_exists($source[1], $map) && array_key_exists($source[0], $map) && $map[$source[1]] != $map[$source[0]]) {
                // Adjust the groups - change the groups of one of the elements to the other
                $keys = array_keys($map, $map[$source[1]]);
                print_r($keys);
                foreach($keys as $key) {
                        $map[$key] = $map[$source[0]];
                }
        } else {
                $group++;
                $map[$source[0]] = $group;
                $map[$source[1]] = $group;
        }
}

print_r($map);
?>

Output:

Array
(
    [1] => 2
    [2] => 2
    [3] => 3
    [4] => 3
    [5] => 4
    [9] => 4
    [6] => 2
    [7] => 3
    [] => 5
)

EDIT: Fixed the bug that was mentioned in the comment. Just playing around out of curiosity :) Feel free to point out any other bugs. I am currently testing out which solution is faster.

Legend 2010-10-03 03:06:50

your algorithm is surely wrong. The e.g intput that make your result wrong is:1 4 | 2 3 | 4 2 . You should create a graph from your input first, then you could use Depth-first search algorithm to find the dis-joint subgraph.

coolkid 2010-10-03 03:18:36

@coolkid: Thanks for your input. That was a very good corner case. :) I think I fixed it in my new code but I will test it more thoroughly. I am a little worried about creating an entire graph because I need to do this on billions of records. In any case, I will do some benchmarks before coming to a conclusion on which approach is better.

Legend 2010-10-03 03:37:45

I think you should find out the algorithm by yourself. THis is a fundamental problem of Computer Science. If you cannot solve it, we will help you by psuedo-code , not language specific code.

coolkid 2010-10-03 03:38:54

@coolkid: I didn't realize the language-agnostic tag. I removed it. Though I am concerned with arriving at the right algorithm, I am a little more concerned about efficiency too. But thanks for your pointers.

Legend 2010-10-03 03:46:45

@Legend: the problem you face when deal with big input is the size of representation for the graph. There're 2 kind of graph representative: matix and linked list. If you're dealing with big input, I suggest that you use the linked list representation, the size is linear with the number of node in graph. THen you can use depth first search to solve the problem as I proposed.

coolkid 2010-10-03 03:53:00

If you use `isset` instead of `array_key_exists` the code would be way more efficient. Functions are slow in PHP. So better use language constructs instead.

nikic 2010-10-03 07:51:40

Your code seems to produce the correct output apart from one thing: It puts `27 744` in a new group, but they should be in the first.

nikic 2010-10-03 09:43:43

Your new algorithm runs in order V^2+E time where V is the number of distinct numbers in the input and E is the number of pairs in the list. The problem can be solved in order V+E time using DFS, as coolkid writes.

Reid Barton 2010-10-03 14:20:13

Answer 3

+1 A:

Here's a slightly different version in Python, which builds a graph containing the edges specified, then converts that to a list of connected subgraphs.

I might want to use this later so I wrote it as a general-purpose version that doesn't do input from a file or output with print statements, just converting data structures.

def graph_to_connected_subgraphs(graph):
    trees = []
    for start in graph.keys():
        if start in graph:
            list = [start]
            append_tree_from(graph, start, list)
            trees.append(list)
    return trees

def append_tree_from(graph, node, list):
    if node in graph:
        for endpoint in graph[node]:
            list.append(endpoint)
            append_tree_from(graph, endpoint, list)
        del graph[node]
    return list

def add_edge(graph, f, s):
    if s < f: # ensure f < s to handle cyclic graphs
        f, s = s, f
    if f not in graph:
        graph[f] = [s]
    else:
        graph[f].append(s)

graph = {}

add_edge(graph, 1,2)
add_edge(graph, 2,6)
add_edge(graph, 3,4)
add_edge(graph, 5,9)
add_edge(graph, 3,7)

print graph_to_connected_subgraphs(graph)

Output

[[1, 2, 6], [3, 4, 7], [5, 9]]

Lee Reeves 2010-10-03 03:10:55

Is it not more Pythonic to do '`s, f = f, s`' to swap to the values?

Jonathan Leffler 2010-10-03 03:43:25

Yep, thanks. I've been working too much in C++ today.

Lee Reeves 2010-10-03 03:47:39

Answer 4

A:

Here's my stab at an answer. Python.

groups = []

infile = open("so2.txt")

for line in infile.readlines():
  newset = set(line.split())
  matchgroups = []
  excludegroups = []
  for group in groups:
    if len(newset & group):
      newset |= group
    else:
      excludegroups.append(group)
  groups = excludegroups
  groups.append( newset)

for i, s in enumerate(groups):
  print "%d: %s"%(i, " ".join(s))

The Idea here is that forming graphs is not really right. Each pair of numbers in the input is a set. The rule is to return only disjoint sets. So I read each line and convert them to sets, then I check all of the existing sets for intersections, and merge those into the new set. Nonintersecting sets are just added to the new list of sets, and once I'm done I add the new, merged set into the new list of sets. This way I can be sure that only disjoint sets make it into the list.

TokenMacGuy 2010-10-03 07:00:51

Answer 5

+1 A:

This is a typical application of DFS (Depth First Search) algorithm performed on graphs. Try read this dfs Complexity of this algorithm is O(|V|+|E|), where V - number of vertices and E - number of edges

Ilija 2010-10-03 08:37:59

Answer 6

A:

nikic 2010-10-03 10:22:42

Answer 7

+3 A:

TokenMacGuy 2010-10-03 10:46:24

@TokenMacGuy: After trying out pretty much all the versions here including mine, your version is the fastest :) Thank you very much for the detailed explanations and thanks to everyone else for their valuable time.

Legend 2010-10-04 00:06:19

Thank you very much. I'd love to see the run statistics for the different implementations! If you really need to push the speed envelope, there are some further optimizations we can make, especially if you know the maximum element.

TokenMacGuy 2010-10-04 00:30:53

Answer 8

A:

Olathe 2010-10-03 18:19:35

Answer 9

A:

After not being completely satisfied with my first 2 attempts at this, and some research, I came across this recipe for disjoint sets in Python, with Raymond Hettinger's blessing and input. (Raymond Hettinger is a long-time very active Python core developer.)

Here is an adaptation of that recipe that's very close to my first 2 attempts, but the recipe itself may be a better reference.

Collecting should be as efficient as possible for very large sets of data, as much of the set operations in Python are implemented in C. The input data does not have to be sorted. For printing, I sorted outputs solely for readability, but if this affects performance, connections can be printed without sorting.

# ~~~~~
# data, setup

input = '''
    1 2
    3 4
    2 3
    ''' # etc.

def lgen():
    for l in input.splitlines():
        l = l.strip()
        if l:
            yield tuple(int(i) for i in l.split())

# ~~~~~
# collect

connections = {} # this is a mapping of values to the connections they are in
                 # each node will map to a shared object instance of the connection it is in
                 # e.g. {1: set([1,2]), 2: set([1,2])}, where the 2 sets are the same object

for x, y in lgen():
    cx = connections.setdefault(x, set([x]))      # if not found, create new connection with this single value

    cy = connections.get(y)                 # defaults to None if not found
    if not cy:                              # if we haven't come across this value yet...
        cx.add(y)                           # ...add it to the current connection...
        connections[y] = cx                 # ...and update the reference
    elif cy is not cx:                      # if the cy connection is not the exact same object as the cx connection...
        if len(cy) > len(cx):               # \
            cx, cy = cy, cx                 #  >... merge them ...
        cx |= cy                            # /
        connections[y] = cx                 # ...and update the reference

# ~~~~~
# print

seen = set()
for key in sorted(connections.keys()):
    if key not in seen:
        c = connections[key]
        print sorted(c)
        seen |= c

ma3 2010-10-03 19:11:46

Answer 10

A:

This actually is a very classic Union-find.

If M is the number of edges, N the number of nodes, the time complexity is O(M * α(M)) which is a O(M) for all practical M and the space complexity if O(N) with N the number of nodes.

The algorithm is online and does not need to know in advance all the edges (compared to other graph-traversal solutions) and hence can scale very well. Also there is no need to order the edges, they can be given in any order.

For graph with billions of nodes you'll need 64 bits / long int and a lot of RAM, but it should handle millions of nodes and billions of edges very well.

The implementation is in C++ but only use vector/map that you can find in almost any language you might want to use.

But since you have unique id for each element we need to map these id to (contiguous) integers.

First version without mapping (suppose that all nodes between 1 and N exists):

#include <cstdio>
#include <algorithm>
#include <vector>
using namespace std;

const int MAX_N = 1000*1000;
int p[MAX_N],f[MAX_N];

int parent(int a) {
  return p[a] == a ? a : p[a] = parent(p[a]);
}
bool join(int a, int b) {
  p[a = parent(a)] = parent(b);
  return p[a] != a;
}

int main()
{
    // First integer in the file : number of nodes in the graph
    int N;
    scanf("%d",&N);
    // Union-find in O(M * alpha(M)) ~= O(M)
    // M = number of lines in the file
    for(int i = 1; i <= N ; i++)
    {
        p[i] = i;
        f[i] = -1;
    }
    int a,b;
    while(scanf("%d%d",&a,&b) != EOF)
        join(a,b);
    // Determine the number of groups : O(M)
    int nG = 0;
    for(int i = 1 ; i <= N ; i++)
    {
        p[i] = parent(p[i]);
        if(f[p[i]] == -1)
            f[p[i]] = nG++;
    }
    // Build groups : O(M)
    vector< vector<int> > Groups(N+1);
    for(int i = 1 ; i <= N ; i++)
        Groups[ f[p[i]] ].push_back(i);
    // Output result
    for(int i = 0 ; i < Groups.size() ; i++)
    {
        if(!Groups[i].size())
            continue;
        printf("%d : ",i);
        for(int j = 0 ; j < Groups[i].size() ; j++)
            printf("%d ",Groups[i][j]);
        printf("\n");
    }

}

Version with mapping : for that version we need to build the mapping. Since I don't know anything about your data, I'm using a classical map to build it in O(M log(N)), if you can send the id of all nodes at the begin of the input file, it can be O(N log(N)) or even better if your are using a Hash Map ( O(N)) or if you can build the mapping yourself, with some knowing of the graph.

Anyway, here is the code :

#include <cstdio>
#include <algorithm>
#include <vector>
#include <map>
using namespace std;

const int MAX_N = 1000*1000;
int p[MAX_N],f[MAX_N];

int parent(int a) {
  return p[a] == a ? a : p[a] = parent(p[a]);
}
bool join(int a, int b) {
  p[a = parent(a)] = parent(b);
  return p[a] != a;
}
// Mapping
int N = 0;
map<int,int> indMap,invMap;
int IND(int x) {
    if(indMap.find(x) == indMap.end())
    {
        N++;
        p[N] = N;
        f[N] = -1;
        indMap[x] = N;
    }
    invMap[ indMap[x] ] = x;
    return indMap[x];
}


int main()
{
    // Union-find in O(M * alpha(M)) ~= O(M)
    // M = number of lines in the file
    int a,b;
    while(scanf("%d%d",&a,&b) != EOF)
        join(IND(a),IND(b));
    // Determine the number of groups : O(M)
    int nG = 0;
    for(int i = 1 ; i <= N ; i++)
    {
        p[i] = parent(p[i]);
        if(f[p[i]] == -1)
            f[p[i]] = nG++;
    }
    // Build groups : O(M)
    vector< vector<int> > Groups(N+1);
    for(int i = 1 ; i <= N ; i++)
        Groups[ f[p[i]] ].push_back(i);
    // Output result
    for(int i = 0 ; i < Groups.size() ; i++)
    {
        if(!Groups[i].size())
            continue;
        printf("%d : ",i+1);
        for(int j = 0 ; j < Groups[i].size() ; j++)
            printf("%d ", invMap[ Groups[i][j] ]);
        printf("\n");
    }

}

Loïc Février 2010-10-03 20:27:23

ansaurus

tags:

views:

answers:

Trying to group values?

related questions