I have some data like this:
1 2
3 4
5 9
2 6
3 7
and am looking for an output like this (group-id and the members of that group):
1: 1 2 6
2: 3 4 7
3: 5 9
First row because 1 is "connected" to 2 and 2 is connected to 6. Second row because 3 is connected to 4 and 3 is connected to 7
This looked to me like a graph traversal but the final order does not matter so I was wondering if someone can suggest a simpler solution that I can use on a large dataset (billions of entries).
From the comments:
- The problem is to find the set of disjoint sub-graphs given a set of edges.
- The edges are not directed; the line '1 2' means that 1 is connected to 2 and 2 is connected to 1.
- The '1:' in the sample output could be 'A:' without changing the meaning of the answer.
EDIT 1:
Problem looks solved now. Thanks to everyone for their help. I need some more help picking the best solution that can be used on billions of such entries.
EDIT 2:
Test Input file:
1 27
1 134
1 137
1 161
1 171
1 275
1 309
1 413
1 464
1 627
1 744
2 135
2 398
2 437
2 548
2 594
2 717
2 738
2 783
2 798
2 912
5 74
5 223
7 53
7 65
7 122
7 237
7 314
7 701
7 730
7 755
7 821
7 875
7 884
7 898
7 900
7 930
8 115
9 207
9 305
9 342
9 364
9 493
9 600
9 676
9 830
9 941
10 164
10 283
10 380
10 423
10 468
10 577
11 72
11 132
11 276
11 306
11 401
11 515
11 599
12 95
12 126
12 294
13 64
13 172
13 528
14 396
15 35
15 66
15 210
15 226
15 360
15 588
17 263
17 415
17 474
17 648
17 986
21 543
21 771
22 47
23 70
23 203
23 427
23 590
24 286
24 565
25 175
26 678
27 137
27 161
27 171
27 275
27 309
27 413
27 464
27 627
27 684
27 744
29 787
Benchmarks:
I tried out everything and the version posted by TokenMacGuy is the fastest on the sample dataset that I tried. The dataset has about 1 million entries for which it took me about 6 seconds on a Dual Quad-Core 2.4GHz machine. I haven't gotten a chance to run it on the entire dataset yet but I will post the benchmark as soon as it is available.