views:

389

answers:

3

I have a big csv file which lists connections between nodes in a graph. example:

0001,95784
0001,98743
0002,00082
0002,00091

So this means that node id 0001 is connected to node 95784 and 98743 and so on. I need to read this into a sparse matrix in numpy. How can i do this? I am new to python so tutorials on this would also help.

A: 

If you want an adjacency matrix, you can do something like:

from scipy.sparse import *
from scipy import *
from numpy import *
import csv
S = dok_matrix((10000,10000), dtype=bool)
f = open("your_file_name")
reader = csv.reader(f)
for line in reader:
    S[int(line[0]),int(line[1])] = True
tkerwin
A: 

You might also be interested in Networkx, a pure python network/graphing package.

From the website:

NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

>>> import networkx as nx
>>> G=nx.Graph()
>>> G.add_edge(1,2)
>>> G.add_node("spam")
>>> print G.nodes()
[1, 2, 'spam']
>>> print G.edges()
[(1, 2)]
mavnn
+1  A: 

Example using lil_matrix (list of list matrix) of scipy.

Row-based linked list matrix.

This contains a list (self.rows) of rows, each of which is a sorted list of column indices of non-zero elements. It also contains a list (self.data) of lists of these elements.

$ cat 1938894-simplified.csv
0,32
1,21
1,23
1,32
2,23
2,53
2,82
3,82
4,46
5,75
7,86
8,28

Code:

#!/usr/bin/env python

import csv
from scipy import sparse

rows, columns = 10, 100
matrix = sparse.lil_matrix( (rows, columns) )

csvreader = csv.reader(open('1938894-simplified.csv'))
for line in csvreader:
    row, column = map(int, line)
    matrix.data[row].append(column)

print matrix.data

Output:

[[32] [21, 23, 32] [23, 53, 82] [82] [46] [75] [] [86] [28] []]
The MYYN
Exactly what I needed. Any good resources for scipy that you can recommend?
Ankur Chauhan
i guess http://docs.scipy.org/doc/ would be a starting point ..
The MYYN
One small question. The numbers in the csv are not the indices. they are Ids ie the file starts with 0001001,93040450001001,93081220001001,93090970001001,93110420001001,94011390001001,94041510001001,94070870001001,94080990001001,95010300001001,9503124So how do i convert these IDs to numerical indices, the ID server the purpose of just identifying nodes, they may be replaced by equivalent indices if they are unique.How do I accomplish this. I know I can just make rows and columns as big as the largest ID but that seems wasteful as the nodes like with indices 0 - 1001 are wasted.
Ankur Chauhan
i understand your concern and i assume, there is no one best way to 'compress' your data to the relevant elements. it depends largely on your goal, what you want to do with the data later. e.g. you could use a 'mapping dictionary' which maps the actual ids to some smaller numerical values ...
The MYYN
If you do want to 'squeeze' your indices so that they start at 0 and go up in increments of 1 to some maximum, why not (1) sort them producing `sorted_ixs` (`sorted_ixs = ixs; sorted_ixs.sort()`), (2) `zip(sorted_ixs, range(len(sorted_ixs))` producing a list of pairs matching an index with a 'squeezed index', (3) use the list as a 'translation table' from old to new indices.
Michał Marczyk
Actually this will also sort `ixs`, I think; use `sorted_ixs = ixs[:]` if you want to keep your unsorted `ixs` around.
Michał Marczyk