views:

492

answers:

2

I have a list of many sentences in Excel on each row in a column. I have like 3 or more columns with such sentences. There are some common sentences in these. Is it possible to create a script to create a Venn diagram and get the common ones between all.

Example: These are sentences in a column. Similarly there are different columns.

Blood lymphocytes from cancer

Blood lymphocytes from patients

Ovarian tumor_Grade III

Peritoneum tumor_Grade IV

Hormone resistant PCA

Is it possible to write a script in python?

A: 

Your question is not fully clear, so I might be misunderstanding what you're looking for.

A Venn diagram is just a few simple Set operations. Python has this stuff built into the Set datatype. Basically, take your two groups of items and use set operations (e.g. use intersection to find the common items).

To read in the data, your best bet is probably to save the file in CSV format and just parse it with the string split method.

Brian
+2  A: 

This is my interpretation of the question...

Give the data file z.csv (export your data from excel into a csv file)

"Blood lymphocytes from cancer","Blood lymphocytes from sausages","Ovarian tumor_Grade III"
"Blood lymphocytes from patients","Ovarian tumor_Grade III","Peritoneum tumor_Grade IV"
"Ovarian tumor_Grade III","Peritoneum tumor_Grade IV","Hormone resistant PCA"
"Peritoneum tumor_Grade XV","Hormone resistant PCA","Blood lymphocytes from cancer"
"Hormone resistant PCA",,"Blood lymphocytes from patients"

This program finds the sentences common to all the columns

import csv

# Open the csv file
rows = csv.reader(open("z.csv"))

# A list of 3 sets of sentences
results = [set(), set(), set()]

# Read the csv file into the 3 sets
for row in rows:
    for i, data in enumerate(row):
        results[i].add(data)

# Work out the sentences common to all rows
intersection = results[0]
for result in results[1:]:
    intersection = intersection.intersection(result)

print "Common to all rows :-"
for data in intersection:
    print data

And it prints this answer

Common to all rows :-
Hormone resistant PCA
Ovarian tumor_Grade III

Not 100% sure that is what you are looking for but hopefully it gets you started!

It could be generalised easily to as many columns as you like, but I didn't want to make it more complicated

Nick Craig-Wood