views:

306

answers:

7

Hi,

I have a list with a large number of lines, each taking the subject-verb-object form, eg:

Jane likes Fred
Chris dislikes Joe
Nate knows Jill

To plot a network graph that expresses the different relationships between the nodes in directed color-coded edges, I will need to replace the verb with an arrow and place a color code at the end of each line, thus, somewhat simplified:

Jane -> Fred red;
Chris -> Joe blue;
Nate -> Jill black;

There's only a small number of verbs, so replacing them with an arrow is just a matter of a few search and replace commands. Before doing that, however, I will need to put a color code at the end of every line that corresponds to the line's verb. I'd like to do this using Python.

These are my baby steps in programming, so please be explicit and include the code that reads in the text file.

Thanks for your help!

+3  A: 

Simple enough; assuming the lists of verbs is fixed and small, this is easy to do with a dictionary and for loop:

VERBS = {
    "likes": "red"
  , "dislikes": "blue"
  , "knows": "black"
  }

def replace_verb (line):
    for verb, color in VERBS.items():
        if verb in line:
            return "%s %s;" % (
                  line.replace (verb, "->")
                , color
                )
    return line

def main ():
    filename = "my_file.txt"
    with open (filename, "r") as fp:
        for line in fp:
            print replace_verb (line)

# Allow the module to be executed directly on the command line
if __name__ == "__main__":
    main ()
John Millikin
`def main():` is just silly in my opinion. I like C, but why write C in Python?
Chris Lutz
why do you loop the dict every time? Verb is always 2nd position (sez the OP).
Gregg Lind
@Chris: It's not "C in Python", it's just good style. By keeping behavior in functions, you'll be able to import the module in a REPL.
John Millikin
@Gregg: The dict loop is to check every line for every verb. Position in the line is irrelevant to the algorithm.
John Millikin
If a person's name includes a verb string, this will change the name. For example, walken as a name and walk as a verb
foosion
@foosion: There's not enough information in the question to solve that issue, since there's no guarantee names will be only a single word, or that lines will have only three words.
John Millikin
@john - I (and others) are reading "I have a list with a large number of lines, each taking the subject-verb-object form" and the example as meaning three words each separated by a space.
foosion
Yes, but we've only been given a simple sample input. The actual input could be something like "Jean Paul eats fish".
John Millikin
It would be nice if those with questions were more exact in their specs :-)
foosion
Apologies for the lack of clarity. In the actual input both subjects and objects vary unpredictably between one and two words.
Karasu
`line.rstrip()` is missing.
J.F. Sebastian
FWIW, I agree that most "meaningful" statements should be inside functions (or methods), and `main` is the most typical and meaningful name for the function to execute when `__name__` equals `'__main__'`.
Alex Martelli
+1  A: 

Are you sure this isn't a little homeworky :) If so, it's okay to fess up. Without going into too much detail, think about the tasks you're trying to do:

For each line:

  1. read it
  2. split it into words (on whitespace - .split() )
  3. convert the middle word into a color (based on a mapping -> cf: python dict()
  4. print the first word, arrow, third word and the color

Code using NetworkX (networkx.lanl.gov/)

'''
plot relationships in a social network
'''

import networkx
## make a fake file 'ex.txt' in this directory
## then write fake relationships to it.
example_relationships = file('ex.txt','w') 
print >> example_relationships, '''\
Jane Doe likes Fred
Chris dislikes Joe
Nate knows Jill \
'''
example_relationships.close()

rel_colors = {
    'likes':  'blue',
    'dislikes' : 'black',
    'knows'   : 'green',
}

def split_on_verb(sentence):
    ''' we know the verb is the only lower cased word

    >>> split_on_verb("Jane Doe likes Fred")
    ('Jane Does','Fred','likes')

    '''
    words = sentence.strip().split()  # take off any outside whitespace, then split
                                       # on whitespace
    if not words:
        return None  # if there aren't any words, just return nothing

    verbs = [x for x in words if x.islower()]
    verb = verbs[0]  # we want the '1st' one (python numbers from 0,1,2...)
    verb_index = words.index(verb) # where is the verb?
    subject = ' '.join(words[:verb_index])
    obj =  ' '.join(words[(verb_index+1):])  # 'object' is already used in python
    return (subject, obj, verb)


def graph_from_relationships(fh,color_dict):
    '''
    fh:  a filehandle, i.e., an opened file, from which we can read lines
        and loop over
    '''
    G = networkx.DiGraph()

    for line in fh:
        if not line.strip():  continue # move on to the next line,
                                         # if our line is empty-ish
        (subj,obj,verb) = split_on_verb(line)
        color = color_dict[verb]
        # cf: python 'string templates', there are other solutions here
        # this is the 
        print "'%s' -> '%s' [color='%s'];" % (subj,obj,color)
        G.add_edge(subj,obj,color)
        # 

    return G

G = graph_from_relationships(file('ex.txt'),rel_colors)
print G.edges()
# from here you can use the various networkx plotting tools on G, as you're inclined.
Gregg Lind
> Are you sure this isn't a little homeworky :) It ain't. Honest, guv'nor. I'm a humanities type with not much of a technical background except for HTML and CSS. As part of a postgrad research project I'm doing some network analysis that will benefit from data visualizations. Graphviz and Matplotlib seemed to answer my needs; I started using Python because of Matplotlib and figured the language could help me condition my Graphviz input -- or, conversely, shaping up Graphviz input seemed a nice problem for learning some Python, whichever way round.
Karasu
Aha, then what you might really want is to look at NetworkX (http://networkx.lanl.gov/), which can make both the plots (using Graphviz) and and the graphs. Read other postings on it, it's really easy to use.
Gregg Lind
Thanks for the pointer -- I will need to look into this!
Karasu
A: 

Python 2.5:

import sys
from collections import defaultdict

codes = defaultdict(lambda: ("---", "Missing action!"))
codes["likes"] =    ("-->", "red")
codes["dislikes"] = ("-/>", "green")
codes["loves"] =    ("==>", "blue")

for line in sys.stdin:
    subject, verb, object_ = line.strip().split(" ")
    arrow, color = codes[verb]
    print subject, arrow, object_, color, ";"
Georg
don't forget the semi-colon in the last line (see original spec). The defaultdict is a nice touch, maybe explain that it makes it that if a verb isn't found, it just does an empty string, which may or may not be what the OP wants.
Gregg Lind
You just added the explanation for the defaultdict, so ne need for me to do it.
Georg
Shadowing builtin "object" is yucky...
bstpierre
Now you changed it!
Gregg Lind
+5  A: 

It sounds like you will want to research dictionaries and string formatting. In general, if you need help programming, just break down any problem you have into extremely small, discrete chunks, search those chunks independently, and then you should be able to formulate it all into a larger answer. Stack Overflow is a great resource for this type of searching.

Also, if you have any general curiosities about Python, search or browse the official Python documentation. If you find yourself constantly not knowing where to begin, read the Python tutorial or find a book to go through. A week or two investment to get a good foundational knowledge of what you are doing will pay off over and over again as you complete work.

verb_color_map = {
    'likes': 'red',
    'dislikes': 'blue',
    'knows': 'black',
}

with open('infile.txt') as infile: # assuming you've stored your data in 'infile.txt'
    for line in infile:
        # Python uses the name object, so I use object_
        subject, verb, object_ = line.split()
        print "%s -> %s %s;" % (subject, object_, verb_color_map[verb])
leo-the-manic
for those keeping track at home: 'with' doesn't work for those of us in 2.4 land... (the OP has 2.5, so all good there).
Gregg Lind
on 2.5, Gregg, `with` requires `__future__` import, which is of course a very basic thing
SilentGhost
subject and object_ might each be two words, in which case this won't work.
foosion
Subject and object are in fact two words in some cases. Sorry for failing to make this clear.
Karasu
+2  A: 
verbs = {"dislikes":"blue", "knows":"black", "likes":"red"}
for s in open("/tmp/infile"):
  s = s.strip()
  for verb in verbs.keys():
    if (s.count(verb) > 0):
      print s.replace(verb,"->")+" "+verbs[verb]+";"
      break

Edit: Rather use "for s in open"

leonm
Why do you not just do `for s in f` in place of the `while`???
leo-the-manic
Probably because I wrote this before I had coffee ;-)
leonm
-1 for weird flow control. Why don't you rewrite it to use the "while line in open()" idiom?
steveha
Hm -- I can't judge how "weird" it is, but I'm going to accept it as the solution because it's concise, looks straightforward to my uneducated eye, and works even where either object or subject consist of two or even three words, regardless of upper/lower-casing. Thanks, leonm!
Karasu
There is a standard idiom in Python for text processing, `for line in open("filename"):` loops once per input line. This is clean and straightforward. I don't like loops where you initialize a variable before the loop and then again inside the loop; if you ever need to change the loop, you need to change your code in two places. I was hoping leonm would edit his answer to use the Python idiom. Aside from that, this is a good simple solution. My own answers involved splitting the input into words, to make it easier to analyze the words, but your question doesn't require it.
steveha
A: 

In addition to the question, Karasu also said (in a comment on one answer): "In the actual input both subjects and objects vary unpredictably between one and two words."

Okay, here's how I would solve this.

color_map = \
{
    "likes" : "red",
    "dislikes" : "blue",
    "knows" : "black",
}

def is_verb(word):
    return word in color_map

def make_noun(lst):
    if not lst:
        return "--NONE--"
    elif len(lst) == 1:
        return lst[0]
    else:
        return "_".join(lst)


for line in open("filename").readlines():
    words = line.split()
    # subject could be one or two words
    if is_verb(words[1]):
        # subject was one word
        s = words[0]
        v = words[1]
        o = make_noun(words[2:])
    else:
        # subject was two words
        assert is_verb(words[2])
        s = make_noun(words[0:2])
        v = words[2]
        o = make_noun(words[3:])
    color = color_map[v]
    print "%s -> %s %s;" % (s, o, color)

Some notes:

0) We don't really need "with" for this problem, and writing it this way makes the program more portable to older versions of Python. This should work on Python 2.2 and newer, I think (I only tested on Python 2.6).

1) You can change make_noun() to have whatever strategy you deem useful for handling multiple words. I showed just chaining them together with underscores, but you could have a dictionary with adjectives and throw those out, have a dictionary of nouns and choose those, or whatever.

2) You could also use regular expressions for fuzzier matching. Instead of simply using a dictionary for color_map you could have a list of tuples, with a regular expression paired with the replacement color, and then when the regular expression matches, replace the color.

steveha
A: 

Here is an improved version of my previous answer. This one uses regular expression matching to make a fuzzy match on the verb. These all work:

Steve loves Denise
Bears love honey
Maria interested Anders
Maria interests Anders

The regular expression pattern "loves?" matches "love" plus an optional 's'. The pattern "interest.*" matches "interest" plus anything. Patterns with multiple alternatives separated by vertical bars match if any one of the alternatives matches.

import re

re_map = \
[
    ("likes?|loves?|interest.*", "red"),
    ("dislikes?|hates?", "blue"),
    ("knows?|tolerates?|ignores?", "black"),
]

# compile the regular expressions one time, then use many times
pat_map = [(re.compile(s), color) for s, color in re_map]

# We dont use is_verb() in this version, but here it is.
# A word is a verb if any of the patterns match.
def is_verb(word):
    return any(pat.match(word) for pat, color in pat_map)

# Return color from matched verb, or None if no match.
# This detects whether a word is a verb, and looks up the color, at the same time.
def color_from_verb(word):
    for pat, color in pat_map:
        if pat.match(word):
            return color
    return None

def make_noun(lst):
    if not lst:
        return "--NONE--"
    elif len(lst) == 1:
        return lst[0]
    else:
        return "_".join(lst)


for line in open("filename"):
    words = line.split()
    # subject could be one or two words
    color = color_from_verb(words[1])
    if color:
        # subject was one word
        s = words[0]
        o = make_noun(words[2:])
    else:
        # subject was two words
        color = color_from_verb(words[1])
        assert color
        s = make_noun(words[0:2])
        o = make_noun(words[3:])
    print "%s -> %s %s;" % (s, o, color)

I hope it is clear how to take this answer and extend it. You can easily add more patterns to match more verbs. You could add logic to detect "is" and "in" and discard them, so that "Anders is interested in Maria" would match. And so on.

If you have any questions, I'd be happy to explain this further. Good luck.

steveha