views:

97

answers:

3

My Friends,

I spent quite some time on this one... but cannot yet figure out a better way to do it. I am coding in python, by the way.

So, here is a line of text in a file I am working with, for example:

">ref|ZP_01631227.1| 3-dehydroquinate synthase [Nodularia spumigena CCY9414]..."

How can I extract the two strings "ZP_01631227.1" and "Nodularia spumigena CCY9414" from the line?

The pairs of "| |" and brackets are like markers so we know we want to get the strings in between the two...

I guess I can probably loop over all the characters in the line and do it the hard way. It just takes so much time... Wondering if there is a python library or other smart ways to do it nicely?

Thanks to all!

+1  A: 
>>> for line in open("file"):
...     if "|" in line:
...         whatiwant_1=line.split("|")[1]
...         if "[" in line:
...             whatiwant_2=line.split("[")[1].split("]")[0]
...
>>> print whatiwant_1 , whatiwant_2
ZP_01631227.1 Nodularia spumigena CCY9414
ghostdog74
This is exactly the solution I needed! Thanks so much!
GoJian
+3  A: 

One concise alternative is a regular expression (for some reason they have a bad rep in the Python community, but they do provide conciseness and power for simple text handling):

import re
s = ">ref|ZP_01631227.1| 3-dehydroquinate synthase [Nodularia spumigena CCY9414]..."
mo = re.search(r'\|(.*?)\|/*\[(.*?)\]', s)
if mo:
  thefirst, thesecond = mo.groups()
Alex Martelli
Did you mean to say `mo = re.search(r'\|(.*?)\|.*\[(.*?)\]', s)`?
gnibbler
As for the reason that regexps have a bad reputation in the Python community, I would suggest that the documentation is a little intimidating compared to, say, the Perl documentation (perlrequick). A gentle tutorial filled with examples could usefully be added at the beginning of the current `re` documentation, for instance.
EOL
@gnibbler, yep, I'd dropped the `s`, tx for spotting, editing to fix.
Alex Martelli
@EOL, doc patches always welcome!-)
Alex Martelli
@Alex Your comment is a good motivation!
EOL
A: 

@ghostdog74 I think yours is the easiest solution if the line is always in the form "bla|foo|bar[baz]. However, you could put both of your if statements on the same indentation level.

Wieland H.