views:

92

answers:

3

I have a peculiar problem. I need to read (from a txt file) using python only those substrings that are present at predefined range of offsets. Let's say 5-8 and 12-16.

For example, if a line in the file is something like:

abcdefghi akdhflskdhfhglskdjfhghsldk

then I would like to read the two words - "efgh" and "kdhfl". Because, in the word "efgh", the offset of character "e" is 5 and that of "h" is 8. Similarly, the other word "kdhfl".

Please note that the whitespaces also add to the offset. Infact, the white spaces in my file are not "consistenty occurring" in every line and cannot be depended upon to extract the words of interest. Which is why, I have to bank on the offsets.

I hope I've been able to make the question clear.

Awaiting answers!

Edit -

yes, the whitespace amount in each line can change and accounts for the offsets also. For example, consider these two lines -

abcz d 
a bc d

In both cases, I view the offset of the final character "d" as the same. As I said, the white spaces in the file are not consistent and I cannot rely on them. I need to pick up the characters based on their offsets. Does your answer still hold?

+1  A: 

To extract pieces from offsets simply read each line into a string and then access a substring with a slice ([from:to]).

It's unclear what you're saying about the inconsistent whitespace. If whitespace adds to the offset, it must be consistent to be meaningful. If the whitespace amount can change but actually accounts for the offsets, you can't reliably extract your data.

In your added example, as long as d's offset stays the same, you can extract it with slicing.

>>> s = 'a bc d'
>>> s[5:6]
'd'
>>> s = 'abc  d'
>>> s[5:6]
'd'
Eli Bendersky
yes, the whitespace amount can change and accounts for the offsets also.For example, consider these two lines - abc da bc dIn both cases, I view the offset of the final character "d" as the same. As I said, the white spaces in the file are not consistent and I cannot rely on them. I need to pick up the characters based on their offsets.Does your answer still hold?
Gitmo
Sorry, ignore the above comment. It's not clear. I've made an edit to the main question instead.
Gitmo
@EliThanks a lot. I'm a newbie to Python. Now I feel the question i feel the question was quite trivial. Sorry for bothering :)
Gitmo
A: 

What's to stop you from using a regular expression? Besides the whitespace do the offsets vary?

/.{4}(.{4}).{4}(.{4})/
Epsilon Prime
I've edited my question a bit to make it more clear. I could not understand your solution, but does it still hold?
Gitmo
regex isn't a tool for everything. for extracting data at constant indexes, simple slicing is much clearer and much faster
Eli Bendersky
Save the regexen until you've determined that simple slicing or string methods wont be sufficient. Python strings have a number of really nice methods. Instead of building an RE to match "^prefix", and calling re.match, you can just use s.startswith("prefix"); similar with endswith. In this case, string slicing is *far* preferable to slashes and dots.
Paul McGuire
Regexp should be last resort.
Arrieta
I implemented it as a regular expression because it was stated that it was unclear what to do with the whitespace. Sure slice away if you don't care what the whitespace is going to do. But you'll need a regular expression if you're going to do something strange with the whitespace (like treat tabs as 8 spaces or something).That said it looks like from the added example that spaces are being treated as characters so slicing works just fine.
Epsilon Prime
+5  A: 

assuming its a file,

for line in open("file"):
    print line[4:8] , line[11:16]
ghostdog74