ansaurus

Question

How to strip variable spaces in each line of a text file based on special condition - one-liner in Python?

Answer 1

+2 A:

It still seems tome that there's some format in your files:

>>> regex = r'^(.+)\b\s{2,}\b(.+)\s+(\d+)\s+(\d+)\s+(.+)\s+(\d+)'
>>> for line in s.splitlines():
    lst = [i.strip() for j in re.findall(regex, line) for i in j if j]
    print(lst)


[]
[]
['John Marshall', 'CSC', '78659944', '89989', 'BE', '900']
['Think Code DB I10', 'MSC', '87782', '1231', 'MS', '878']
['Mary 200 Jones', 'CIVIL', '98993483', '32985', 'BE', '898']
['John G. S', 'Mech', '7653', '54', 'MS', '65']
['Silent Ghost', 'Python Ninja', '788505', '88448', 'MS Comp', '887']

Regex is quite straightforward, the only things you need to pay attention to are the delimiters (\s) and the word breaks (\b) in case of the first delimiter. Note that when the line wouldn't match you get an empty list as lst. That would be a read flag to bring up the user interaction described below. Also you could skip the header lines by doing:

>>> file = open(fname)
>>> [next(file) for _ in range(2)]
>>> for line in file:
    ...  # here empty lst indicates issues with regex

Previous variants:

>>> import re
>>> for line in open(fname):
    lst = re.split(r'\s{2,}', line)
    l = len(lst)
    if l in (2,3):
        lst[l-1:] = lst[l-1].split()
    print(lst)

['Name', 'Degree', 'CLASS', 'CODE', 'EDU', 'Scores']
['--------------------------------------------------------------------------------------']
['John Marshall', 'CSC', '78659944', '89989', 'BE', '900']
['Think Code DB I10', 'MSC', '87782', '1231', 'MS', '878']
['Mary 200 Jones', 'CIVIL', '98993483', '32985', 'BE', '898']
['John G. S', 'Mech', '7653', '54', 'MS', '65']

another thing to do is simply allow user to decide what to do with questionable entries:

if l < 3:
    lst = line.split()
    print(lst)
    iname = input('enter indexes that for elements of name: ')     # use raw_input in py2k
    idegr = input('enter indexes that for elements of degree: ')

Uhm, I was all the time under the impression that the second element might contain spaces, since it's not the case you could just do:

>>> for line in open(fname):
    name, _, rest = line.partition('  ')
    lst = [name] + rest.split()
    print(lst)

SilentGhost 2010-10-06 15:30:15

Wow, awesome! it works for the most part but in instances where the degree is separated by a single space, it is failing. [John G. S Mech 7653 54 MS 65]. Really appreciate the super quick solution!

ThinkCode 2010-10-06 15:47:43

@ThinkCode: *degree is separated by a single space* separated from what? name?

SilentGhost 2010-10-06 15:51:19

@Think: if you have a fixed list of degrees, you might want to check `lst[1]` against it and try manual split on those entries.

SilentGhost 2010-10-06 15:57:45

My bad, I was referring to EDU. If we try to parse [John G. S Mech 7653 54 MS 65], the data is not separated. Cases where the data is separated by a single space (except for name and degree)? Thank you!

ThinkCode 2010-10-06 15:59:20

@Think: formatting is being lost in your comments, do you mind updating your question with the record you have troubles with?

SilentGhost 2010-10-06 16:02:01

Updated the last record. All the columns in this record are separated by one space, in this case the script is treating it as one single record.

ThinkCode 2010-10-06 16:05:44

@Think: see my update and another idea how to deal with idiosyncrasies of such format.

SilentGhost 2010-10-06 16:15:15

Spot on! Thank you so much :) One last request (hope I am not asking for too much!). I added one more record (what if degree is 2 words separated by a space).

ThinkCode 2010-10-06 16:30:56

@Think: but not separated by two spaces from the remaining fields?

SilentGhost 2010-10-06 16:32:13

The remaining fields are all numbers. Numbers are all one unit and never have a space. Strings can be 2 words with a space, I know it is a complicated request, the formatting is totally off in these files :( Your snippet is really really helpful :)

ThinkCode 2010-10-06 16:42:11

OMG! You are awesome! Works beautifully! If you could update your sol. explain regex and the crucial steps in your snippet, that would greatly benefit us newbies! Thank you a million times :)

ThinkCode 2010-10-06 17:03:49

Answer 2

+2 A:

Variation on SilentGhost's answer, this time first splitting the name from the rest (separated by two or more spaces), then just splitting the rest, and finally making one list.

import re

for line in open(fname):
    name, rest = re.split('\s{2,}', line, maxsplit=1)
    print [name] + rest.split()

Steven 2010-10-06 16:19:04

Answer 3

+1 A:

This answer was written after the OP confessed to changing every tab ("\t") in his data to 3 spaces (and not mentioning it in his question).

Looking at the first line, it seems that this is a fixed-column-width report. It is entirely possible that your data contains tabs that if expanded properly might result in a non-crazy result.

Instead of doing line.replace('\t', ' ' * 3) try line.expandtabs().

Docs for expandtabs are here.

If the result looks sensible (columns of data line up), you will need to determine how you can work out the column widths programatically (if that is possible) -- maybe from the heading line.

Are you sure that the second line is all "-", or are there spaces between the columns? The reason for asking is that I once needed to parse many different files from a database query report mechanism which presented the results like this:

RecordType  ID1                  ID2         Description           
----------- -------------------- ----------- ----------------------
1           12345678             123456      Widget                
4           87654321             654321      Gizmoid

and it was possible to write a completely general reader that inspected the second line to determine where to slice the heading line and the data lines. Hint:

sizes = map(len, dash_line.split())

If expandtabs() doesn't work, edit your question to show exactly what you do have i.e. show the result of print repr(line) for the first 5 or so lines (including the heading line). It might also be useful if you could say what software produces these files.

John Machin 2010-10-07 20:43:12

The problem is that the text files I was given to process are of multiple different formats, no fixed width columns - some records have varied tabs and spaces (mixed). It is all over the place. Some don't even have spacing sometimes. The second line sometimes got delimiters to denote the start of the column. Those can be taken care of easily. When there is no delimiter they might or might not have dashes. They will be gone in the cleansing process. Different sources get the same data formatted in a different manner.

ThinkCode 2010-10-08 13:27:54

@ThinkCode: "some records have varied tabs and spaces (mixed)" -- yeah yeah, that description fits the "need for expandtabs" diagnosis. "They will be gone in the cleansing process" -- you have an interesting interpretation of "cleansing". I sympathise with your general problem, but bashing tabs into 3 spaces and not saying so is not a very good idea at all.

John Machin 2010-10-08 20:49:17

ansaurus

tags:

views:

answers:

How to strip variable spaces in each line of a text file based on special condition - one-liner in Python?

related questions