I have some data (text files) that is formatted in the most uneven manner one could think of. I am trying to minimize the amount of manual work on parsing this data.
Sample Data :
Name Degree CLASS CODE EDU Scores
--------------------------------------------------------------------------------------
John Marshall CSC 78659944 89989 BE 900
Think Code DB I10 MSC 87782 1231 MS 878
Mary 200 Jones CIVIL 98993483 32985 BE 898
John G. S Mech 7653 54 MS 65
Silent Ghost Python Ninja 788505 88448 MS Comp 887
Conditions :
- More than one spaces should be compressed to a delimiter (pipe better? End goal is to store these files in the database).
- Except for the first column, the other columns won't have any spaces in them, so all those spaces can be compressed to a pipe.
- Only the first column can have multiple words with spaces (Mary K Jones). The rest of the columns are mostly numbers and some alphabets.
- First and second columns are both strings. They almost always have more than one spaces between them, so that is how we can differentiate between the 2 columns. (If there is a single space, that is a risk I am willing to take given the horrible formatting!).
- The number of columns varies, so we don't have to worry about column names. All we want is to extract each column's data.
Hope I made sense! I have a feeling that this task can be done in a oneliner. I don't want to loop, loop, loop :(
Muchos gracias "Pythonistas" for reading all the way and not quitting before this sentence!