views:

96

answers:

3

Related to a previous question, I'm trying to do replacements over a number of large CSV files.

The column order (and contents) change between files, but for each file there are about 10 columns that I want and can identify by the column header names. I also have 1-2 dictionaries for each column I want. So for the columns I want, I want to use only the correct dictionaries and want to implement them sequentially.

An example of how I've tried to solve this:

# -*- coding: utf-8 -*-
import re

# imaginary csv file. pretend that we do not know the column order.
Header = [u'col1', u'col2']
Line1 = [u'A',u'X']
Line2 = [u'B',u'Y']
fileLines = [Line1,Line2]

# dicts to translate lines
D1a = {u'A':u'a'}
D1b = {u'B':u'b'}
D2 = {u'X':u'x',u'Y':u'y'}

# dict to correspond header names with the correct dictionary.
# i would like the dictionaries to be read sequentially in col1.
refD = {u'col1':[D1a,D1b],u'col2':[D2]}

# clunky replace function
def freplace(str, dict):
    rc = re.compile('|'.join(re.escape(k) for k in dict))
    def trans(m):
        return dict[m.group(0)]
    return rc.sub(trans, str)

# get correspondence between dictionary and column
C = []
for i in range(len(Header)):
    if Header[i] in refD:
        C.append([refD[Header[i]],i])

# loop through lines and make replacements
for line in fileLines:
    for i in range(len(line)):
        for j in range(len(C)):
            if C[j][1] == i:
                for dict in C[j][0]:
                    line[i] = freplace(line[i], dict)

My problem is that this code is quite slow, and I can't figure out how to speed it up. I'm a beginner, and my guess was that my freplace function is largely what is slowing things down, because it has to compile for each column in each row. I would like to take the line rc = re.compile('|'.join(re.escape(k) for k in dict)) out of that function, but don't know how to do that and still preserve what the rest of my code is doing.

+1  A: 

You don't need re:

# -*- coding: utf-8 -*-

# imaginary csv file. pretend that we do not know the column order.
Header = [u'col1', u'col2']
Line1 = [u'A',u'X']
Line2 = [u'B',u'Y']
fileLines = [Line1,Line2]

# dicts to translate lines
D1a = {u'A':u'a'}
D1b = {u'B':u'b'}
D2 = {u'X':u'x',u'Y':u'y'}

# dict to correspond header names with the correct dictionary
refD = {u'col1':[D1a,D1b],u'col2':[D2]}

# now let's have some fun...

for line in fileLines:
    for i, (param, word) in enumerate(zip(Header, line)):
        for minitranslator in refD[param]:
            if word in minitranslator:
                line[i] = minitranslator[word]

returns:

[[u'a', u'x'], [u'b', u'y']]
eumiro
@eumiro: I was under the impression from my previous question that re would be much faster than looping through each item in the dictionary individually? Also, I want to loop through the dictionaries sequentially when there are multiple dictionaries for a given column, which is why I used lists in refD.
rallen
@rallen - if you have large csv files, then it is better to prepare the refD/translator dictionary correctly. If you want to apply several dictionaries (D1a, D1b) one after another, then merge them into one dictionary (translator[param]) - their keys should be unique, shouldn't they? In my final for-loop I am iterating only over the lines and columns of your large csv-file, not over the dictionaries - I am just accessing them through keys, which is **fast**.
eumiro
@eumiro: Due to the nature of my data, I want the dictionaries to be preserved. If something gets replaced through D1a it gives me more information than if it goes through D1b (which consists largely of shortened entries from D1a).
rallen
@rallen - see the edited version without `translator`.
eumiro
A: 

So if that's the case, and all 10 columns have the same names each time, but out of order, (I'm not sure if this is what you're doing up there, but here goes) keep one array for the heading names, and one for each column split into elements (should be 10 items each line), now just offset which regex by doing a case/select combo, compare the element number of your header array, then inside the case, reference the data array at the same offset, since the name is what will get to the right case you should be able to use the same 10 regex's repeatedly, and not have to recompile a new "command" each time.

I hope that makes sense. I'm sorry i don't know the syntax to help you out, but I hope my idea is what you're looking for EDIT: I.E.

initialize all regexes before starting your loops.

then after you read a line (and after the header line)

select array[n]

case "column1"

regex(data[0]);

case "column2"

regex(data[1]); . . . . end select

This should call the right regex for the right columns

onaclov2000
+3  A: 

There's a ton of things that you can do to speed this up:

First, use the csv module. It provides efficient and bug-free methods for reading and writing CSV files. The DictReader object in particular is what you're interested in: it will present every row it reads from the file as a dictionary keyed by its column name.

Second, compile your regexes once, not every time you use them. Save the compiled regexes in a dictionary keyed by the column that you're going to apply them to.

Third, consider that if you apply a hundred regexes to a long string, you're going to be scanning the string from start to finish a hundred times. That may not be the best approach to your problem; you might be better off investing some time in an approach that lets you read the string from start to end once.

Robert Rossney
+1 for the `csv` module. Regexen are not the right tool here.
Daenyth
@Robert: Thanks. 1.) In my actual file I'm using the csv module. I'll look into DictReader to match up with the columns; previously I didn't use it because I had to translate the first line of my text first before matching the columns with their corresponding dictionaries, which I did with like "for line in lines: if firstline == 1: [translate, match translated cols in line with dictionary, etc.]". 2.) Indeed I was hoping to compile them first but wasn't sure how to reference them . . . I'll try what you suggested.
rallen
@Robert: 3.) I only want to use specific dictionaries with specific columns, so I don't see how to implement what you suggested. All told I only have 10 dictionaries so I hope it isn't too much of a speedloss. And if I am iterating over columns in the rows anyways, I don't see how it matters that I call different regexes apart from the step to assign the correct regexes to the correct column, as you said in your second point.
rallen
Every RE that you use searches the target string from start to end. If your target strings are long, and you are using a lot of REs, it could matter a great deal how many REs you're using. It's hard to know without profiling your code to find out what's actually taking so long.
Robert Rossney
@Robert: Why is it faster to use 1 big RE on an entire row than to use 10 small ones, one to each column? I know that there is an extra checking step involved here: each column has to be checked to know which RE to use. So far RE have been described in a "magical way" to me: 'RE searches everything at once, compared with iterating over keys in a dictionary which searches one by one', which confuses this Python/programming beginner. Do you know reference on /how/ RE works as I can't find any that make sense to me.
rallen
When you test a string with an RE, the RE engine starts at the beginning of the string and moves through it character by character until it finds a match. This operation's time is going to be roughly proportional to the product of two numbers: the length of the string and the number of REs. The larger those numbers, the slower it will be. If your strings are short, or the number of REs is small, it's no big deal. If your strings are long and the number of REs is large, you should look to finding a more efficient approach.
Robert Rossney
It's really hard to suggest alternatives. You're asking us to help you optimize a complicated algorithm without clearly describing the overall problem. So I can only suggest things that I think might be big wins, and I can't make my suggestions very clear.
Robert Rossney