views:

677

answers:

6

I want to filter two list with any fastest method in python script. I have used built in filter() method for this purpose. but it is quite slow and taking too much time because I have very big list, I think more than 5 million item in each list or may be more. I do not know how I will make it. Please if anybody have idea or write small function for it. Please send me. I am very thankful for it. Thanks.

+1  A: 

I guess filter() is as fast as you can possibly get without having to code the filtering function in C (and in that case, you better code the whole filtering process in C).

Why don't you paste the function you are filtering on? That might lead to easier optimizations.

Read this about optimization in Python. And this about the Python/C API.

Vinko Vrsalovic
+8  A: 

Maybe your lists are too large and do not fit in memory, and you experience thrashing. If the sources are in files, you do not need the whole list in memory all at once. Try using itertools, e.g.:

from itertools import ifilter

def is_important(s):
   return len(s)>10

filtered_list = ifilter(is_important, open('mylist.txt'))

Note that ifilter returns an iterator that is fast and memory efficient.

Generator Tricks is a tutorial by David M. Beazley that teaches some interesting uses for generators.

gimel
+1  A: 

Before doing it in C, you could try numpy. Perhaps you can turn your filtering into number crunching.

Cheery
+1  A: 

Filter will create a new list, so if your original is very big, you could end up using up to twice as much memory. If you only need to process the results iteratively, rather than use it as a real random-access list, you are probably better off using ifilter instead. ie.

for x in itertools.ifilter(condition_func, my_really_big_list):
    do_something_with(x)

Other speed tips are to use a python builtin, rather than a function you write yourself. There's a itertools.ifilterfalse specifically for the case where you would otherwise need to introduce a lambda to negate your check. (eg "ifilter(lambda x: not x.isalpha(), l)" should be written "ifilterfalse(str.isalpha, l)")

Brian
+1  A: 

It may be useful to know that generally a conditional list comprehension is much faster than the corresponding lambda:

>>> import timeit
>>> timeit.Timer('[x for x in xrange(10) if (x**2 % 4) == 1]').timeit()
2.0544309616088867
>>> timeit.f = lambda x: (x**2 % 4) == 1
timeit.Timer('[x for x in xrange(10) if f(x)]').timeit()
>>> 
3.4280929565429688

(Not sure why I needed to put f in the timeit namespace, there. Haven't really used the module much.)

fivebells
+3  A: 

If you can avoid creating the lists in the first place, you'll be happier.

Rather than

aBigList = someListMakingFunction()
filter( lambda x:x>10, aBigList )

You might want to look at your function that makes the list.

def someListMakingGenerator( ):
    for x in some source:
        yield x

Then your filter doesn't involve a giant tract of memory

def myFilter( aGenerator ):
    for x in aGenerator:
        if x > 10: 
            yield x

By using generators, you don't keep much stuff in memory.

S.Lott