views:

106

answers:

4

I read many files from my system. I want to read them faster, maybe like this:

results=[]
for file in open("filenames.txt").readlines():
    results.append(open(file,"r").read())

I don't want to use threading. Any advice is appreciated.

the reason why i don't want to use threads is because it will make my code unreadable,i want to find so tricky way to make speed faster and code lesser,unstander easier

yesterday i have test another solution with multi-processing,it works bad,i don't know why, here is the code as follows:

def xml2db(file):
    s=pq(open(file,"r").read())
    dict={}
    for field in g_fields:
        dict[field]=s("field[@name='%s']"%field).text()
    p=Product()
    for k,v in dict.iteritems():
        if v is None or v.strip()=="":
            pass
        else:
            if hasattr(p,k):
                setattr(p,k,v)
    session.commit()
@cost_time
@statistics_db
def batch_xml2db():
    from multiprocessing import Pool,Queue
    p=Pool(5)
    #q=Queue()
    files=glob.glob(g_filter)
    #for file in files:
    #    q.put(file)

    def P():
        while q.qsize()<>0:
            xml2db(q.get())
    p.map(xml2db,files)
    p.join()
A: 
results = [open(f.strip()).read() for f in open("filenames.txt").readlines()]

This may be insignificantly faster, but it's probably less readable (depending on the reader's familiarity with list comprehensions).

Your main problem here is that your bottleneck is disk IO - buying a faster disk will make much more of a difference than modifying your code.

Zooba
it's reduce the cost time from 805s to 714s,thanks, but i still want to find any futher quick way,also i want to know why this style will quicker than my code
mlzboy
@mlzboy, you must have a whole lot of little files here if you had that big of a speed-up from this. Is this code inside of a function? That could speed things up a little bit too.
Justin Peel
This is quicker because the loop is handled as part of the list comprehension, which is quite likely in native code rather than interpreted Python. Without threading you are unlikely to get any faster than this (using Python). The disk IO is still your bottleneck.
Zooba
A: 

Well, if you want to improve performance then improve the algorithm, right? What are you doing with all this data? Do you really need it all in memory at the same time, potentially causing OOM if filenames.txt specifies too many or too large of files?

If you're doing this with lots of files I suspect you are thrashing, hence your 700s+ (1 hour+) time. Even my poor little HD can sustain 42 MB/s writes (42 * 714s = 30GB). Take that grain of salt knowing you must read and write, but I'm guessing you don't have over 8 GB of RAM available for this application. A related SO question/answer suggested you use mmap, and the answer above that suggested an iterative/lazy read like what you get in Haskell for free. These are probably worth considering if you really do have tens of gigabytes to munge.

TomMD
A: 

Is this a one-off requirement or something that you need to do regularly? If it's something you're going to be doing often, consider using MySQL or another database instead of a file system.

Richard Careaga
i'm extract data from html and persistentce to many xml files,then i put xml to mysql database
mlzboy
Ah, then I think that you will get a greater benefit from going directly from html into xml directly to the database without the intermediate step of writing to the file system. That saves both a write and a read. Alternatively, put your xml files under your webserver and read them in through a socket, rather than calling to the OS filesystem.
Richard Careaga
A: 

Not sure if this is still the code you are using. A couple adjustments I would consider making.

Original:

def xml2db(file):
    s=pq(open(file,"r").read())
    dict={}
    for field in g_fields:
        dict[field]=s("field[@name='%s']"%field).text()
    p=Product()
    for k,v in dict.iteritems():
        if v is None or v.strip()=="":
            pass
        else:
            if hasattr(p,k):
                setattr(p,k,v)
    session.commit()

Updated: remove the use of the dict, it is extra object creation, iteration and collection.

def xml2db(file):
    s=pq(open(file,"r").read())

    p=Product()
    for k in g_fields:
        v=s("field[@name='%s']"%field).text()
        if v is None or v.strip()=="":
            pass
        else:
            if hasattr(p,k):
                setattr(p,k,v)
    session.commit()

You could profile the code using the python profiler.

This might tell you where the time being spent is. It may be in session.Commit() this may need to be reduced to every couple of files.

I have no idea what it does so that is really a stab in the dark, you may try and run it without sending or writing any output.

If you can separate your code into Reading, Processing and Writing. A) You can see how long it takes to read all the files.

Then by loading a single file into memory process it enough time to represent the entire job without extra reading IO. B) Processing cost

Then save a whole bunch of sessions representative of your job size. C) Output cost

Test the cost of each stage individually. This should show you what is taking the most time and if any improvement can be made in any area.

kevpie