views:

102

answers:

3

i download many html store in os,now get their content ,and extract data what i need to persistence to mysql, i use the traditional load file one by one ,it's not efficant cost nealy 8 mins.

any advice is welcome

g_fields=[
 'name',
 'price',
 'productid',
 'site',
 'link',
 'smallImage',
 'bigImage',
 'description',
 'createdOn',
 'modifiedOn',
 'size',
 'weight',
 'wrap',
 'material',
 'packagingCount',
 'stock',
 'location',
 'popularity',
 'inStock',
 'categories',
]   @cost_time
def batch_xml2csv():
    "批量将xml导入到一个csv文件中"
    delete(g_xml2csv_file)
    f=open(g_xml2csv_file,"a")
    import os.path
    import mmap
    for file in glob.glob(g_filter):
    print "读入%s"%file
    ff=open(file,"r+")
    size=os.path.getsize(file)
    data=mmap.mmap(ff.fileno(),size)
    s=pq(data.read(size))
    data.close()
    ff.close()
    #s=pq(open(file,"r").read())
    line=[]
    for field in g_fields:
        r=s("field[@name='%s']"%field).text()
        if r is None:
            line.append("\N")
        else:
            line.append('"%s"'%r.replace('"','\"'))
    f.write(",".join(line)+"\n")
    f.close()
    print "done!"

i tried mmap,it seems didn't work well

+2  A: 

If you've got 25,000 text files on disk, 'you're doing it wrong'. Depending on how you store them on disk, the slowness could literally be seeking on disk to find the files.

If you've got 25,0000 of anything it'll be faster if you put it in a database with an intelligent index -- even if you make the index field the filename it'll be faster.

If you have multiple directories that descend N levels deep, a database would still be faster.

synthesizerpatel
i store files on single directory
mlzboy
25k files in one directory will take a long time to list no matter how you slice it. To give you an example, I wrote a script that generated N number files with between 0 and 65 kbytes of data. Running simply 'ls -l' took 0.021 seconds @ 1000 files, 0.199s for 10,000 files, and a whopping 0.487 seconds (half a second!) for 25,000 files. That's worst case scenario of course, but randomly picking files out of this list still means having to traverse the btree _and_ compete with other applications that're using the filesystem for reads and writes.
synthesizerpatel
Whoops. I understand your problem a bit better now.. Whatever is producing these files should instead be writing directly to a database than using an intermediary file _before_ you write it to the database. If you're parsing through HTML consider writing your spider code in Python so it can do everything at once. Alternatively, use a tiered directory system so you can break up the chunks of files into more manageable parts. i.e. root/a/aa/aardvark.html, root/c/ch/chiapet.html ..
synthesizerpatel
A: 

You can scan the files while downloading them in multiple threads if you use scrapy.

DiggyF
i keep all step separate,it will keep the solution clear
mlzboy
A: 

If algorith is correct, using the psyco module can sometimes help quite a lot. It does not however work with Python 2.7 or Python 3+

Tony Veijalainen