views:

83

answers:

1

My plan is to collect data from websites in batches (lawyer bios from each firm's website; since they are all different, I will use a modified spider for each site) and convert each batch into a csv file; then to json; and then load it to database. So I will need to append each new file to the existing database. Please let me know how to achieve this task the best way. Thanks.

+4  A: 

Just load the database directly. Collect data from websites in batches, loading the SQlite3 directly. Just write simple batch applications that use the Django ORM. Collect data from websites and load SQLite3 immediately. Do not create CSV. Do not create JSON. Do not create intermediate results. Do not do any extra work.


Edit.

from myapp.models import MyModel
import urllib2

with open("sourceListOfURLs.txt", "r" ) as source:
    for aLine in source:
        for this, the, the_other in someGenerator( aLine ):
            object= MyModel.objects.create( field1=this, field2=that, field3=the_other )
            object.save()

def someGenerator( url ):
    # open the URL with urllib2
    # parse the data with BeautifulSoup
    yield this, that, the_other
S.Lott
Thank you. This sounds good. But can you give me some more detail to start from. A search for "Django ORM" did not lead to any basic stuff that I can use.
Zeynel
What do you mean by "load the SQLite3 directly?"
Zeynel
@Zeynel: Your question says you're using Django. Are you actually using Django? If you are, then you already know about the Django ORM. http://stackoverflow.com/questions/1884694/how-to-populate-sqlite3-in-django/1885417#1885417 If you're not actually using Django, please update the question to say what you *are* using.
S.Lott