ansaurus

Question

Storing files for testbin/pastebin in Python

Answer 1

+1 A:

I wrote something similar a while back in Django to test jQuery snippets. See:

http://jquery.nodnod.net/

I have the code available on GitHub at http://github.com/dz/jquerytester/tree/master if you're curious.

If you're using straight Python, there are a couple ways to approach naming:

If storing as files, ask for a name, salt with current time, and generate a hash for the filename.
If using mysqlite or some other database, just use a numerical unique ID.

Personally, I'd go for #2. It's easy, ensures uniqueness, and allows you to easily fetch various sets of 'files'.

thedz 2009-07-26 09:44:57

that's a pretty nice script ya got there, i envisioned something like that at one point.. hehe props for having it on github

meder 2009-07-26 12:36:24

Answer 2

+1 A:

Have you considered trying lodgeit. Its a free pastbin which you can host yourself. I do not know how hard it is to set up.

Looking at their code they have gone with a database for storage (sqllite will do). They have structured there paste table like, (this is sqlalchemy table declaration style). The code is just a text field.

pastes = Table('pastes', metadata,
        Column('paste_id', Integer, primary_key=True),
        Column('code', Text),
        Column('parent_id', Integer, ForeignKey('pastes.paste_id'),
               nullable=True),
        Column('pub_date', DateTime),
        Column('language', String(30)),
        Column('user_hash', String(40), nullable=True),
        Column('handled', Boolean, nullable=False),
        Column('private_id', String(40), unique=True, nullable=True)
    )

They have also made a hierarchy (see the self join) which is used for versioning.

David Raznick 2009-07-26 09:47:11

Answer 3

A:

Plain files are definitely more effective. Save your database for more complex queries.

If you need some formatting to be done on files, such as highlighting the code properly, it is better to do it before you save the file with that code. That way you don't need to apply formatting every time the file is shown.

You definitely would need somehow ensure all file names are unique, but this task is trivial, since you can just check, if the file already exists on the disk and if it does, add some number to its name and check again and so on.

Don't store them all in one directory either, since filesystem can perform much worse if there are A LOT (~ 1 million) files in the single directory, so you can structure your storage like this:

FILE_DIR/YEAR/MONTH/FileID.html and store the "YEAR/MONTH/FileID" Part in the database as a unique ID for the file.

Of course, if you don't worry about performance (not many users, for example) you can just go with storing everything in the database, which is much easier to manage.

maksymko 2009-07-26 09:51:26

So how would you pull say the files from the current month if you created one for each day?

meder 2009-07-26 09:57:36

If performance is the ultimate concern, storing non-binary data (text, in this case) on the filesystem is almost never the way to go. A proper database allows connection pooling, load balancing, automatic mirroring, master/slave relations and a whole lot more. Not to mention the ability to run complex queries across the dataset more easily and more efficiently.

thedz 2009-07-26 10:51:04

Yeah, unless your data set is really small, flat files don't scale worth beans. The solution to your "million file per directory" problem is to use a database.

Paul McMillan 2009-07-26 11:18:13

@thedzLoading file from the filesystem is LIGHTNING fast, as compared to database query, since all your most used files will end up in the memory eventually.This depends on the usage scenario, however, and yes, CAN be done in a way so files are slower.

maksymko 2009-07-26 15:06:13

All your most use database objects end up somewhere cached in any decent scalable system, so that's not a very compelling reason to use files. There's a reason why you don't see very many flat file backends when looking at sites that have ridiculously large visitor counts -- it's because it's much easier and practical to scale db backed solutions. The tools for scaling that out are already written and proven in production.

thedz 2009-07-26 18:53:04

ansaurus

tags:

views:

answers:

Storing files for testbin/pastebin in Python

related questions