tags:

views:

302

answers:

8

What do wikis/stackoverflow/etc. do when it comes to storing text? Is the text broken at newlines? Is it broken into fixed-length chunks? How do you best store arbitrarily long chunks of text?

+9  A: 

nvarchar(max) ftw. because over complicating simple things is bad, mmkay?

Darren Kopp
Cool. I didn't know about nvarchar(max). I'm Still stuck on SQL Server 2000. :-(
Patrick McElhaney
+2  A: 

I guess if you need to offer the ability to store large chunks of text and you don't mind not being able to look into their content too much when querying, you can use CLobs.

fd
+1  A: 

This all depends on the RDBMS that you are using as well as the types of text that you are going to store. If the text is formatted into sizable chunks of data that mean something in and of themselves, like, say header/body, then you might want to break the data up into columns of these types. It may take multiple tables to use this method depending on the content that you are dealing with.

I don't know how other RDBMS's handle it, but I know that that it's not a good idea to have more than one open ended column in each table (text or varchar(max)). So you will want to make sure that only one column has unlimited characters.

Charles Graham
A: 

If you need to worry about keeping things like formatting strings, quotes, and other "cruft" in the text, as code would likely have, then the special characters need to be completely escaped first - otherwise on submission the db, they might end up causing an invalid command to be issued.

Most scripting languages have tools to do this built-in natively.

warren
Well yes, but any sane database interface will do all the escaping for you. Best to let it get on with it and not think about it too much :)
Dan
A: 

I suspect StackOverflow is storing text in markdown format in arbitrarily-sized 'text' column. Maybe as UTF8 (but it might be UTF16 or something. I'm guessing it's SQL Server, which I don't know much about).

As a general rule you want to store stuff in your database in the 'rawest' form possible. That is, do all your decoding, and possibly cleaning, but don't do anything else with it (for example, if it's Mardown, don't encode it to HTML, leave it in its original 'raw' format)

Dan
A: 

I guess it depends on where you want to store the text, if you need things like transactions etc.

Databases like SQL Server have a type that can store long text fields. In SQL Server 2005 this would primarily be nvarchar(max) for long unicode text strings. By using a database you can benefit from transactions and easy backup/restore assuming you are using the database for other things like StackOverflow.com does.

The alternative is to store text in files on disk. This may be fairly simple to implement and can work in environments where a database is not available or overkill.

Regards the format of the text that is stored in a database or file, it is probably very close to the input. If it's HTML then you would just push it through a function that would correctly escape it.

Something to remember is that you probably want to be using unicode or UTF-8 from creation to storage and vice-versa. This will allow you to support additional languages. Any problem with this encoding mechanism will corrupt your text. Historically people may have defaulted to ASCII based on the assumption they were saving disk space etc.

BrianLy
A: 

For SQL Server:

Use a varchar(max) to store. I think the upper limit is 2 GB.

Don't try to escape the text yourself. Pass the text through a parameterizing structure that will do the escapes properly for you. In .Net you'd add a parameter to a SqlCommand, or just use LinqToSQL (which then manages the SqlCommand for you).

David B
+1  A: 

Regarding PostgreSQL - use type TEXT or BYTEA. If you need to read random chunks you may consider large objects.

Milen A. Radev