views:

70

answers:

1

Hello, I like to organize a lot of information from literature reviews in "tables" (information not unlike product comparisons, but for scientific research), but often the information I enter can contain lines or paragraphs of text and becomes unwieldy in a spreadsheet. I've heard SQL relational tables are often used for this purpose; for data analysis I use Python or R to parse data from a flat text file and enter this into SQLite. Should I just create a "marked up" text file and do the same thing? I wonder what interfaces people use to enter and also view such text-heavy tables? Or I wonder if there is another software that might be suited for this purpose. Thanks!

+3  A: 

The way you store and retrieve data would depend on what you plan to do with it.

Text files have problems with manageability. You can't really take care of a directory tree with thousands and thousands of files. It would be a nightmare to search through them. If you're concurrently updating, you'll have to deal with locks and a slew of other problems. They're not really meant for storing large amounts of data that you're going to mine.

Relational databases are fine but you'll have to parse the information into meaningful bits , break it down into relations and put the resultant data into tables for it to make any sense. Dumping all the text (after some preprocessing) into a single column would not be very useful. The upshot of what I'm saying is the SQL databases store 'structured' data which can be queried using the structure.

Another think you might consider is to use a document database. There are quite a few out there and while I don't have personal experience, I have listened to a presentation on CouchDB which stores information as JSON documents. You mine the data using scripts that can sort according to some conditions and then get back the sorted documents. If you're dealing with a lot of textual data, this would definitely atleast be worth a shot. Word on the street is that these engines are much more scalable than their relational counterparts.

Noufal Ibrahim
+1 Nice answer. A lot of people assume SQL is just a dumping ground (no offence to OP), and I'm glad you've put the 'structured' comment in.
Randolph Potter
Thank you Randolph. :)
Noufal Ibrahim
Thanks! My aim is to actually store structured data - "data" being chunks of text I have harvested from the literature and the structure being one that I have prescribed according to the type of information I am extracting; the only thing is that the entries in each field can be large sometimes...
Stephen
Sounds like large amounts of textual data. How do you plan to process them? Pattern searches? Language analysis?
Noufal Ibrahim
Unfortunately, not pattern searches (though I am proficient in pattern searches) as the information I extract is content-specific. Investment in a language analysis algorithms is also not worthwhile because I make many such tables once every few weeks, each requiring different content each time. There is no flexible algorithm I have found that matches the proficiency of the human mind a the moment. So... copy-paste, or manually entering my assessment.
Stephen
Ah okay. So you basically want to use it as a data dump. If you're simply going to keep large amounts of textual data like this along with an 'analysis' and don't plan to later mine it (eg. `All content with analysis == 'good'`), you can go ahead with plain old text files with some annotations in them.
Noufal Ibrahim
Thanks - it appears this is probably the way to go for now...
Stephen
Keep your code a little flexible though. You never know when you might need to change your storage mechanism. Good luck! :)
Noufal Ibrahim