views:

528

answers:

7

I need to analyze tens of thousands of lines of data. The data is imported from a text file. Each line of data has eight variables. Currently, I use a class to define the data structure. As I read through the text file, I store each line object in a generic list, List.

I am wondering if I should switch to using a relational database (SQL) as I will need to analyze the data in each line of text, trying to relate it to definition terms which I also currently store in generic lists (List).

The goal is to translate a large amount of data using definitions. I want the defined data to be filterable, searchable, etc. Using a database makes more sense the more I think about it, but I would like to confirm with more experienced developers before I make the changes, yet again (I was using structs and arraylists at first).

The only drawback I can think of, is that the data does not need to be retained after it has been translated and viewed by the user. There is no need for permanent storage of data, therefore using a database might be a little overkill.

+1  A: 

It sounds like what you want is a database. Sqlite supports in-memory databases (use ":memory:" as the filename). I suspect others may have an in-memory mode as well.

jeffamaphone
This sounds like quite an attractive solution. I am not very familiar with in-memory databases so I will have to do my research, but Sqlite sounds like a light-weight system (judging by the name).
Snooze
SQLite, despite the hype, doesn't handle data all that well in the millions. Tens of thousand, I wouldn't know, it depends. Make sure to index it right. Try a large page size. If you somehow suspect it could grow to more data, don't commit to SQLite! I know from experience.
MPelletier
Yeah, it's lite and you get what you pay for. I've used it for small things, and some things in the hundreds of thousands. Also, it's not really thread-safe.
jeffamaphone
A: 

If you do not mind using access, here is what you can do

Attach a blank Access db as a resource When needed, write the db out to file. Run a CREATE TABLE statement that handles the columns of your data Import the data into the new table Use sql to run your calculations OnClose, delete that access db.

You can use a program like Resourcer to load the db into a resx file

  ResourceManager res = new ResourceManager( "MyProject.blank_db", this.GetType().Assembly );
  byte[] b = (byte[])res.GetObject( "access.blank" );

Then use the following code to pull the resource out of the project. Take the byte array and save it to the temp location with the temp filename

"MyProject.blank_db" is the location and name of the resource file "access.blank" is the tab given to the resource to save

JDMX
BTW, the same thing works with SQL Server Compact Edition, which comes with Visual Studio 2008.
John Saunders
I think I would rather use an in-memory SQL solution, but I will have to do my research.
Snooze
+2  A: 

It is not absolutely necessary to go a database. It depends on the actual size of the data and the process you need to do. If you are loading the data into a List with a custom class, why not use Linq to do your querying and filtering? Something like:

var query = from foo in List<Foo>
            where foo.Prop = criteriaVar
            select foo;

The real question is whether the data is so large that it cannot be loaded up into memory confortably. If that is the case, then yes, a database would be much simpler.

Thomas
The files I am importing tend to contain tens of thousands of lines, some can span over 100 thousand lines. Each line has eight fields that need to be analyzed and translated. For example, one field might contain the value 'phy' and need to be translated to 'Physical Layer' based on a definition file. I am trying to figure out the most efficient way to analyze and translate this data.
Snooze
Are you going to do aggregate analysis on the data or it is only processed line-by-line? If the later, then reading out of the text file and processing as you go could be pretty fast in comparison to trying to get the data into SQL Express or Access. Still, putting it in a database will give you a degree of flexibility such as the ability to create indexes that you would have to code yourself.
Thomas
I think I have no other choice than to do aggregate analysis because some of the definitions depends on prior lines in the text files. Furthermore, with aggregate analysis I can restructure the data to make it more readable/easier to comprehend. Even if I were to process on the fly, I would still need a data structure so that the user can filter/search the data.
Snooze
I suppose it comes down to whether you want analyze the data from the perspective of sets or iteratively. If you want to analyze it in sets, then a database is the way to go. If you want to analyze it iteratively, the you might still consider doing it in code. However, be wary about the DMBS you use. If you use Access, you won't get the same degree of flexibility in your SQL queries that you would with SQL Server. If you have access to a SQL Server on the destination's local network, the simplest solution is probably to use a db on that and drop your table when you are done.
Thomas
And in case I decide to forgo the database, storing class defined objects in a List<T> is the most efficient way of doing it iteratively?
Snooze
If you do it iteraivly, then you will have to analyze the optimal, in-memory indexing schema. While at the end that might involve List<T> or Dictionary<T,T>, it means that you will have to develop you own indexing schemes. Given that you know the data, that might be faster than a DBMS indexing scheme, but only perf tests will tell you for sure.
Thomas
Storing objects in a List<T> is probably efficient enough. Computers today are shockingly fast. I'd expect it would be far more performant than any database solution. Premature optimization...
Joe H
+1  A: 

I was facing the same problem that you faced now while I was working on my previous company.The thing is I was looking a concrete and good solution for a lot of bar code generated files.The bar code generates a text file with thousands of records with in a single file.Manipulating and presenting the data was so difficult for me at first.Based on the records what I programmed was, I create a class that read the file and loads the data to the data table and able to save it in database. The database what I used was SQL server 2005.Then I able to manage the saved data easily and present it which way I like it.The main point is read the data from the file and save to it to the data base.If you do so you will have a lot of options to manipulate and present as the way you like it.

Wonde
+3  A: 

This is not a large amount of data. I don't see any reason to involve a database in your analysis.

There IS a query language built into C# -- LINQ. The original poster currently uses a list of objects, so there is really nothing left to do. It seems to me that a database in this situation would add far more heat than light.

Joe H
For the query language, so you don't have to hard code such things (or invent it yourself)?
jeffamaphone
@jeffamaphone - There IS a query language built into C# -- LINQ. The original poster currently uses a list of objects, so there is really nothing left to do. It seems to me that a database in this situation would add far more heat than light.
Joe H
Good point. You should put that in your answer.
jeffamaphone
A: 

If the only thing you need to do is search and replace, you may consider using sed and awk and you can do searches using grep. Of course on a Unix platform.

Hei
On Windows, msys and cygwin can give you sed and awk.
ecounysis
A: 

From your description, I think linux command line tools can handle your data very well. Using a database may unnecessarily complicate your work. If you are using windows, these tools are also available by different ways. I would recommend cygwin. The following tools may cover your task: sort, grep, cut, awk, sed, join, paste.

These unix/linux command line tools may look scary to a windows person but there are reasons for people who love them. The following are my reasons for loving them:

  1. They allow your skill to accumulate - your knowledge to a partially tool can be helpful in different future tasks.
  2. They allow your efforts to accumulate - the command line (or scripts) you used to finish the task can be repeated as many times as needed with different data, without human interaction.
  3. They usually outperform the same tool you can write. If you don't believe, try to beat sort with your version for terabyte files.
Codism