Library/data structure for handling huge data

views:

250

answers:

+11 Q:

Library/data structure for handling huge data

I have some huge binary driver logs (around 2-5GB each, and probably around 10x as much after converting them to a readable form) and I need to write a tool that would allow me to sequentially browse, sort, search and filter them effectively (in order to find and resolve bugs).

Each log entry has few attributes like: time-stamp, type, message, some GUIDs. Entries are homogeneous, no relations, no need to store the data after "inspecting" it.

I don't really know how to handle so much data. Keeping everything in memory would be foolish, same goes for keeping the data in a flat file. I thought of using small DBMS like SQLite, but I'm not sure if it will be fast enough and I don't need many features of DMBS - only sorting and searching. I would eagerly trade space for speed in this case, if it's possible.

Is there any library (or maybe data structure) that would help me handle such amounts of data?

"Served" RDBMSs like Postgre, MSSQL, MySQL are out of the question, the tool should be easy to use anywhere without any hassle.

EDIT: Oh, and does anyone know if SQLite's ":memory" mode has any restrictions on the size of DB or will it just fill virtual memory until it's filled up completely?

+6 A:

I think storing this in a DBMS is the appropriate approach. Sorting and searching are tasks DB's excel at performing - and with this much data, using a tool designed for that purpose will be a huge advantage.

SQLite would work well for this, though a non relational datastore may use less space. However, if you want to search on multiple "entries", a DB is definitely the way to go.

Reed Copsey 2010-08-09 18:22:22

From [Appropriate Uses For SQLite](http://www.sqlite.org/whentouse.html), it doesn't seem like handling huge datasets is SQLite's forte.

quantumSoup 2010-08-09 18:32:51

@quantumSoup: SQLite's page is referencing databases approaching 2 **tebibytes** - way, way more than the specifications here. Granted, huge databases are handled better by enterprise class DBMS, however, SQLite should handle multi-GB DB's just fine.

Reed Copsey 2010-08-09 18:36:40

@Reed SQLite stores the entire database in a single file, and thus you might run into the filesystem's limitation on maximum file size as well.

quantumSoup 2010-08-09 18:42:53

@quantumSoup: Yes, but any flat file solution is probably going to have that problem, and the OP specifically ruled out client/server DBMS systems, and other, probably better options.

Reed Copsey 2010-08-09 18:49:32

@quantumSoup Unless you're using FAT32, or, say, something even older, your FS can store files much bigger than 2 or 4GB.

Nick Johnson 2010-08-09 20:09:20

+6 A:

Check out STXXL - Standard Template Library for Extra Large Data Sets.

"The core of STXXL is an implementation of the C++ standard template library STL for external memory (out-of-core) computations, i.e., STXXL implements containers and algorithms that can process huge volumes of data that only fit on disks. While the compatibility to the STL supports ease of use and compatibility with existing applications, another design priority is high performance."

Also, if you can dedicate several computers for the task, check Hadoop. Especially HBase, Hive and MapReduce.

Lior Kogan 2010-08-17 06:40:25

This looks interesting; Any info on timings of operations with large data?

kurczak 2010-08-17 08:21:50

STXXL has some information regarding performance at http://algo2.iti.kit.edu/dementiev/stxxl/report/node11.html. I'm afraid I can't tell you much more.

Lior Kogan 2010-08-17 08:42:57

+2 A:

I recommend using some implementation of MapReduce, perhaps Hadoop or something similar. I haven't had the chance to work with Hadoop beyond a theoretical presentation I was given, but it seems promising.

An alternative is to use commercial tools, like Splunk.

Yuval 2010-08-17 07:21:13

I need it locally on one computer, no complicated set ups.

kurczak 2010-08-17 08:14:08

+3 A:

How about using some kind of memory mapped I/O, something like Java's MappedByteBuffer and roll your own tool?

To paraphrase from an SO answer on MBBs,

Basically, this mechanism uses the OS's virtual memory paging system to 'map' your files and present them programmaticly as byte buffers. The OS will manage moving the bytes to/from disk and memory auto-magically and very quickly.

It would make sense for you to create such files for each one of your log files to read them. Caveat is that you should be on 64bit, since that gives your files a TB limit rather than GBs.

Browse, filter and sort Just displaying the files in some hierarchy and using a metric like filename or timestamp to sort them should be simple with your own code when you're dealing with MBBs. What are your filter criteria?

Search Now, if you wanted to search through them - Lucene running on top of this would give you a good method to index the files. There are various ways you can take this too - use hadoop and Map/Reduce as others have mentioned to distribute tasks across multiple machines.

Performance tips on this site are great.

viksit 2010-08-17 20:40:51

+5 A:

The HDF5 file format and related library is designed for storing massive amounts of data and allowing fast and efficient I/O over it.

The pytables project provides a nice way to use them from python and provides methods for sorting and searching.

Robert Christie 2010-08-17 21:00:34

+2 A:

Log parser. I suggest that you look at the msft log parser. This is included in the iis resource kit and provides a lot of what you are looking for. Perhaps the most useful feature is the ability to do SQL like queries on flat file. This can even be done across files.

Mike 2010-08-20 17:41:34

+1 A:

One option may be Berkeley DB, or some similar embeddable database manager.

I've not used Berkely DB, but from a quick look, I'm guessing it's similar to a lot of ISAM database managers that were around years ago - basically a library for handling on-disk key->data index data structures. The only caution - I saw a mention of hash tables, so it may not do the sequential part of ISAM, but I expect it does - the most recent version even has SQL support.

You don't necessarily need to translate the full binary log to a readable form. You could do an initial index-building scan that saves offsets into the original files. One useful index might simply be from line-number to byte-range, so you can display a specific line range quickly - though only if the log records are variable length, of course.

If it's anything like Btrieve (which I used years ago for a while), it should be easy enough.

Steve314 2010-08-21 12:34:38

You didn't state language. So just providing a module allowing you to do random access on a file supposedly in an efficient manner: http://perldoc.perl.org/Tie/File.html

nicomen 2010-08-24 01:13:01

"time-stamp, type, message, some GUIDs. Entries are homogeneous, no relations, no need to store the data after "inspecting" it."

Have you considered just storing the discrete entries as separate files in a directory?

If you just need to do simple sorting, then construct the filename from the sort fields and put the others in the file. Selection is rapid if you know which fields you want.

And best of all, the api is built into the OS.

Obviously, if you need something more flexible than that, then you're going to need a proper DB, but it might work depending on your requirements.

WOPR 2010-08-24 06:09:49

In some cases it would be more than 60 million files, I guess that would be a bit of a stress for the FS. (?)

kurczak 2010-08-24 06:26:27

ansaurus

tags:

views:

answers:

Library/data structure for handling huge data

related questions