views:

264

answers:

6

I have a list of 1 million digits. Every time the user submit an input, I would need to do a matching of the input with the list.

As such, the list would have the Write Once Read Many (WORM) characteristics?

What would be the best way to implement storage for this data?

I am thinking of several options:

  1. A SQL Database but is it suitable for WORM (UPDATE: using VARCHAR field type instead of INT)
  2. One file with the list
  3. A directory structure like /1/2/3/4/5/6/7/8/9/0 (but this one would be taking too much space)
  4. A bucket system like /12345/67890/

What do you think?

UPDATE: The application would be a web application.

+1  A: 

If you're concerned about speed and don't want to care about file system storage, probably SQL is your best shot. You can optimize your table indexes but also will add another external dependency on your project.

EDIT: Seems MySQL have an ARCHIVE Storage Engine:

MySQL supports on-the-fly compression since version 5.0 with the ARCHIVE storage engine. Archive is a write-once, read-many storage engine, designed for historical data. It compresses data up to 90%. It does not support indexes. In version 5.1 Archive engine can be used with partitioning.

Rubens Farias
Curious what type of index would work for a list of digits? You would still require a table scan, right?
Alan
actually due to the nature, the field type would be varchar instead of int
Niko Gunadi
+2  A: 

To answer this question you'll need to think about two things:

Are you trying to minimize storage space, or are you trying to minimize process time.

Storing the data in memory will give you the fastest processing time, especially if you could optimize the datastructure for your most common operations (in this case a lookup) at the cost of memory space. For persistence, you could store the data to a flat file, and read the data during startup.

SQL Databases are great for storing and reading relational data. For instance storing Names, addresses, and orders can be normalized and stored efficiently. Does a flat list of digits make sense to store in a relational database? For each access you will have a lot of overhead associated with looking up the data. Constructing the query, building the query plan, executing the query plan, etc. Since the data is a flat list, you wouldn't be able to create an effective index (your index would essentially be the values you are storing, which means you would do a table scan for each data access).

Using a directory structure might work, but then your application is no longer portable.

If I were writing the application, I would either load the data during startup from a file and store it in memory in a hash table (which offers constant lookups), or write a simple indexed file accessor class that stores the data in a search optimized order (worst case a flat file).

Alan
Maybe MySQL with Memory Storage would be a good option.An init script for the application would populate this table.
Niko Gunadi
Perhaps, but to be effective, you'd need an effective index. I know for MS SQL even with an index, if the index returns more than some percentage of rows, MS SQL will default to a table scan.
Alan
Actually on a side-note, memcached solution should also be considered and might solve this elegantly.
Niko Gunadi
A: 

If you're concerned about tampering, buy a writable DVD (or a CD if you can find a store which still carries them ...), write the list on it and then put it into a server with only a DVD drive (not a DVD writer/burner). This way, the list can't be modified. Another option would be to buy an USB stick which has a "write protect" switch but they are hard to come by and the security isn't as good as with a CD/DVD.

Next, write each digit into a file on that disk with one entry per line. When you need to match the numbers, just open the file, read each line and stop when you find a match. With todays computer speeds and amounts of RAM (and therefore file system cache), this should be fast enough for a once-per-day access pattern.

Aaron Digulla
Taking up a server's DVD drive for one application is going to be frowned upon by your sysadmin :) Better to simply set the file permissions properly.
Paolo
It's been a few years since I've seen an admin actually use the DVD drive of a server. Today, they pop the CD into some PC and map the drive.
Aaron Digulla
+1  A: 

Two options I would consider:

  1. Serialization - when the memory footprint of your lookup list is acceptable for your application, and the application is persistent (a daemon or server app), then create it and store it as a binary file, read the binary file on application startup. Upside - fast lookups. Downside - memory footprint, application initialization time.
  2. SQL storage - when the lookup is amenable to index-based lookup, and you don't want to hold the entire list in memory. Upside - reduced init time, reduced memory footprint. Downside - requires DBMS (extra app dependency, design expertise), fast, but not as fast as holding the whole list in memeory
Steve De Caux
+2  A: 

Maybe you are interested in how The Pi Searcher did it. They have 200 million digits to search through, and have published a description on how their indexed searches work.

Emil Vikström
+1 interesting link, ty
Rubens Farias
A: 

Given that 1M numbers is not a huge amount of numbers for todays computers, why not just do pretty much the simplest thing that could work. Just store the numbers in a text file and read them into a hash set on application startup. On my computer reading in 1M numbers from a text file takes under a second and after that I can do about 13M lookups per second.

Ants Aasma
Exactly. So, what I did is to load the numbers to memcached during startup.
Niko Gunadi