views:

320

answers:

6

Problem description: In my application, I have to present the contents of data packets with a certain format. An example:

An example Any packed binary data, for example: 4 byte header, 4 byte type (type codes having pre-defined meanings), then source address, destination address, and so on.

Previously, I made home cooked implementations that stored the data in a binary file (fixed record length allowed fast lookup), but with time I'm realized I'm inventing some kind of a database. For example, I'm implementing my own efficient binary storage format for very large data files. I'm also implementing my own indexing to rapidly run searches on some fields. I think a real DB (even the simple SQLite) can make this stuff transparently simple.

Question #1: are DBs useful for storing such data, and how should it be done? Note that there are no 1-to-many, many-to-many mappings here and other advanced things, it's just a plain sequence of packets with a certain internal structure I want to display to the user and let him interact with (i.e. search by a certain field).

Question #2: Now suppose the user himself can specify the format of his packets, i.e. in a configuration file: the length of each field, its type, what its values mean (in case of an enumeration) and so on. How do I extend a DB-backed implementation for this? Should the user define DB schemas? Should the configuration file be auto-translated into this schemas? ORM?

Question #3: Even more advanced... Now suppose the data packages can be varying in length and contents. I.e., for type #2 packages, there are some field, for type #3, some other fields, and so on. But I'd still like my app to handle it, displaying everything nicely and also allowing users to specify the formats in config files. How is it done?

Thanks in advance.

A: 

Though I'm not a huge fan of this implementation, we have some software that essentially does this for some calling lists. Essentially, here's what they do:

  1. A table with column definitions - call it tblColumnDefs. This table contains columns like "Name", "Type", "Length" and "Description"
  2. An instance master table (tblPacketNames). Essentially, just "PacketTypeID", "PacketName", and "Description" for each packet type you're defining
  3. An instance definition table (for you, this would be tblPacketColumns). This table collects the pre-defined columns together to form the data structure that you're storing. For example, it might hold "PacketTypeID", "ColumnNumber", "ColumnID". In database-normalization-speak, this is a many-to-many table, since it maps the columns to the packets that use them.
  4. In a second database (because of dynamic SQL/injection implications of this step), tables are created dynamically to hold the actual data. For example, if you've defined (in steps 2/3) a packet type called "PING", you might have a table called "PING" in your database to hold that data. You'd use tblPacketColumns, linked to tblColumnDefs, to figure out what field types to create and how big they should be. You end up with a collection of tables that match the packet type definitions from step 3, using the columns from step 1.

NOTE: I don't particular like the SQL-injection implications of step 4. Creating tables dynamically can lead to some consequences if security isn't designed properly and the input from any user-entered fields in your application isn't cleansed properly, especially if this application has an interface that available to untrusted callers (ie, the Internet).

Using this, you can create indexes however you want when the tables are created (maybe you have a column in step 1 where you flag certain columns as "Indexable", and indexes are created on top of them when the tables are created.

rwmnau
+1  A: 

A simple rule is this: If you are going to query the data, then it should be a discrete field within a table within the DB. If not, you can store the BLOB and be done with it.

That said, if you want to derive "meta data" from a BLOB, and index THAT, then you can do that readily as well.

If your data types are congruent with what the database can support (or can be accurately converted), there can be some value is exploding the BLOB in to its component parts that map nicely in to DB columns.

The problem with defining "tables on the fly" (which could be easily done) is not so much the definition of the table, but the potential change of the table. Tables that are being changed (i.e. a column added, or dropped, etc.) tend to be unusable for the duration of the change. Not an issue for 100 rows. A real problem for millions of rows.

If the data definitions are fairly static, then creating a mapping facility that lets your users describe the BLOB, and then you use that definition to both create a compliant table and convert the BLOBs appropriately during import.

As for the "different rows of different types", you can still stuff that data in to a single table. Some rows have "unused" columns compared to others, each row is identified by type. If you have lots of row definitions, and lots of variance, you get lots of wasted space doing this. Then you may want to go to having a table for each row type, and a master table that holds the row types and references to the real rows in the actual tables. You would only need this master table if you care about the relationships of the original data packets to each other (then you can store them in receipt order, say, etc.).

Really, it all boils down to how much data you have, how much you expect, how much work you want to do vs how much you already have done, etc.

Will Hartung
+1  A: 

Question #1: are DBs useful for storing such data, and how should it be done?

Certainly a database is useful for this application. You could implement your own special-purpose data store, and perhaps it would be more efficient for your specific application, because you can design it for that specialization. A relational database is more general-purpose, but you can avoid weeks or months of development time by employing a database.

I answered another question earlier today on the subject of how to handle extensible types, where each new sub-type has its own distinct set of attributes.

"product table, many kind of product, each product have many parameters."

For your application, I would choose the Concrete Table Inheritance design.

Question #2: Now suppose the user himself can specify the format of his packets, i.e. in a configuration file: the length of each field, its type, what its values mean (in case of an enumeration) and so on. How do I extend a DB-backed implementation for this?

I assume the number of packet types are relatively few, and then many packets are inserted with pretty much the same structure. So you should use the database's ability to manage metadata. I would define an additional table for each new packet types.

I would also store the packets "exploded" so each field of the packet is stored in a separate database column. That way you can index each column individually, to support efficient searching.

You can also define constraints so that some fields are mandatory (NOT NULL) or their values constrained by lookup tables. Again, leveraging the database's capabilities to use metadata to enforce consistent structure where it's desirable.

SQL already supports a standard, declarative language for specifying fields with data types, constraints, etc. Why develop a different language which you then have to translate to SQL?

Question #3: Even more advanced... Now suppose the data packages can be varying in length and contents.

Fields that are optional in a given packet type should permit NULL in the corresponding column.

Bill Karwin
can this perhaps be better solved with an ORM like SQLAlchemy, or is it irrelevant to my needs?
Eli Bendersky
also, could you please clarify what you mean by "ability to manage metadata"? what kind of metadata suites me best here?
Eli Bendersky
By metadata I mean a table, consisting of columns and constraints. An RDBMS uses metadata to enforce data types, NOT NULL, referential integrity, etc.
Bill Karwin
Yes, using an ORM can be helpful. You can write code in your table class to convert a full packet into its component columns as it stores it in the database, and then reassemble data fetched from the database into a packet. This supports encapsulation.
Bill Karwin
thank you for the help
Eli Bendersky
+1  A: 

Another option you may wish to consider is Berkeley DB or one of its clones. BDB is pretty low level, there's no SQL. It's pretty much a really small, really fast file-backed hash table. It's been around forever, and is used in a lot of places where speed and simplicity is paramount. You'd need to add some functionality on top to do what you're trying to accomplish, though.

joev
+1  A: 

Despite the fact that you stated that there are no 1-many relationships, there are :)

I would recommend creating two tables for packet storage. One to store "header" or "scalar" information, which is common to the packet and--while it may define WHAT data is present--isn't the actual data stored in the packet.

Your second table would store the data for each packet, with each field-value combination representing a row in this table. For example, the following two tables:

create table packet
(
    packet_id int identity(1, 1) primary key,
    destination varchar(50),
    sender varchar(50),
    packet_type_id int not null
)

create table packet_field
(
    packet_field_id int identity(1, 1) primary key,
    packet_id int not null references packet (packet_id),
    field_id int not null,
    data varbinary(500)
)

Obviously these two tables are making assumptions about the type and size of data being stored and aren't exhaustive in what they'll need to store. However, this fundamental structure will allow for dynamically-defined packet formats and is a schema that's easily indexed (for instance, adding an index on packet_id+field_id in packet_field would be a no-brainer).

All your application is then responsible for is unpacking the packet and storing it in your DB in this schema, then repacking (if necessary).

Of course, from this point you'll need tables that store the actual format of the packet. Something like...

create table packet_type
(
    packet_type_id int identity(1, 1) primary key,
    name varchar(200) not null
)

create table packet_type_field
(
    field_id int identity(1, 1) primary key,
    packet_type_id int not null references packet_type (packet_type_id)
    field_offset int not null,
    name varchar(200) not null
)

Again, obviously simplified but it shows the basic idea. You would have a single record in your packet_type table for each packet format, and one row in the packet_type_field for each field in a given packet. This should give you most of the information you would need to be able to process an arbitrary chunk of binary data into the aforementioned packet storage schema.

Adam Robinson
A: 

Three methods come to mind.

sFlow and IPFlow can transmit a limited set of packet contents. This can be logged directly into several different databases.

Another more targeted method would be a write a very simple snort rule such as source or destination address. Then have snort capture the payload of the packets. That way you would only get the actual data you require. For instance you could grab just the fields of data inside the packet. e.g. password etc.

ngrep can also grab selective data right off the wire.

Of course each of these could require a tap or monitor session on a port if you are not doing the capture on the server/workstation itself.