views:

869

answers:

4

I am current in planning on creating a big database (2+ million rows) with a variety of data from separate sources. I would like to avoid structuring the database around auto_increment ids to help prevent against sync issues with replication, and also because each item inserted will have a alphanumeric product code that is guaranteed to be unique - it seems to me more sense to use that instead.

I am looking at a search engine to index this database with Sphinx looking rather appealing due to its design around indexing relational databases. However, looking at various tutorials and documentation seems to show database designs being dependent on an auto_increment field in one form or another and a rather bold statement in the documentation saying that document ids must be 32/64bit integers only or things break.

Is there a way to have a database indexed by Sphinx without auto_increment fields as the id?

+2  A: 

sphinx only requires ids to be integer and unique, it doesn't care if they are auto incremented or not, so you can roll out your own logic. For example, generate integer hashes for your string keys.

stereofrog
I'm a bit worried about having colliding ids with that approach - or maybe I read you wrong?
squeeks
yes, it's totally justified, because you never know when hashes are going to collide... however, with "only" 2mln rows and 64bit ids you have enough space to play around, e.g. think about using hash+timestamp or hash+user_id - really depends on your application.
stereofrog
Would an idea be to use unixtime + microtime at time of insert? I could then use that as the time of insertion as well as document id, two birds with one stone.
squeeks
yes, as a primary key this would be perfect, however i'd like to warn you against "two birds" approach - it usually causes more problems as it seems to solve. But that's another story.
stereofrog
btw reading your another comment, if your product codes are purely alphanumeric (i.e. only a-z0-9) the simples option would be treat them as 36-base integers and simply convert to/from decimal while reading/writing the db
stereofrog
I think that would be a good idea worth trying. Cheers.
squeeks
+1  A: 

Sphinx doesnt depend on auto increment , just needs unique integer document ids. Maybe you can have a surrogate unique integer id in the tables to work with sphinx. As it is known that integer searches are way faster than alphanumeric searches. BTW how long is ur alphanumeric product code? any samples?

Sabeen Malik
They vary in length from 4 to 13 characters in length.
squeeks
+3  A: 

Sure - that's easy to work around. If you need to make up your own IDs just for Sphinx and you don't want them to collide, you can do something like this in your sphinx.conf (example code for MySQL)

source products {

  # Use a variable to store a throwaway ID value
  sql_query_pre = SELECT @id := 0 

  # Keep incrementing the throwaway ID.
  # "code" is present twice because Sphinx does not full-text index attributes
  sql_query = SELECT @id := @id + 1, code AS code_attr, code, description FROM products

  # Return the code so that your app will know which records were matched
  # this will only work in Sphinx 0.9.10 and higher!
  sql_attr_string = code_attr  
}

The only problem is that you still need a way to know what records were matched by your search. Sphinx will return the id (which is now meaningless) plus any columns that you mark as "attributes".

Sphinx 0.9.10 and above will be able to return your product code to you as part of the search results because it has string attributes support.

0.9.10 is not an official release yet but it is looking great. It looks like Zawodny is running it over at Craig's List so I wouldn't be too nervous about relying on this feature.

casey
A: 

I think it's possible to generate a XML Stream from your data. Then create the ID via Software (Ruby, Java, PHP).

Take a look at http://github.com/burke/mongosphinx

chris