views:

42

answers:

3

What's the best storage mechanism (from the view of the database to be used and system for storing all the records) for a system built to track whois record changes? The program will be run once a day and a track should be kept of what the previous value was and what the new value is.

Suggestions on database and thoughts on how to store the different records/fields so that data is not redundant/duplicated


(Added) My thoughts on one mechanism to store data

Example case showing sale of one domain "sample.com" from personA to personB on 1/1/2010

Table_DomainNames
DomainId | DomainName
1           example.com
2           sample.com

Table_ChangeTrack
DomainId | DateTime | RegistrarId | RegistrantId | (others)
2           1/1/2009           1              1
2           1/1/2010           2              2

Table_Registrars
RegistrarId | RegistrarName
1             GoDaddy
2             1&1

Table_Registrants
RegistrantId | RegistrantName
1              PersonA
2              PersonB

All tables are "append-only". Does this model make sense? Table_ChangeTrack should be "added to" only when there is any change in ANY of the monitored fields.

Is there any way of making this more efficient / tighter from the size point-of-view??

A: 

You could

  • store the checksum of a normalized form of the whois record data fields for comparison.
  • store the original and current version of the data (possibly in compressed form), if required.
  • store diffs of each detected change (possibly in compressed form), if required.

It is much like how incremental backup systems work. Maybe you can get further inspiration from there.

relet
+2  A: 

The primary data is the existence or changes to the whois records. This suggests that your primary table be:

<id, domain, effective_date, detail_id>

where the detail_id points to actual whois data, likely normalized itself:

<detail_id, registrar_id, admin_id, tech_id, ...>

But do note that most registrars consider the information their property (whether it is or not) and have warnings like:

TERMS OF USE: You are not authorized to access or query our Whois database through the use of electronic processes that are high-volume and automated except as reasonably necessary to register domain names or modify existing registrations...

From which you can expect that they'll cut you off if you read their databases too much.

msw
I got permissions for limited automated use of their system... Instead of storing it by "detail_id", wouldnt it be more efficient to store it by one of 'X' values for registrar_id, admin_id, etc... where those numbers are from another table itself. eg: table_registrar (registrar_id, registrar_name)?
DrMHC
I'm not sure I understand your comment. The domain name is the primary key and all information is subordinate (dependent upon) that key. Since it is possible(?) for a domain name to change registrars, the structure as I outlined makes third-normal sense where there will be many references to registrar_id, admin_id will equal tech_id, etc.
msw
Umm... adding info in the main question to explain better
DrMHC
The example is more-or-less what I trying to describe; it looks generally fine.
msw
Any thoughts on making it more efficient, storage wise? Especially how to record the fact that perhaps only ONE or a FEW of the fields changed (for eg the expiration date)...
DrMHC
Make it work and then see if you have storage problems. Disk is so cheap now that you really shouldn't micro-optimize in advance when the cost is greater code complexity.
msw
A: 

you can write vbscript in an excel file to go out and query a webpage (in this case, the particular 'whois' url for a specific site) and then store the results back to a worksheet in excel.

CheeseConQueso