views:

100

answers:

2

I am using PostgreSQL.

I realize there is Array data type for PostgreSQL.

http://www.postgresql.org/docs/8.1/interactive/arrays.html

Currently, I need to use database to store measurement result of a semiconductor factory.

They are producing semicondutor units. Every semicondutor units can have variable number of measurement parameters.

I plan to design the table in the following way.

SemicondutorComponent
=====================
ID |


Measurement
=================
ID | Name | Value | SemicondutorComponent_ID

Example of data :

SemicondutorComponent
=====================
1 |
2 |

Measurement
=================
1 | Width       | 0.001 | 1
2 | Height      | 0.021 | 1
3 | Thickness   | 0.022 | 1
4 | Pad0_Length | 0.031 | 1
5 | Pad1_Width  | 0.041 | 1
6 | Width       | 0.001 | 2
7 | Height      | 0.021 | 2
8 | Thickness   | 0.022 | 2
9 | Pad0_Length | 0.031 | 2
10| Pad1_Width  | 0.041 | 2
11| Pad2_Width  | 0.041 | 2
12| Lead0_Width | 0.041 | 2

Assume a factory is producing 24 million units in 1 day

SemicondutorComponent table will have 24 million rows in 1 day

Assume one SemicondutorComponent unit is having 50 measurement parameters. (can be more or can be less, depending on SemicondutorComponent type)

Measurement table will have 24 * 50 million rows in 1 day

Is it efficient to design that way?

I wish to have super fast write speed, and reasonable fast read speed from the database.

Or shall I make use of PostgreSQL Array facility?

SemicondutorComponent
=====================
ID | Array_of_measurement_name | Array_of_measurement_value
+1  A: 

I cannot comment on the write speed of arrays versus regular tables, but as far as i can tell, the design with 2 arrays will be pretty cumbersome for querying.

I don't know about the performance of reading arrays either, but from what I can tell by looking at the documentation, all array acess is done through positional reference (indexes) so it will be a royal pain in the behind to find a particular measurement - you'd have to walk the name array to find the proper index, and then use that to find the value. I doubt it can be done in pure SQL, and it will probably require a user defined function.

Now about the design with tables: you seem to be concerned about write speed. 24 mln components a day, that's 1 million rows per hour which is not that much. times 50, in a worst case, for the measurements, that's 51 million rows an hour, so less than 1 million rows per minute. I think this should be doable, although it would be advisable to batch inserts and avoid doing many single row inserts over many short lived transactions (better to insert them and commit in bunches of say 10.000 or 100.000).

I do think you would need to also design an archiving and/or aggregation solution, because it doesn't seem very maintainable to keep inserting those volumes. I doubt it is useful too, but perhaps that s just me not understanding the purpose of this database. I mean, it seems unlikely to me that you want to be able to pinpoint an individial measurement of one component after say, 1 year aftre it was manufactured. Whereas it does seem useful to keep stats like average, min, max and stddev measurements over time. But perhaps you can explain a bit about this.

Another thing that I thought of, is that it could help to store the raw measurement data first in a cheap and fast log (just text files say CSV format would do), and then use multiple readers to read them and insert them into the database. These readers could run in a fairly constant fashion. This would make the database less of a bottlenck, and make for a more robust system (assuming the chances that your log keeps working are higher than a database crash). Of course this approach is less suitable if you need realtime reporting from your db to monitor the process (although again, it seems very strange to me that you would need to do this on an individual component level)

Roland Bouman
yes. you need information for individual component. customers just want it.
Yan Cheng CHEOK
A: 

It depends on how you plan on accessing the data, and secondarily on how you plan on storing it.

If you are going to examine measurement values for a component as a unit, and you don't intend to search by the values, it is not totally out of the question to use an array. On the other hand if you later want to see which components have (say) a width larger than X value, then using an array is going to cause you pain, hair loss, and universe heat.

On the other hand, if you are going to store the values all at the same time, then using arrays is probably OK. If, instead, you are going to store first the width, then UPDATE the row to set the height, and so on, the performance is going to kill you, because each UPDATE in Postgres will need to be cleaned up by VACUUM.

I concur with Roland that you probably need some sort of aggregation. You might want to look at partitioning too, so that you can truncate (or drop) old partitions away, without the overhead of vacuuming dead rows caused by deleting old data.

alvherre