views:

87

answers:

2

Hi all,

I'm learning JBoss Drools and I'm playing with the genetics data from the hapmap project: ( http://hapmap.ncbi.nlm.nih.gov/genotypes/latest/forward/non-redundant/ ) . Each file in this directory is a table with the individuals at the top, the positions on the genome on the left , and the observed mutations for each individual/position.

Here I'd like to find some potential errors in the file (e.g. a children doesn't have any mutation from his parents) using Drools.

1) I want to load those data in Drools. This can be a large amount of data (e.g. genotypes_chr2_YRI_r27_nr.b36_fwd.txt.gz is 20Mo gzipped ) Will those data be stored in memory ? or does Drools stores it somewhere ? or should I use a persistence system ?

2) about the model:

I was thinking about putting the following classes in a StatefulKnowledgeSession:

class Individual
 {
 private String name;
 //constructor, getters, setters etc...
 }

class Position
 {
 private String name;
 private String chromosome;
 private int position;
 //constructor, getters, setters etc...
 }

class ObservedMutation
 {
 private String individualName;
 private String positionName;
 private String observed;
 //constructor, getters, setters etc...
 }

or should ObservedMutation be:

class ObservedMutation
 {
 private Individual individual;
 private Position position;
 private String observed;
 //constructor, getters, setters etc...
 }

thanks for you suggestions

Pierre

update: my firs test : http://plindenbaum.blogspot.com/2010/07/rules-engine-for-bioinformatics-playing.html

A: 

I think it should be the second one. I'd prefer objects over primitives like String.

duffymo
+1  A: 

Yes, when you insert the large amount of data, Drools will store them in memory. 20 Mb is probably not a problem - just try it.

It should be straightforward to write rules for the model classes you propose - the rules in the hapmap.drl example in your first test look reasonable. The choice between your two ObservedMutation classes is as much a matter of taste as anything else, since they will result in different DRL rules syntax. I would start with the second version and see how you get on: perhaps the non-obvious thing if you have object properties (as in the second version of ObservedMutation) is that you might need to use this to refer to a bound object, e.g. $p in:

when
    ObservedMutation($p : position)
    Position(this == $p)
Peter Hilton