tags:

views:

165

answers:

2

Hi,

First off, this may be the wrong Forum for this question, as it's pretty darn R+Bioconductor specific. Here's what I have:

library('GEOquery')
GDS = getGEO('GDS785')
cd4T = GDS2eSet(GDS)
cd4T <- cd4T[!fData(cd4T)$symbol == "",]

Now cd4T is an ExpressionSet object which wraps a big matrix with 19794 rows (probesets) and 15 columns (samples). The final line gets rid of all probesets that do not have corresponding gene symbols. Now the trouble is that most genes in this set are assigned to more than one probeset. You can see this by doing

gene_symbols = factor(fData(cd4T)$Gene.symbol)
length(gene_symbols)-length(levels(gene_symbols))
[1] 6897

So only 6897 of my 19794 probesets have unique probeset -> gene mappings. I'd like to somehow combine the expression levels of each probeset associated with each gene. I don't care much about the actual probe id for each probe. I'd like very much to end up with an ExpressionSet containing the merged information as all of my downstream analysis is designed to work with this class.

I think I can write some code that will do this by hand, and make a new expression set from scratch. However, I'm assuming this can't be a new problem and that code exists to do it, using a statistically sound method to combine the gene expression levels. I'm guessing there's a proper name for this also but my googles aren't showing up much of use. Can anyone help?

A: 

I'm not an expert, but from what I've seen over the years everyone has their own favorite way of combining probesets. The two methods that I've seen used the most on a large scale has been using only the probeset which has the largest variance across the expression matrix and the other being to take the mean of the probesets and creating a meta-probeset out of it. For smaller blocks of probesets I've seen people use more intensive methods involving looking at per-probeset plots to get a feel for what's going on ... generally what happens is that one probeset turns out to be the 'good' one and the rest aren't very good.

I haven't seen generalized code to do this - as an example we recently realized in my lab that a few of us have our own private functions to do this same thing.

geoffjentry
A: 

The word you are looking for is 'nsFilter' in R genefilter package. This function assign two major things, it looks for only entrez gene ids, rest of the probesets will be filtered out. When an entrez id has multiple probesets, then the largest value will be retained and the others removed. Now you have unique entrez gene id mapped matrix. Hope this helps