views:

138

answers:

3

Sorry about the odd title.

I am using eSearch & eSummary to go from

Accession Number --> gID --> TaxID

Assume that 'accessions' is a list of 20 accession numbers (I do 20 at a time because that's the maximum that NCBI will allow).

I do:

handle = Entrez.esearch(db="nucleotide", rettype="xml", term=accessions)
record = Entrez.read(handle)
gids = ",".join(record[u'IdList'])

This gives me 20 correspoding GIDs from those 20 accession numbers.

Followed by:

handle = Entrez.esummary(db="nucleotide", id=gids)
record = Entrez.read(handle)

Which gives me this error because one of the GIDs in gids has been removed from NCBI:

File ".../biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", line 191, in endElement value = IntegerElement(value)
ValueError: invalid literal for int() with base 10: ''

I could do try:, except: except that would skip the other 19 GIDs which are okay.

My question is:

How do I read 20 records at a time with Entrez.read and skip over the ones that are missing without sacrificing the other 20? I could do one at a time but that would be incredibly slow (I have 300,000 accession numbers, and NCBI only allows you to do 3 queries per second but in reality it's more like 1 query per second).

A: 

I'd have a look at Parser.py and see what is being parsed. It looks like you are getting a result from the NCBI ok, but the format of one record is tripping up the parser.

It may be possible to subclass/monkeypatch the parser to get it past the exception.

gnibbler
any suggestions as to how to do that?I tried putting:if value == "": returnbefore the problematic line but it gives me the same error.
Austin
can you put a try/except around the problem line and print the value when the exception happens? Otherwise i'd use the debugger
gnibbler
+1  A: 

Hey,

I don't know the answer (sorry) but you are probably more likely to get some help from the biopython mailing list:

http://lists.open-bio.org/mailman/listinfo/biopython/

david w
+2  A: 

I sent a message out to the BioPython mailing list.Apparently it's a bug & they're working on it.

Austin