tags:

views:

156

answers:

1

Hi,

I got a big set of molecules from the zinc database (http://zinc.docking.org/), in mol2 (http://tripos.com/index.php?family=modules,SimplePage,,,&page=sup_mol2&s=0) format. I would like to be able to split this database into an arbitrary set of N smaller databases. What is the best scripting approach in either python, bash or perl for this? I read about openbabel, but it can only generate sets of individual molecules.

If not, I can also convert mol2 to another more convenient format

Thaks

+2  A: 

csplit can separate the file into individual molecules:

csplit ~/Download/zinc.mol2 '/@<TRIPOS>MOLECULE/' '{*}'

If you want something more clever then you can read each molecule into a list or array as a string and then spit out as many as you like into each file.

Ignacio Vazquez-Abrams
thankswhat is the point of '{*}'? I get an errorEach mol2 molecule starts with the TRIPOS header as you said but ends with 2 blank lines
asdf
The {*} tells csplit to repeat the previous pattern until the end of the file. Without it csplit will only create 2 files, one containing everything up to the first match, and the other containing everything from the first match to the end of the file.
Ignacio Vazquez-Abrams
gracias por la ayuda
asdf