views:

36

answers:

1

I am trying to slim down the bib text files I get from my reference manager because it leaves extra fields that end up getting mangled when I put it into LaTeX.

A characteristic entry that I want to clean up is:

@Article{Kholmurodov:2001p113,
author = {K Kholmurodov and I Puzynin and W Smith and K Yasuoka and T Ebisuzaki}, 
journal = {Computer Physics Communications},
title = {MD simulation of cluster-surface impacts for metallic phases: soft landing, droplet spreading and implantation},
abstract = {Lots of text here.  Even more text.},
affiliation = {RIKEN, Inst Phys {\&} Chem Res, Computat Sci Div, Adv Comp Ctr, Wako, Saitama 3510198, Japan},
number = {1},
pages = {1--16},
volume = {141},
year = {2001},
month = {Dec},
language = {English},
keywords = {Ethane, molecular dynamics, Clusters, Dl_Poly Code, solid surface, metal, Hydrocarbon Thin-Films, Adsorption, impact, Impact Processes, solid surface, Molecular Dynamics Simulation, Large Systems, DL_POLY, Beam Deposition, Package, Collision-Induced Desorption, Diamond Films, Vapor-Deposition, Transition-Metals, Molecular-Dynamics Simulation}, 
date-added = {2008-06-27 08:58:25 -0500},
date-modified = {2009-03-24 15:40:27 -0500},
pmid = {000172275000001},
local-url = {file://localhost/User/user/Papers/2001/Kholmurodov/Kholmurodov-MD%20simulation%20of%20cluster-surface%20impacts-2001.pdf},
uri = {papers://B08E511A-2FA9-45A0-8612-FA821DF82090/Paper/p113},
read = {Yes},
rating = {0}
}

I would like to eliminate fields like month, abstract, keywords, etc. some of which are single lines and some of which are multiple lines.

I have given it a try in Python and like this:

fOpen = open(f,'r')
start_text = fOpen.read()
fOpen.close()

# regex
out_text = re.sub(r'^(month).*,\n','',start_text)
out_text = re.sub(r'^(annote)((.|\n)*?)\},\n','',out_text)
out_text = re.sub(r'^(note)((.|\n)*?)\},\n','',out_text)
out_text = re.sub(r'^(abstract)((.|\n)*?)\},\n','',out_text)

fNew = open(f,'w')
fNew.write(out_text)
fNew.close()

I have tried to run these regexes in TextMate to see if they work before giving them a try in Python and they appear to be ok.

Any suggestions?

Thanks.

A: 
Tomalak
Yes - thanks. I think that does the job. And thanks for the warning. Fortunately in this case I don't think I should run into any instances that cause the regex you suggest to fail.
dtlussier
Oh - just a quick thing for anyone seeing this later. To use the multiline an dotall flags as suggested here you need to compile the regex first. So: `text_out = re.sub(re.compile(<regex>, re.DOTALL | re.MULTILINE), <replacement-txt>, original))`Note that to use more than one flag you put them together by using the `|` or operator.
dtlussier
@dltussier: Compiling the regex will also result in a speed-up when re-used in a loop, for example. BTW, seeing a first time user who instantly *gets it* w/r/t question and comment formatting is a delight. :-)
Tomalak