tags:

views:

377

answers:

5

Hi! I'm trying to write a method for an app that takes a chemical formula like "CH3COOH" and returns some sort of collection full of their symbols.

CH3COOH would return [C,H,H,H,C,O,O,H]

I already have something that is kinda working, but it's very complicated and uses a lot of code with a lot of nested if-else structures and loops.

Is there a way I can do this by using some kind of regular expression with String.split or maybe in some other brilliant simple code?

+14  A: 

Assuming it's correctly capitalised, each symbol in the equation matches this regular expression:

[A-Z][a-z]*\d*

(For the chemically challenged, an element's symbol is always capital letter followed by optionally a lower case one or possibly two - e.g. Hg for mercury)

You can capture the element symbol and the number in groups like so:

([A-Z][a-z]*)(\d*)

So yes, in theory this would be something regular expressions could help with. If you're dealing with formulae like C6H2(NO2)3(CH3)3 then your job is of course a bit harder...

David M
+1, same answer as I came up with.
Adam Robinson
Yes exactly, i had a very hard time dealing with those parentheses. But I could probably separate them and run the same method on the inside.
Christian Kjær
Yes, you'd have to effectively "multiple out the brackets" somehow and work recursively.
David M
This is such a powerful tool wow :O
Christian Kjær
+5  A: 

The solution with regular expressions is the best approach if you need to handle only simple cases. Otherwise you need to build something like Abstract Syntax Tree and evaluate it or use Polish Notation.

For example, TNT formula C6H2(NO2)3CH3 should be presented like:

(+ (* C 6) (* H 2) (* (+ N (* O 2)) 3) C (+ H 3))
Roman
This solution is dynamite
Freiheit
+1 for spotting my TNT formula; sadly I can't then give you the +1 you also deserve for the solution.
David M
+2  A: 

Have you looked into expressing your chemical formulas in Chemical Markup Language? It is very versatile and there are lot of tools/viewers out there that can render these chemical forumulas or compounds in 2D to 3D.

CodeToGlory
In Wikipedia it mentions JUMBO, a java library that can process CML. It sounds like that would be better than growing your own little-chemical-expressions mini-language that nobody else uses.
Warren P
+1 for not re-inventing the round thing.
Carl
CML is way overkill for this. For one, CML doesn't use molecular formulas but molecular graphs, so the original poster would need a molecular formula parser just to generate the CML in the first place.
Andrew Dalke
A: 

Hello Christian,

I am working on a program that requires molar mass calculations of chemical formulas, so I have created a solution that works with for a variety of formulas.

For example, "(CH3)16(Tc(H2O)3CO(BrFe3(ReCl)3(SO4)2)2)2MnO4" will result in " 16C 48H 2Tc 12H 6O 2C 2O 4Br 12Fe 12Re 12Cl 8S 32O Mn 4O" (this compound is made up, but hey, it works!)

This code is written in C# so that's why I haven't posted it. If you're interested I can post it for you. I actually wrote out a full answer before noticing the java tag.

Anyway, it works by basically grouping blocks of atoms matched by parenthesis recursively. It does not handle coefficients such as 2Pb (but (Pb)2 or Pb2 does work) or charged compounds such as OH-.

In no way is it simple or elegant. I did want a working solution so I know there are better ways (I never even tried Regular expressions!). But it works with the formulas I need, maybe it suits yours as well.

Here are some test cases I run it on. Take a look at them and let me know if the C# code would still be useful to you. The format is (input, expected output)

        ("Pb ", " Pb"); 
        ("H ", " H"); 
        ("Pb2 ", " 2Pb"); 
        ("H2 ", " 2H");             
        ("3Pb2 ", " 6Pb");
        ("Pb2SO4", " 2Pb S 4O");                                     
        ("PbH2 ", " Pb 2H");            
        ("(PbH2)2 ", " 2Pb 4H");
        ("(CCC)2 ", " 2C 2C 2C");
        ("Pb(H2)2 ", " Pb 4H");            
        ("(Pb(H2)2)2 ", " 2Pb 8H"); 
        ("(Pb(H2)2)2NO3 ", " 2Pb 8H N 3O"); 
        ("(Ag(Pb(H2)2)2)2SO4 ", " 2Ag 4Pb 16H S 4O");             
        ("Pb(CH3(CH2)2CH3)2", " Pb 2C 6H 4C 8H 2C 6H"); 
        ("Na2(CH3(CH2)2CH3)2", " 2Na 2C 6H 4C 8H 2C 6H");
        ("Tc(H2O)3Fe3(SO4)2", " Tc 6H 3O 3Fe 2S 8O");
        ("Tc(H2O)3(Fe3(SO4)2)2", " Tc 6H 3O 6Fe 4S 16O");
        ("(Tc(H2O)3(Fe3(SO4)2)2)2", " 2Tc 12H 6O 12Fe 8S 32O");
        ("(Tc(H2O)3CO(Fe3(SO4)2)2)2", " 2Tc 12H 6O 2C 2O 12Fe 8S 32O");
        ("(Tc(H2O)3CO(BrFe3(ReCl)3(SO4)2)2)2MnO4", " 2Tc 12H 6O 2C 2O 4Br 12Fe 12Re 12Cl 8S 32O Mn 4O");
        ("(CH3)16(Tc(H2O)3CO(BrFe3(ReCl)3(SO4)2)2)2MnO4", " 16C 48H 2Tc 12H 6O 2C 2O 4Br 12Fe 12Re 12Cl 8S 32O Mn 4O"); 
A: 

I have developed a couple of series of articles on how to parse molecular formulas, including more complex formulas like C6H2(NO2)3CH3 .

The most recent is my presentation "PLY and PyParsing" at PyCon2010 where I compare those two Python parsing systems using a molecular formula evaluator as my sample problem. There's even a video of my presentation.

The presentation was based on a three-part series of articles I did developing a molecular formula parser using ANTLR. In part 3 I compare the ANTLR solution to a hand-written regular expression parser and solutions in PLY and PyParsing.

The regexp and PLY solutions were first developed in a two-part series on two ways of writing parsers in Python.

The regexp solution and base ANTLR/PLY/PyParsing solutions, use a regular expression like [A-Z][a-z]?\d* to match terms in the formula. This is what @David M suggested.

Here is it worked out in Python

import re

# element_name is: capital letter followed by optional lower-case
# count is: empty string (so the count is 1), or a set of digits
element_pat = re.compile("([A-Z][a-z]?)(\d*)")

all_elements = []
for (element_name, count) in element_pat.findall("CH3COOH"):
    if count == "":
        count = 1
    else:
        count = int(count)
    all_elements.extend([element_name] * count)

print all_elements

When I run this (it's hard-coded to use acetic acid, CH3COOH) I get

['C', 'H', 'H', 'H', 'C', 'O', 'O', 'H']

Do note that this short bit of code assumes the molecular formula is correct. If you give it something like "##$%^O2#$$#" then it will ignore the fields it doesn't know about and give ['O', 'O']. If you don't want that then you'll have to make it a bit more robust.

If you want to support more complicated formulas, like C6H2(NO2)3CH3, then you'll need to know a bit about tree data structures, specifically (as @Roman points out), abstract syntax trees (most often called ASTs). That's too complicated to get into here, so see my talk and essays for more details.

Andrew Dalke