I'm trying to write a method for an app that takes a chemical formula like "CH3COOH" and returns some sort of collection full of their symbols.
CH3COOH would return [C,H,H,H,C,O,O,H]
I already have something that is kinda working, but it's very complicated and uses a lot of code with a lot of nested if-else structures and loops.
Is there a way I can do this by using some kind of regular expression with String.split or maybe in some other brilliant simple code?
I have developed a couple of series of articles on how to parse molecular formulas, including more complex formulas like C6H2(NO2)3CH3 .
The most recent is my presentation "PLY and PyParsing" at PyCon2010 where I compare those two Python parsing systems using a molecular formula evaluator as my sample problem. There's even a video of my presentation.
The presentation was based on a three-part series of articles I did developing a molecular formula parser using ANTLR. In part 3 I compare the ANTLR solution to a hand-written regular expression parser and solutions in PLY and PyParsing.
The regexp and PLY solutions were first developed in a two-part series on two ways of writing parsers in Python.
The regexp solution and base ANTLR/PLY/PyParsing solutions, use a regular expression like [A-Z][a-z]?\d* to match terms in the formula. This is what @David M suggested.
Here is it worked out in Python
import re
# element_name is: capital letter followed by optional lower-case
# count is: empty string (so the count is 1), or a set of digits
element_pat = re.compile("([A-Z][a-z]?)(\d*)")
all_elements = []
for (element_name, count) in element_pat.findall("CH3COOH"):
if count == "":
count = 1
else:
count = int(count)
all_elements.extend([element_name] * count)
print all_elements
When I run this (it's hard-coded to use acetic acid, CH3COOH) I get
['C', 'H', 'H', 'H', 'C', 'O', 'O', 'H']
Do note that this short bit of code assumes the molecular formula is correct. If you give it something like "##$%^O2#$$#" then it will ignore the fields it doesn't know about and give ['O', 'O']. If you don't want that then you'll have to make it a bit more robust.
If you want to support more complicated formulas, like C6H2(NO2)3CH3, then you'll need to know a bit about tree data structures, specifically (as @Roman points out), abstract syntax trees (most often called ASTs). That's too complicated to get into here, so see my talk and essays for more details.