views:

82

answers:

2

This problem actually hit me recently.

So I was tasked with putting people's bios up on the web (asked for opinions in a different question), which I went with XML and just created elements based on what sections was going to be displayed.

Some people had formulas in their bio and when I was copying/pasting the formatting didn't copy over.

My question is that is there an easy way to parse out the formulas and format accordingly?
One idea I had was to just subscript the numbers, but I would have to implement bbcode tags to do this as there are numbers all over the place. Hmm, or I could detect if a number is to the right of a letter and subscript the number.

Some of the forumlas are like CoO3

I used PHP to parse the XML.

What are your opinions?

A: 

I would lean toward using REGEX to parse your chem notation

Maybe this helps? http://www.pmichaud.com/pipermail/pmwiki-users/2008-October/052692.html

BrianAdkins
+1  A: 

Maybe something like this?

<?php
function formatFormulas($html)
{
 $regex  = '/(\\s*(Ac|Ag|Al|Am|Ar|As|At|Au|Ba|Be|Bh|Bi|Bk|Br|B|Ca|Cd|Ce|Cf|Cl|Cm|Co|Cr|Cs|Cu|C|';
 $regex .= 'Db|Ds|Dy|Er|Es|Eu|Fe|Fm|Fr|F|Ga|Gd|Ge|He|Hf|Hg|Ho|Hs|H|In|Ir|I|Kr|K|La|Li|Lr|Lu|Md|';
 $regex .= 'Mg|Mn|Mo|Mt|Na|Nb|Nd|Ne|Ni|No|Np|N|Os|O|Pa|Pb|Pd|Pm|Po|Pr|Pt|Pu|P|Ra|Rb|Re|Rf|Rg|Rh|';
 $regex .= 'Rn|Ru|Sb|Sc|Se|Sg|Si|Sm|Sn|Sr|S|Ta|Tb|Tc|Te|Th|Ti|Tl|Tm|Uub|Uuh|Uuo|Uup|Uuq|Uus|Uut|';
 $regex .= 'U|V|W|Xe|Yb|Y|Zn|Zr)\\s*(<[^>]+>)*\\s*\\d*\\s*(<[^>]+>)*\\s*)+/';
 if ( preg_match_all($regex, $html, $m) ) {

  for ($i = 0; $i < count($m[0]); $i++) {

   $replace = preg_replace('/\\s+/', "", $m[0][$i]);
   $replace = preg_replace('/<[^>]+>/', "", $replace);
   $replace = preg_replace('/\\d+/', '<sub>$0</sub>', $replace);
   $leading = preg_replace('/^(\\s*)[\\S\\s]*/', '$1', $m[0][$i]);
   $trailing = preg_replace('/^[\\S\\s]*?(\\s*)$/', '$1', $m[0][$i]);
   $replace = $leading . $replace . $trailing;
   $html = str_replace($m[0][$i], $replace, $html);

  }

 }

 return $html;
}
?>
SoaperGEM
Thanks! I'll test it out today and mark it as the answer if it works :).
Nathan Adams
What it's doing is looking for any of the elements from the periodic table (case-sensitive at the moment; this would be easy to change if desired), followed by optional whitespace, optional HTML tags, optional whitespace, optional number(s), optional whitespace, optional HTML tags, optional whitespace--and then any repetitions of that (i.e. a series of those). Then it strips out the inner whitespace and places the numbers in <sub> tags.
SoaperGEM
Forgot to mention--it strips out the HTML tags too. Also, if you wanted to wrap the whole thing in some sort of special tag that you could format with CSS, you could change the 18th line to this:`$replace = $leading . '<span class="formula">' . $replace . '</span>' . $trailing;`
SoaperGEM
I plugged it into our framework today, worked great as far as I could tell. Wherever you work, you deserve a raise :). A system like this was bouncing around in my brain but I am no regular expression expert so I couldn't have come up something as elegant as this.
Nathan Adams