I need to improve on a regular expression I'm using. Currently, here it is:
^[a-zA-Z\s/-]+
I'm using it to pull out medication names from a variety of formulation strings, for example:
- SULFAMETHOXAZOLE-TRIMETHOPRIM 200-40 MG/5ML PO SUSP
- AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE
- AMOXICILLIN TRIHYDRATE 125 mg ORAL TABLET, CHEWABLE
- AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE
- Amoxicillin 1000 MG / Clavulanate 62.5 MG Extended Release Tablet
The resulting matches on these examples are:
- SULFAMETHOXAZOLE-TRIMETHOPRIM
- AMOX TR/POTASSIUM CLAVULANATE
- AMOXICILLIN TRIHYDRATE
- AMOX TR/POTASSIUM CLAVULANATE
- Amoxicillin
The first four are what I want, but on the fifth, I really need "Amoxicillin / Clavulanate".
How would I pull out patterns like "Amoxicillin / Clavulanate" (in fifth row) while missing patterns like "MG/5 ML" (in the first row)?
Update
Thanks for the help, everyone. Here's a longer list of examples with more nuances of the data:
- Amoxicillin 1000 MG / Clavulanate 62.5 MG Extended Release Tablet
- Amoxicillin 1000 MG / Clavulanate 62.5 MG Extended Release Tablet
- Amoxicillin 10 MG/ML Oral Suspension
- Amoxil 10 MG/ML Oral Suspension
- AMOXICILLIN TRIHYDRATE 125 mg ORAL TABLET, CHEWABLE
- AMOXAPINE
- AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE
- AMOXICILLIN TRIHYDRATE 125 mg ORAL TABLET, CHEWABLE
- AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE
- AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE
- CARBATROL 200 MG PO CP12
- CARBATROL 200 MG PO CP12
- CARBATROL
- CARBAMAZEPINE 100 MG PO CHEW
- CEFDINIR 250 MG/5ML PO SUSR
- AMOXICILLIN 400 MG/5ML PO SUSR
- SULFAMETHOXAZOLE-TRIMETHOPRIM 200-40 MG/5ML PO SUSP
- DIAZEPAM 2 MG PO TABS
- DIAZEPAM
- PREDNISONE 20 MG PO TABS
- AUGMENTIN 250-62.5 MG/5ML PO SUSR
- ACETAMINOPHEN 325 MG/10.15ML PO SUSP
What I've done for now is this:
private static string GetMedNameFromIncomingConceptString(string conceptAsString)
{
// look for match at beginning of string
Match firstRegMatch = new Regex(@"^[a-zA-Z\s/-]+").Match(conceptAsString);
if (firstRegMatch.Success)
{
// grab matching part of string as whole string
string firstPart = conceptAsString.Substring(firstRegMatch.Index, firstRegMatch.Length);
// look for additional match following a hash (like Amox 1000 / Clav 50)
Match secondRegMatch = new Regex(@"/\s[a-zA-Z\s/-]+").Match(conceptAsString, firstRegMatch.Length);
if (secondRegMatch.Success)
return firstPart + conceptAsString.Substring(secondRegMatch.Index, secondRegMatch.Length);
else
return firstPart;
}
else
{
return conceptAsString;
}
}
It's pretty ugly, and I imagine it may fail when I run a lot more data through it, but it works for the larger set of cases I listed above.