tags:

views:

46

answers:

1

I am trying to parse some text files into a database and there is a string that includes 2 pieces of information in it. There are a few options for what the string can look like. It can either look like a single word Word or it can have that first word, followed by a dash, followed by any number of other words like Word - Second. The key though, is that IF the string ends in a number like Word - Second 4 or two numbers separated by a slash like Word - Second 2/3 then those numbers need to be put into a different variable.

I do NOT know enough about regex to do this one. Help? (with explanations?)

+2  A: 

I think you might be looking for something like this:

^([a-zA-Z]+(?: *- *[a-zA-Z]+(?: +[a-zA-Z]+)*)?)(?: +(\d+(?:\/\d+)?))?$

Explanation:

^               Start of line
(               First capturing group (for the words)
  [a-zA-Z]+     A word
  (?:...)?      (Omitted for clarity)
)               Close first group
(?:             Start non-capturing group
  \s+           Some whitespace
  (             Second capturing group (for the numbers)
    \d+         A number
    (?:\/\d+)?  Optionally a slash followed by another number
  )             Close capturing group
)?              Close optional non-capturing group
$               End of line

I omitted an explanation of this part above: (?: *- *[a-zA-Z]+(?: +[a-zA-Z]+)*)?. It matches a dash followed by one or more space separated words. I also wrote \s in the explanation instead of because the space is invisible. But \s matches any whitespace, including new lines. You may prefer to match only spaces.

Rubular

Mark Byers
You should escape your forward slashes.
Stefan Kendall
Umm...yeah...that makes no sense at all to me. Here's hoping it works!
Pselus
Had to escape the forward slash as Stefan said, which made me realize I didn't provide enough information. I don't need to find the strings that fit those criteria, I need to find the numbers inside those strings and pull them out.
Pselus
@Pselus: Click the rubular link I provided and look on the right hand side of the page: all the words and numbers are "pulled out". Did you notice that? Is that what you want?
Mark Byers
@Mark Byers: I think that is what I want. But when I use it with the gsub function in Ruby, I just get the strings that fit the criteria and nothing else. I think the problem now is that I don't know enough Ruby. :)Thanks!
Pselus
There is another problem with it as well. The word after the dash can be any number of words. So it could look like `Word - Second Third Seventh 3/3`. This Regex is only grabbing lines that have a single word after the dash.
Pselus
@Pselus: But only one dash?
Mark Byers
Yes, only one dash, but any number of words separated by a space each after the dash.
Pselus