NOTE: The python code here is not correct! It is just a rough pseudo-code of how it might look.
Regular Expressions are good at finding and extracting data from text in a fixed format (e.g. a DD/MM/YYYY date).
A lexer/parser pair is good at processing data in a structured, but somewhat variable format. Lexers split text into tokens. These tokens are units of information of a given type (number, string, etc.). Parsers take this series of tokens and does something depending on the order of the tokens.
Looking at the data, you have a basic (subject, verb, object) structure in different combinations for the relation (person, 'birthday', date):
I would handle 29/9/10 and 24-9-2010 as a single token using a regex, returning it as a date type. You could probably do the same for the other dates, with a map to convert September and sep to 9.
You could then return the everything else as strings (separated by whitespace).
You then have:
- date ',' string 'birthday'
- string 'birthday' ',' date
- date 'birthday' 'of' string string
- date ':' string string 'birthday'
- string string 'birthday' date
NOTE: 'birthday', ',', ':' and 'of' here are keywords, so:
class Lexer:
DATE = 1
STRING = 2
COMMA = 3
COLON = 4
BIRTHDAY = 5
OF = 6
keywords = { 'birthday': BIRTHDAY, 'of': OF, ',': COMMA, ':', COLON }
def next_token():
if have_saved_token:
have_saved_token = False
return saved_type, saved_value
if date_re.match(): return DATE, date
str = read_word()
if str in keywords.keys(): return keywords[str], str
return STRING, str
def keep(type, value):
have_saved_token = True
saved_type = type
saved_value = value
All except 3 use the possessive form of the person ('s
if the last character is a consonant, s
if it is a vowel). This can be tricky, as 'Alexis' could be the plural form of 'Alexi', but since you are restricting where plural forms can be, it is easy to detect:
def parseNameInPluralForm():
name = parseName()
if name.ends_with("'s"): name.remove_from_end("'s")
elif name.ends_with("s"): name.remove_from_end("s")
return name
Now, name can either be first-name
or first-name last-name
(yes, I know Japan swaps these around, but from a processing perspective, the above problem does not need to differentiate first and last names). The following will handle these two forms:
def parseName():
type, firstName = Lexer.next_token()
if type != Lexer.STRING: raise ParseError()
type, lastName = Lexer.next_token()
if type == Lexer.STRING: # first-name last-name
return firstName + ' ' + lastName
else:
Lexer.keep(type, lastName)
return firstName
Finally, you can process forms 1-5 using something like this:
def parseBirthday():
type, data = Lexer.next_token()
if type == Lexer.DATE: # 1, 3 & 4
date = data
type, data = Lexer.next_token()
if type == Lexer.COLON or type == Lexer.COMMA: # 1 & 4
person = parsePersonInPluralForm()
type, data = Lexer.next_token()
if type != Lexer.BIRTHDAY: raise ParseError()
elif type == Lexer.BIRTHDAY: # 3
type, data = Lexer.next_token()
if type != Lexer.OF: raise ParseError()
person = parsePerson()
elif type == Lexer.STRING: # 2 & 5
Lexer.keep(type, data)
person = parsePersonInPluralForm()
type, data = Lexer.next_token()
if type != Lexer.BIRTHDAY: raise ParseError()
type, data = Lexer.next_token()
if type == Lexer.COMMA: # 2
type, data = Lexer.next_token()
if type != Lexer.DATE: raise ParseError()
date = data
else:
raise ParseError()
return person, date