ansaurus

Question

lexical analyse or series of regular expressions to parse unstructured text into structured form

Answer 1

+1 A:

NOTE: The python code here is not correct! It is just a rough pseudo-code of how it might look.

Regular Expressions are good at finding and extracting data from text in a fixed format (e.g. a DD/MM/YYYY date).

A lexer/parser pair is good at processing data in a structured, but somewhat variable format. Lexers split text into tokens. These tokens are units of information of a given type (number, string, etc.). Parsers take this series of tokens and does something depending on the order of the tokens.

Looking at the data, you have a basic (subject, verb, object) structure in different combinations for the relation (person, 'birthday', date):

I would handle 29/9/10 and 24-9-2010 as a single token using a regex, returning it as a date type. You could probably do the same for the other dates, with a map to convert September and sep to 9.

You could then return the everything else as strings (separated by whitespace).

You then have:

date ',' string 'birthday'
string 'birthday' ',' date
date 'birthday' 'of' string string
date ':' string string 'birthday'
string string 'birthday' date

NOTE: 'birthday', ',', ':' and 'of' here are keywords, so:

class Lexer:
    DATE = 1
    STRING = 2
    COMMA = 3
    COLON = 4
    BIRTHDAY = 5
    OF = 6

    keywords = { 'birthday': BIRTHDAY, 'of': OF, ',': COMMA, ':', COLON }

    def next_token():
        if have_saved_token:
            have_saved_token = False
            return saved_type, saved_value
        if date_re.match(): return DATE, date
        str = read_word()
        if str in keywords.keys(): return keywords[str], str
        return STRING, str

    def keep(type, value):
        have_saved_token = True
        saved_type = type
        saved_value = value

All except 3 use the possessive form of the person ('s if the last character is a consonant, s if it is a vowel). This can be tricky, as 'Alexis' could be the plural form of 'Alexi', but since you are restricting where plural forms can be, it is easy to detect:

def parseNameInPluralForm():
    name = parseName()
    if name.ends_with("'s"): name.remove_from_end("'s")
    elif name.ends_with("s"): name.remove_from_end("s")
    return name

Now, name can either be first-name or first-name last-name (yes, I know Japan swaps these around, but from a processing perspective, the above problem does not need to differentiate first and last names). The following will handle these two forms:

def parseName():
    type, firstName = Lexer.next_token()
    if type != Lexer.STRING: raise ParseError()
    type, lastName = Lexer.next_token()
    if type == Lexer.STRING: # first-name last-name
        return firstName + ' ' + lastName
    else:
        Lexer.keep(type, lastName)
        return firstName

Finally, you can process forms 1-5 using something like this:

def parseBirthday():
    type, data = Lexer.next_token()
    if type == Lexer.DATE: # 1, 3 & 4
        date = data
        type, data = Lexer.next_token()
        if type == Lexer.COLON or type == Lexer.COMMA: # 1 & 4
            person = parsePersonInPluralForm()
            type, data = Lexer.next_token()
            if type != Lexer.BIRTHDAY: raise ParseError()
        elif type == Lexer.BIRTHDAY: # 3
            type, data = Lexer.next_token()
            if type != Lexer.OF: raise ParseError()
            person = parsePerson()
    elif type == Lexer.STRING: # 2 & 5
        Lexer.keep(type, data)
        person = parsePersonInPluralForm()
        type, data = Lexer.next_token()
        if type != Lexer.BIRTHDAY: raise ParseError()
        type, data = Lexer.next_token()
        if type == Lexer.COMMA: # 2
            type, data = Lexer.next_token()
        if type != Lexer.DATE: raise ParseError()
        date = data
    else:
        raise ParseError()
    return person, date

reece 2010-06-30 23:59:49

ansaurus

tags:

views:

answers:

lexical analyse or series of regular expressions to parse unstructured text into structured form

related questions