ansaurus

Question

Answer 1

A:

I suspect that you may well be on a bit of a hiding to nothing with this but the following 2 separate expressions work for your sample data (putting the plaintiff/defendants in the first capturing group).

([ A-Z]+[A-Z])[^A-Z]+Plaintiff

([.,\sA-Za-z]+[.,A-Za-z])[^,A-Z]+Defendants

NB: For the Defendants your sample data includes dots, commas and lower case characters in the names so I had to include these as valid characters

Martin Smith 2010-05-02 17:48:50

Answer 2

+2 A:

I like Martin's answer.
Here's perhaps a more general approach using Python:

import re

# load file into memory 
# (if large files, provide some limit to how much of the file gets loaded)
with open('paren.txt','r') as f:
  paren = f.read() # example doc with parens

# match all sequences of one or more alphanumeric (or underscore) characters 
# when followed by the word `Plaintiff`; this is intentionally general
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)', paren, 
    re.DOTALL|re.MULTILINE)

# join the list separating by whitespace
str_of_matches = ' '.join(list_of_matches)

# split string by digits (line numbers)
tokens = re.split(r'\d',str_of_matches)

# plaintiffs will be in 2nd-to-last group
plaintiff = tokens[-2].strip()

Tests:

with open('paren.txt','r') as f:
  paren = f.read() # example doc with parens
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)',paren,
  re.DOTALL|re.MULTILINE)
str_of_matches = ' '.join(list_of_matches)>>> tokens = re.split(r'\d', str_of_matches)
tokens = re.split(r'\d', str_of_matches)
plaintiff = tokens[-2].strip()
plaintiff
# prints 'JOHN SMITH and JILL SMITH'

with open('no_paren.txt','r') as f:
  no_paren = f.read() # example doc with no parens
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)',no_paren,
  re.DOTALL|re.MULTILINE)
str_of_matches = ' '.join(list_of_matches)
tokens = re.split(r'\d', str_of_matches)
plaintiff = tokens[-2].strip()
plaintiff
# prints 'JOHN SMITH'

Adam Bernier 2010-05-02 18:30:03

ansaurus

tags:

views:

answers:

Parsing two-dimensional text

related questions