views:

104

answers:

2

I need to parse text files where relevant information is often spread across multiple lines in a nonlinear way. An example:

1234
 1         IN THE SUPERIOR COURT OF THE STATE OF SOME STATE           
 2              IN AND FOR THE COUNTY OF SOME COUNTY                
 3                      UNLIMITED JURISDICTION                        
 4                            --o0o--                                 
 5                                                                    
 6   JOHN SMITH AND JILL SMITH,         )                             
                                        )                             
 7                  Plaintiffs,         )                             
                                        )                             
 8        vs.                           )     No. 12345
                                        )                             
 9   ACME CO, et al.,                   )                             
                                        )                             
10                  Defendants.         )                             
     ___________________________________)                             

I need to pull out Plaintiff and Defendant identities.

These transcripts have a very wide variety of formattings, so I can't always count on those nice parentheses being there, or the plaintiff and defendant information being neatly boxed off, e.g.:

 1        SUPREME COURT OF THE STATE OF SOME OTHER STATE
                      COUNTY OF COUNTYVILLE
 2                  First Judicial District
                     Important Litigation
 3  --------------------------------------------------X
    THIS DOCUMENT APPLIES TO:
 4
    JOHN SMITH,
 5                            Plaintiff,          Index No.
                                                  2000-123
 6
                                            DEPOSITION
 7                  - against -             UNDER ORAL
                                            EXAMINATION
 8                                              OF
                                            JOHN SMITH,
 9                                           Volume I

10  ACME CO,
    et al,
11                            Defendants.

12  --------------------------------------------------X

The two constants are:

  1. "Plaintiff" will occur after the name of the plaintiff(s), but not necessarily on the same line.
  2. Plaintiffs and defendants' names will be in upper case.

Any ideas?

A: 

I suspect that you may well be on a bit of a hiding to nothing with this but the following 2 separate expressions work for your sample data (putting the plaintiff/defendants in the first capturing group).

([ A-Z]+[A-Z])[^A-Z]+Plaintiff

([.,\sA-Za-z]+[.,A-Za-z])[^,A-Z]+Defendants

NB: For the Defendants your sample data includes dots, commas and lower case characters in the names so I had to include these as valid characters

Martin Smith
+2  A: 

I like Martin's answer.
Here's perhaps a more general approach using Python:

import re

# load file into memory 
# (if large files, provide some limit to how much of the file gets loaded)
with open('paren.txt','r') as f:
  paren = f.read() # example doc with parens

# match all sequences of one or more alphanumeric (or underscore) characters 
# when followed by the word `Plaintiff`; this is intentionally general
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)', paren, 
    re.DOTALL|re.MULTILINE)

# join the list separating by whitespace
str_of_matches = ' '.join(list_of_matches)

# split string by digits (line numbers)
tokens = re.split(r'\d',str_of_matches)

# plaintiffs will be in 2nd-to-last group
plaintiff = tokens[-2].strip()

Tests:

with open('paren.txt','r') as f:
  paren = f.read() # example doc with parens
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)',paren,
  re.DOTALL|re.MULTILINE)
str_of_matches = ' '.join(list_of_matches)>>> tokens = re.split(r'\d', str_of_matches)
tokens = re.split(r'\d', str_of_matches)
plaintiff = tokens[-2].strip()
plaintiff
# prints 'JOHN SMITH and JILL SMITH'

with open('no_paren.txt','r') as f:
  no_paren = f.read() # example doc with no parens
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)',no_paren,
  re.DOTALL|re.MULTILINE)
str_of_matches = ' '.join(list_of_matches)
tokens = re.split(r'\d', str_of_matches)
plaintiff = tokens[-2].strip()
plaintiff
# prints 'JOHN SMITH'
Adam Bernier