The "re" approach runs out of steam very quickly. Named entity recognition is a very complicated topic, way beyond the scope of an SO answer. If you think you have a good approach to this problem, please point it at Flann O'Brien a.k.a. Myles na cGopaleen, Sukarno, Harry S. Truman, J. Edgar Hoover, J. K. Rowling, the mathematician L'Hopital, Joe di Maggio, Algernon Douglas-Montagu-Scott, and Hugo Max Graf von und zu Lerchenfeld auf Köfering und Schönberg.
Update Following is an "re"-based approach that finds a lot more valid cases. I still don't think that this is a good approach, though. N.B. I've asciified the Bavarian count's name in my text sample. If anyone really wants to use something like this, they should work in Unicode, and normalise whitespace at some stage (either on input or on output).
import re
text1 = """Conan Doyle said that the character of Holmes was inspired by Dr. Joseph Bell, for whom Doyle had worked as a clerk at the Edinburgh Royal Infirmary. Like Holmes, Bell was noted for drawing large conclusions from the smallest observations.[1] Michael Harrison argued in a 1971 article in Ellery Queen's Mystery Magazine that the character was inspired by Wendell Scherer, a "consulting detective" in a murder case that allegedly received a great deal of newspaper attention in England in 1882."""
text2 = """Flann O'Brien a.k.a. Myles na cGopaleen, I Zingari, Sukarno and Suharto, Harry S. Truman, J. Edgar Hoover, J. K. Rowling, the mathematician L'Hopital, Joe di Maggio, Algernon Douglas-Montagu-Scott, and Hugo Max Graf von und zu Lerchenfeld auf Koefering und Schoenberg."""
pattern1 = r"(?:[A-Z][a-z]+[. ]+)+(?:[A-Z][a-z]+)?"
joiners = r"' - de la du von und zu auf van der na di il el bin binte abu etcetera".split()
pattern2 = r"""(?x)
(?:
(?:[ .]|\b%s\b)*
(?:\b[a-z]*[A-Z][a-z]*\b)?
)+
""" % r'\b|\b'.join(joiners)
def get_names(pattern, text):
for m in re.finditer(pattern, text):
s = m.group(0).strip(" .'-")
if s:
yield s
for t in (text1, text2):
print "*** text: ", t[:20], "..."
print "=== Ned B"
for s in re.finditer(pattern1):
print repr(s.group(0))
print "=== John M =="
for name in get_names(pattern2, t):
print repr(name)
Output:
C:\junk\so>\python26\python extract_names.py
*** text: Conan Doyle said tha ...
=== Ned B
'Conan Doyle '
'Holmes '
'Dr. Joseph Bell'
'Doyle '
'Edinburgh Royal Infirmary. Like Holmes'
'Bell '
'Michael Harrison '
'Ellery Queen'
'Mystery Magazine '
'Wendell Scherer'
'England '
=== John M ==
'Conan Doyle'
'Holmes'
'Dr. Joseph Bell'
'Doyle'
'Edinburgh Royal Infirmary. Like Holmes'
'Bell'
'Michael Harrison'
'Ellery Queen'
'Mystery Magazine'
'Wendell Scherer'
'England'
*** text: Flann O'Brien a.k.a. ...
=== Ned B
'Flann '
'Brien '
'Myles '
'Sukarno '
'Harry '
'Edgar Hoover'
'Joe '
'Algernon Douglas'
'Hugo Max Graf '
'Lerchenfeld '
'Koefering '
'Schoenberg.'
=== John M ==
"Flann O'Brien"
'Myles na cGopaleen'
'I Zingari'
'Sukarno'
'Suharto'
'Harry S. Truman'
'J. Edgar Hoover'
'J. K. Rowling'
"L'Hopital"
'Joe di Maggio'
'Algernon Douglas-Montagu-Scott'
'Hugo Max Graf von und zu Lerchenfeld auf Koefering und Schoenberg'