views:

204

answers:

6

What would be the best definition of an English word?

What are the other cases of an English word than just \w+? Some may include \w+-\w+ or \w+'\w+; some may exclude cases like \b[0-9]+\b. But I haven't seen any general consensus on those cases. Do we have a formal defintion of such? Can any of you clarify?

(Edit: broaden the question so it doesn't depend on regexp only.)

+6  A: 

I really don't think a regex is going to help you here, the problem with English (or any language for that matter) text is context. Without it you can be sure if what's between the word boundaries is text, a number, a random collection of characters, etc. For an NLP I think you are going to be selecting a subset of the language and looking for specific words rather than trying to extract all 'Words' from a string.

Lazarus
The problem is that you do not know what the subset of English language would look like, as there are new terms coming in every second. Therefore you have to resort to a regexp like solution, which would work fine if the corpus is cleaner like, say, English Wikipedia.
OTZ
If you don't know the subset you are actually going to process then how are you going to act on what you've processed? Perhaps you need to indicate what the objective of your NLP is.
Lazarus
All this fuss over regexes and NLP - does it really matter - whichever way, the corpus is going to have to be broken down into words. Of course, you'd keep track of the punctuation if you want to stand a chance of processing it.
Will A
@Will A: I would say that regular expressions do matter, that's what the OP asked about. Yes, I completely agree the body of text will need breaking down into words for processing. This question is about how to do that reliably. Given the multitude of punctuation symbols and how they affect the context of the words it's vital that context is, somehow, maintained. I want to say that a state machine would be a better parser for the text but I feel even that has limitations and would probably only form one pass of the text, preparing it for a subsequent NLP engine.
Lazarus
A: 

A true English word will almost never contain accents or foreign characters - so \w+ might capture more than you're after, although there are a number of words used in English that we've borrowed from other languages - most of us probably don't have the time or inclination to bother accenting them, tho'. I was even too lazy to write 'though' out in full there - \w+'\w+ wouldn't capture that. In general, so long as your \w+ is capturing your words correctly, I can't think of any other punctuation on top of - and ' that might be encountered mid-word.

Will A
I think the only addition to that list is ampersand, used in some company names and where people mistakenly want to ensure that it's not confused with an adjacent 'and' in a list. Good summation.
Lazarus
A naïve regex approach will do then? :)
Hightechrider
Quite a lot of true English words contain accents and non-English characters, which are omitted due to ignorant education, and more and more due to the native character set of the English keyboard. Even in increasingly common Mac OS, where "special" characters are easy to access, so many people are unaware of how to use them, both linguistically and technologically.
eyelidlessness
@eyelidlessness: Really? I suppose we (I'm a Brit) have adopted a lot of foreign words, and so they are true English at the end of the day anyway. I can only think of façade as having a foreign character in it - I'd be interested in any others you can think of (a little off-topic I know!). Note I'm considering accented characters as different to 'foreign characters' - which is of course not strictly correct. :)
Will A
I think we have a "no true Scottsman" fallacy here. There is no such thing as "true English word" -- almost all words have been borrowed from other languages at some point, or influenced by them. Even the very core words come from the older languages, like Old English, Gaelic or Latin -- languages don't appear from vacuum. Sure, the spelling of words changes as they are used more and more: accents are dropped, word compounds are joined with dash and later without it, pronunciation becomes more like the other words. So you could perhaps say that different words have different levels of "truth".
Radomir Dopieralski
@Will, résumé comes to mind. Accented characters are absolutely "foreign" to the English character set, even if they're included in the "Latin" character set (of which English is in some ways a member).
eyelidlessness
@Radomir, that is fair, but the only reason that the English "résumé" and the English (and inferred by context) "resume" are the same word is because of poor education and the widespread use of typing accoutrements which make typing the word "correctly" difficult. The distinction is worthwhile (and undermines the "fallacy") because the form without accents is easily mistakable for the verb, "resume", and causes cognitive dissonance in nearly all readers.
eyelidlessness
For additional anecdotal evidence, we have a local bar in Seattle called Barça, and quite a lot of their printed material says Barca. It's probably pure snobbery that the name is spelled the way it is, given the context of a culture which has rarely if ever used that character, but the distinction arises as most of the people I know pronounce the name "Barkah".
eyelidlessness
ç should turn it into "Barsah" so it is probably just a visual affectation.
neil
+1  A: 

Let's be concrete and try to solidify the ground by examples.

Is 'word' an English word?  YES

49th?  YES

NYSE?  YES

Résumé?  YES

Haight-Ashbury? YES/NO?

good-looking?  YES/NO?

P&G?  YES/NO?

1023?  YES/NO?

304-392-9999?  YES/NO?

3.14?  YES/NO?
OTZ
Numbers are NOT words. resume is NOT written like that. An abbreviation is NOT a word. I think that only 'word' is word in your list.
Tomas
@Tomas: [Résumé](http://en.wikipedia.org/wiki/Resume).
Paul Ruane
Edited: 'résumé' -> 'Résumé'. This is a community wiki. You guys can edit it too.
OTZ
This is a community wiki. you guys can edit it too.
OTZ
+3  A: 

The best way to check if a word is English is to look it up in a dictionary. If it's in an a dictionary of English words, than it is an english word. It is possible that a word could be in an English dictionary and a French dictionary also. For example 'me' is both a French and English word.

I'm sure you can find lots of downloadable dictionaries online. You can also make your own. For example, you could download the English version of Wikipedia and assume that all words found there are English words. You may or may not to filter out numbers.

A regular expression will not tell you whether a word is English. For instance xyvfg matches your pattern \w' but is certainly not an English word.

Edit: In theory, using English Phonology, it could be possible to tell whether a phonetic transcription of a word is pronounceable by an english speaker. There are lots of words pronounceable to english speakers which are not actually english words. This could take into account words that may appear in the english language in the future. However, translating between a phonetic transcription and text is quite a challenging problem as there can be many different spellings of the same phonetic transcription. I don't know if anyone has done anything like this. It could be an interesting theoretic excercise. I'm not sure this would be very useful in real world NLP though.

Jay Askren
There are new English words being created every second. Your dictionary method does not capture them.
OTZ
A regular expression cannot tell you that xyvfg is not an english word.
Jay Askren
There is no possible way to predict what words will enter the English language tomorrow or next year. The only thing you can do is update your dictionary as words enter the English language. One way to do that is to constantly generate a new dictionary from a live corpus (Wikipedia is one example).
Jay Askren
+1  A: 

http://www.sussex.ac.uk/linguistics/documents/essay_-_what_is_a_word.pdf

Tommy Herbert
can you elaborate on the essay? who wrote it? is this authoritative?
OTZ
Yes, it's a reliable source. It was written by Larry Trask, who's a professor at Sussex University.
Tommy Herbert
A: 

Your problem is called word tokenization. Take a look here:
http://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html

Stanford is a very famous NLP laboratory. They produces one of the most efficient parser for English. The page outlines some common tokenization problems like

  • Unusual domain specific token: M*A*S*H, C++, IP address ...
  • Hyphenation: co-education, Hewlett-Packard
  • Collocation: San Francisco, Los Angeles
  • Specific syntax ...
    • Advertisements for air fares "San Francisco-Los Angeles"
    • Omitted spaces etc...

The Penn Treebank Project also provides a simple sed script for word tokenization "that does a decent enough job on most corpora" here.

Ugo