views:

1764

answers:

23

I have a .NET application where, given a noun, I want it to correctly prefix that word with "a" or "an". How would I do that?

Before you think the answer is to simply check if the first letter is a vowel, consider phrases like:

  • an honest mistake
  • a used car
+9  A: 

You have to implemented manually and add the exceptions you want like for example if the first letter is 'H' and followed by an 'O' like honest, hour ... and also the opposite ones like europe, university, used ...

Ahmad Farid
But words like "hopper" will break this rule...
Corey Ross
yeah true man. I guess I was mistaken in that. It has no rule at all
Ahmad Farid
+11  A: 

You need to use a list of exceptions. I don't think all of the exceptions are well defined, because it sometimes depends on the accent of the person saying the word.

One stupid way is to ask Google for the two possibilities (using the one of the search APIs) and use the most popular:

Or:

Therefore "a europe" and "an honest" are the correct versions.

rjmunro
Is that actually permitted use or is this asking to be banned? Regular such use is certainly frowned upon IIRC.
Eamon Nerbonne
@Eamon: Interesting point. What if the application kept a record of all the words it has previously googled, so it only has to google once for each new word it encounters? Would that still be a questionable usage of Google?
gnovice
Aside from the obvious technical difficulties (the use of a search engine output in an automated manner like this is not allowed and will be blocked rather quickly), this does not solve the problem in a correct way - at worst it will duplicate common misuse of syntax.
Guss
At worst? There's a pretty strong argument to be made that duplicating "common misuse" is exactly what a natural-language system should strive for. See David Foster Wallace's essay "Authority and American Usage", in _Consider the Lobster_. There are better corpora to use than Google, but that's a different issue.
Robert Rossney
Could you ask googlefight, rather than google?
Andrew Grimm
This doesn't work for numerous examples - "a hotel" and "an hotel" says "a hotel" has more hits, when use of an is correct. Same for "a heroine" and "an heroine" - a has more hits but an is correct.
Callum Rogers
"a hotel" and "a heroine" both seem right to me. I guess you are coming from a slightly cockney accent perspective. Different accents mean that there is no right answer to some of these words.
rjmunro
"an hotel" is a peculiar usage - at one stage very high status RP users (and a few still do this) did not pronounce aitch at the beginning of certain words of foreign origin: you can still hear this in "homage" which is pronounced both ways. The use of "an" in English has always been to do with the following sound, so "an homage"/"a homage" is correct depending on what you say. But there grew up a fashion to use (and even - God forbid - to say) "an hotel" and "an historic" and a small set of such words, though the set is not well defined. Stick to "a hotel".
Francis Davey
A: 

I would use a rule-based algorithm to cover as many as I could, then use a list of exceptions. If you wanted to get fancy, you could try to determine some new "rules" from your exception list.

A. L. Flanagan
A: 

I just looks like a set of heuristics. It needs be a bit more complicated and answer some things which I never got a good answer for, for example how do you treat abbreviations ("a RPM" or "an RPM"? I always thought the latter one makes more sense).

A quick search yielded on linguistic libraries that talk about how to handle the English singular prefix, but you can probably find something if you dig dip enough. And if not - you can always write your own inflection library and gain world fame :-) .

Guss
Abbreviations like RPM are not a problem. As you say they can be treated either way. Hence the solution is obvious: ignore them.
Andrew J. Brehm
I would not agree because that causes inconsistent prefixing. Just ignoring it would cause "a RPM" and "an UGC" which is clearly wrong.
Guss
A: 

I don't suppose you can just fill-in some boiler plate stuff like 'a/an' as a one step cover-all. Otherwise you will end up with assumption errors like all words with 'h' proceed by 'o' get 'an' instead of 'a' like 'home' - (an home?). Basically, you will end up including the logic of the english language or occassionally find rare cases that will make you look foolish.

A: 

Check for whether a word starts with a vowel or a consonent. A "u" is generally a consonant and a vowel ("yu"), hence belongs in the consonant group for your purposes.

The letter "h" stands for a gottal stop (a consonant) in French and in French words used in English. You can make a list of those (in fact, including "honor", "honour", and "hour" might be sufficient) and count them as starting with vowels (since English doesn't recognise a glottal stop).

Also count "eu" as a consonant etc.

It's not too difficult.

Andrew J. Brehm
+3  A: 

Since "a" and "an" is determined by phonetic rules and not spelling conventions, I would probably do it like this:

  1. If the first letter of the word is a consonant -> 'a'
  2. If the first letter of the word is a vowel-> 'an'
  3. Keep a list of exceptions (heart, x-ray, house) as rjumnro says.
Patrik
+1  A: 

Note that there are differences between American and British dialects, as Grammar Girl pointed out in her episode A Versus An.

One complication is when words are pronounced differently in British and American English. For example, the word for a certain kind of plant is pronounced “erb” in American English and “herb” in British English. In the rare cases where this is a problem, use the form that will be expected in your country or by the majority of your readers.

Jan Aagaard
A: 

just put 'a' and maybe over time the silly language will change. :)

kenny
+11  A: 

If you could find a source of word spellings to word pronunciations, like:

"honest":"on-ist"
"horrible":"hawr-uh-buhl, hor-"

You could base your decision on the first character of the spelled pronunciation string. For performance, perhaps you could use such a lookup to pre-generate exception sets and use those smaller lookup sets during execution instead.

Edited to add:

!!! - I think you could use this to generate your exceptions: http://www.speech.cs.cmu.edu/cgi-bin/cmudict

Not everything will be in the dictionary, of course - meaning not every possible exception would wind up in your exceptions sets - but in that case, you could just default to an for vowels/ a for consonants or use some other heuristic with better odds.

(Looking through the CMU dictionary, I was pleased to see it includes proper nouns for countries and some other places - so it will hande examples like "a Ukrainian", "a USA Today paper", "a Urals-inspired painting".)

Editing once more to add: The CMU dictionary does not contain common acronyms, and you have to worry about those starting with s,f,l,m,n,u,and x. But there are plenty of acronym lists out there, like in Wikipedia, which you could use to add to the exceptions.

Anon
A: 

choice of an or a depends on the way the word is pronounced. By looking at the word you can't necessarily tell its correct pronunciation e.g. a Jargon or abbreviation etc. One of the ways can be to have a dictionary with support for phonemes and use the phoneme information associated with the word to determine whether an "a" or an "an" should be used.

Rohin
+4  A: 

Can you take a different approach? Any way you do this, it looks like you're going to be adding exceptions to it for years. What about a word like "one" instead of a/an?

Bob Kaufman
+32  A: 
  1. Download wikipedia
  2. Unzip it and write a quick filter program that spits out only Article text (the download is generally in xml format and included non-article metadata too).
  3. Find all instances of a(n).... and make an index on the following word and all of its prefixes (you can use a simple suffixtrie for this). This should be case sensitive, and you'll need a maximum word-length - 15 letters?
  4. (optional) Discard all those prefixes which occur less than 5 times or where "a" vs. "an" achieves less than 2/3 majority (or some other threshholds - tweak here). Preferably keep the empty prefix to avoid corner-cases.
  5. You can optimize your prefix database by discarding all those prefixes whose parent shares the same "a" or "an" annotation.
  6. When determining whether to use "A" or "AN" find the longest matching prefix, and follow its lead. If you didn't discard the empty prefix in step 4, then there will always be a matching prefix (namely the empty prefix), otherwise you may need a special case for a completely-non matching string (such input should be very rare).

You probably can't get much better than this - and it'll certainly beat most rule-based systems.

Eamon Nerbonne
And if a noun is missing from this output, you can certainly fall back to the simple rule engine.
John Fisher
Could use that as a great corpus for a Bayesian approach.
sixlettervariables
The Corpus of Contemporary American English (http://www.americancorpus.org/) is probably a better choice than Wikipedia for individual tests, though it's not in a form you can download.
Robert Rossney
Given that the Wikipedia download decompresses to (currently) 2.8 Terabytes, it would be great if anyone who uses this method would post the resulting data publicly so the process doesn't have to be repeated much.
Nathan Long
Bwahaha! Good one.
mcandre
You don't need a rule-based engine as a fallback - you get that for free with the "longest-matching prefix" approach.
Eamon Nerbonne
This answer wasn't entirely serious, but I have done something like this, and wikipedia's .xml file with raw wikimarkup is just on the order of 40GB (the newest one is always a bit bigger), not 2.8TB - all in one file - don't download the expanded .html version or any images, maybe that's the version that's 2.8TB? In any case, it's actually quite feasible to parse, as long as you're not to picky about the markup.
Eamon Nerbonne
A: 

Work in Polish, then you need not worry about articles at all.

chris
A: 

I can't be certain that it has the appropriate information in it to differentiate "a" and "an", but Princeton's WordNet database exists precisely for the purpose of similar sorts of tasks, so I think it's likely that the data is in there. It has some tens of thousands of words and hundreds of thousands of relationships between said words (IIRC; I can't find the current statistics on the site). Give it a look. It's freely downloadable.

rmeador
+5  A: 

this is what you need
http://www.cogs.susx.ac.uk/users/darrenp/software/ana/

adi92
+2  A: 

Take a look at Perl's Lingua::EN::Inflect. See sub _indef_article in the source code.

Sinan Ünür
The exceptions are located in inflections.t. It seems to me that the list is rather incomplete.
Jan Aagaard
A: 

You use "a" whenever the next word isn't a vowel? And you use "an" whenever there is a vowel?

With that said, couldn't you just do a regular expression like "a\s[a,e,i,o,u].*"? And then replace it with an "an?"

Daniel
A: 

How? How about when? Get the noun with article attached. Ask for it in a specific form.

Ask for the noun with the article. Many a MUD codebase store items as information consisting of:

  • one or more keywords
  • a short form
  • a long form

The keyword form might be "short sword rusty". The short form will be "a sword". The long form will be "a rusty short sword".

Are you writing an "a vs. an" Web service? Take a step back and look at if you can attack this leak further upstream. You can build a dam, but unless you stop it from flowing, it will spill over eventually.

Determine how critical this is, and as others have suggested, go for "quick but crude", or "expensive but sturdy".

maxwellb
+2  A: 

@Nathan Long: Downloading wikipedia is actually not a bad idea. All images, videos and other media is not needed.

I wrote a (crappy) program in php and javascript(!) to read the entire Swedish wikipedia (or at least all aricles that could be reached from the aricle about math, which was the start for my spider.)

I collected all words and internal links in a database, and also kept track of the frequency of every word. I now use that as a word database for various tasks: * Finding all words that can be created from a given set of letters (including wildcard) * Created a simple syntax file for Swedish (all words not in the database are considered incorrect).

Oh, and downloading the entire wiki took about one week, using my laptop running most of the time, with 10Mbit connection.

When you're at it, log all occurrences that are inconsistent with the english language and see if some of them are mistakes. Go fix 'em and give something back to the community.

Paxinum
A: 

The rule is very simple. If the next word starts with a vowel sound then use 'an', if it starts with a consonant then use 'a'. The hard thing is that our school classification of vowels and consonants doesn't work. The 'h' in 'honour' is a vowel, but the 'h' in 'hospital' is a consonant.

Even worse, some words like 'honest' start with a vowel or a consonant depending on who is saying them. Even worse, some words change depending on the words around them for some speakers.

The problem is bounded only by how much time and effort you want to put into it. You can write something in a couple using 'aeiou' as vowels in a couple of minutes, or you can spends months doing linguistic analysis of your target audience. Between them are a huge number of heuristics which will be right for some speakers and wrong for others -- but because different speakers have different determinations for the same word it simply isn't possible to be right all of the time no matter how you do it.

KayEss
A: 

You need to look at the grammatical rules for indefinite articles (there are only two indefinite articles in English grammar - "a" and "an). You may not agree these sound correct, but the rules of English grammar are very clear:

"The words a and an are indefinite articles. We use the indefinite article an before words that begin with a vowel sound (a, e, i, o, u) and the indefinite article a before words that begin with a consonant sound (all other letters)."

Note this means a vowel sound, and not a vowel letter. For instance, words beginning with a silent "h", such as "honour" or "heir" are treated as vowels an so are proceeded with "an" - for example, "It is an honour to meet you". Words beginning with a consonant sound are prefixed with a - which is why you say "a used car" rather than "an used car" - because "used" has a "yoose" sound rather than a "uhh" sound.

So, as a programmer, these are the rules to follow. You just need to work out a way of determining what sound a word begins with, rather than what letter. I've seen examples of this, such as this one in PHP by Jaimie Sirovich :

function aOrAn($next_word) 
{ 
    $_an = array('hour', 'honest', 'heir', 'heirloom'); 
    $_a = array('use', 'useless', 'user'); 
    $_vowels = array('a','e','i','o','u'); 

    $_endings = array('ly', 'ness', 'less', 'lessly', 'ing', 'ally', 'ially'); 
    $_endings_regex = implode('|', $_endings); 

    $tmp = preg_match('#(.*?)(-| |$)#', $next_word, $captures); 
    $the_word = trim($captures[1]); 
    //$the_word = Format::trimString(Utils::pregGet('#(.*?)(-| |$)#', $next_word, 1)); 

    $_an_regex = implode('|', $_an); 
    if (preg_match("#($_an_regex)($_endings_regex)#i", $the_word)) { 
        return 'an'; 
    } 

    $_a_regex = implode('|', $_a); 
    if (preg_match("#($_a_regex)($_endings_regex)#i", $the_word)) { 
        return 'a'; 
    } 

    if (in_array(strtolower($the_word{0}), $_vowels)) { 
        return 'an';     
    } 

    return 'a'; 
}

It's probably easiest to create the rule and then create a list of exceptions and use that. I don't imagine there will be that many.

Dan Diplo
A: 

Could you get a English dictionary that stores the words written in our regular alphabet, and the International Phoenetic Alphabet?

Then use the phoenetics to figure out the beginning sound of the word, and thus whether “a” or “an” is appropriate?

Not sure if that would actually be easier than (or as much fun as) the statistical Wikipedia approach.

Paul D. Waite