views:

1616

answers:

10

I'm preparing some table names for an ORM, and I want to turn plural table names into single entity names. My only problem is finding an algorithm that does it reliably. Here's what I'm doing right now:

  1. If a word ends with -ies, I replace the ending with -y
  2. If a word ends with -es, I remove this ending. This doesn't always work however - for example, it replaces Types with Typ
  3. Otherwise, I just remove the trailing -s

Does anyone know of a better algorithm?

A: 

I think you have to use a list to translate plural into singular for some special words (in your example Types->Type).

I think you could have a look at the sourcecode of CakePHP (you might start your search here). They are using such an algorithm for their tablenames and fieldnames to automagically join tables.


[Edit:] Here you have some scientific work to read about "Plural inflection in English"

Peter
+7  A: 

The problem is that's based on the general rules, but English has (figuratively) a billion exceptions... What do you do with words like "fish", or "geese"?

Also, the rules are for how to turn singular nouns to plurals. The reverse mapping isn't necessarily possible (consider "freebies").

Tal Pressman
I don't think you realize how big a billion actually is :-) Or were you being figurative? [That's actually a bug-bear of mine, the people that say "literally a billion" when they really mean figuratively].
paxdiablo
Well, I didn't say "literally", now did I? :pStill, if it bothers you that much...
Tal Pressman
freebie? (15 chars)
Charlie Somerville
That would be the correct singular for "freebie", but going according to the original rules in the question you would have to make it freeby which is wrong.
Tal Pressman
A: 

I'm sure you can google to find plenty of libs that do this.

But if you feel like coding, you could try the reverse process: start with singular words of dictionary (download free ones, used by aspell or whatever), use pluralization rule; collect mappings and switch the direction. For "type" you would pluralize to "types", and reverse mapping would work as expected. While there are exceptions here too it is slightly easier to reliably pluralize things. I did this a while back (in mid 90s... :-) ), for an online game (a MUD), where descriptions for multiple identical items were concatenatd, and automatic pluralization was needed.

Also: given that it's finite number of tables you could just use simplest algorithm, get raw output, eyeball it and fix error cases manually. :-)

StaxMan
+2  A: 

Andrew Peters has a class called Inflector.NET which provides plural-to-singular and singular-to-plural methods. As Tal has pointed out no algorithm is infallible but this covers a decent number of irregular English nouns.

itowlson
I've used this and it's great... I've extended it a little. There are many examples on the net of uncommon pluralization to add to the basic version you can get online.
BenAlabaster
+8  A: 

Those are all general rules (and good ones) but English is not a language for the faint of heart :-).

My own preference would be to have a transformation engine along with a set of transformations (surprisingly enough) for doing the actual work.

You would run through the transformations (from specific to general) and, when a match was found, apply the transformation to the word.

Regular expressions would be an ideal approach to this due to their expressiveness. An example rule set:

 1. If the word is fish, return fish.
 2. If the word is sheep, return sheep.
 3. If the word is "radii", return "radius".
 4. If the word is "types", return "type".
 3. If the word ends in "ii", replace that "ii" with "us" (octopii,virii).
    : : : : :
97. If a word ends with -ies, I replace the ending with -y
98. If a word ends with -es, I remove this ending.
99. Otherwise, I just remove the trailing -s.

Note that, when we found the problem with "types" at 98, we created a higher-priority transformation at 4. You'll basically need to keep this transformation table updated as you find all those wondrous exceptions that English has spawned.

The other possibility is to not waste your time with a general rule. Since the names of the tables will be relatively limited, just create another table (or some sort of data structure) called singulars which maps all the relevant plural table names (employees) to singular object names (employee).

Then every time a table is added, add an entry to the singulars "table" so you can singularize it.

paxdiablo
good practical example.
thomasrutter
ooh, I love this. it so downprioritizes (is that a new word) my cases that i feel a bit embarassed. okay. point taken. will work with exceptions rather than rules.
Dmitri Nesteruk
Virii is not the plural of virus, and octopii is not the plural of octopus. If you're going to provide examples, provide correct examples.
Adam Jaskiewicz
Regular expressions only really takes you part way there, you need to create a class that will allow you to define basic rules, exceptions, uncountables, uncommon variations and a host of other variants - some use latin for pluralization, some use greek it's a complex subject.
BenAlabaster
Wow, @Adam, I guess you've shown the whole world how clever you are. How you did that while missing the point of the answer is amazing. I am truly in awe :-)
paxdiablo
+1  A: 

Maybe take a look at source code of something like Rails Inflector

rkj
+1  A: 

See also this answer, which recommends using Morpha (or studying the algorithm behind it).

If you know that the words that you want to lemmatize are plural nouns then you can tag them with NNS to get a more accurate output.

Input example:

$ cat test.txt 
Types_NNS
Pies_NNS
Trees_NNS
Buses_NNS
Radii_NNS
Communities_NNS
Sheep_NNS
Fish_NNS

Output example:

$ cat test.txt | ./morpha -c
Type
Pie
Tree
Bus
Radius
Community
Sheep
Fish
Kaarel
+1  A: 

As an improvement, you could use rules that generate multiple possibilities and then look up the results in a dictionary to weed out impossible options.

For example replace -ies with -y and -ie. Pies becomes Py and Pie. Only one of those is in the dictionary, so choose that one.

Perhaps you can even find a dictionary with frequency information and select the most common word you generate.

If you combine this with an ordered list of rules that covers a few exceptions, you might get pretty good accuracy.

Adrian McCarthy
A: 

There's a nice implementation of an inflector in uNnAddIns project that even implements an experimental spanish inflector. The idea is caught from Rails Inflector module.

It can be used as well for other things like converting from CamelCase to normal text and other goodies and for example generating browser friendly URLs from titles.

Marc Climent
A: 

I'm going to try this MorphAdorner: http://morphadorner.northwestern.edu/morphadorner/download/ (Java). It's a collection of different types of NLP processing tools, and you can test them through online examples. For your problem (that is also my problem) there's the Pluralizer tool: http://morphadorner.northwestern.edu/morphadorner/pluralizer/example/

Marco