views:

1901

answers:

23

I am looking for a library or database that can provide guesses about whether a person is male or female based on his or her name or nickname. Something like

john => "M",
mary => "F",
alex => "A", #ambiguous

I am looking for something that supports names other than English names (such as Japanese, Indian, etc.).

Before I get another answer along the lines of "you are going to offend people by assuming their sex/gender" let me be clear, my application does not interact with anyone. It does not send emails or contact anyone in anyway. There are no users to ask. In many cases, the person in question is dead, and the only information I have is name, birth date, and date of death. The reason I want to know the sex of the individual is to make the grammar of the output nicer and to aid in possible searches that may come latter.

+37  A: 

"I tell ya, life ain't easy for a boy named 'Sue.'"

...So, why make it any harder? If you need to know the sex, just ask... Otherwise, don't worry about it.

Shog9
That works when you are in communication with the person, but we make assumptions about sex all the time, and it has an affect on grammar, so it would be nice if I could make even an 80% stab at getting the pronouns right. Right now I am right fiftyish percent of the time using "he", but that seems sexist to me, and I want to make it better.
Chas. Owens
@Chas: even if you manage to get it right 80% of the time, that's still gonna irritate two out of every ten people using your software. Not so good... I understand the motivation, but you're really better off re-wording your messages such that they're gender-neutral.
Shog9
I wonder if anyone else remembers "Pat" from SNL.
erickson
Re-reading this, i'm coming off awful preachy... Don't mean to put down your idea, i think it's an interesting one... But also very risky. I used to work as a telemarketer, and no matter how many names you know, no matter how good you get at recognizing voices even, you still end up guessing wrong sometimes... and it's never fun. And so, it's something i would avoid, unless there's a big, big payoff for managing to guess it *right*.
Shog9
I would take Shog9's advise. Unless your application is going to be in languages other than English, it might be easier to just use gender-neutral pronouns--e.g. they, one, s/he, etc.
Calvin
I can totally see the benefit of this. In my home country, a lot of marketing companies get address lists from the high schools but without gender info, so they used to try and figure out genders from first names. I had a male friend with a very female name that regularly got samples of female hygiene products.
Uri
My girlfriend's name is Kevan. She gets enough people wrongly guessing her gender from her name, she doesn't need computers also doing it. The thing to beware of is creating a database entry that states someone's gender based on a computer's guess; people who look at that might assume it's provided by the person themselves, and get really confused when their assumptions turn out to be wrong (as opposed to only somewhat confused based on the name alone).
Brian Campbell
re: "Just ask". C'mon, what's more practical, spam 100000 people with "Please tell us your gender" email or use a mapping database on data you already have? Especially when all you need is a nice piechart for marketing.
Constantin
@Constantin: sure - if you're not going to try to change the experience of any user based on the data, then use whatever means of creating that data you feel like. Heck, if accuracy doesn't really matter, then just make it up...
Shog9
Maybe he's trying to de-prioritize the obvious ones so that he could set the Ambiguous names to a higher priority ... like people doing data entry. There are valid reasons of doing something like this.
Chad Grant
If you do it in a subtle way this would boost usability too. for example: put the name inputs first, hit the server with an ajax request and set the gender field (which is near the end) to the "best guess"
Jiaaro
I'm implementing it to choose the voice of an IVR to be the opposite sex of the user. If you guess wrong, who cares, but if you guess right, you could get a boost to retention and reduction of call volume to live agents.
Chris McCall
@Chris: heh... That's actually kinda cool. And a little creepy. But cool!
Shog9
+2  A: 

Just ask people, and if they are nice they will give you their 'M's or 'F's , and if they are not then give'em an 'A' .

Azder
I am not in communication with the people whose names I want to map.
Chas. Owens
+47  A: 

The gender of a name is something that cannot be inferred programmatically in the general case. You need a name database. Here is a free name database from the US Census Bureau.

Ayman Hourieh
Saw this in first.dist.males "HAI 0.004 90.022 1214"
Unknown
This is great. Seems to be just what the asker needs. Ambiguous names could have the gender "guessed" based on frequency of Male versus frequency of Female.
stalepretzel
Unknown, I'm not sure what you mean. A quick Google search shows that Hai is a boy name of Hebrew origin.
Ayman Hourieh
it's not a name that translates well to English.. not Hai but more like Chai, or if you go by the Alphabet, Jai.
Amir Arad
I built a library for this using the Census data and it works great! Potential applications: analytics, IVR voice gender choosing.
Chris McCall
A: 

I think it will be hard to find libraries or databases that covers more than one (or a limited few) cultures. For instance, here in Sweden the name "Maria" is strongly identified as a female name, while (to the best of my understanding) that connection is not at all as strong in spanish-speaking countries.

Fredrik Mörk
Maria is a female name in Spain
knoopx
Google for José Maria and you will find plenty of men carrying the name Maria. Perhaps not as their primary name, but still.
Fredrik Mörk
Multuple databases/libraries are fine. My assumption is that such a search would be O(1), so even a hundred searches would be trivial.
Chas. Owens
I'm Hispanic and Maria is STRONGLY associated with females. not to say there aren't Jose Maria's out there, but it's not common at all.
Paolo Bergantino
Italy also has some males with Maria as a middle name. English speaking countries have some females with Jean as a first or second name (pronounced differently from the French name Jean). English has Karen as a female name. English vs. English on Leslie. Databases are going to have a lot of A's.
Windows programmer
A: 

Dont bother attempting to determine a persons sex from an entered name. When you get it wrong your going to really piss someone off, letting users enter their sex is simple so you may as well let them enter it.

mP
That would work if I were designing a form to interact with people, but that is not what is going on.
Chas. Owens
If you don't know don't guess - it's extremely rude to assume. If you really need to know ask, there are always ways to really get a definitive answer.
mP
@mP Really? There is a good way to ask a dead person with no relatives? I am glad you know all of the facts of what I am doing so much better than I do. So, do you hold a seance or what?
Chas. Owens
@ChasWell why bother with a database at all. Why not randomly generate data whenever a query is made for a person. Your boss will be happy, with the savings in DB licenses and no support/admin....
mP
+1  A: 

Name-gender maps can work but in multicultural countries it's more like guessing. I can give you one example: Marian in Polish is a typical masculine name, whereas the same name in Great Britain is a female name. In the era of people immigrating all over the world, I'm not sure such database would be very accurate. Good luck!

Michal Rogozinski
No, but so long as it is better than 50% it beats treating names as always masculine.
Chas. Owens
@Chas, so why cling to that false dichotomy? You have the option of gender-neutrality.
bignose
we even have 2 famous politics who have a second name 'maria' - 'mary' which would be classified in your database as feminine. just for the lols.
zalew
@JZ I am speaking of the firstname not the lastname (or vice versa for the cultures that do the reverse).
Chas. Owens
@bignose, the gender-neutral language looks weird and is convoluted, I would rather produce something that looks nicer when I can. This is not communicated back to the individuals (if they even exist), so there is no chance of offense, I don't know why people are spending so much time arguing this instead of just providing links to databases if they know of a good one.
Chas. Owens
@Chas: second name != last name. like JFK :)
zalew
+37  A: 

gender.c is an open source C program that does a good job. It comes with data for 44568 first names from all around the world. There is good documentation and a description of the file format (basically plain text) so it should not be to difficult to read it from your own application.

Here is what the author says:

A few words on quality of data

The dictionary of first names has been prepared with utmost care. For example, the Turkish, Indian and Korean names in this dictionary have all been independently classified by several native speakers. I also took special care to list only those names which can currently be found.

The lesson from this?

Any modifications should be done very cautiously (and they must also adhere to the sorting required by the search algorithm). For example, knowing that "Sascha" is a boy's name in Germany, the author never assumed the English "Sasha" to be a girl's name. Knowing that "Jan" is a boy's name in Germany, I never assumed it to be also a English short form of "Janet". Another case in point is the name "Esra". This is a boy's name in Germany, but a girl's name in Turkey.

The program calculates a probability for the name being male of female. It can do so with the name as input alone or with the name and country of origin, which gives significantly better results.

You can download it from the website of the German computer magazine c't 40 000 Namen. The article is in German but don't worry, all documentation is English. Here is the direct ftp link 0717-182.zip if you are not interested in the article. The zip-File contains the source code, an windows executable, the database and the documentation.

Ludwig Weinzierl
+6  A: 

The only thing you'll get from trying to automate it is a bunch of unhappy users. From that census data:

JAMES, JOHN, ROBERT, MICHAEL, WILLIAM, DAVID, RICHARD, CHARLES, JOSEPH, THOMAS, CHRISTOPHER, DANIEL, PAUL, MARK, DONALD, GEORGE, KENNETH, STEVEN, EDWARD, BRIAN, RONALD, ANTHONY, KEVIN, JASON, MATTHEW, GARY, TIMOTHY, JOSE, LARRY, JEFFREY, FRANK, SCOTT, ERIC, STEPHEN, ANDREW, RAYMOND, GREGORY, JOSHUA, JERRY, DENNIS, WALTER, PATRICK, PETER, HAROLD, HENRY, CARL, ARTHUR, RYAN, JOE, JUAN, JACK, ALBERT, JUSTIN, TERRY, GERALD, KEITH, SAMUEL, WILLIE, LAWRENCE, ROY, BRANDON, ADAM, FRED, BILLY, LOUIS, JEREMY, AARON, RANDY, EUGENE, CARLOS, RUSSELL, BOBBY, VICTOR, MARTIN, JESSE, SHAWN, CLARENCE, SEAN, CHRIS, JOHNNY, JIMMY, ANTONIO, TONY, LUIS, MIKE, DALE, CURTIS, NORMAN, ALLEN, GLENN, TRAVIS, LEE, MELVIN, KYLE, FRANCIS, JESUS, RAY, JOEL, EDDIE, TROY, ALEXANDER, MARIO, FRANCISCO, MICHEAL, OSCAR, JAY, ALEX, JON, RONNIE, TOMMY, LEON, LEO, WESLEY, DEAN, DAN, LEWIS, COREY, MAURICE, VERNON, ROBERTO, CLYDE, SHANE, SAM, LESTER, CHARLIE, TYLER, GENE, BRETT, ANGEL, LESLIE, CECIL, ANDRE, ELMER, GABRIEL, MITCHELL, ADRIAN, KARL, CORY, CLAUDE, JAMIE, JESSIE, CHRISTIAN, LONNIE, CODY, JULIO, KELLY, JIMMIE, JORDAN, JAIME, CASEY, JOHNNIE, SIDNEY, JULIAN, DARYL, VIRGIL, MARSHALL, PERRY, MARION, TRACY, RENE, FREDDIE, AUSTIN, JACKIE, JOEY, EVAN, DANA, DONNIE, SHANNON, ANGELO, SHAUN, LYNN, CAMERON, BLAKE, KERRY, JEAN, IRA, RUDY, BENNIE, ROBIN, LOREN, NOEL, DEVIN, KIM, GUADALUPE, CARROLL, SAMMY, MARTY, TAYLOR, ELLIS, DALLAS, LAURENCE, DREW, JODY, FRANKIE, PAT, MERLE, TERRELL, DARNELL, TOMMIE, TOBY, VAN, COURTNEY, JAN, CARY, SANTOS, AUBREY, MORGAN, LOUIE, STACY, MICAH, BILLIE, LOGAN, DEMETRIUS, ROBBIE, KENDALL, ROYCE, MICKEY, DEVON, ASHLEY, CAREY, SON, MARLIN, ALI, SAMMIE, MICHEL, RORY, KRIS, AVERY, ALEXIS, GERRY, STACEY, CARMEN, SHELBY, RICKIE, BOBBIE, OLLIE, DENNY, DION, ODELL, MARY, COLBY, HOLLIS, KIRBY, CRUZ, MERRILL, LANE, CLEO, BLAIR, NUMBERS, CLAIR, BERNIE, JOAN, DOMINIQUE, TRISTAN, JAME, GALE, LAVERNE, ALVA, STEVIE, ERIN, AUGUSTINE, YOUNG, JOHNIE, ARIEL, DUSTY, LINDSEY, TRACEY, SCOTTIE, SANDY, SYDNEY, GAIL, DORIAN, LAVERN, REFUGIO, IVORY, ANDREA, SANG, DEON, CAROL, YONG, BERRY, TRINIDAD, SHIRLEY, MARIA, CHANG, ROSARIO, DANNIE, FRANCES, THANH, CONNIE, TORY, LUPE, DEE, SUNG, CHI, QUINN, MINH, THEO, LOU, CHUNG, VALENTINE, JAMEY, WHITNEY, SOL, CHONG, PARIS, OTHA, LACY, DONG, ANTONIA, KELLEY, CARROL, SHAYNE, VAL, JUDE, BRITT, HONG, LEIGH, GAYLE, JAE, NICKY, LESLEY, MAN, KASEY, JEWELL, PATRICIA, LAUREN, ELISHA, MICHAL, LINDSAY, and JEWEL

are all names that work for both males and females. If a girl's name is Robert and everyone, including your software, keeps on calling her a man, she'd be rather pissed.

nitromaster101
Lets assume that there exists a girl called Mark (feel free to point one out). If I was her I'd be pissed off at my parents and not at Chas' software...
Darko Z
What if the software never calls her a man, but presents the "masculine" version of the UI? Or she's lumped in with men in an aggregate over a dataset used to develop marketing collateral? She might not even notice.
Chris McCall
+5  A: 

Given your stated constraints, your best option is to re-phrase whatever it is you're writing to be gender-neutral unless you know what gender they want to be called in each instance.

If writing in English, remember that singular “they” is grammatically fine as a gender-neutral third-person singular pronoun.

A good example is the title of this question. As is currently:

    … mapping a person's name to his or her sex?

That would be less awkward if written:

    … mapping a person's name to their sex?
bignose
It's not quite "perfectly" grammatical. Even the Wikipedia article admits that it has been used, particularly in the modern context, as a result of some writers' discomfort with the generic "he". I don't have a big problem with writers that do this (although if gender-neutrality is really important, I prefer to reword the construct so I can use pronouns like "one"), but let's call it what it is.
Ben Collins
I'd argue we're both right. All grammar, especially English grammar, has significant problems; but I'd say any definition of “perfect grammar” that actually applies to anything in English applies here too. Either the singular “they” is perfectly grammatical, or nothing in English is :-)
bignose
Of note, Grammar Girl (author Mignon Fogarty) has been leaning towards acceptance of the singular "they" for a while now. http://grammar.quickanddirtytips.com/he-they-generic-personal-pronoun.aspx
Karen Lopez
A: 

This is not really a programming problem - it comes down to getting a probability table.

AFAIK there are no public databases in distilled forms. You could either build this from census data, or buy the data from someone.

For example, this is someone who sells the probability table for Canada.

Uri
+3  A: 

Some cultures have unisex names - like mine. What do you do then? I think the answer is plain and simple - don't assume - you could cause offence. Just ask if its needed, otherwise gender neutrality.

Preet Sangha
The question already answered your answer: alex => "A", #ambiguous. Whether or not the question has an answer, your answer isn't it.
Windows programmer
I disagree - My point is that all names are potentially ambiguous.
Preet Sangha
If the names are unisex then they would all be classified as A and I would go for gender neutrality, but if a name is predominately masculine or feminine I can use much more natural language.
Chas. Owens
I see what your reasoning is but I refer you to the latter comment.
Preet Sangha
I agree that all names are potentially ambiguous. The question already answered your answer. The question said to use "A" for ambiguous. You answer asked a question that the question already answered.
Windows programmer
@Windows programmer If all names are potentially ambiguous, then the answer is f(x) = "A", which is the answer that Preet gave.
Brian Campbell
@Brian Campbell Yeah, but not all names are ambiguous. Many have a high correlation with the person's sex.
Chas. Owens
But if you see a "Preet" on StackOverflow, it's probably a male.
Nosredna
+2  A: 

It's also poor practice to assume that users must be male or female. There are a small but significant number of "intersex" people, most of whom are heartily sick of not having a box to tick..
bignose: interesting on the "singular they". I didn't realize it had such a long history.

Karl
+2  A: 

Although databases are probably the most practical solution, if you want to have some fun maybe you could try writing a neural net (or using a neural net library) that takes in the name and outputs one of those 3 options (F,M,A).

You could train it using the datasets that exist in the databases suggested by other answers, as well as with any other data you have.

This solution would allow you to handle names not specifically categorised previously, and also handle different languages. You might want to pass the language (if you know it) as an input to the neural net as well.

I don't know that I can say neural nets (or any other machine learning) would do a good job of categorising though.

chees
+1  A: 

IMHO, it is a generally bad idea to determine sex from an individuals name. A lot of names are intersexual (good grief, is this even a word ?? :-), and also they may be one sex in one culture and another in another.

A few stupid examples, just a few that came to mind (from my part of the world, CE)

Vanja - female, in eastern countries from here, mostly male
Alex - intersex (short for Sandra, female, and Sandro, male)
Robin - in western cultures, can be both

In some parts of the world, a persons sex can be determined by looking at how the name ends. For example, Marija, Sandra, Ivana, Petra, Sara, Lucija, Ana - you can see that most of these female names end in "ja" or "ra". There are other examples as well.

Still, I think it's better just to ask the user for sex.

ldigas
"Still, I think it's better just to ask the user for sex." -- I agree, that would be far better than posting comments on Stack Overflow.
Windows programmer
Ups. Okeey, that didn't come out right :-)
ldigas
It was better before editing :-)
Windows programmer
Hehehe ... yeah :-)
ldigas
+1  A: 

Well, not anymore. IBM patented that idea a while ago.

So if you're looking for any level of flexability (something other than a list of names), you'll either have to (gasp!) ask the user, or simply pay IBM for the rights :)

In any case, such autodetection is annoying for many people who have gender-ambiguous names, or even just mean parents. Let's not make this any harder for them.

lfaraone
It looks like IBM patented choosing an avatar based on name. Luckily that is not one of the applications I intend to use this for, so I am not violating their patent. As for asking the user, that assumes I have users to ask as opposed to a list of names. I have said repeatedly that there are no users, no interaction, and no messages going to the people who the names belong to.
Chas. Owens
+1  A: 

It's culture/region dependent: take Andrea, for Italians is only masculine, for Sweden is a female name while Andreas is for men; Shawn is ambiguous in English. If a language has declination, like Latin or Russian, the final letters will change according to grammatical rules,

Another source of ambiguities is Family names identical to Personal names.

In my opinion it's impossibile to solve in general.

Giulio Vian
A: 

I agree that there are two issues with this question:

  1. The assumption that better than 50% is good. I'd say ask your average female Chris and male Su whether or not they enjoy being mis-addressed 100% of the time.
  2. That there are only two genders in the world. Sure, we've been taught that, but it really doesn't reflect reality.
Karen Lopez
Note that I am specifically looking for sex, not gender. This is one of the reasons I have resisted the changing of the question to refer to gender instead of sex. Also, for someone concerned about assumptions you are making one. If you read the comments you will see that nothing is being sent to the people whose names are being examined, so there us no possibility of offense.
Chas. Owens
Then I'm baffled by what you are trying to accomplish. I'm not aware of any standard, tool, dataset, algorithm, or SWAG that can get you from a given name to that."Note that I am specifically looking for sex"Okie dokie. On that note, I'll leave this up to you to sort out - ON YOUR OWN.
Karen Lopez
@Karen: Please... Don't waste our time posting ignorant answers like this. If you're baffled or don't like the question... MOVE ALONG. If you read the question, he's trying to make nicer output, by attaching a "sex" (him/her, male/female, penis/vagina) to a name, so that anything written about an individual reads nicer without complicated "him or her" or gender neutral phrases. The name owners do not see this and therefore cannot be offended. In other words, it only has to been good enough so as to not be obviously wrong, like referring to to "Dick", "Bob", or "Tom" as "she" or "her".
Triynko
+1  A: 

It's not free, but this is a nice library that I have used before:

NetGender for .NET allows you to quickly and easily build Name Verification, Parsing and Gender Determination into your custom applications. Accurately verify whether a particular field contains a valid individual or company. NetGender uses a 100,000+, ethnically diverse, Name Dictionary in combination with an 8,000+ Company Name Dictionary to ensure precise gender determination.

http://www.softwarecompany.com/dotnet/netgender.htm

Richard West
+11  A: 

Here are two oddball approaches that may not even work, and likely wouldn't work en masse without violating the terms of a license:

  1. Use the Facebook API (which I know virtually nothing about, it may not even be possible) to perform two searches: one for FB male users with that first name, and one for female. Use the two numbers to decide the probability of gender.

  2. Much looser but more scalable, use the Google API and search for the name plus the gender-specific pronouns, and compare the numbers. For instance, there are 592,000,000 results for searching for "Richard his" (not as a phrase), but only 179,000,000 for "Richard her".

richardtallent
Apart from the general consensus on having software trying to guess things like sex from a first name, this is really cool algorithmic answer to the original question. Well done.
peSHIr
Great idea. You can probably throw in some words in a given country's language as well to localize it.
Nosredna
Good point, Nosrenda... of course, Google also allows you to filter search results by language code already. You can even steal the user's preferred language from the HTTP request. Their browser language setting may or may not match up with the ethnicity of their name, but this is a fuzzy technique anyway.
richardtallent
I found that Google is chauvinist: more results come back for men than women because more men are in Google. The Facebook API is probably a lot more representative.
Chris McCall
A: 

Got this from hacker news discussion about this

Surya
A: 

Interesting. Don't think anyone has tried to do something like this. And like it has been suggested in previous answers - it must be country specific too.

gnlogic
Some good answer.
Ronnie Overby
+1  A: 

I haven't used it, but IBM has a Global Name Analytics library (for a price!) that seems pretty comprehensive.

altan
+1  A: 

It's interesting that you say you have birth date. That could help. I've seen databases of histories of name popularity.

In the film Splash (1984), it was funny that Darryl Hannah's character chooses the name "Madison" from a Madison Avenue street sign, because obviously "Madison" is not a girl's name.

24 years later, Madison is the 4th most popular name for girl babies!


Name history from the gov't. (Check out Mary's sad decline through the last 100 years.)


When I wrote to the White House as a child, Richard Nixon (or, perhaps a secretary) responded to me with some photos of the historic place, addressed to "Miss Rhett Anderson." "Miss Rhett?" It doesn't even make sense! Can we REALLY not tell the difference between Clark Gable's Rhett (with a mustache, in Gone With The Wind!) and Vivian Lee's Scarlett? I shall never forgive him, despite Neil Young's assurance that "even Richard Nixon has got soul."

Nosredna
Good point, date definitely comes into play here.
Chas. Owens