views:

421

answers:

10

I can't find anything other than closed-source web applications. Are there any active projects? I'd be interested in using the software in something I'm developing and getting involved.

A: 

You're going to run into a problem: The guesses will be just that -- guesses. There's no even remotely accurate way to tell the gender of an author strictly from their writing, the most you'll get is a bad estimation.

Jeff Hubbard
That's fine. I understand that you cannot be entirely accurate, and such a feature could only ever be for entertainment.
rmh
A: 

Hey, this could probably be done. You would need to take a bunch of books from male and female authors, pull out sentences, mix them up and feed them to some sort of neural network for training. To be honest, I'd be interested to see if someone pulls it off. Oh, and I am just curious why one would need such a program :)

Dmitri Nesteruk
One reason: analyzing the blogs for marketing purposes.
stalepretzel
Another reason: guessing the demographics of your users. You could probably guess, with a good program, the gender, age, and geographic region of a user, only by looking at writing samples.
stalepretzel
If you want the demographics of your users, just ask! If they're caring enough to write content (posts, comments, etc.) for which they have to be logged in, just get that info during registration.
Gregg Lind
A: 

There are applications like "The Gender Genie" which operate within a reasonable degree of success: http://bookblog.net/gender/genie.php (and particularly with longer texts)

It doesn't need to be entirely successful. I would have huge amounts of data to deal with, and it's mostly just for fun.

If anyone knows of anything, please do share.

Richard

rmh
hmm, gender genie seems to consistently classify texts written by me as female :-/
Wim Coenen
A: 

There's a section about this in the book by Stephen Baker, The Numerati. There are companies out there devoted to computationally analyzing the blogosphere for marketing purposes, and part of their algorithms deal with deciding if the author is male or female. I suggest reading this.

I don't believe any work like this is open source, but you may be able to construct a compressed version yourself. However, short of analyzing a LOT of data in order to program this, I don't think it will be very accurate.

stalepretzel
A: 

There are some open source implementations of latent semantic indexing / analysis. If you have a good training set of male and female writing relevant to your application it might be able to classify accurately enough to be useful.

Jason Watkins
+1  A: 

Since you're assuming two categories, almost any classifier will probably do ok. Some suggestions:

  • Naive bayes
  • support vector machines

As an earlier commenter said, starting from a known sample of text (and there should be plenty... newspaper corpuses might be good), train and classify, on some reasonable attributes (maybe presence / absence or words or word pairs).

This one should be (comparatively) easy.

If you're using python, even something as simple as the Natural Language Toolkit (cf: nltk.org) and their book should get you a lot of way there.

Gregg Lind
+2  A: 

Here's another web site that claims to do this: GenderAnalyzer. However it is relying on another website called uClassify.com that is down as I write this. They have a contact link at the bottom for questions.

It sounds like an academic outfit: "In our lab it seems to works pretty well".

Steve Steiner
Anyone can claim a "lab". All that means is a computer to test on.
Tim
@Tim: Sounds academic though. I might try contacting them.
rmh
Tried them. They said my page was probably written by a male, which is correct. They had buttons to click for right or wrong guess, and the results were about at chance level. Either they don't do well or people click dishonestly (or both).
David Thornley
+2  A: 

There is a whole set of two-class analyzers that could be adapted here... spam-blocking and identification software. It still requires the user to get male-written text (treated as spam) and female text (treated as ham, or the reverse), but many should work.

Gregg Lind
A: 

Hi, you can try a gender classifier on text strings here: http://uclassify.com/browse/uClassify/gender_v3

A: 

nlpers blogged about this some years ago; see the comments there for some suggestions...

unhammer