I can't find anything other than closed-source web applications. Are there any active projects? I'd be interested in using the software in something I'm developing and getting involved.
views:
421answers:
10Is there open source software available that analyses a string and guesses the gender of the author?
You're going to run into a problem: The guesses will be just that -- guesses. There's no even remotely accurate way to tell the gender of an author strictly from their writing, the most you'll get is a bad estimation.
Hey, this could probably be done. You would need to take a bunch of books from male and female authors, pull out sentences, mix them up and feed them to some sort of neural network for training. To be honest, I'd be interested to see if someone pulls it off. Oh, and I am just curious why one would need such a program :)
There are applications like "The Gender Genie" which operate within a reasonable degree of success: http://bookblog.net/gender/genie.php (and particularly with longer texts)
It doesn't need to be entirely successful. I would have huge amounts of data to deal with, and it's mostly just for fun.
If anyone knows of anything, please do share.
Richard
There's a section about this in the book by Stephen Baker, The Numerati. There are companies out there devoted to computationally analyzing the blogosphere for marketing purposes, and part of their algorithms deal with deciding if the author is male or female. I suggest reading this.
I don't believe any work like this is open source, but you may be able to construct a compressed version yourself. However, short of analyzing a LOT of data in order to program this, I don't think it will be very accurate.
There are some open source implementations of latent semantic indexing / analysis. If you have a good training set of male and female writing relevant to your application it might be able to classify accurately enough to be useful.
Since you're assuming two categories, almost any classifier will probably do ok. Some suggestions:
- Naive bayes
- support vector machines
As an earlier commenter said, starting from a known sample of text (and there should be plenty... newspaper corpuses might be good), train and classify, on some reasonable attributes (maybe presence / absence or words or word pairs).
This one should be (comparatively) easy.
If you're using python, even something as simple as the Natural Language Toolkit (cf: nltk.org) and their book should get you a lot of way there.
Here's another web site that claims to do this: GenderAnalyzer. However it is relying on another website called uClassify.com that is down as I write this. They have a contact link at the bottom for questions.
It sounds like an academic outfit: "In our lab it seems to works pretty well".
There is a whole set of two-class analyzers that could be adapted here... spam-blocking and identification software. It still requires the user to get male-written text (treated as spam) and female text (treated as ham, or the reverse), but many should work.
Hi, you can try a gender classifier on text strings here: http://uclassify.com/browse/uClassify/gender_v3