views:

100

answers:

6

Suppose you have a repository of 10,000 function names and possibly their frequency of use in a corpus of code which can be in C/C#/C++. (they have different conventions usually prescribed)

Some Samples may be:

DoPaint
OnPaint
CloseWindow
DeleteGraphOnClose
FreeConnection
ConnectInternat (smallTypo, but part of code)
FreeSoH

Now given a function name, how can we predict if the name follows the convention of Human Generated Name?

Note:

  1. Obviously all candidate names will be valid names
  2. generated names can have arbitrary characters and will be treated as bad
  3. Letter cases can get garbled up

Some candidates:

Z090292 - not likely
onDelete - likely
CloseWindow - likely
iGetIndex - unlikely

Any pointers on technique and software are welcome

+1  A: 

Split the identifiers into individal words (based on capitalization), and put the words into a spell checker (such as ispell). Consider all words with spelling errors as non-human-generated, along with the identifiers in which they occur.

Martin v. Löwis
Especially for technical nouns like StartTeredoTunnelling most words could get flagged as bad.
DotDot
A: 

Predicting if it's human-generated is a very tricky question. Analyzing the code base to find the function names is easier - you might look at tools such as NDepend.

TrueWill
A: 

You can probably detect camelcase. Also, you could possible do a regex search for typical words like: do, get, set, in, etc before the next capitalized word.

Jeff
+1  A: 

A friend of mine might help. He is doing a PhD on this very subject, as far as I can tell.

Home page

Peder Skou
+2  A: 

You could try conducting some Bayesian analysis on the text:

  1. Load the list of names (and their frequencies) into your program. It might be worth tokenising the names at this point. So e.g. CloseWindow becomes Close and Window, with the frequency of both incremented. At this point it would also be useful to load in some non human function names to train the program in nagatives as well.
  2. Take a function name, and using the data you have just gathered find the probability of each part coming up

    P((HumanGenerated|Seeing the Token) = P(Seeing the Token|Human Generated) * P(Humangenerated)) / P(Seeing the Token)

In this case the probability of something being human or computer generated would be decided based on known knowledge i.e. what percentage of function names are thought to be human generated.

The probability of seeing the token ( P(Seeing the Token)) would have to gradually evolve. It would consist of the number of of times the token is seen in human functions and the number of times it is seen in computer functions...this solution is based on the premise that the program learns over time (and thus needs to be trained)

The result, P((HumanGenerated|Seeing the Token) , will give you a probability of the function name being generated by a human.

NB: This is only a rough outline, many details are missing. If you are interested in this line of investigation that I would suggested reading up on probability theory and in particular Bayesian analysis

Jamie Lewis
I see what you suggest, similar to spam detection, great idea
DotDot
A: 

In addition to using a dictionary as Martin V. Lowes suggested is a good one, but you have to remember to also account for the following common forms of variables:

  1. Single-letter variable names.
  2. Variable names that use underscores instead of camel case.
  3. Metasyntactic variables.
  4. Hungarian notation.
  5. Keywords/types with a character attached (i.e. $return or list_).
Imagist