ansaurus

Question

How to Predict if Function Name Follows Convention

Answer 1

+1 A:

Split the identifiers into individal words (based on capitalization), and put the words into a spell checker (such as ispell). Consider all words with spelling errors as non-human-generated, along with the identifiers in which they occur.

Martin v. Löwis 2009-08-29 21:43:52

Especially for technical nouns like StartTeredoTunnelling most words could get flagged as bad.

DotDot 2009-08-29 22:50:45

Answer 2

A:

Predicting if it's human-generated is a very tricky question. Analyzing the code base to find the function names is easier - you might look at tools such as NDepend.

TrueWill 2009-08-29 21:45:18

Answer 3

A:

You can probably detect camelcase. Also, you could possible do a regex search for typical words like: do, get, set, in, etc before the next capitalized word.

Jeff 2009-08-29 21:46:21

Answer 4

+1 A:

A friend of mine might help. He is doing a PhD on this very subject, as far as I can tell.

Home page

Peder Skou 2009-08-29 21:48:43

Answer 5

+2 A:

You could try conducting some Bayesian analysis on the text:

Load the list of names (and their frequencies) into your program. It might be worth tokenising the names at this point. So e.g. CloseWindow becomes Close and Window, with the frequency of both incremented. At this point it would also be useful to load in some non human function names to train the program in nagatives as well.
Take a function name, and using the data you have just gathered find the probability of each part coming up

P((HumanGenerated|Seeing the Token) = P(Seeing the Token|Human Generated) * P(Humangenerated)) / P(Seeing the Token)

In this case the probability of something being human or computer generated would be decided based on known knowledge i.e. what percentage of function names are thought to be human generated.

The probability of seeing the token ( P(Seeing the Token)) would have to gradually evolve. It would consist of the number of of times the token is seen in human functions and the number of times it is seen in computer functions...this solution is based on the premise that the program learns over time (and thus needs to be trained)

The result, P((HumanGenerated|Seeing the Token) , will give you a probability of the function name being generated by a human.

NB: This is only a rough outline, many details are missing. If you are interested in this line of investigation that I would suggested reading up on probability theory and in particular Bayesian analysis

Jamie Lewis 2009-08-29 21:57:59

I see what you suggest, similar to spam detection, great idea

DotDot 2009-08-30 15:29:19

Answer 6

A:

In addition to using a dictionary as Martin V. Lowes suggested is a good one, but you have to remember to also account for the following common forms of variables:

Single-letter variable names.
Variable names that use underscores instead of camel case.
Metasyntactic variables.
Hungarian notation.
Keywords/types with a character attached (i.e. $return or list_).

Imagist 2009-08-29 22:24:37

ansaurus

tags:

views:

answers:

How to Predict if Function Name Follows Convention

related questions