What would be the best way to detect what programming language is used in a snippet of code?
It would depend on what type of snippet you have, but I would run it through a series of tokenizers and see which language's BNF it came up as valid against.
First, I would try to find the specific keyworks of a language e.g.
"package, class, implements "=> JAVA
"<?php " => PHP
"include main fopen strcmp stdout "=>C
"cout"=> C++
etc...
I think that the method used in spam filters would work very well. You split the snippet into words. Then you compare the occurences of these words with known snippets, and compute the probability that this snippet is written in language X for every language you're interested in.
http://en.wikipedia.org/wiki/Bayesian_spam_filtering
If you have the basic mechanism then it's very easy to add new languages: just train the detector with a few snippets in the new language (you could feed it an open source project). This way it learns that "System" is likely to appear in C# snippets and "puts" in Ruby snippets.
I've actually used this method to add language detection to code snippets for forum software. It worked 100% of the time, except in ambiguous cases:
print "Hello"
Let me find the code.
I couldn't find the code so I made a new one. It's a bit simplistic but it works for my tests. Currently if you feed it much more Python code than Ruby code it's likely to say that this code:
def foo
puts "hi"
end
is Python code (although it really is Ruby). This is because Python has a def
keyword too. So if it has seen 1000x def
in Python and 100x def
in Ruby then it may still say Python even though puts
and end
is Ruby-specific. You could fix this by keeping track of the words seen per language and dividing by that somewhere (or by feeding it equal amounts of code in each language).
I hope it helps you:
class Classifier
def initialize
@data = {}
@totals = Hash.new(1.0)
end
def words(code)
code.split(/[^a-z]/).reject{|w| w.empty?}
end
def train(code,lang)
@data[lang] ||= Hash.new(1)
for word in words(code)
@data[lang][word] += 1
@totals[word] += 1
end
end
def prob(words,lang)
# We really want to multiply here but I use logs
# to avoid floating point underflow
# (adding logs is equivalent to multiplication)
words.inject(0.0){|c,w| c + Math.log(@data[lang][w]/@totals[w])}
end
def classify(code)
ws = words(code)
@data.keys.sort_by{|lang| prob(ws,lang)}.last
end
end
# Example usage
c = Classifier.new
# Train from files
c.train(open("code.rb").read, :ruby)
c.train(open("code.py").read, :python)
c.train(open("code.cs").read, :csharp)
# Test it on another file
c.classify(open("code2.py").read) # => :python (hopefully)
Nice puzzle.
I think it is imposible to detect all languages. But you could trigger on key tokens. (certain reserved words and often used character combinations).
Ben there are a lot of languages with similar syntax. So it depends on the size of the snippet.
I wouldn't think there would be an easy way of accomplishing this. I would probably generate lists of symbols/common keywords unique to certain languages/classes of languages (e.g. curly brackets for C-style language, the Dim and Sub keywords for BASIC languages, the def keyword for Python, the let keyword for functional languages). You then might be able to use basic syntax features to narrow it down even further.
It's very hard and sometimes impossible. Which language is this short snippet from?
int i = 5;
int k = 0;
for (int j = 100 ; j > i ; i++) {
j = j + 1000 / i;
k = k + i * j;
}
(Hint: It could be any one out of several.)
You can try to analyze various languages and try to decide using frequency analysis of keywords. If certain sets of keywords occur with certain frequencies in a text it's likely that the language is Java etc. But I don't think you will get anything that is completely fool proof, as you could name for example a variable in C the same name as a keyword in Java, and the frequency analysis will be fooled.
If you take it up a notch in complexity you could look for structures, if a certain keyword always comes after another one, that will get you more clues. But it will also be much harder to design and implement.
You might find some useful material here: http://alexgorbatchev.com/wiki/SyntaxHighlighter. Alex has spent a lot of time figuring out how to parse a large number of different languages, and what the key syntax elements are.