views:

957

answers:

7

What would be the best way to detect what programming language is used in a snippet of code?

+1  A: 

It would depend on what type of snippet you have, but I would run it through a series of tokenizers and see which language's BNF it came up as valid against.

Jekke
All languages can't even be described by a BNF. If you're allowed to redefine keywords and create macros it gets much harder. Alså as we're talking about a snippet you would have to do partial match against a BNF, which is harder and more error prone.
DeletedAccount
+1  A: 

First, I would try to find the specific keyworks of a language e.g.

"package, class, implements "=> JAVA
"<?php " => PHP
"include main fopen strcmp stdout "=>C
"cout"=> C++
etc...
Pierre
+29  A: 

I think that the method used in spam filters would work very well. You split the snippet into words. Then you compare the occurences of these words with known snippets, and compute the probability that this snippet is written in language X for every language you're interested in.

http://en.wikipedia.org/wiki/Bayesian_spam_filtering

If you have the basic mechanism then it's very easy to add new languages: just train the detector with a few snippets in the new language (you could feed it an open source project). This way it learns that "System" is likely to appear in C# snippets and "puts" in Ruby snippets.

I've actually used this method to add language detection to code snippets for forum software. It worked 100% of the time, except in ambiguous cases:

print "Hello"

Let me find the code.

I couldn't find the code so I made a new one. It's a bit simplistic but it works for my tests. Currently if you feed it much more Python code than Ruby code it's likely to say that this code:

def foo
   puts "hi"
end

is Python code (although it really is Ruby). This is because Python has a def keyword too. So if it has seen 1000x def in Python and 100x def in Ruby then it may still say Python even though puts and end is Ruby-specific. You could fix this by keeping track of the words seen per language and dividing by that somewhere (or by feeding it equal amounts of code in each language).

I hope it helps you:

class Classifier
  def initialize
    @data = {}
    @totals = Hash.new(1.0)
  end

  def words(code)
    code.split(/[^a-z]/).reject{|w| w.empty?}
  end

  def train(code,lang)
    @data[lang] ||= Hash.new(1)
    for word in words(code)
      @data[lang][word] += 1
      @totals[word] += 1
    end
  end

  def prob(words,lang)
    # We really want to multiply here but I use logs 
    # to avoid floating point underflow
    # (adding logs is equivalent to multiplication)
    words.inject(0.0){|c,w| c + Math.log(@data[lang][w]/@totals[w])}
  end

  def classify(code)
    ws = words(code)
    @data.keys.sort_by{|lang| prob(ws,lang)}.last
  end
end

# Example usage

c = Classifier.new

# Train from files
c.train(open("code.rb").read, :ruby)
c.train(open("code.py").read, :python)
c.train(open("code.cs").read, :csharp)

# Test it on another file
c.classify(open("code2.py").read) # => :python (hopefully)
Jules
I also need to use it in forum software. Thanks for the tip about the Bayesian filtering.
triton
huh, you learn something new everyday around here; +1
jcollum
Thanks for the example code.
triton
+1  A: 

Nice puzzle.

I think it is imposible to detect all languages. But you could trigger on key tokens. (certain reserved words and often used character combinations).

Ben there are a lot of languages with similar syntax. So it depends on the size of the snippet.

Gamecat
A: 

I wouldn't think there would be an easy way of accomplishing this. I would probably generate lists of symbols/common keywords unique to certain languages/classes of languages (e.g. curly brackets for C-style language, the Dim and Sub keywords for BASIC languages, the def keyword for Python, the let keyword for functional languages). You then might be able to use basic syntax features to narrow it down even further.

Noldorin
+1  A: 

It's very hard and sometimes impossible. Which language is this short snippet from?

int i = 5;
int k = 0;
for (int j = 100 ; j > i ; i++) {
    j = j + 1000 / i;
    k = k + i * j;
}

(Hint: It could be any one out of several.)

You can try to analyze various languages and try to decide using frequency analysis of keywords. If certain sets of keywords occur with certain frequencies in a text it's likely that the language is Java etc. But I don't think you will get anything that is completely fool proof, as you could name for example a variable in C the same name as a keyword in Java, and the frequency analysis will be fooled.

If you take it up a notch in complexity you could look for structures, if a certain keyword always comes after another one, that will get you more clues. But it will also be much harder to design and implement.

DeletedAccount
A: 

You might find some useful material here: http://alexgorbatchev.com/wiki/SyntaxHighlighter. Alex has spent a lot of time figuring out how to parse a large number of different languages, and what the key syntax elements are.

Steve