Hi all,
i'm supposed to write code which when given a text file (source code) as input will output which programming language is it. This is the most basic definition of the problem. More constraints follow:
- I must write this in C++.
- A wide variety of languages should be recognized - html, php, perl, ruby, C, C++, Java, C#...
- Amount of false positives (wrong recognition) should be low - better to output "unknown" than a wrong result. (it will be in the list of probabilities for example as unknown: 100%, see below)
- The output should be a list of probabilities for each language the code knows, so if it knows C, Java and Perl, the output should be for example: C: 70%, Java: 50%, Perl: 30% (note there is no need to have the probabilities sum up to 100%)
- It should have a good ratio of accuracy/speed (speed is a bit more favored)
It would be very nice if the code could be written in a way that adding new languages for recognition will be fairly easy and involve just adding "settings/data" for that particular language. I can use anything available - a heuristic, a neural network, black magic. Anything. I'am even allowed to use existing solutions, but: the solution must be free, opensource and allow commercial usage. It must come in form of easily integrable source code or as a static library - no DLL. However i prefer writing my own code or just using fragments of another solution, i'm fed up with integrating code of others. Last note: maybe some of you will suggest FANN (fast artificial neural network library) - this is the only thing i cannot use, since this is the thing we use ALREADY and we want to replace that.
Now the question is: how would you handle such a task, what would you do? Any suggestions how to implement this or what to use?
EDIT: based on the comments and answers i must emphasize some things i forgot: speed is very crucial, since this will get thousands of files and is supposed to answer fast, so looking at a thousand files should produce answers for all of them in a few seconds at most (the size of files will be small of course, a few kB each one). So trying to compile each one is out of question. The thing is, that i really want probabilities for each language - so i rather want to know that the file is likely to be C or C++ but that the chance it is a bash script is very low. Due to code obfuscation, comments etc. i think that looking for a 100% accurate code is a bad idea and in fact is not the goal of this.