Hello,
I'm building an application that receives source code as input and analyzes several aspects of the code. It can accept code from many common languages, e.g. C/C++, C#, Java, Python, PHP, Pascal, SQL, and more (however many languages are unsupported, e.g. Ada, Cobol, Fortran). Once the language is known, my application knows what to do (I have different handlers for different languages).
Currently I'm asking the user to input the programming language the code is written in, and this is error-prone: although users know the programming languages, a small percentage of them (on rare occasions) click the wrong option just due to recklessness, and that breaks the system (i.e. my analysis fails).
It seems to me like there should be a way to figure out (in most cases) what the language is, from the input text itself. Several notes:
- I'm receiving pure text and not file names, so I can't use the extension as a hint.
- The user is not required to input complete source codes, and can also input code snippets (i.e. the include/import part may not be included).
- it's clear to me that any algorithm I choose will not be 100% proof, certainly for very short input codes (e.g. that could be accepted by both Python and Ruby), in which cases I will still need the user's assistance, however I would like to minimize user involvement in the process to minimize mistakes.
Examples:
- If the text contains "x->y()", I may know for sure it's C++ (?)
- If the text contains "public static void main", I may know for sure it's Java (?)
- If the text contains "for x := y to z do begin", I may know for sure it's Pascal (?)
My question:
- Are you familiar with any standard library/method for figuring out automatically what the language of an input source code is?
- What are the unique code "tokens" with which I could certainly differentiate one language from another?
I'm writing my code in Python but I believe the question to be language agnostic.
Thanks