Writing a python script and it needs to find out what language a block of code is written in. I could easily write this myself, but I'd like to know if a solution already exists.
Pygments is insufficient and unreliable.
Writing a python script and it needs to find out what language a block of code is written in. I could easily write this myself, but I'd like to know if a solution already exists.
Pygments is insufficient and unreliable.
what language a block of code is written in
What are your alternatives, among what languages? There is no way to determine this universally. But if you narrow your focus there is probably a tool somewhere
Vim uses a bunch of interesting tests and regular expressions to look for certain file formats. You can look at the vim instruction file at vim/vim71/filetype.vim
, or here online.
I guess you should try what this very site uses: google-code-prettify (from this question)
[EDIT]J.F. Sebastian pointed me to Pygments (see this answer)
This can be a little difficult to do reliably. For example, what language is the following:
print("blah");
The most reliable way (aside from having the user select the correct language, of course) is to check if the first line is starts with #!
("hashbang") - whatever is after this is the intepreter for the scripting language.
That will work reliably for a lot of scripting languages (including python, shell scripting, perl, ruby etc etc..), but not for compiled languages..
You could look for unique syntax stylings, or specific keywords and weight each one towards a specific language. For example $#somevar
is probably Perl. somevar.each do |another| ..... end
is probably ruby.. but this would end up being a lot of work, and will not always work (especially with short code blocks)
The other obvious way is to use the file-extension. If it's *.pl
it's probably Perl code..
What are you trying to achieve? If you want to syntax highlight, look at what google-code-prettify does - basically a reasonably intelligent, generic syntax highlighter..
In the above above ambiguous example, print
is probably a statement or function name, "blah"
is probably a string. If you highlight those two differently, you've successfully highlighted a lot of different languages, without having to detect what one it actually is.. but that may not always work, depending on the task..
With human languages, you can use something like this Language Recognition Chart with a remarkable degree of accuracy, even where you have to rely on small distinctions. (Discussed on SO here.) A similar chart for the languages you are targeting could probably also find a small number of unique characteristics; say for python double underscores, library names, etc.
You can check highlight.js which automatically highlights the code block, they say they are using some kind of heuristic methods to accomplish this http://softwaremaniacs.org/soft/highlight/en/
Ohcount has been developed for this exactly: http://labs.ohloh.net/ohcount
They are using it at www.ohloh.net to count the contribution of people in languages.
The bad news is that it is coded in ruby, but I am sure that you can integrate it one way or the other in python.
Pygments can guess too. Here is an example from the documentation:
>>> from pygments.lexers import guess_lexer, guess_lexer_for_filename
>>> guess_lexer('#!/usr/bin/python\nprint "Hello World!"')
<pygments.lexers.PythonLexer>
>>> guess_lexer_for_filename('test.py', 'print "Hello World!"')
<pygments.lexers.PythonLexer>