tags:

views:

498

answers:

9

Writing a python script and it needs to find out what language a block of code is written in. I could easily write this myself, but I'd like to know if a solution already exists.

Pygments is insufficient and unreliable.

+1  A: 

what language a block of code is written in

What are your alternatives, among what languages? There is no way to determine this universally. But if you narrow your focus there is probably a tool somewhere

Robert Gould
+2  A: 

Vim uses a bunch of interesting tests and regular expressions to look for certain file formats. You can look at the vim instruction file at vim/vim71/filetype.vim, or here online.

Gaurav
+5  A: 

I guess you should try what this very site uses: google-code-prettify (from this question)

[EDIT]J.F. Sebastian pointed me to Pygments (see this answer)

Aaron Digulla
may i mention this site is terrible at doing this? (at least in my opinion)
Claudiu
A little is better than nothing. MarkDown is OSS, so he can fix anything he needs.
Aaron Digulla
Markdown does no syntax colouring at all, it just turns *blah* into <b></b> tags and other such formatting. The syntax highlighting is done by google-code-pretty (in javascript). I added it to your answer, hope you don't mind..
dbr
So "all" he has to do is convert the Javascript (the library) to Python (what he uses)...
bart
What better solution is there when there is no existing Python library?
Aaron Digulla
@Aaron: http://stackoverflow.com/questions/325165/is-there-a-library-that-will-detect-the-source-code-language-of-a-block-of-code#325521
J.F. Sebastian
+2  A: 

This can be a little difficult to do reliably. For example, what language is the following:

print("blah");

The most reliable way (aside from having the user select the correct language, of course) is to check if the first line is starts with #! ("hashbang") - whatever is after this is the intepreter for the scripting language.

That will work reliably for a lot of scripting languages (including python, shell scripting, perl, ruby etc etc..), but not for compiled languages..

You could look for unique syntax stylings, or specific keywords and weight each one towards a specific language. For example $#somevar is probably Perl. somevar.each do |another| ..... end is probably ruby.. but this would end up being a lot of work, and will not always work (especially with short code blocks)

The other obvious way is to use the file-extension. If it's *.pl it's probably Perl code..

What are you trying to achieve? If you want to syntax highlight, look at what google-code-prettify does - basically a reasonably intelligent, generic syntax highlighter..

In the above above ambiguous example, print is probably a statement or function name, "blah" is probably a string. If you highlight those two differently, you've successfully highlighted a lot of different languages, without having to detect what one it actually is.. but that may not always work, depending on the task..

dbr
A: 

With human languages, you can use something like this Language Recognition Chart with a remarkable degree of accuracy, even where you have to rely on small distinctions. (Discussed on SO here.) A similar chart for the languages you are targeting could probably also find a small number of unique characteristics; say for python double underscores, library names, etc.

bvmou
+1  A: 

You can check highlight.js which automatically highlights the code block, they say they are using some kind of heuristic methods to accomplish this http://softwaremaniacs.org/soft/highlight/en/

M. Utku ALTINKAYA
+2  A: 

Ohcount has been developed for this exactly: http://labs.ohloh.net/ohcount

They are using it at www.ohloh.net to count the contribution of people in languages.

The bad news is that it is coded in ruby, but I am sure that you can integrate it one way or the other in python.

Bluebird75
+4  A: 

Pygments can guess too. Here is an example from the documentation:

>>> from pygments.lexers import guess_lexer, guess_lexer_for_filename

>>> guess_lexer('#!/usr/bin/python\nprint "Hello World!"')
<pygments.lexers.PythonLexer>

>>> guess_lexer_for_filename('test.py', 'print "Hello World!"')
<pygments.lexers.PythonLexer>
Ali A
A: 

As other have said Pygments will be your best bet.

Alex Gaynor