ansaurus

Question

Is there a library that will detect the source code language of a block of code?

Answer 1

+1 A:

what language a block of code is written in

What are your alternatives, among what languages? There is no way to determine this universally. But if you narrow your focus there is probably a tool somewhere

Robert Gould 2008-11-28 06:49:36

Answer 2

+2 A:

Vim uses a bunch of interesting tests and regular expressions to look for certain file formats. You can look at the vim instruction file at vim/vim71/filetype.vim, or here online.

Gaurav 2008-11-28 07:32:59

Answer 3

+5 A:

I guess you should try what this very site uses: google-code-prettify (from this question)

[EDIT]J.F. Sebastian pointed me to Pygments (see this answer)

Aaron Digulla 2008-11-28 08:01:46

may i mention this site is terrible at doing this? (at least in my opinion)

Claudiu 2008-11-28 08:06:52

A little is better than nothing. MarkDown is OSS, so he can fix anything he needs.

Aaron Digulla 2008-11-28 08:22:02

Markdown does no syntax colouring at all, it just turns *blah* into <b></b> tags and other such formatting. The syntax highlighting is done by google-code-pretty (in javascript). I added it to your answer, hope you don't mind..

dbr 2008-11-28 08:28:51

So "all" he has to do is convert the Javascript (the library) to Python (what he uses)...

bart 2008-11-28 09:29:14

What better solution is there when there is no existing Python library?

Aaron Digulla 2008-11-28 09:55:37

@Aaron: http://stackoverflow.com/questions/325165/is-there-a-library-that-will-detect-the-source-code-language-of-a-block-of-code#325521

J.F. Sebastian 2008-11-28 12:36:44

Answer 4

+2 A:

This can be a little difficult to do reliably. For example, what language is the following:

print("blah");

The most reliable way (aside from having the user select the correct language, of course) is to check if the first line is starts with #! ("hashbang") - whatever is after this is the intepreter for the scripting language.

That will work reliably for a lot of scripting languages (including python, shell scripting, perl, ruby etc etc..), but not for compiled languages..

You could look for unique syntax stylings, or specific keywords and weight each one towards a specific language. For example $#somevar is probably Perl. somevar.each do |another| ..... end is probably ruby.. but this would end up being a lot of work, and will not always work (especially with short code blocks)

The other obvious way is to use the file-extension. If it's *.pl it's probably Perl code..

What are you trying to achieve? If you want to syntax highlight, look at what google-code-prettify does - basically a reasonably intelligent, generic syntax highlighter..

In the above above ambiguous example, print is probably a statement or function name, "blah" is probably a string. If you highlight those two differently, you've successfully highlighted a lot of different languages, without having to detect what one it actually is.. but that may not always work, depending on the task..

dbr 2008-11-28 08:40:37

Answer 5

A:

With human languages, you can use something like this Language Recognition Chart with a remarkable degree of accuracy, even where you have to rely on small distinctions. (Discussed on SO here.) A similar chart for the languages you are targeting could probably also find a small number of unique characteristics; say for python double underscores, library names, etc.

bvmou 2008-11-28 09:12:12

Answer 6

+1 A:

You can check highlight.js which automatically highlights the code block, they say they are using some kind of heuristic methods to accomplish this http://softwaremaniacs.org/soft/highlight/en/

M. Utku ALTINKAYA 2008-11-28 09:53:55

Answer 7

+2 A:

Ohcount has been developed for this exactly: http://labs.ohloh.net/ohcount

They are using it at www.ohloh.net to count the contribution of people in languages.

The bad news is that it is coded in ruby, but I am sure that you can integrate it one way or the other in python.

Bluebird75 2008-11-28 10:02:36

Answer 8

+4 A:

Pygments can guess too. Here is an example from the documentation:

>>> from pygments.lexers import guess_lexer, guess_lexer_for_filename

>>> guess_lexer('#!/usr/bin/python\nprint "Hello World!"')
<pygments.lexers.PythonLexer>

>>> guess_lexer_for_filename('test.py', 'print "Hello World!"')
<pygments.lexers.PythonLexer>

Ali A 2008-11-28 11:16:17

Answer 9

A:

As other have said Pygments will be your best bet.

Alex Gaynor 2008-12-06 21:43:24

ansaurus

tags:

views:

answers:

Is there a library that will detect the source code language of a block of code?

related questions