views:

165

answers:

4

How can I determine the mime-type of a file (in OCaml)?

I am trying to set the language for a GtkSourceView control, but to do that, I need to first determine the language. The only way I can see of doing this is using the mime-type - there is a function that will return the correct language as follows:

GSourceView.source_languages_manager#get_language_from_mime_type : string -> source_language option

I really don't want to hard code the language into my source. If it isn't possible to determine the mime-type in OCaml (and I haven't yet found a way, after searching through the documentation), is there perhaps another way I can determine the source language?

+3  A: 

Most languages lack this, so I would be very surprised to find it in OCaml. Apache does it with a mime.types file - you can look there for hints. This is the most usual way - a huge table which maps extensions into mimetypes. You can implement it in OCaml easily:

let mimetype_of_extension = function
    | "txt" | "log" -> "text/plain"
    | "html" | "htm" -> "text/html"
    | "zip" | "application/zip"
...

Another way is to look at the file contents, but then you basically need to know about the various file formats.

That said, it does not help you much, since source files of all languages are normally treated as text/plain. They are not distinguishable by mimetype; and thus I really have no idea what your get_language_from_mime_type function does.

However, filename extensions of various source files are more-or-less standardised, so if you know the extension, you will know the language. Getting the extension is as simple as ripping whatever follows the last period from the filename.

let extension_of_filename filename =
    let pos = (String.rindex filename '.') + 1 in
    let len = String.length filename in
    let ext = String.create (len - pos) in
    String.blit filename pos ext 0 (len - pos);
    ext;;

Well, okay, simple in any language except Brainfuck and OCaml, at least. After that, it's easy - "c" is a C program, as is "h"; "ml" is OCaml; etc.

Amadan
@Amadan The OP already has a dependency on GtkSourceView, so he probably wants a function that returns a type listed in .../share/mime/types, which is installed by GtkSourceView or one of its dependencies. That file lists "text/x-erlang", "text/x-eiffel", etc (just going through the "e"s) :) There are no canonical extensions for these types listed in this file though.
Pascal Cuoq
I think get_language_from_mime_type is for ultimately getting a syntax description (highlighting, ...) from one of the configuration files eiffel.lang, erlang.lang, ... in .../share/gtksourceview-2.0/language-specs/
Pascal Cuoq
@PascalCuoq - Your right, those are the mime types that I want to look up - does this mean that I will have to create a big lookup table myself, and return the mime type based on the file extension?
a_m0d
Sorry, I wasn't familiar with GtkSourceView. If you already have a file such as @Pascal describes, you can make a routine to parse it instead of creating a lookup table yourself.
Amadan
+3  A: 

After studying the source code of gedit, which includes this functionality, I have discovered a method in glib which will do this for me. This answer provides an example use of the g_file_info_get_content_type() method. There is also the g_content_type_get_mime_type() method, which is also available in glib.

Unfortunately, there is no wrapping available for these functions yet, which means I may have to generate my own wrapping for them.

a_m0d
+1  A: 

In GTK, you can wrap the functions you have already found.

It is also not hard to parse /etc/mime.types - it's a simple whitespace-separated file. I believe both Ocsigen and Ocamlnet contain code to do this, but I don't know off-hand if they make it easy to access (e.g. a function exposed by the Ocamlnet netstring library).

Michael E
+1  A: 

This is probably not the best method for determining the type of source code (using /etc/mime.types is best for that IMO), but there are also OCaml bindings for libmagic that you could use.

tsuyoshi