views:

103

answers:

3

Is there a list of language codes in YAML or JSON somewhere out there?

Another format is fine, I can convert it if necessary.

+3  A: 

It is available in HTML via the link you have posted in your question :) Seriously, if that list in Wikipedia is complete, than it is easy to grab it using lxml.html (in Python) or any similar library in your favorite language.

Andrey Vlasovskikh
Hey, this approach does work. Propose a better one before voting down.
Andrey Vlasovskikh
For those who don't like Python, use something else. This question has Ruby as a tag; that language certainly is sufficient to extract what is needed from the Wikipedia list. Maybe the OP should post a different question: "How do I extract values from an HTML table in Ruby (or other language of choice)?" ;) (And I wouldn't be surprised if that can be found on Stack Overflow already.)
John Y
+2  A: 

Checkout source code of Wikipedia entry.

It's a very simple format - table cells are separated by ||. That's much easier to parse than HTML.

Vojto
+1 Nice observation!
Andrey Vlasovskikh
+4  A: 

I think the United Nations or the ISO actually publish that list in CSV format. That would be the ultimate source.

However, I'm not sure if they publish it for free.

EDIT: Actually, the link is in the Wikipedia article you linked to. The US Library of Congress has been designated the official registration authority by the ISO and they publish the entire, official, up-to-date list as a trivial to parse text file for free.

The format looks like this:

ara||ar|Arabic|arabe
arc|||Official Aramaic (700-300 BCE); Imperial Aramaic (700-300 BCE)|araméen d'empire (700-300 BCE)
arg||an|Aragonese|aragonais
arm|hye|hy|Armenian|arménien
arn|||Mapudungun; Mapuche|mapudungun; mapuche; mapuce
arp|||Arapaho|arapaho
art|||Artificial languages|artificielles, langues
arw|||Arawak|arawak
asm||as|Assamese|assamais
ast|||Asturian; Bable; Leonese; Asturleonese|asturien; bable; léonais; asturoléonais
ath|||Athapascan languages|athapascanes, langues

That's 5 fields separated by vertical bars:

  1. ISO 639-2 Alpha-3 bibliographic code
  2. ISO 639-2 Alpha-3 terminology code
  3. ISO 639-1 Alpha-2 code
  4. English language name(s)
  5. French language name(s)

So, this is actually in CSV format, if you interpret that as character separated values instead of comma separated values, which most CSV parsers let you do.

Jörg W Mittag