views:

1309

answers:

19

Sooner or later, coders will feel the need to have access to "open data" in one of their projects, from knowing a city's zip to a more obscure information such as the axial tilt of Pluto.

I know data.un.org which offers access to the UN's extensive array of databases that deal with human development and other socio-economic issues. The other usual suspects are NASA and the USGS for planetary data. There's an article at readwriteweb with more links. infochimps.org seems to stand out.

Personally, I need to find historic commodity prices, stock values and other financial data. All these data sets seem to cost money however.

Clarification

I'm interested in all kinds of open data, because sooner or later, I know I will be in a situation where I could need it. I will try to edit this answer and include the suggestions in a structured manners.

A link for financial data was hidden in that readwriteweb article, doh! It's called opentick.com. Looks good so far!

Update

I stumbled over semantic data in another question of mine on here. There is opencyc ("the world's largest and most complete general knowledge base and commonsense reasoning engine"). A project called UMBEL provides a light-weight, distilled version of opencyc. Umbel has semantic data in rdf/owl/skos n3 syntax.

The Worldbank also released a very nice API. It offers data from the last 50 years for about 200 countries

+1  A: 

I'm not sure about the sort of "deep" datasets your interested in, those tend to be pretty highly prized by private institutions, but I've found ProgrammableWeb to have a very interesting selection of open APIs that can be mined.

On a quasi-related note, anyone who hasn't seen Hans Rosling's TED talk on open data is really missing out.

bouvard
+1  A: 

I don't know about commondity price, but census.gov has a lot of downloadable data (population figures naturally, but also things like county perimeter long/lat coordinates )

James Curran
+1  A: 

Not sure which language you're working in, but there's a neat article from Luca Bolognese's blog about Html Scraping stock prices (from Yahoo Finance, I believe), could be useful:

Downloading stock prices in F#

Hope that helps!

Zachary Yates
+2  A: 

The CIA World Factbook is a great reference (https://www.cia.gov/library/publications/the-world-factbook). I don't know if it's accessible via an API, however. (https://www.cia.gov/library/publications/the-world-factbook/docs/faqs.html#Technical seems to indicate 'no' - but I don't know.)

Other sources that tend to have data available are various weather sites, Google news (accessible via RSS and other means).

addition - http://www.ancestry.com has buckets of data on genealogical information. I do not know how 'available' it is, but they have a lot.

addition - from a friend, http://www.cs.brown.edu/~pavlo/stocks/ has a ton of historical stock data for free. Also, http://www.cs.brown.edu/~pavlo/fortune1000/ has Fortune 100 contact information as of last year :)

warren
+1  A: 

Apart from the generic resource I mentioned in my previous answer, I've found the open API for We Feel Fine to be a particularly enjoyable mining target. Probably not useful for too much beyond high-level/vague sorts of correlations, but still fascinating.

bouvard
+1  A: 

Geonames database has listings of geographic names from all over the world, along with their latitude and longitude, among other information.

http://www.geonames.org

Wordnet is a free lexical relation database.

http://wordnet.princeton.edu/

Eric Normand
+3  A: 

Run immediately to http://www.freebase.com/

Andy Lester
FYI, Freebase was mentioned in the readwriteweb article linked in the question.
bouvard
+1  A: 

A UK-centric list is available at www.showusabetterway.co.uk

Paul Dixon
A: 

Dbpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web.

Lagenar
A: 

Consider processing the Bach Chorales, images of handwritten characters or other data from the Machine Learning Repository.

+6  A: 

Amazon Web Services has public data sets available for free for users of their EC2 cloud services. I do not know anything more than that.

They have fantastic sounding stuff. Even if you do not want to use EC2 you can still use the list as a guide to sites and organizations that make the data available.

(taken directly from their page)

BIOLOGY

Annotated Human Genome Data provided by ENSEMBL

An annotated form of the Human Genome, perfect for biological research, which was released as of December 10, 2008. The first snapshot, called the main Ensembl data, includes human and approximately 40 other species (see www.ensembl.org for a list) as well as comparative genomics data (approximately 550GB). The second snapshot, called the Ensembl Biomart, is a denormalized, query-optimized database that facilitates complex queries of one or more datasets (approximately 172GB).

Main Ensembl (Linux/UNIX): snap-c78360ae
Ensembl BioMart (Linux/UNIX): snap-c48360ad

GenBank provided by the National Center for Biotechnology Information

An annotated collection of all publicly available DNA sequences including more than 85.7B bases and 82.8M sequence records (approximately 250GB)

Linux/UNIX: snap-b04ba2d9 (updated 02/15/2009)

UniGene provided by the National Center for Biotechnology Information

A set of transcript sequences of well-characterized genes and hundreds of thousands of expressed sequence tags (EST), last updated as of December 9, 2008. (approximately 10 GB)

Linux/UNIX: snap-5ad83b33
Windows: snap-60d83b09

CHEMISTRY

A 3D Version of the PubChem Library provided by Rajarshi Guha at Indiana University

A 3D (single conformer) version of Pubchem, a public database of chemical structures in SD Format (approximately 70 GB)

Linux/UNIX: snap-a8dd3dc1
Windows: snap-40dd3d29

UGI Virtual Conformer Library provided by Rajarshi Guha at Indiana University

80GB of data in SD format on conformers for 500,000 molecules that can be used for virtual screening (approximately 85 GB)

Linux/UNIX: snap-59d33330
Windows: snap-48ce2r21

PubChem Library provided by by the National Center for Biotechnology Information

A data set of information on the biological activities of small molecules (approximately 230 GB)

Linux/UNIX: snap-e6df3c8f
Windows: snap-63d83b0a

ECONOMICS

Various US Census Databases provided by The US Census Bureau

United States demographic data from the 1980 (approximately 2 GB), 1990 (approximately 50 GB), and 2000 US Censuses (approximately 200GB), summary information about Business and Industry (approximately 15 GB), and 2003-2006 Economic Household Profile Data (approximately 220 GB)

2000 US Census (Linux/UNIX): snap-92d333fb
2000 US Census (Windows): snap-36ce2e5f
1990 US Census (Linux/UNIX): snap-33f8185a
1990 US Census (Windows): snap-8cf818e5
1980 US Census (Linux/UNIX): snap-9df717f4
1980 US Census (Windows): snap-b6f818df
2003-2006 Economic Data (Linux/UNIX): snap-0bdf3f62
2003-2006 Economic Data (Windows): snap-4edd3d27
Business and Industry Summary Data (Linux/UNIX): snap-5cf81835
Business and Industry Summary Data (Windows): snap-8af818e3

Various Labor Statistics Databases provided by The Bureau of Labor Statistics

Statistics on Inflation & Prices, Employment, Unemployment, Pay & Benefits, Spending & Time Use, Productivity, Workplace Injuries, International Comparisons, Employment Projections, and Regional Resources (approximately 15 GB)

Linux/UNIX: snap-30f81859
Windows: snap-8df818e4

Various Transportation Databases provided by The Bureau of Transportation Services

Data and statistics from the US Department of Transportation on Aviation, Maritime, Highway, Transit, Rail, Pipeline, Bike/Pedestrian and other modes of transportation (approximately 15 GB)

Linux/UNIX: snap-e1608d88
Windows: snap-37668b5e

ENCYCLOPEDIC

DBpedia Knowledge Base provided by DBpedia.

DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. The DBpedia knowledge base currently describes more than 2.6 million things, including at least 213,000 persons, 328,000 places, 57,000 music albums, 36,000 films, 20,000 companies. The knowledge base consists of 274 million pieces of information (RDF triples). It features labels and short abstracts for these things in 30 different languages; 609,000 links to images and 3,150,000 links to external web pages; 4,878,100 external links into other RDF datasets, 415,000 Wikipedia categories, and 75,000 YAGO categories (approximately 67GB).

Semantic extraction by DBpedia with contributions from the DBpedia Community, using data from Wikipedia.org. Snapshots prepared by the infochimps.org team using community curated metadata. Released under the GNU Free Documentation License.

Linux/UNIX: snap-37b75e5e
Windows: snap-09b75e60

Freebase Data Dump provided by Freebase.com.

A data dump of all the current facts and assertions in the Freebase system. Freebase is an open database of the world’s information, covering millions of topics in hundreds of categories. Drawing from large open data sets like Wikipedia, MusicBrainz, and the SEC archives, it contains structured information on many popular topics, including movies, music, people and locations – all reconciled and freely available. This information is supplemented by the efforts of a passionate global community of users who are working together to add structured information on everything from philosophy to European railway stations to the chemical properties of common food ingredients. For more answers check the Freebase FAQ(approximately 26GB).

Data aggregated, processed and reconciled by freebase.com using data from Wikipedia.org, the freebase community, and many other open data sets. Snapshots prepared by the infochimps.org team using community curated metadata. Released under Creative Commons Attribution (CC-BY) license and the Freebase Terms of Service and Licensing Policy.

Linux/UNIX: snap-a8957cc1
Windows: snap-ab957cc2

Wikipedia Extraction (WEX) provided by Freebase.com.

The Freebase Wikipedia Extraction (WEX) is a processed dump of the English language Wikipedia. The wiki markup for each article is transformed into machine-readable XML, and common relational features such as templates, infoboxes, categories, article sections, and redirects are extracted intabular form. Freebase WEX is provided as a set of database tables in TSV format for PostgreSQL, along with tables providing mappings between Wikipedia articles and Freebase topics, and corresponding Freebase Types. (approximately 66GB)

Semantic extraction by freebase.com, using data from Wikipedia.org. Snapshots prepared by the infochimps.org team using community curated metadata. Released under the GNU Free Documentation License.

Linux/UNIX: snap-a0957cc9
Windows: snap-a6957ccf

Allen
A: 

Google APIs

They have APIs to access news, web pages, satellite maps, finance data, images and so on ;-)

FerranB
+1  A: 

Astronomical Data sets

Most astronomical data sets are freely available after a short proprietary period (to allow the original observers who did the work a window of exclusive time to publish results). These are scattered all over the web. The Astronomical Data Center provides convenient links to most of the big catalogues and astronomical data providers out there, and also contains local copies of data accessible via FTP and on CD-ROM.

ire_and_curses
+1  A: 

Most space science data is freely available, but it's spread all over the place. Either visit one of the appropriate VxOs (discipline specific virtual observatories), or the VSPO, a sort of directory of the sets. For earth science, there's the GCMD.

You can find other US government data at science.gov, but like the GCMD, they search more than just data.

For library catalogs, see the OpenLibrary, or get it from the Internet Archive. There's other data stored in IA, but I'm not sure what the best way is to find it. Likewise, Talis offers storage of open data.

... there was at some point a directory of open data sets, but I've tried looking through Code4Lib's archive, and I'm not having luck finding it again. It seems like the Open Data Commons is now part of the Open Knowledge Foundation, and you might be able to mine that for links.

Joe
+2  A: 

This question was asked before the website started, but stats.gov has extremely high quality government collected data sets. This type of data ranges from farming output to how people spend time in the US National Parks.

Also, if you could figure out a way to get into ICPSR, you'd be set too. This type of data is generally sociology, political science, or the like.

Adam
A: 

There are a lot of free data sets available at Data.gov, a new site designed for collecting and disseminating various data produced by the federal government in machine-readable format.

Brian Campbell
+1  A: 

Consider OpenStreetMap for free map data.

Shaji
A: 

Economics - The World Bank - Open Data Initiative

The World Bank decided last week to open up a lot of its previously non-free datasets and published them online on its revised homepage. The new internet appearance looks pretty nice as well.

mropa
A: 

Find Business, World, Point of Interest & General Open datasets at Edigitalz Data

Sergei loves Data