tags:

views:

694

answers:

10

What are the relevant skills in the arsenal of a Data Scientist? With new technologies coming in every day, how does one pick and choose the essentials?

A few ideas germane to this discussion:

  • Knowing SQL and the use of a DB such as MySQL, PostgreSQL was great till the advent of NoSql and non-relational databases. MongoDB, CouchDB etc. are becoming popular to work with web-scale data.
  • Knowing a stats tool like R is enough for analysis, but to create applications one may need to add Java, Python, and such others to the list.
  • Data now comes in the form of text, urls, multi-media to name a few, and there are different paradigms associated with their manipulation.
  • What about cluster computing, parallel computing, the cloud, Amazon EC2, Hadoop ?
  • OLS Regression now has Artificial Neural Networks, Random Forests and other relatively exotic machine learning/data mining algos. for company

Thoughts?

+12  A: 

Just to throw in some ideas for others to expound upon:

At some ridiculously high level of abstraction all data work involves the following steps:

  • Data Collection
  • Data Storage/Retrieval
  • Data Manipulation/Synthesis/Modeling
  • Result Reporting
  • Story Telling

At a minimum a data scientist should have at least some skills in each of these areas. But depending on specialty one might spend a lot more time in a limited range.

JD Long
+1 for Story Telling. Any monkey with a calculator can crunch the numbers. You distinguish yourself by communicating what the numbers mean.
Kennet Belenky
Kennet, as an economist I often describe my job as ex post story telling. I'm like Tom T Hall but without the guitar... or the talent ;)
JD Long
I would add to the third bullet "data cleaning"
Tal Galili
I would also add "Understanding how to ask a question, solveable by data"
kpierce8
+4  A: 

I think it's important to have command of a commerial database or two. In the finance world that I consult in, I often see DB/2 and Oracle on large iron and SQL Server on the distributed servers. This basically means being able to read and write SQL code. You need to be able to get data out of storage and into your analytic tool.

In terms of analytical tools, I believe R is increasingly important. I also think it's very advantageous to know how to use at least one other stat package as well. That could be SAS or SPSS... it really depends on the company or client that you are working for and what they expect.

Finally, you can have an incredible grasp of all these packages and still not be very valuable. It's extremely important to have a fair amount of subject matter expertise in a specific field and be able to communicate to relevant users and managers what the issues are surrounding your analysis as well as your findings.

Phil Rack
+1  A: 

Matrix algebra is my top pick

el chief
Interesting - why would you say that ?
Tal Galili
It's pretty important to understand several concepts in data analysis, such as regression, optimization (ex linear programming), image/video processing, or even how google's pagerank works.
el chief
+5  A: 

JD hit it on the head: Storytelling. Although he did forget the OTHER important story: the story of why you used <insert fancy technique here>. Being able to answer that question is far and away the most important skill you can develop.

The rest is just hammers. Don't get me wrong, stuff like R is great. R is a whole bag of hammers, but the important bit is knowing how to use your hammers and whatnot to make something useful.

Byron Ellis
+5  A: 

JD's are great, and for a bit more depth on these ideas read Michael Driscoll's excellent post The Three Sexy Skills of Data Geeks:

  1. Skill #1: Statistics (Studying)
  2. Skill #2: Data Munging (Suffering)
  3. Skill #3: Visualization (Story telling)
DrewConway
Mike's Sexy Skills post very much influenced my thinking in this area. I'm glad you linked to it.
JD Long
Its amazing how much mileage that story got by calling data geeks sexy. Well, I know I am.
kpierce8
yeah I totally was drawn to the sexy... http://www.cerebralmastication.com/2009/02/hal-varian-google%E2%80%99s-chief-economist-thinks-i-am-sexy/
JD Long
+4  A: 
  • The ability to collaborate.

Great science, in almost any discipline, is rarely done by individuals these days.

wkmor1
great point. Certain professions tend to attract individuals who have been the "smartest guy in the room" for so long they have a lot of trouble taking input from others. This is a huge handicap.
JD Long
+1  A: 

There are several computer science topics that are useful for data scientists, many of them have been mentioned: distributed computing, operating systems, and databases.

Analysis of algorithms, that is understanding the time and space requirements of a computation, is the single most-important computer science topic for data scientists. It's useful for implementing efficient code, from statistical learning methods to data collection; and determining your computational needs, such as how much RAM or how many Hadoop nodes.

mattrepl
A: 

Patience - both for getting results out in a reasonable fashion and then to be able to go back and change it for what was 'actually' required.

Paddy
+7  A: 

To quote from the intro to Hadley's phd thesis:

First, you get the data in a form that you can work with ... Second, you plot the data to get a feel for what is going on ... Third, you iterate between graphics and models to build a succinct quantitative summary of the data ... Finally, you look back at what you have done, and contemplate what tools you need to do better in the future

Step 1 almost certainly involves data munging, and may involve database accessing or web scraping. Knowing people who create data is also useful. (I'm filing that under 'networking'.)

Step 2 means visualisation/ plotting skills.

Step 3 means stats or modelling skills. Since that is a stupidly broad category, the ability to delegate to a modeller is also a useful skill.

The final step is mostly about soft skills like introspection and management-type skills.

Software skills were also mentioned in the question, and I agree that they come in very handy. Software Carpentry has a good list of all the basic software skills you should have.

Richie Cotton
The Software Carpentry is wonderful! I wish it had an R equivalent.
Tal Galili
@Tal: Your wish may be granted. The course materials and currently being updated, and Greg stated in a recent blog post that "after Python and MATLAB, R and Perl are our next target languages" http://software-carpentry.org/blog/2010/05/setting-up-a-new-windows-machine/
Richie Cotton
+1  A: 

At dataist the question is addressed in a general way with a nice Venn diagram:

venn diagram

mropa