Wednesday, December 7, 2011

A pre-requisite to be a Data Scientist

So what should be in the toolkit of people who call themselves a data scientist?

A fundamental skill is the ability to manipulate data. A data scientist should be familiar and comfortable with a number of platforms and scripting tools to get the job done. What is difficult in Excel might be trivial in R. And when R struggles, you should switch to Unix (or use a programming language such as Python) get that portion of the data munging done. Along the way, you pick up a lot of tips and tricks. For example: how to read a big datafile in R?

The goal is to get the job done. Familiarity with a wide variety of tools, and expertise in some is the hallmark of any good would-be data scientist.

Friday, December 2, 2011

O'Reilly's Data Science Kit - Books

It is not as if I don't have enough books (and material on the web) to read. But this list compiled by the O'Reilly team should make any data analyst salivate.

http://shop.oreilly.com/category/deals/data-science-kit.do

The Books and Video included in the set are:

  1. Data Analysis with Open Source Tools
  2. Designing Data Visualizations
  3. An Introduction to Machine Learning with Web Data (Video)
  4. Beautiful Data
  5. Think Stats
  6. R Cookbook
  7. R in a Nutshell
  8. Programming Collective Intelligence

Wednesday, November 30, 2011

Tips for getting started on Kaggle (datamining)

Ever since I heard about Kaggle.com at this year's Bay Area Data Mining Camp, I've wanted to participate. But I was feeling somewhat intimidated.
Jeremy Howard's "Intro to Kaggle" talk at yesterday's MeetUp (DataMining for a Cause) was exactly what I needed.
He had a number of tips for beginners. His was exactly the talk that I was looking for, though I didn't know it. I am sharing some of his tips here, in case it helps others as well.

Jeremy Howard's Tips for Getting Started on Data Mining competitions at Kaggle

* Visit the Kaggle site and spend at least 30 minutes every day hanging around. Read the forum, the competition pages, and read the Kaggle blog
* It is much better to start participating in competitions which are just starting up, rather than in ones where there are 100s of entries and teams already well on their way
* Aim to make at least one submission each and every day
* Jeremy himself participates in competitions to see where he stands, and to learn and get better
* He'd start out making trivial submissions (all zero's, or alternate zero's, all entries as averages) until his algorithm got better
* A lot of people who compete use R (and SAS, Excel or Python)
* Nearly 50% of the winning entries use Random Forest techniques.
* If you place in the top 3, that is great. But personal improvement and learning should be the goal.
* As you get better, you might get invited to "private competitions."
* Every day, strive to do a little better and improve your submission's performance, scoring and ranking

Related Links:

Tuesday, October 18, 2011

Fusion Tables by Google

Google's Fusion Tables look impressive, for those who want to try geo-visualizations of their data. You don't need much programming experience to be able to use it.

For those who want to try it out, here's a nice intro that Kathyrn Hurley presented at the recent SVCC (Silicon Valley Code Camp). When combined with ShpEscape (note spelling) it becomes very powerful.

The Guardian (UK), Texas Tribune, and WNYC seem to be organizations that are taking advantage of it.
I'll post a couple of their examples soon. If you have a Google account, it's easy to test out Fusion Tables.

Related Link: Journalist’s guide to mapping data by county, district using ShpEscape

Monday, October 17, 2011

Get the Basics right - Suggestion for R Beginners

I am always looking for suggestions on how to get better at R, esp. for beginners. So when I see someone who's gotten adept at it, I ask them how they got there.

This weekend, at the Bay Area ACM Data Mining Camp, one person gave me what seemed like a good suggestion. Just wanted to share it here, for anyone else who's just getting started.

He told me that there are tons and tons of libraries, and if you start going down that path, you might know how to use a library or two, but you may not learn the basics of data manipulation, which is one of R's main strengths.

His suggestion:
Get some data and learn to manipulate it - understand the differences between vectors, data-frames, arrays and matrices. Once you have this down, only then should you start exploring the different libraries. Don't rush in to try every new library that someone praises.

Sunday, October 16, 2011

Geo-doodlers - Paul Butler and FlowingData

I found this great R-Visualization example via an R-Blogger post that xingmowang made. (One more good reason for why it is important to read lots of field-related blogs!)

Here's the image:

If this was merely eye-candy, I would have enjoyed it, but not included it here. But to think that this was done in R -- that means the rest of us can learn from it!

When Paul Butler writes about how he created it, he shares with us how he had to tweak it, and how the results surprised him. That is true data-doodling. You guide things along, but then the data surprises (or delights) you.

I also like this small bit of musing that he includes:
What really struck me, though, was knowing that the lines didn't represent coasts or rivers or political borders, but real human relationships. Each line might represent a friendship made while travelling, a family member abroad, or an old college friend pulled away by the various forces of life.
For those of us who are new to R, this example has a few things to try. Take any dataset with Lat/Long values in it, and plot it over a world map. Once you can do that successfully, try this.
(Also pointed out courtesy of Xingmowang.)

We may not all create infographics that are great, but these examples will point us in the right direction.

Wednesday, October 12, 2011

A true data-doodler - Christophe Ladroue (R ddly and plyr on Triathlon Results)

To me, this post by Christophe Ladroue personifies what data doodlers do.

They take a dataset that is of interest to them (In his case, his triathlon results) and then they manipulate the numbers to see what insights can be drawn. Most bloggers only show their final results which look great, but for our purposes (for wannabe data doodlers) it is much more fun to see the process. It is often messy, but that's the way we learn.

In Christophe's example, you will see some data cleanup, then he plots averages and medians across categories. Then he starts to try to squeeze insights out of the day. That is true data analysis.

Check out his full post.






What does it mean to be a Data Scientist?


Check out this talk by John Rauser of AMZN at the 2011 Strata Conf. It is an excellent intro to the field.

Sunday, October 9, 2011

The Skills of a Data Miner

Data mining is not only statistics, even if statistics is the most recognized academic component of it. It also includes data cleaning, machine learning and data visualization.
The scarce factor is the ability to understand that data and extract value from it.
Hal Varian, Google

The full article by Luca Sbardella published in QuantMind is well worth a read.