Monday, November 18, 2013

Statistics is the least important part of data science

Statistics is the least important part of data science «Andrew Gelman 

Quote in full:
This came up already but I’m afraid the point got lost in the middle of our long discussion of Rachel and Cathy’s book. So I’ll say it again:
There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science. . . .
The tech industry has always had to deal with databases and coding; that stuff is a necessity. The statistical part of data science is more of an option.
To put it another way: you can do tech without statistics but you can’t do it without coding and databases.
This came up because I was at a meeting the other day (more comments on that in a later post) where people were discussing how statistics fits into data science. Statistics is important—don’t get me wrong—statistics helps us correct biases from nonrandom samples (and helps us reduce the bias at the sampling stage), statistics helps us estimate causal effects from observational data (and helps us collect data so that causal inference can be performed more directly), statistics helps us regularize so that we’re not overwhelmed by noise (that’s one of my favorite topics!), statistics helps us fit models, statistics helps us visualize data and models and patterns. Statistics can do all sorts of things. I love statistics! But it’s not the most important part of data science, or even close.

No comments:

Post a Comment