Thoughts about Data Science

More free advice, and worth what you paid for it. This is one of a collection of advice pages, all of them with the same theme: get skills, especially quant skills, they're fun and they pay off. They include advice for undergrads, advice for MBAs (soon!), and this one. None of these thoughts are the official view of the Department, the School, or the University, but we (meaning my colleagues and I) believe them to be accurate, maybe even useful. Comments, of course, are welcome.

These thoughts are aimed at both undergrad and MBA students. The main difference between them is time: undergrads have four years to take courses, MBAs only have two. Even so, we think a few courses along these lines will prove useful to you in your life and career.

[Work in progress, please send comments for improvement]

Q1. What is Data Science?

The modern world generates data at an incredible rate, which has opened up lots of opportunities for people who have the skills to work with data effectively. Historically, the field of statistics -- and its mathematical basis, probability theory -- has led the way. But increasingly we're finding that data comes in unusual forms (text, images, etc) and that some of the best ideas are coming from computer science. Right now, the best people seem to combine insights from both fields, and more. The newly emerging field is commonly referred to as Data Science. In the ongoing battle to find marketable labels, you'll also hear about Business Analytics and Big Data. More here if you want a sense of the intellectual history.

Q2. How do I get started?

We like the idea of leaping into the deep end of the pool to see if you know how to swim, but you will find it easier if you have some background:

  • First, programming experience is really helpful, but if you don't have it, you can teach yourself. See Q4 below. The languages of choice in the business world right now are Python, R, Matlab, and C++. They serve different purposes, but we find that once you know one, learning another isn't that hard. If you're looking for a place to start, we recommend Python. More below.
  • Second, it's helpful to know some basic math. Calculus and linear algebra are incredibly useful, you would benefit from familiarity with each.

Even without this background, there's a lot you can do, but you can do more if you have at least the first of these.

Q3. Yeah, fine, but what about Data Science?

If you're not ready to swim yet, return to Q2. How will you know? Look at the prereqs of the courses you want to take. Or give them a try and see if they cause you more stress than you're ready to deal with.

If you're ready to go, you can put these skills to work in a number of ways. Data Science is a portfolio of skills, you can get them one at a time or focus on those that interest you most. A relatively standard set of courses would include some or all of the following:

  • Data science: an overview. Here's an example with lots of hands-on practice with data and (mostly) Python programming.
  • Probability theory: mathematical models of randomness.
  • Multivariate statistics or econometrics: estimating linear models -- picture a scatterplot with a straight line drawn through it.
  • Data mining: a variety of methods for finding patterns in large datasets.

In most cases, there are applied versions of these courses at Stern and more theoretical versions at Courant. You can find descriptions of courses and programs at NYU's Center for Data Science, at Stern's IOMS group, and at the Courant Institute of Mathematical Sciences. This is way too much information, but you might want to page through it anyway. For current purposes, you should probably ignore the programs and focus on the courses and their content. If there's a course at NYU for one program, you can probably count it toward another program.

One thing we'd add: get some practice with graphics, which now goes by the name "visualization." They give you a good first cut at data and can be an effective way to describe it to others.

Q4. Can you tell me more about Python?

We thought you'd never ask! We think Python is the obvious entry point if you want to find out what programming is about. It's a flexible general-purpose high-level language, high-level meaning that you don't need to do everything, the program does most of it for you. It's quickly building a large community of users, including many at investment and consulting firms. It has packages that allow you to do scientific programming, graphics, and data analysis. It's the basis of the Google app engine. It's a skill employers ask for.

If you want a taste, you can teach yourself Python as a summer or winter break project. It'll take some discipline, but Professor Okun did it, and says if he can, then you can, too.

[What follows is a work in progress, but stop by or email if you have questions.]

Distributions. If you do this, you will need the program (the "code distribution"), a user interface (a "GUI" or "IDE"), and a collection of packages that do whatever specialized tasks you're interested in.

Packages. There are lots of packages, but these are essential. NumPy does vector arithmetic, making it a good substitute for Matlab. SciPy does math and statistics. Pandas does data analysis. Matplotlib does graphics.

Installing Python. There are lots of ways to do this -- unfortunately. Here are two we like a lot. Wakari lets you edit and run IPython notebooks online. The beauty of this is that you get a controlled operating environment with no setup. Anaconda gives you a standard set of quantitative packages and comes with Spyder, a user interface that makes it easy to edit and run programs. In both cases, I prefer to use recent versions of Python (some version of Python 3). That leads to occasional conflicts with older versions, but I think you're better off looking to the future.

Getting started. There are lots of things out there, but we think the two best are Python the hard way and Codecademy. The latter is particularly user-friendly because you do all your coding online. Both give you the basics, but only the basics: you won't see Pandas or Matplotlib, the standard tools for data management and graphics, respectively. You can get a quick sense of what they do from Sargent and Stachurski's Quantitative Economics. It's very good, but a little on the terse side. We're working on our own Data Bootcamp, which will cover that material a little more systematically, but it's not ready for prime time yet. Here's a link to a very rough draft. When we fix it up, we'll try to update the link.

Comments welcome on all of this.

Q5. Does economics help here?

Our goal here isn't to sell you on economics, although we think economics is helpful in giving you a framework for thinking about data. See the comments by Susan Athey, the chief economist at Microsoft, and Michael Bailey, who works at Facebook.

More advice along the same lines: Undergrad advice | Graduate programs | MBA advice (soon!)

(c) NYU Stern School of Business | Address comments to Dave Backus.