Thoughts about Data Science

More free advice, and worth what you paid for it. This is one of a collection of advice pages, all of them with the same theme: get skills, especially quant skills, they're fun and they pay off. They include advice for undergrads, advice for MBAs (soon!), and this one. None of these thoughts are the official view of the Department, the School, or the University, but we (meaning my colleagues and I) believe them to be accurate, maybe even useful. Comments, of course, are welcome.

These thoughts are aimed at both undergrad and MBA students. The main difference between them is time: undergrads have four years to take courses, MBAs only have two. Even so, we think a few courses along these lines will prove useful to you in your life and career.

[Work in progress, please send comments for improvement]

Q1. What is Data Science?

The modern world generates data at an incredible rate, which has opened up lots of opportunities for people who have the skills to work with data effectively. Historically, the field of statistics -- and its mathematical basis, probability theory -- has led the way. But increasingly we're finding that data comes in unusual forms (text, images, etc) and that some of the best ideas are coming from computer science. Right now, the best people seem to combine insights from both fields, and more. The newly emerging field is commonly referred to as Data Science. In the ongoing battle to find marketable labels, you'll also hear about Business Analytics and Big Data. More here if you want a sense of the intellectual history.

Q2. How do I get started?

We like the idea of leaping into the deep end of the pool to see if you know how to swim, but you will find it easier if you have some background:

  • First, programming experience is really helpful, but if you don't have it, you can teach yourself. See Q4 below. The languages of choice in the business world right now are Python, R, Matlab, and C++. They serve different purposes, but we find that once you know one, learning another isn't that hard. If you're looking for a place to start, we recommend Python. More below.
  • Second, it's helpful to know some basic math. Calculus and linear algebra are incredibly useful, you would benefit from familiarity with each.

Even without this background, there's a lot you can do, but you can do more if you have at least the first of these.

Q3. Yeah, fine, but what about Data Science?

If you're not ready to swim yet, return to Q2. How will you know? Look at the prereqs of the courses you want to take. Or give them a try and see if they cause you more stress than you're ready to deal with.

If you're ready to go, you can put these skills to work in a number of ways. Data Science is a portfolio of skills, you can get them one at a time or focus on those that interest you most. A relatively standard set of courses would include some or all of the following:

  • Data science: an overview. Here's an example with lots of hands-on practice with data and (mostly) Python programming.
  • Probability theory: mathematical models of randomness.
  • Multivariate statistics or econometrics: estimating linear models -- picture a scatterplot with a straight line drawn through it.
  • Data mining: a variety of methods for finding patterns in large datasets.

In most cases, there are applied versions of these courses at Stern and more theoretical versions at Courant. You can find descriptions of courses and programs at NYU's Center for Data Science, at Stern's IOMS group, and at the Courant Institute of Mathematical Sciences. This is way too much information, but you might want to page through it anyway. For current purposes, you should probably ignore the programs and focus on the courses and their content. If there's a course at NYU for one program, you can probably count it toward another program.

One thing we'd add: get some practice with graphics, which now goes by the name "visualization." They give you a good first cut at data and can be an effective way to describe it to others.

Q4. Can you tell me more about Python?

We thought you'd never ask! We think Python is the obvious entry point if you want to find out what programming is about. It's a flexible general-purpose high-level language, high-level meaning that you don't need to do everything, the program does most of it for you. It's quickly building a large community of users, including many at investment and consulting firms. It has packages that allow you to do scientific programming, graphics, and data analysis. It's the basis of the Google app engine. It's a skill employers ask for.

If you want a taste, you can teach yourself Python as a summer or winter break project. It'll take some discipline, but Professor Okun did it, and says if he can, then you can, too.

[What follows is a work in progress, but stop by or email if you have questions.]

Distributions. If you do this, you will need the program (the "code distribution"), a user interface (a "GUI" or "IDE"), and a collection of packages that do whatever specialized tasks you're interested in.

Packages. There are lots of packages, but these are essential. NumPy does vector arithmetic, making it a good substitute for Matlab. SciPy does math and statistics. Pandas does data analysis. Matplotlib does graphics.

Getting started. There are lots of ways to do this -- unfortunately. Here are two I like, basically because they're easy. Wakari lets you edit and run IPython notebooks online. The beauty of this is that you get a controlled operating environment with no setup. Anaconda gives you a standard set of quantitative packages and comes with Spyder, a user interface that makes it easy to edit and run programs. In both cases, I prefer to use recent versions of Python (some version of Python 3). That leads to occasional conflicts with older versions, but I think you're better off looking to the future.

Once you're set up, you can start programming. But how? If you're new to programming, try Mark Lutz's Learning Python. Professor Okun says: "It is a very good book, easy to understand and complete. A Kindle version is 65% off the cover price at Amazon. This book is so large that it should come with a warning to anyone who would attempt to carry it." (At page 900 of 1600, he admits it's been a slog, but enjoys the feeling of power he now has over the world.) This is a book you can use to teach yourself.

Here are some other resources that look promising. Professors Attenberg and Provost have a nice overview with links. Some of the better online courses are Google's, Coursera's, and Python the hard way. Codecademy has interactive tutorials for Python and other languages. ACM, Codeforces, and Project Euler have problem sets and sample programs. For economics and finance, you might try Sargent and Stachurski's Quantitative Economics and BYU's Python Lab. If you try any of them, let us know what you think.

Comments welcome on all of this.

Q5. Does economics help here?

Our goal here isn't to sell you on economics, although we think it's helpful in giving you a framework for thinking about data. See the comments by Susan Athey, the chief economist at Microsoft, and Michael Bailey, who works at Facebook.

More advice along the same lines: Undergrad advice | Graduate programs | MBA advice (soon!)


(c) NYU Stern School of Business | Address comments to Dave Backus.