Thoughts about Data Science

More free advice, and worth what you paid for it. This is one of a collection of advice pages, all of them with the same theme: get skills, especially quant skills, they pay off. They include advice for undergrads, advice for MBAs (soon!), and this one. None of these thoughts are the official view of the Department, the School, or the University, but we (meaning my colleagues and I) believe them to be accurate, maybe even useful. Comments, of course, are welcome.

These thoughts are aimed at both undergrad and MBA students. The main difference between them is time: undergrads have four years to take courses, MBAs only have two. Even so, we think a few courses along these lines will prove useful to you in your life and career.

[Work in progress, please send comments for improvement]

Q1. What is Data Science?

The modern world generates data at an incredible rate, which has opened up lots of opportunities for people who have the skills to work with data effectively. Historically, the field of statistics -- and its mathematical basis, probability theory -- has led the way. But increasingly we're finding that data comes in unusual forms (text, images, etc) and that some of the best ideas are coming from computer science. Right now, the best people seem to combine insights from both fields, and more. The newly emerging field is commonly referred to as Data Science. In the ongoing battle to find marketable labels, you'll also hear about Business Analytics and Big Data.

Q2. How do I get started?

We like the idea of leaping into the deep end of the pool to see if you know how to swim, but you will find it easier if you have some background:

  • First, programming experience is really helpful, but if you don't have it, you can teach yourself. See Q4 below. The languages of choice in the business world right now are C++, Python, Matlab, and R. They serve different purposes, but we find that once you know one, learning another isn't that hard.
  • Second, it's helpful to know some basic math. Calculus and linear algebra are incredibly useful, you would benefit from familiarity with each.

Even without this background, there's a lot you can do, but you can do more if you have at least the first of these.

Q3. Yeah, fine, but what about Data Science?

If you're not ready to swim yet, return to Q2. How will you know? Look at the prereqs of the courses you want to take. Or give them a try and see if they cause you more stress than you're ready to deal with.

If you're ready to go, you can put these skills to work in a number of ways. Data Science is a portfolio of skills, you can get them one at a time or focus on those that interest you most. A relatively standard set of courses would include some or all of the following:

  • Data science: an overview. Here's an example with lots of hands-on practice with data and (mostly) Python programming.
  • Probability theory: mathematical models of randomness.
  • Multivariate statistics or econometrics: estimating linear models -- picture a scatterplot with a straight line drawn through it.
  • Data mining: finding patterns in large datasets.

In most cases, there are applied versions of these courses at Stern and more theoretical versions at Courant. You can find descriptions of courses and programs at NYU's Center for Data Science, at Stern's Department of Information, Operations, and Management Science, and at the Courant Institute of Mathematical Sciences. This is way too much information, but you might want to page through it anyway. For current purposes, you should probably ignore the programs and focus on the courses and their content. If there's a course at NYU for one program, you can probably count it toward another program.

One thing we'd add: get some practice with graphics, which now goes by the name "visualization." They give you a good first cut at data and can be an effective way to describe it to others.

Q4. Can you tell me more about Python?

We thought you'd never ask! We think Python is the obvious entry point if you want to find out what programming is about. It's a flexible high-level language, high-level meaning that you don't need to do everything, the program does most of it for you. It's quickly building a large community of users, including many at investment and consulting firms. It has packages that allow you to do scientific programming, graphics, and data analysis. It's the basis of the Google app engine. It's a skill employers ask for.

If you want a taste, you can teach yourself Python as a summer or winter break project. It'll take some discipline, but Glenn Okun is teaching himself now, and says if he can do it, then you can, too.

Distributions. If you do this, you will need the program (the "code distribution"), a user interface (a "GUI" or "IDE"), and possibly a collection of packages. One option is to download a version that includes all of the above. Macs, for example, come with Anaconda installed, but on other platforms you can download it yourself from the link. It comes bundled with the Spyder and Qt interfaces and some of the more useful packages (more on them below). Another option is is Python(x,y), "a free scientific and engineering development software for numerical computations, data analysis and data visualization." Like Anaconda, it comes with interface and packages ready to go. A third option is the official Python site, which links to distributions for all platforms, a beginner's guide, and lots of other stuff. But you'll need to install an interface and packages yourself.

Okun has tried most of these on both Macs and Windows machines. He recommends Python(x,y) with Spyder. And use the Windows version even if you have a Mac.

Update. We've found two alternatives that we like better. The first is the Enthought academic distribution. The second is Wakari, a browser-based version. Both are complete -- they come with packages and interfaces -- and seem to us to have a lot of promise.

Packages. There are lots of packages, but these are essential. NumPy does vector arithmetic, making it a good substitute for Matlab. SciPy does math and statistics. pandas does data analysis. Matplotlib does graphics. All of them come pre-packaged with Anaconda and Python(x,y).

Getting started. Once you have the program, start programming. If you're new to programming, the best book we've found is Mark Lutz's Learning Python. Okun says: "It is a very good book, easy to understand and complete. A Kindle version is 65% off the cover price at Amazon. This book is so large that it should come with a warning to anyone who would attempt to carry it." (At page 900 of 1600, he admits it's been a slog, but enjoys the feeling of power he now has over the world.) This is a book you can use to teach yourself. You'll learn a lot if you work your way through it. And get a good job.

Here are some other resources that look promising. Josh Attenberg and Foster Provost have a nice overview with links. Some of the better online courses are Google's, Coursera's, and Python the hard way (recommended by Josh and Foster). Codecademy has interactive tutorials for Python and other languages. MIT's course looks fine, but it's a computer science course, not a programming course. ACM, Codeforces, and Project Euler have problem sets and sample programs. For economics, and maybe finance, you might try John Stachurski's Intro for Economists and BYU's Python Lab. If you try any of them, let us know what you think.

Comments welcome on all of this.

Q5. Does economics help here?

Our goal here isn't to sell you on economics, although we think it's helpful in giving you a framework for thinking about data. For more on the subject, see the comments by Susan Athey, the chief economist at Microsoft , and Michael Bailey, who works at Facebook.

More advice along the same lines: Undergrad advice | Graduate programs | MBA advice (soon!)


(c) NYU Stern School of Business | Address comments to Dave Backus.