Data science is an interdisciplinary field that combines (advanced) maths, statistics, programming, and specific domain knowledge. Besides knowledge in these areas, successful data scientists should also possess the ability to think with data — asking the right questions, framing problems properly, breaking down complex problems into manageable analyses, and more generally, having good data intuition.
How might we help students master critical skills for data scientists?
The Rise of Active Learning
Let’s face it. Traditional lectures aren’t always effective.
Most of us took a statistics class in college, but can you still explain important concepts like 95% confidence intervals? Let alone using them in your daily job. (and by the way, you should!) Many aspiring data scientists have completed a number of online machine learning courses taught by world-renowned professors, yet they struggle to produce useful models from real data that provide valuable insights, address important business questions, or create promising data products.
Research shows that students learn better by doing (aka Active Learning).
“[…] students must do more than just listen: They must read, write, discuss, or be engaged in solving problems. Most important, to be actively involved, students must engage in such higher-order thinking tasks as analysis, synthesis, and evaluation”
It’s been almost two years since I quitted my job as a data scientist at Facebook and started teaching data science in Thailand. I have been experimenting with my instructional strategies in search of approaches that best meet the learning needs of the students.
In this blog post, I’d like to discuss some of the active learning activities I have tried or created for my classes and workshops.
As a Google Developer Expert, I use Google Codelabs quite a lot in my workshops.
Google Developers Codelabs provide a guided, tutorial, hands-on coding experience. Most codelabs will step you through the process of building a small application, or adding a new feature to an existing application.
To create my own poor man’s codelabs, I usually use Jupyter Notebook which allows adding Markdown text to describe syntax or provide instructions. Jupyter Notebook is already considered the gold standard in the data science community. It supports both Python and R. You can find tons of tutorials in Jupyter Notebook format.
These codelabs are great when they are short and not too difficult. Students can work at their own pace and will feel accomplished as they finish something all by themselves.
However, students need to be highly motivated and dedicated. Because it requires a lot of reading, it could get boring very quickly. If the solutions are provided in the notebooks, they will start running all the code to see the final output without following instructions nor trying to understand how the code works.
Lastly, if the code is somewhat complicated, make sure you have TAs or facilitators walking around to assist with any issues that may come up. It could be anything from unclear instructions to not having environment properly set up (which could be really complicated to resolve especially for Windows users).
Pro tip: Recently, Google has launched Google Colaboratory, a Jupyter notebook environment that requires no setup to use and runs entirely in the cloud. It even lets you run TensorFlow computation on GPU! Most of commonly used packages are readily available, and I no longer have to worry about my students not having a proper environment set up before class.
2. Live Coding Lessons
Live coding lessons are great for learning new commands or syntax. Students can play around and get real-time feedback whether they got the right answers or what they might have missed. Through those hints, they can learn from their mistakes. Students usually find these live coding lessons very enjoyable.
There are many online platforms providing live coding lessons, but DataCamp stands out when it comes to data-related courses.
3. Interactive Data Visualizations
A picture is worth a thousand words, but it’s not easy to draw and animate things in PowerPoint or Keynote presentation. (Well, it’s not too hard I’d say, but teachers simply don’t have the time!) There are many interactive visualizations explaining complex concepts in statistics and machine learning available on the Internet. I provide here two examples. (Let me know your favorites in the comment!)
Students usually don’t read instructions carefully and may miss some key learning steps, so make sure you have a debrief session or give them some quizzes to complete while playing with the visualizations.
4. Interactive Playgrouds
Long text and equations generally discourage students from learning, and one way to help students establish intuition is to let them explore the problem themselves. Interactive playgrounds are great for this. You give your students a mission to complete and let them go through a process of trial and error.
One of the best examples is Google’s TensorFlow playground, which lets you tinker with a neural network right in your web browser. You can find a list of guided playground exercises from Google’s Machine Learning Crash Course here.
For my class, I built a simple command line tool that encourages students to think about feature engineering.
5. Group Activities
Group activities are fun! It helps students internalize the concepts through hands-on experience. Students also benefit from discussions with their teammates along the way.
To teach students without statistics background about sampling and confidence intervals, I asked them to estimate the proportion red M&M’s from samples of various sizes and calculate the 95% confidence intervals. At the end, they would find that roughly 95% of all the confidence intervals generated contain the true parameter and (hopefully) remember this correct interpretation of 95% confidence interval.
In another class, to emphasize the importance of “good data” in machine learning, students were asked to collect data (as video clips), prepare data (as image files), label images, and use transfer learning to build an image classifier. Most would find that their classifiers didn’t work well the first time and need to collect more images from the missing angles to further improve the performance.
The only drawback of group activities is that it could be quite time consuming.
Besides the active learning activities presented above, working on actual projects is definitely a must. It’s the only way for students to connect all the dots and gain expertise. Just make sure you are a good “coach” helping students think through their problems, as opposed to simply telling them what to do.
If you teach data science, let me know what you think or share any fun activities you use in your classes.
 Bonwell, C. & Eison, J. (1991). Active Learning: Creating Excitement in the Classroom, ERIC Clearinghouse on Higher Education, Washington, D.C.http://ericae.net/db/edo/ED340272.htm