Getting Started with Python Pandas

I finally decided to get myself familiar with pandas while working on a recent side-project related to recommender systems. When I got started with it, I was still stubborn that I could achieve most things I needed to do in relation to data pre-processing with Python modules like tools like glob, json, numpy and scipy. True as that may be, I found myself spending way too much time writing routines to process the data itself and not getting anywhere close to working on the actual project. This was very reminiscent of the time a few years ago when I got immersed in writing code to manually compute gradients for various neural network architectures while getting nowhere in developing a music prediction model before finally deciding to make my life easier with theano! And so, this seemed like the perfect time to get started with learning pandas.

In the past I’ve found that, especially when it comes to learning useful features of new modules in Python, a hands-on and practical approach is much better than reviewing documentation and learning various features of a module without much of an application context, so I started looking around for such tutorial introductions to pandas. In the process I came across two invaluable resources that I thought I’d highlight here in this blog post. These really aren’t much, but gave me a surprisingly thorough (and quick) start to employ pandas in my own project.

Kaggle Learn

Kaggle Learn has a bunch of very well-organised and basic introductory Micro-courses on various Data Science topics from Machine Learning, to Data IO and Visualisation. I get started with the Pandas Micro-course which proved to be the ideal starting point for someone like me that had never used the module previously. This can be followed up with some of the other micro-courses, such as the one on data visualisation or embeddings which help one understand various concepts better through application. In fact, it’s what I’m planning to do as well!

Pandas Exercises on GitHub

So the Pandas Micro-course was a great starting point, but still left me wanting more practice on the topic as I still didn’t feel totally fluent. It was then that I stumbled upon a fantastic compilation of Pandas exercises on GitHub by Guilherme Samora. So I cloned the repository, loaded these exercises up on Jupyter Notebook and got down to solving them one after another! This really did help with getting more fluent with the rich set of tools that Pandas has to offer.

By the time I was done with Guilherme’s exercises (only a couple of days after starting with the Kaggle micro-course), I felt ready to apply my newly acquired pandas skills to my own project, and to discover more about the module through it. There certainly were plenty more resources that a quick Google search returned, but none appealed as much to me at a first glance, as the two I finally went with.

I’m sure I have only scratched the surface when it comes to useful pandas learning resources, and I’m very curious to hear about those that others have found useful, and why, so that I can look them up as well! So do feel free to share them in the comments below.