How to Increase Speed of Pandas Code by 4X

Muhammad Saleh
2 min readJan 23, 2021
Photo by Marc-Olivier Jodoin on Unsplash

Pandas is the main library for processing data in Python. It’s easy to use and quite flexible when it comes to handling different sizes and types of data. It has hundreds of different functions that make working with data very easy.

The main issue with Pandas is its slowness for large datasets. But every problem has a solution and to cop up with this issue one way is to use modin.pandas library. The reason behind Pandas slowness is that it only utilizes only one core of CPU while modin.pandas spreads the workload across multiple cores available. Let’s see how to use it.

First, install modin.pandas library. This is the way of installing it in jupyter notebook enviroment.

!pip install modin[ray]

Let’s see what improvements it brings in the performance of Pandas library. For demonstration I am using this kaggle dataset. First, let’s import the normal pandas.

import pandas as pd

Now, let’s check the data loading speed of normal pandas.

%%time
df = pd.read_csv('data.csv')
### output
Wall time: 7.36 s

Now, do it modin.pandas. First, import it.

import modin.pandas as pd

Now, repeat the same operation of loading data and check the time taken.

%%time
df = pd.read_csv('data.csv')
### output
Wall time: 2.8 s

You can see significant improvement in speed of loading data and you will see this difference become bigger as you process larger datasets. One thing that I noted about modin.pandas is that it only improves the processing speed significantly in data reading, writing kind of operations and not improves much when performing statistics kind of operations. Let’s see it in practice.

# using normal pandas
%%time
df.groupby('county').count()
### output
Wall time: 1.76 s

Now, use modin.pandas and check its performance.

%%time
df.groupby('county').count()
### output
Wall time: 1.52 s

As of today, 73% of all pandas functionalities are available in modin.pandas.

So, this is a very useful library especially when you dealing with large datasets. There is so much more you can do with this library and I encourage you to practice and experiment as much as you can using the extensive information online. Best of luck!

--

--

Muhammad Saleh
0 Followers

Machine Learning and Data Science Enthusiast