A practical guide
I have been using Pandas for quite a long time but I feel like I’m still exploring it. The way Pandas handles complex operations and expedites the data analysis and manipulation process has always amazed me.
I do not claim that Pandas is the best tool out there for data analysis. I’m also aware that what Pandas is capable of can also be achieved using other packages or libraries. However, Pandas always offers at least one way to accomplish whatever task I need to do.
In this article, we will go over 3 examples that solve relatively complex operations. We will first frame a task and demonstrate how simply it can be done with Pandas.
We will use a small sample from the Melbourne housing dataset available on Kaggle for the examples. We first read the csv file using the read_csv function.
import numpy as np
import pandas as pdmelb = pd.read_csv( "/content/melb_data.csv", usecols = ['Type', 'Price', 'Distance', 'Date', 'Landsize', 'BuildingArea'], parse_dates = ['Date'])melb.head()
The building area column has some missing values. We want to fill in them based on the land size and house type. If the type is house (‘h’), the building area is equal to 70% of the land size. Otherwise, it is equal to 80% of the land size.
Here is how Pandas solves this task with a little help from NumPy.
melb.loc[melb.BuildingArea.isna(), 'BuildingArea'] = np.where( melb[melb.BuildingArea.isna()]['Type'] == 'h',
melb[melb.BuildingArea.isna()]['Landsize'] * 0.7,
melb[melb.BuildingArea.isna()]['Landsize'] * 0.8)
We first select the missing values in the building area column. The where function of NumPy takes a condition which is the house type in our case. The next line determines the action if the condition is true. Obviously, the last line determines what to do if the condition is false.
This example involves in the groupby function. We want to calculate the average distance and the number of unique prices for each house type. We also want to count the total number of houses for each time.
In case of different functions applied to different columns, we use the agg function. By default, it assigns the name of the column to the aggregated values. However, it may cause some confusion because we will not be able to know what kind of aggregation is performed.
We can overcome this challenge by explicitly declaring names of the aggregated columns.
melb.groupby('Type', as_index=False).agg( unique_price_count = ('Price', 'nunique'),
count = ('Type', 'count'),
avg_distance = ('Distance', 'mean'))
Let’s say we are generating new features for a machine learning model. We want to have quarters in a year as binary variables. Thus, there will be 4 new features which are Q1, Q2, Q3, Q4. If the date is in Q1, it takes the value of 1 and the other quarters become 0.
There are many ways to accomplish this task. The following is a quite straightforward way to do it.
melb = pd.concat(
pd.get_dummies(melb.Date.dt.quarter, prefix='Q', prefix_sep='')
We first extract the quarter from the date using the dt accessor. The get_dummies function is a very simple method to do one hot encoding. By default, the name of the columns are the values they represent. We are able to modify them using the prefix, prefix_sep, or columns parameters.
The encoded quarter columns and the original dataframe are then combined using the concat function. It concatenates two dataframes either side-by-side or on top of each other depending on the axis parameter.
Pandas is a highly popular data analysis and manipulation library. I think it totally deserves the reputation. What I like the most about Pandas is that there are almost always multiple ways to accomplish a task. It allows for tackling down complex problems easily.
What we have covered in this article demonstrates a very small part of what Pandas is capable of. As you practice more, you will discover the rich selection of powerful and versatile functions.
Thank you for reading. Please let me know if you have any feedback.