6 Pandas Techniques that Saved My Life

8 min readJun 21, 2024

If you work with data in your job or side projects, then you likely spend the vast majority of your time cleaning, manipulating, and transforming your data before you even get to work with it. In fact, it’s a popular trope among data scientists that 80% of their time is spent wrangling data.

Given this reality, you’ve probably worked with pandas, the essential Python library for data manipulation. I’ve used pandas extensively for everything from ad-hoc analyses to building production-level data pipelines. Through my experience, I’ve collected six cardinal techniques that have significantly streamlined my workflow and improved the quality of my code.

In this article, I’ll explore the techniques and demonstrate how to apply them effectively using the iconic Titanic Dataset.

Chaining

Chaining is at the center of all the techniques described below. It’s a method that allows you to apply multiple operations to your data in a single, continuous statement, creating a streamlined pipeline that looks more like a cohesive recipe than a series of separate steps. This style of pandas was popularized by Matt Harrison, a Python and data science trainer and author of Effective Pandas (highly recommend).

Chaining does not resemble traditional programming, which aims to make operations independent and functions small and simple. Instead, chaining represents data-oriented workflows more realistically, where data is gradually cleansed or augmented through a series of operations. This approach can be intimidating for beginners due to the potential length and complexity of workflows. However, chaining quickly becomes an easy and intuitive approach to creating robust data pipelines by understanding a few basic principles and breaking down workflows step by step.

Below I have created a sample workflow for performing a common set of operations on the dataset. I used variable reassignment in the first snippet and chaining in the second. Which one looks cleaner to you?

Notice how we did not have to create variables in the chain version! If you were executing this pipeline in a Jupyter Notebook, you could directly run the cell to inspect the results. In production, you would need to save the results to a variable, but you eliminate four other variable instantiations, resulting in cleaner and more maintainable code.

To maintain visual clarity when chaining, arrange your pipeline code hierarchically, where each additional indent corresponds to a deeper level in the pipeline. This becomes increasingly crucial as the workflow becomes more complex. Additionally, note the use of “lambda” functions in the .assign method. While they may seem daunting at first, these lambdas simply reference the preceding DataFrame they are attached to, which is particularly useful when applying transformations to grouped or filtered data or when creating multiple interdependent columns.

Now that we understand the importance and advantages of chaining, let’s explore some techniques that leverage this approach to tackle complex tasks easily.

Inspecting Duplicated Rows

Pandas’ built-in drop_duplicates function is useful for removing duplicate rows, but it doesn't show us the duplicated rows themselves. To identify duplicates, we use the .duplicated method, which returns a boolean series where rows are True if they are duplicates of a preceding row. However, this often leads to the creation of clunky code:

While this achieves our goal, we can improve our approach. First, we can replace vanilla boolean indexing with .loc , allowing us to create dynamic filtering expressions.

And voila! We filtered for the duplicated rows without variable reassignment. While you can use .drop_duplicates like we mentioned above to drop the duplicated rows, you can also filter them out using the same expression above with a ~ operator prepended to the row.duplicated filter.

The ~ operator reverses the boolean conditions of the filter statement, such that only rows that were not duplicated would be returned. This is handy when you want to inspect your data and quickly peek at the duplicated and non-duplicated rows.

Value selection using .loc

One aspect of pandas I have found tricky is accessing a specific value in an individual cell. While we often use pandas to operate on rows or columns, sometimes we need to extract individual values. The general technique for doing this typically looks like the following:

While this does work, the .iloc property can feel a bit hacky. Fortunately, with a bit of preparation before using .loc method, we can access values directly. The key is to set the column you are filtering on as the index of the DataFrame. This way, you can use the name as the first accessor in the .loc statement and then specify the desired column as the second accessor.

Note that if the column you set as the index contains duplicate values, you may receive multiple rows as a result of your .loc statement. Therefore, ensure that duplicates are either handled or not present in the column you set as the index.

The following techniques all revolve around .pipe, one of the handiest methods in pandas. It allows you to chain any custom function, whether predefined or expressed through lambda, onto a DataFrame or series. However, be cautious: the result of a DataFrame passed through a pipe function is not always a DataFrame itself. Depending on how the function returns, you may not be able to chain additional pandas transformations. Let’s take a look at how it works.

Pipe Ternary

Pandas is excellent at performing row, column, and table-oriented transformations on data, but ironically has poor support for conditional operations. One workaround I have leveraged is using ternary functions within lambdas in the .pipe method. Let’s say you are creating a pipeline for the Titanic dataset and it expects a column called “Cabin”. However, you cannot guarantee that the column will always be present. You may do something like this:

This works; however, it breaks your chain! For a one-off operation, this might be acceptable, but if multiple such instances occur in a larger pipeline, it quickly becomes unreadable. Instead, we can define a simple lambda function within the pipe to perform the ternary operation:

Using .pipe, we create a custom ternary function that checks if “Cabin” already exists in the DataFrame's columns. If it does, the function directly returns the DataFrame; if not, it assigns the column. Now, let's take a look at something a bit fancier…

Apply Transformation to a Subset of Columns

This is by far the most useful technique in my day-to-day work, as I often deal with datasets comprising multiple data types, with similar columns requiring the same transformations. Take the Titanic dataset, for example. There are null values across various columns, but let’s say we only want to forward-fill the “Age” and “Sex” columns. How would we do that? One way is by using the .assign method and lambda functions to overwrite the original columns:

But what if, instead of transforming two columns, we want to transform ten? Typing the same transformation repeatedly becomes tedious, and if you need to change the transformation, you’ll have to update it ten times. Luckily, there’s a better way using .pipe, lambda, and a little Python unpacking magic:

This may look confusing at first, but like any chained flow, let’s break it down step by step. First, the lambda function in the .pipe method passes the DataFrame into a new flow (yes, a flow within a flow). In this flow, we unpack a smaller DataFrame (comprising only the columns we want to transform) from the original DataFrame. Using this technique, we can apply .ffill to the smaller DataFrame, targeting only the desired columns, and then unpack these columns directly back onto the original DataFrame. This approach is not only syntactically clearer but also provides performance optimization, as .ffill is called only once!

Next, we will explore how .pipe can help us flow our prepared data directly into visualization libraries.

Piping Transformed Data Directly into Plotly

While I prefer Plotly for its intuitiveness and visual appeal, this strategy also works with Matplotlib, Seaborn, and other graphing libraries that interface directly with pandas DataFrames and NumPy series. Typically, we perform our data operations first and then create a secondary function for visualizing the results. However, when working with data, the speed of iteration is important. Ideally, we could append visualizations directly to our chained flow to better inspect the results of our analysis. Using the .pipe method, we can do just that. For example, let's say I want to take a glance at the distribution of fare prices aboard the Titanic:

In this flow, I do not use a lambda function but instead, pass the desired function directly as the first argument in the .pipe method. The subsequent arguments are associated with the px.histogram function and pandas smartly passes these as additional keyword arguments. The result of the .pipe method is a Plotly chart, allowing me to directly append Plotly methods to the chain, seamlessly merging my pipeline and visualization.

And there you have it — six life-saving techniques for writing better and more efficient pandas code. In each example, I started the flow with the original dataset. This is intentional, as it’s generally better to read the data directly from the source for each analysis, especially in notebook environments. This approach prevents previous data transformations from altering the results of your pipeline and ensures your pipelines remain reproducible, even if the notebook outputs are refreshed or removed.

If you’d like to experiment with the code examples mentioned above, check out the notebook version of this article on my GitHub. For more about me, my projects, and other work, visit my website.