Data Visualization with pandas & Dash

my screenshot

Anytime I have a chance to make something visual out of my work I jump at the opportunity. I recently had some free time on my hands, and lo and behold I also had my interests piqued by this recent announcement by Plotly!

This post will be the first of two in a short series giving a quick walkthrough of how we can use pandas and Dash to make a simple Line & Scatter plot. In this first post, we will focus only on the data wrangling portion of things. The code is available on my GitHub at:

https://github.com/stubs/dash_transactions

Prepping Data w/ pandas

As I mentioned, we are going to be using pandas and Dash (and a few other packages) for this lesson.

#!/usr/bin/env python
import dash
import dash_core_components as dcc
import dash_html_components as html
import pandas as pd
import plotly.graph_objs as go
import datetime

I hope by now that many (if not all) of y’all have already had your ears talked off with the wealth of blog posts, PyCon tutorials, etc lauding pandas for Data Wrangling and Data Analysis. When I first started working in Python, I was reluctant to rely on pandas and preferred to roll my own functions for all my data wrangling needs. I really wish that I would have been more open to using it earlier. It has done nothing but increase my productivity when it comes to working with CSV/Excel data.

First, we are going to want to read in the data file to model the data in a Dataframe object (one of two data structures that pandas brings to Python). For brevity, I am only going to be showing a portion of the 213 rows that are in this xlsx file.

#read in excel data
data_df = pd.read_excel("data/aug_dec_2016.xlsx")

dataframe

For those of you wondering about the “NaT” value in some of the cells, we can read those cells as having no value entered in them. They are essentially NULL.

Next, I am going to add two new columns, one named “period” and one named “rperiod” (the r is for returns). To extract the YYYY-MM from the two date columns we can use a lambda function to convert the dates to strings and take a slice of the 7 left most characters in the date column. Under the second comment in the code snippet below, we can see that I am applying that lambda function to the “dateordered” and “datereturned” columns in our dataframe.

#lambda function to get monthly periods from date fields
to_string = lambda x: str(x)[:7]

#add monthly period columns as per directions
data_df['period'] = data_df.dateordered.apply(to_string)
data_df['rperiod'] = data_df.datereturned.apply(to_string)

I just recently discovered the pandas where statement for adding a touch of conditional flare to manipulating our data. Let’s look at the code first and then we’ll step through what the details.

#add returns column to separate true orders from true returns
data_df['returns'] = data_df.orders.where(data_df.orderstatus == 'returned',0)
data_df['orders'] = data_df.orders.where(data_df.orderstatus == 'complete',0)

Let’s use the first line from directly above as our example. In plain English, we are adding a new column named “returns” to our dataframe object. Looking to the already established “orders” column as the source that is going to populate the new column. Our where statement allows us to set conditional rules for how we are going to populate the new column. The rules for this example state that for a given row, set the value for “returns” column equal to the value that is currently in the row’s “order” column IF that row’s “orderstatus” column is equal to “returned”. If it is not, set the new “returns” column to the value 0. I hope that wasn’t too confusing! I hope that because we are turning right around and doing it once more, except this time I set anything in the “orders” that did not have an “orderstatus” equal to “complete” to 0. I am left with a dataframe that is much easier to work with (in my opinion).

dataframe2

Great! Our dataframe object now has clearly labeled columns for orders AND returns instead of having everything simply in the “orders” column. Now let us make 2 new dataframes by selecting only certain columns we need from our original.

#make new dataframes for returns & orders
returns = data_df[["rperiod","returns"]].copy()
orders = data_df[["period", "orders"]].copy()

To actually do this with pandas, we simply need to pass a list of the columns we require from the original dataframe. You will also notice that I am appending a .copy() as well. In pandas, indexing a dataframe returns a reference to the initial dataframe. Thus, changing the subset will change the initial dataframe. This .copy() will make these new dataframes independent of the original.

#rename column
returns.rename(columns={"rperiod": "period"}, inplace=True)

#group by periods and sum
returns = returns[returns.period <> "NaT"].groupby("period")["returns"].sum().to_frame().reset_index()
orders = orders[orders.period <> "NaT"].groupby("period")["orders"].sum().to_frame().reset_index()

We are nearing the home stretch! The two new dataframes that we created in the previous step are going to eventually be merged together into a single dataframe. We are going to want to do this based off a column they both have in common. As it stands now, returns and orders do not have any columns that are the same. This can easily be addressed by renaming a column. I chose to rename the “rperiod” column in the returns dataframe to just “period”. We could now combine these two dataframes, but before we actually do that I want to calculate the sum of returns or orders that take place for each period. This is accomplished by calling .groupby("period") on each of the two dataframes. groupby() typically refers to a process where we’d like to split a dataset into groups, apply some function (typically aggregation), and then combine the groups together. Once all of our periods are grouped together, we are going to want to apply the .sum() function to the returns and orders columns. The .to_frame().reset_index() performs a little bit of housekeeping. With these two function calls we are stating we want our returned object to be a dataframe (not a series) and that we want the period column to not be an index for the dataframe. What we get is a set of nice, neat, dataframes of the sums of returns and orders.

returns sums

#merge the two new dataframes
month_df = pd.merge(returns, orders, how="outer", on="period").fillna(0)

Now we can combine our returns and orders dataframes together. In pandas this is called merging. This is similar to joining in SQL.

returns sums

#add return rate computed columns
month_df["return_rate"] = month_df.returns/month_df.orders
month_df["return_rate_percentage"] = month_df.returns/month_df.orders * 100
month_df["return_rate_percentage"] = month_df.return_rate_percentage.round(2)
month_df["return_rate_%_delta"] = ((month_df.return_rate_percentage - month_df.return_rate_percentage.shift(1))/month_df.return_rate_percentage.shift(1))
month_df["return_rate_%_delta"] = month_df["return_rate_%_delta"].round(2)

While we are at it, let us add in some additional columns to see how the rate of returns varies for each monthly period.

#output to excel
month_df.to_excel("data/results.xlsx")

Lastly, output the merged dataframe to an excel file.

OK, in the next post we’ll get to how we can create reactive data visualizations for the dataframes we have just worked with to end up with something similar to what you see below!

my gif

#TODO data engineering

Data Visualization with pandas & Dash

Prepping Data w/ pandas

Related Links: