Data Visualization with pandas & Dash
Anytime I have a chance to make something visual out of my work I jump at the opportunity. I recently had some free time on my hands, and lo and behold I also had my interests piqued by this recent announcement by Plotly!
This post will be the first of two in a short series giving a quick walkthrough of how we can use pandas and Dash to make a simple Line & Scatter plot. In this first post, we will focus only on the data wrangling portion of things. The code is available on my GitHub at:
Prepping Data w/ pandas
As I mentioned, we are going to be using pandas and Dash (and a few other packages) for this lesson.
#!/usr/bin/env python
import dash
import dash_core_components as dcc
import dash_html_components as html
import pandas as pd
import plotly.graph_objs as go
import datetime
I hope by now that many (if not all) of y’all have already had your ears talked off with the wealth of blog posts, PyCon tutorials, etc lauding pandas for Data Wrangling and Data Analysis. When I first started working in Python, I was reluctant to rely on pandas and preferred to roll my own functions for all my data wrangling needs. I really wish that I would have been more open to using it earlier. It has done nothing but increase my productivity when it comes to working with CSV/Excel data.
First, we are going to want to read in the data file to model the data in a Dataframe object (one of two data structures that pandas brings to Python). For brevity, I am only going to be showing a portion of the 213 rows that are in this xlsx file.
#read in excel data
data_df = pd.read_excel("data/aug_dec_2016.xlsx")
For those of you wondering about the “NaT” value in some of the cells, we can read those cells as having no value entered in them. They are essentially NULL.
Next, I am going to add two new columns, one named “period” and one named “rperiod” (the r is for returns). To extract the YYYY-MM from the two date columns we can use a lambda function to convert the dates to strings and take a slice of the 7 left most characters in the date column. Under the second comment in the code snippet below, we can see that I am applying that lambda function to the “dateordered” and “datereturned” columns in our dataframe.
#lambda function to get monthly periods from date fields
to_string = lambda x: str(x)[:7]
#add monthly period columns as per directions
data_df['period'] = data_df.dateordered.apply(to_string)
data_df['rperiod'] = data_df.datereturned.apply(to_string)
I just recently discovered the pandas where
statement for adding a touch of conditional
flare to manipulating our data. Let’s look at the code first and then we’ll step through
what the details.
#add returns column to separate true orders from true returns
data_df['returns'] = data_df.orders.where(data_df.orderstatus == 'returned',0)
data_df['orders'] = data_df.orders.where(data_df.orderstatus == 'complete',0)
Let’s use the first line from directly above as our example. In plain English,
we are adding a new column named “returns” to our dataframe object. Looking to
the already established “orders” column as the source that is going to populate
the new column. Our where
statement allows us to set conditional rules for how
we are going to populate the new column. The rules for this example state that
for a given row, set the value for “returns” column equal to the value that is
currently in the row’s “order” column IF that row’s “orderstatus” column
is equal to “returned”. If it is not, set the new “returns” column to the value 0.
I hope that wasn’t too confusing! I hope that because we are turning right
around and doing it once more, except this time I set anything in the “orders”
that did not have an “orderstatus” equal to “complete” to 0. I am left with a
dataframe that is much easier to work with (in my opinion).
Great! Our dataframe object now has clearly labeled columns for orders AND returns instead of having everything simply in the “orders” column. Now let us make 2 new dataframes by selecting only certain columns we need from our original.
#make new dataframes for returns & orders
returns = data_df[["rperiod","returns"]].copy()
orders = data_df[["period", "orders"]].copy()
To actually do this with pandas, we simply need to pass a list of the columns we require from
the original dataframe. You will also notice that I am appending a .copy()
as well.
In pandas, indexing a dataframe returns a reference to the initial dataframe.
Thus, changing the subset will change the initial dataframe. This .copy()
will
make these new dataframes independent of the original.
#rename column
returns.rename(columns={"rperiod": "period"}, inplace=True)
#group by periods and sum
returns = returns[returns.period <> "NaT"].groupby("period")["returns"].sum().to_frame().reset_index()
orders = orders[orders.period <> "NaT"].groupby("period")["orders"].sum().to_frame().reset_index()
We are nearing the home stretch! The two new dataframes that we created in the
previous step are going to eventually be merged together into a single dataframe.
We are going to want to do this based off a column they both have in common.
As it stands now, returns and orders do not have any columns that are the same.
This can easily be addressed by renaming a column. I chose to rename the “rperiod”
column in the returns dataframe to just “period”. We could now combine these two
dataframes, but before we actually do that I want to calculate the sum of returns
or orders that take place for each period. This is accomplished by calling .groupby("period")
on each of the two dataframes. groupby() typically refers to a process where we’d
like to split a dataset into groups, apply some function (typically aggregation),
and then combine the groups together. Once all of our periods are grouped together,
we are going to want to apply the .sum()
function to the returns and orders columns.
The .to_frame().reset_index()
performs a little bit of housekeeping. With these
two function calls we are stating we want our returned object to be a dataframe
(not a series) and that we want the period column to not be an index for the dataframe.
What we get is a set of nice, neat, dataframes of the sums of returns and orders.
#merge the two new dataframes
month_df = pd.merge(returns, orders, how="outer", on="period").fillna(0)
Now we can combine our returns and orders dataframes together. In pandas this is called merging. This is similar to joining in SQL.
#add return rate computed columns
month_df["return_rate"] = month_df.returns/month_df.orders
month_df["return_rate_percentage"] = month_df.returns/month_df.orders * 100
month_df["return_rate_percentage"] = month_df.return_rate_percentage.round(2)
month_df["return_rate_%_delta"] = ((month_df.return_rate_percentage - month_df.return_rate_percentage.shift(1))/month_df.return_rate_percentage.shift(1))
month_df["return_rate_%_delta"] = month_df["return_rate_%_delta"].round(2)
While we are at it, let us add in some additional columns to see how the rate of returns varies for each monthly period.
#output to excel
month_df.to_excel("data/results.xlsx")
Lastly, output the merged dataframe to an excel file.
OK, in the next post we’ll get to how we can create reactive data visualizations for the dataframes we have just worked with to end up with something similar to what you see below!