Find out how to quickly visualize data with this popular python tool
Pandas is one of the most popular python libraries for data science. It features an array of tools for data handling and analysis in python. Pandas also has a visualization functionality which leverages the matplotlib library in conjunction with its core data structure, the data frame.
Although the visualizations are fairly basic and don’t produce the most beautiful plots. The plotting functionality, especially when combined with other pandas methods, such as group-by and pivot tables, allows you to easily create visualizations to quickly analyze a data-set. I use it pretty much on a daily basis for quickly getting some information about data I am working with so I wanted to create this brief guide to some of the functionality I use most often.
In this post, I will be using the Boston house prices data-set which is available as part of the scikit-learn library. This can also be downloaded from various other sources across the internet including Kaggle. In the below code I am importing the data-set and creating a data frame so that it can be used for data analysis with pandas.
from sklearn.datasets import load_boston
import pandas as pdboston = load_boston()
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_df['TARGET'] = pd.Series(boston.target)
We can run
boston.DESCR to view explanations for what each feature is.
Plotting in pandas utilizes the matplotlib API so in order to create visualizations, you will need to also import this library alongside pandas. If you are working in a Jupyter Notebook then you will also have to add the
%matplotlib inline command to visualize the plots inline in the notebook.
import matplotlib.pyplot as plt
plot method creates a basic line chart from a data frame or series. In the below code I have used this method to visualize the
A more useful representation of this data would be a histogram. Simply adding
.hist to this command produces this type of plot.
You can add a title to the plot by adding the
boston_df['AGE'].plot.hist(title='Proportion of owner-occupied units built prior to 1940')
As pandas uses the matplotlib API you can use all the functionality of this library to further customize the visualization. In the below, I have customized the
colormap and added custom labels to the x and y axis.
boston_df['AGE'].plot.hist(title='Proportion of owner-occupied units built prior to 1940', colormap='jet')
For a full list of available chart types and optional arguments see the documentation for
DataFrame.corr method can be used to very quickly visualize correlations between variables for a data frame. By default pandas uses the pearson method and outputs a data frame containing the correlation coefficient against the variables.
In the below code I am using this method to determine how each feature correlates with the target variable. The output is shown below the code.
correlations = boston_df.corr()
correlations = correlations['TARGET']
We can see that the feature RM (average number of rooms) correlates quite strongly to the target. Let’s use pandas to visualize this correlation further.
The above code produces the following visualization. We can see that in general the median price of the house increases with the number of rooms.
Where pandas visualizations can become very powerful for quickly analyzing multiple data points with few lines of code is when you combine plots with the groupby function.
Let’s use this functionality to view the distribution of all features in a box-plot grouped by the CHAS variable. This feature contains two values, 1 if the property tract bounds the river, and 0 if it doesn’t. Using the below code we can quickly visualize any differences between variables where the house is close to the river.
The below code groups the data frame by this column and creates a box plot for each feature. We can now quickly visualize the differences between the two groups.
Pandas pivot tables, very similar to those found in spreadsheet tools such as excel, can be useful for quickly aggregating data. You can combine pivot tables with the visualization functionality in pandas to create plots for these aggregations.
We can see from the box-plot above that there is a difference in house price for those near to the river compared to those that are not. In the below code we are creating a pivot table to calculate the mean house price for the two groups.
import numpy as nppivot = pd.pivot_table(boston_df, values='TARGET', index=['CHAS'], aggfunc=np.mean)
This creates the following output.
To visualize this as a bar plot we can simply run
pivot.plot(kind=’bar’) which produces the visualization shown below. We can quickly see that house prices are generally higher for those that are close to the river.
This post is meant as a quick introduction to plotting with pandas. There are many more options for using visualizations and combining them with the pivot table and group-by methods. The pandas user guide contains a more extensive list of possibilities.
Thanks for reading!
Written by Rebecca Vickery