A Boxplot is a graphical representation of a data distribution in a dataset based on a five-number summary which are the minimum, maximum, median (50th percentile), first quartile (25th percentile) and third quartile (75th percentile). It was introduced by John Tukey in his classic book “Exploratory Data Analysis" some 50 years ago.
A boxplot is appropriate for visualizing data distributions from multiple groups. It captures the summary of the data efficiently with a simple box and whiskers and allows you to compare easily across groups.
The advantage of comparing quartiles is that they are not influenced by outliers. Figure 1 below shows the picture of a boxplot and its parts. In this article, you are going to learn how to draw a boxplot in Python using a library called “pandas".
Fig1: picture of a boxplot
If you are interested in learning more about the history and evolution of boxplots, check out Hadley Wickham’s 2011 paper “40 years of Boxplots".
Drawing a Boxplot
Boxplots can be drawn using different platforms such as Excel, SPSS, STATA, Python etc. You should certainly be wondering why we are using pandas and Python. The reasons are: Python offers more flexibility as a programming language. Pandas allows you to work with a very large dataset (big data) unlike software like MS Excel that is more limited. Hence, data scientists have a clear preference for Python.
Drawing a Boxplot with Pandas
What is “Pandas"?
In computer programming, “pandas" is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.
We are going to use Anaconda (Python distribution). It is a free and open-source distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment.
You need to download and install anaconda on your machine if you have not yet done so (it has all the required libraries for this exercise, which are pandas and matplotlib). The procedure is simple, but if you have any challenges doing it, feel free to ask questions in the comments.
After installing Anaconda, you might need to update the default version of pandas that comes with Anaconda using the following code in the Anaconda prompt: conda update pandas
To draw a boxplot, you need 5 things
First Quartile or 25% (Q1)
Median (Second Quartile) or 50%
Third Quartile or 75% (Q3)
Be rest assured, you are not expected to calculate those values. That will be done by pandas!
A Consumer Report providing overall customer satisfaction scores for four cellphone services (AT&T, Sprint, T-Mobile, and Verizon) in 10 metropolitan areas throughout the USA, rated from 0 to 100 is shown in the table below (Consumer Reports, January 2009).
For the coding, you can use any text editor (Notepad++, Sublime Text etc.). In this tutorial, we are going to use Jupyter Notebook (text editor installed with Anaconda). If you have any issue with the Jupyter version installed by default with Anaconda, just uninstall and reinstall it.
Step 1: Import the relevant libraries (pandas and matplotlib) by writing the following code:
import pandas as pd
import matplotlib.pyplot as plt
Step 2: Load the data in a data frame
Get the path to the file as shown in the figure below.
When the path to the file is a string, you should add letter “r" at the beginning of it. You can also solve the problem by replacing all the back-slashes (\) in the string with forward-slashes (/). In this case you will have the following code:
If your file is in the same location with the file containing the code (recommended), you just have to write the name of the data file
dfr = pd.read_csv('customer_report.csv')
At the end of step 2 you should have the following screenshot on your jupyter notebook depending on the path to your data file:
Step 3: Draw the Boxplot
Boxplots can be drawn using series or data frames. Data frames are more flexible than series.
As a reminder, our dataset has a column named containing the names of 10 metropolitan areas and five columns named “AT&T", “Sprint", “T-Mobile" and “Verizon" respectively, containing the consumer satisfaction score.
Note: You can always get the names the columns of your data set with the following code:
dfr.columns.(dfr being the name given to our data frame)
Drawing a Single Boxplot
Let’s begin by drawing a single boxplot for “AT&T"
The syntax is:
dfr.boxplot(column = "AT&T")
Click on the Run button of the jupyter notebook or SHIFT + ENTER to run the code. If everything is ok, you should have an output like the one in the figure below.
There are many arguments that can be used to modify the presentation of the boxplot (orientation, colors, width, removing grids etc. refer to the pandas documentation for more details).
For example, the default orientation of a boxplot is vertical, but if for one reason or the other you want it to be horizontal, then add the following argument to your syntax. vert = False. The code will then be as follows:
dfr.boxplot(column = "AT&T", vert = False)
You should have the following output:
Drawing Multiple Boxplots
The column feature of pandas can also take a list of column names and produce separate plots for each chosen column. The syntax for our example is as follows:
After running the code, you should have the following output:
The intention of this article was to introduce you to boxplot drawing using pandas library. You must have noticed that this library is somewhat limited if you want to improve the aesthetic of your graph. There are other libraries for that and we shall study them in other articles.