Histograms

Histograms #

Histograms are a very common means by which to visualize distributions. The basic idea behind a histogram is to take data and sort it into bins, reporting the number of objects in each bin.

Histograms in Plotly #

Plotly does have a built in means to make a histogram, but it is not quite as flexible as one might like. Thus, I think it is worth considering alternative ways to construct them. For me, I really like binning the data, and then plotting the binned data as a bar chart, with the bars touching. This last point is critical, as histograms where the bars to do not touch are not well-formed histograms. This is because any space between bars would indicate there is no data in that space, which is probably not the case.

Preparing data #

In this tutorial, we will be working with data that is the tuition costs of public universities in 2023. I have a collection of just a bit more than 1600 such universities in the USA. The data looks like this:

!

The data obviously continues for a very long time, but the gist is this: we have a list of universities, and the tuition charged by these universities.

Binning data #

For a histogram, we don’t need to know the names of the universities, only their tuition, thus, we are only interested in the second column (1, by Python counting), and we also want to skip the 1st row, the header. We can import the data as follows:

import numpy as np
public_tuition = np.genfromtxt(<path to data>,
              unpack = True, 
              usecols = [1],
              skip_header = 1,
              delimiter = ",")

This will result in public_tuition having a single array of data, which is the tuition of all the public universities. We next need to bin this. However, Numpy has a histogram function that can bin data for us. It accepts the data as the first argument, and then produces two arrays: the counts of bins, and the center of bins. We use this as follows:

binned_tuition  = np.histogram(public_tuition)

This will give us the data we need to make a histogram in Plotly, using bar charts!

Plotting data #

With the data given above, we can make a first bar chart:

from plotly.subplots import make_subplots

tuitionHist = make_subplots()
tuitionHist.add_bar(x = binned_tuition[1], y = binned_tuition[0])
tuitionHist.show('png')

Running this produces the bar chart.

This is not yet a well-formed histogram! The reason is that the bars do not touch! The gap between the 1st and 2nd bars, for instance, imply that there is no data there… this is not true. So, we need to make the bars touch.

The way to do this is to to use the bargap keyword argument within the update_layout() method of the Plotly figure. While we are there, we might as well apply a better template. This looks like this:

tuitionHist.update_layout(template = "simple_white", bargap = 0)

Here, the bargap = 0 means that we have no gap between the bars, and we obtain:

We now have a well-formed histogram, though a poorly designed one!

Refining the plot #

The refinements required here are pretty simple:

  • We should change the $x$-axis to include a “$” in the front of the numbers.
  • We should add a title to the $y$-axis.
  • We should add an overall title to the plot.

Doing all this, we arrive at the final total code:

import numpy as np
from plotly.subplots import make_subplots

public_tuition = np.genfromtxt(<path to data>,
              unpack = True, 
              usecols = [1],
              skip_header = 1,
              delimiter = ",")

binned_tuition  = np.histogram(public_tuition)

tuitionHist = make_subplots()
tuitionHist.add_bar(x = binned_tuition[1], y = binned_tuition[0])

tuitionHist.update_xaxes(tickprefix = "$")
tuitionHist.update_yaxes(title = "number of universities")

tuitionHist.update_layout(title = "the most common public university tuition is ~$3k", template = "simple_white", bargap = 0)

tuitionHist.show('png')

Which outputs: