Doing a Sanity Check of the Data with Pivot Tables

Go to the index page for the census-surname lessons:

Learning spreadsheets with Census Surname data

In the previous lesson, we created a compiled sheet of the top 5000 most common surnames from the 2000 and 2010 Census – i.e. a dataset of roughly 10,000 records. However, the compilation process was done by hand – clicking-dragging to select data from the sheets, manually copy-pasting datasets together. Rows/columns may have been left out, accidentally altered/erased, or accidentally included.

We could manually spot-check the compiled sheet, to verify that it contains 10,000 rows, that all of the surnames have a rank between 1 and 5000. But that’s even more error-prone hand work. We can use the Pivot Table functionality of our spreadsheet to quickly and more accurately do this kind of counting.

About the data and pivot tables

Teaching pivot tables is difficult, I imagine learning them even moreso. I first learned about pivot tables just to teach pivot tables – I had thought my programming skills covered most of pivot tables’ benefits. I now find them to be one of the best and most accessible data tools for every journalist (or analyst or researcher or investigator), but I think the learning curve can be steep if you aren’t used to thinking of how to explore data via aggregation, sorting, and filtering.

It doesn’t help that the Pivot Table interface for both Google Sheets and Microsoft Excel can be incredibly opaque and bewildering. So I’ve found that it’s easier just to experience the confusing steps with a hands-on process.

Data anticipation

I’ve created a spreadsheet for this lesson, which includes the compiled sheet from the previous lesson:

Pivoting on the top 5000 surnames

To get the most of out trying out pivot tables, go into it already knowing the answers to find. In this situation, we expect that the compiled surnames dataset has:

  • A total of 10,000 records (give or take a few, since some surnames tied for popularity)
  • Exactly and only two different year values: 2000 and 2010
  • The number of records per year to be split roughly 50/50
  • The total count of Americans to be significantly less than the actual number, since the compiled dataset includes only Americans with the 5,000 most common surnames
  • Roughly 5,000 total unique surnames, probably a few more assuming that some surnames dropped/increased in popularity between 2000 and 2010.

Pivot table lessons

Some previous (but likely out-of-context) pivot table lessons:

One of the most popular videos about how to successfully use Excel is Joel Spolsky’s: You Suck at Excel. Given the technically-proficient audience that Spolsky attracts, I was surprised to see him spend a good chunk of the time teaching pivot tables. Even though I was late to learn them myself, I had thought they were so fundamental to spreadsheet and data analysis as to be a “basic” skill. If you like learning via video, Spolsky’s talk is worth checking out: https://www.youtube.com/watch?v=0nbkaYsR94c

Creating a pivot table

Like charts, pivot tables can be based on a selection of data. For most of our purposes, we want the pivot table to consider the entire sheet. The easiest way to ensure this is to click into any single cell. But not multiple cells, as the Pivot Table tool will interpret that as a range of data to pivot upon.

To create a pivot table, click the Data menu, then the Pivot table… action:

../../_images/data-menu-pivot-table.png

This will create a new, seemingly-empty sheet, which will be named something like Pivot Table 1:

../../_images/new-pivot-table-tab-hilight.png

What to pivot on

In our current situation, we just want to confirm that there are about 10,000 rows in the combined sheet. And roughly equal number of rows (i.e. 5,000) for each of the two possible year values: 2000 and 2010.

The Report Editor

Turn your attention to the Report Editor on the right side. Sometimes clicking on a random spot in the Pivot Table sheet will cause the Report Editor to hide. To bring it back into view, click the A1 cell:

../../_images/pivot-table-report-editor-rightside.png

Under the Report Editor menubar, you can see something that seems to vaguely describe the data that the pivot table is ostensibly “pivoting on”:

compiled!A1:K10004

The compiled sheet, cells A1 through K10004 (I think)

Adding rows to the Report Editor

Usually, the first step in creating a Pivot Table is to add a field to the pivot table’s rows:

../../_images/pivot-table-report-editor-rows-add-fields-hilight.png

Click the Add field link that’s next to the Rows subhead. You should see dropdown list of familiar terms – all the column headers of the Census surname data. Since we wanted to get a count of records by year, select the year option:

../../_images/pivot-table-report-editor-add-rows-year.png

The pivot report, previously empty, should now have 2 new rows – one for each year value, and an additional Grand Total row:

../../_images/pivot-table-report-editor-added-year-rows.png

If there were any other year values besides 2010 and 2000 – including blank values – we would see them as extra rows in the pivot table. That we only have 2 rows is nice assurance.

Adding values to the Report Editor

The next Report Editor function-to-try may appear to be Columns, but we’re skipping that for now. Instead, go to the Values section and click Add field

Again, you’ll be presented with a pop-up menu of column headers for surname data. Choose the count field, which, to refresh your memory, is the number of Americans estimated to have a given surname:

../../_images/pivot-table-report-editor-values-menu-count-highlight.png

No new cells will be added to the pivot table, but the existing empty cells will be filled. Note the display item that popped up in the Values menu. By selecting the count column, you seem to have created an aggregation of the count values – a SUM of them:

../../_images/pivot-table-report-editor-values-count-summed.png

Look at the summed values for 2000 and 2010: 162,168,768 and 178,309,478, respectively. That seems very low for the count of Americans with surnames, untl you remember that our dataset doesn’t include all of the Census data, just the top 5000 common surnames for each year.

Aggregations besides SUM

Let’s add another value field: rank. By default, the aggregation performed on value fields is SUM, which doesn’t make much sense:

../../_images/pivot-table-report-editor-values-rank-summed.png

Go back to the Report Editor’s Values section. Click the dropdown icon next to SUM label for rank to see a menu of all the possible aggregation operations. For rank, let’s pick one that we can predict the answer to: MAX:

../../_images/pivot-table-report-editor-values-rank-agg-options.png

While we’re here, add a new Value field for rank, and aggregate by MIN. Each new Value field adds a new column to the Pivot Table. The results for rank should feel boring – we would expect that the smallest value in rank would be 1, and the biggest 5000:

../../_images/pivot-table-report-editor-values-rank-max-min.png

Counting names per year

Originally we wanted to check and see if there were roughly 5,000 rows/surnames per year. Let’s do that now by going to the Report Editor’s Values section and add another field: name. Note that it’s possible to click-and-drag to rearrange the order of the Values columns. I’ve put the name field up top, and you can see it reflected in the pivot table:

../../_images/pivot-table-report-editor-values-name-sum.png

As previously mentioned, the default aggregation on a Values field is to SUM all the values. But the values of the name field aren’t numbers, they are text strings, like LEE and JOHNSON. Adding them up as numbers is nonsensical; I’m not even sure why the pivot table shows a result of 0.

In any case, we want to use a different aggregation, one that counts the different name values.

COUNT, COUNTA, COUNTUNIQUE

I’ve reworked my pivot table to have only Values fields for COUNT, COUNTA, and COUNTUNIQUE aggregations of the name field. Check out the results:

../../_images/pivot-table-report-editor-values-name-counts.png

Here is the Google documentation for the 3 functions:

Apparently, the COUNT function only counts the number of numeric values. That the COUNT value column is just 0 makes sense, then. As to why COUNTA and COUNTUNIQUE are both equal – this means that in no given year did any of the top 5000 names repeat themselves, e.g. there weren’t two separate listings for NGUYEN.

Which is a good thing!

Why pivot. Why aggregate

There’s more to pivot tables than using them to check the meta-stats of your data, e.g. does my dataset contain all the expected values?. But the core power of the pivot table is how quickly it allows us to explore data through summarization and aggregation.

Getting a dataset with just a dozen rows is practically worthless – what interesting insight could an analysis of 12 data points possibly discover that couldn’t already be contained and explored in the average person’s brain? But getting a dataset with thousands/millions of data points can be just as useless if you don’t know what the big picture is.

Usually, you’ll know what the big picture is. It’ll be something like: for each city/agency/school, I want to find the city/agency/school with the most/worst/oldest/newest [whatever]. Because if a certain group is an outlier, there’s a story for why they’re different than everyone else.

The problem is whether you can make those comparisons and calculations when given a bunch of raw data. Being able to group data by categories and to view it at different summarization levels can greatly help in finding stories and meaning in the data.

Aggregation/visualization as simplification

I don’t think it’s an insult to say that data visualization represents a simplification of the raw source data. But raw data can’t often just be plugged into a visualization tool. It takes your personal effort and insight to know how the data can and should be simplified before trying to visualize it.

Going back to the compiled sheet of Census surname data with 10,000 rows, what happens if you try to visualize it? I wouldn’t even know where to start – what is the visualization program supposed to do with 10,000 rows of names and demographic data? What is anybody supposed to do with it?

If you can’t figure out your mess, neither can the computer

Let’s try feeding Google Sheet’s visualization wizard with all 10,000 rows of the name and count column:

../../_images/charting-compiled-surname-data-select-name-count.png

Passing these 2 columns to the Insert chart action, Google’s default is to create a column chart (5000 columns!). It’s not nothing, but it’s not really anything either:

../../_images/charting-compiled-surname-data-select-name-count-resulting-chart.png

Telling the computer what to group by

Here’s another attempt: passing in the year and count column. Google’s default response is not good:

../../_images/charting-compiled-surname-data-select-year-count-default-chart-mess.png

It’s worth taking a look at the Chart Editor options. The “Column chart” type seems fine. But look at the other options:

  • Switch rows / columns
  • Use row 1 as headers
  • Use column A as labels
  • Aggregate column A
../../_images/charting-compiled-surname-data-select-year-count-chart-options.png

Let’s try these options one-by-one. Except for “Switch rows/columns”, as that doesn’t make any sense.

Use row 1 as headers

Apparently, the way I selected data doesn’t automatically imply that row 1 is meant to be used as the headers row. Selecting that option doesn’t seem to change anything, other than to add labels to the legend (which itself doesn’t make any sense):

../../_images/charting-compiled-surname-data-select-year-count-chart-options-row-1-headers.png

Use column A as labels

Column A of the surname data is the year column. In my mind, this chart would at the very least show the count of names by year. Which means that the year values of 2010 and 2000 would show up somewhere on the chart, such as the x-axis:

../../_images/charting-compiled-surname-data-select-year-count-chart-options-column-a-labels.png

Now that looks like an actual chart. There are two columns that appear to correspond to the values of 2000 and 2010. However, looking at the actual values for count, it appears that the two y-values correspond to the most common surname of each year. I guess it’s not nothing. But it seems strange for the chart to show only 2 data points out of the 10,000 given to it.

Aggregate column A

This is the option that turns our sloppy chart attempt into an actual chart that uses the data.

../../_images/charting-compiled-surname-data-select-year-count-chart-options-agg-column-a.png

First, the aggregation option appears to supersede the “Use column A as labels” option. By aggregating column A, i.e. the year column, we get two labels on the x-axis, 2000 and 2010.

The y-values appear to be the sum of the count values for each years. This makes sense as we know that in 2010, more people were among the top 5000 surnames than compared to 2000. Which is to say, we don’t know much at all. It could be that there are just correspondingly more people in America in 2010, while the common last name distribution remains the same.

In any case, this chart shows something that we can at least explain, even if it’s not much of a story. Note how the Google Wizard has no qualms about setting the y-axis to something non-zero, making it appear that 2010 had 10 times the aggregate count compared to 2000.

Using a pivot table to sum count by year

Before we end this lesson, let’s replicate the data table ostensibly behind the chart showing aggregate count of surnames by year.

From the compiled table, create a new Pivot Table (the previous one will remain as is).

In the Report Editor:

  • Add the year field to the Rows section
  • Add the SUM of the count field to the Values section
  • Remove the Grand Total row by unchecking the Show totals option.

The results and process should feel similar to our first pivot table, as the steps are pretty much the same:

../../_images/charting-compiled-surname-data-pivot-year-to-count-sum.png

Now to create a chart from the pivot table. For many intents and purposes, a pivot table is like any other data table, with the main exception that you aren’t supposed to manually alter the table’s structures and values beyond using the Report Editor. But we can pass it directly to the Chart Wizard by highlighting the table and using the Insert chart… action:

../../_images/charting-compiled-surname-data-pivot-year-to-count-sum-charted.png

Randomness is suffering

OK, using a pivot table to create a basic column chart may not feel like much of a victory. Because it literally is the same chart as the one created in the previous section.

However, the important difference, one that is admittedly hard to appreciate right away, is that the previous attempt at charting the raw compiled data involved a desperate, confused process of trial-and-error. Only by tinkering with the Chart editor options – and already spending a lot of time thinking about what our data contains – were we able to come up with something remotely usable. And that was after the Google visualization app burned a few extra of our laptops’ lifecycles just to process the data dump we gave it.

In contrast, using the pivot table to aggregate the compiled data should have felt more rational and purposeful. Yes, the Pivot Table interface is a bit strange and wonky. But once you get used to how it can be used to aggregate and explore raw data, it’s a matter of memorizing the conventions to get to those insights.

The visualization at the end of this lesson was unsatisfactory because we didn’t have any ideas of what insights are in the Census data. In subsequent lessons, we’ll be less preoccupied with the tedious work of data wrangling and learning pivot tables.