Logging a Catalog of Public Data Catalogs [2017-10-09]¶
- California governments have to publish a list of the data systems/databases they use. Let’s make a list of the most interesting items in this informal crowdsourcing exercise
- 2017-10-09 23:59PM
padjo-2017 homework sb-272-research your_sunet_id
This assignment involves contributing to a communal Google Sheet.
Just email me when you have your 10 entries about California public datasets have been entered.
Believe it or not, public agencies aren’t always thrilled to fulfill your public records requests as fast as possible. Sometimes they’ll outright refuse you, blaming it on technological problems (the old mainframe is on its last vacuum tubes) or privacy concerns. Often, it’s a combination of both, such as not having the software or workflow to efficiently redact documents. Or, they’ll offer to fulfill the request, but charge you hundreds/thousands of programmer hours to do the database query.
Some of these stalemates end up being year-long legal fights. But sometimes, agencies refuse out of reflexive habit. Or, they honestly don’t know what the situation is, and all it takes is a little inside knowledge on your part for them to see that something is doable.
So if there is a technical/legal/monetary reason for why the actual data/records can’t be released, ask for the data structure.
When we learn databases, I will refer to this as the “data schema”. For now, it’s fine to think of it as asking for the column headers in a spreadsheet, so that you can at least predict what the spreadsheet contains, i.e. what exactly the government is tracking.
For instance, Menlo Park Police have started putting data online. But so far it’s been slim. And my initial suspicion is that they are leaning towards the side of redaction instead of transparency.
For example, look at MPPD’s 2015-2016 Traffic Stops dataset:
Compare the Menlo Park data structure/schema/column headers with what the state of Connecticut produces:
This is a whole topic into itself – one in which, incidentally, Stanford is doing ground-breaking research – for now, it’s enough to say that every detail of data collection is political. MPPD’s released traffic stop data has very little information about the people stopped, such as race. Is that something that is just left out of the public release? Or is it something not collected at all, because it’s not seen as a priority or of particular relevance to Menlo Park citizens?
I could request a more complete form of the MPPD traffic stop data, but I didn’t want to run into an immediate roadblock. So I asked for something that is much easier for them to release, and that theoretically will answer my question:
(I never followed up, you decide for yourself)
Sure, ideally, raw data would flow like water. But when it doesn’t, information about the data can not only help you make more informed, harder-to-reject records requests, but it could lead to other ideas that you hadn’t realized. Many obvious, important stories aren’t being told simply because journalists are plain unaware what the government actually collects.
We can take this meta-thinking one more level: asking the government for a list of all the datasets in its custody. And then requesting each of those datasets. Knowing that data exists is 60 to 80% of getting he data.
In 2016, California state law S.B. 272 went into effect.
The EFF has a great write up here:
For open data advocates, the new law—S.B. 272—represents an important step forward to releasing government datasets, since these catalogs also serve as a sort of menu of records that may be requested under the California Public Records Act, depending on the sensitivity of the data. From a privacy perspective, these catalogs also reveal the types of information that local governments are collecting on their systems, including potentially surveillance equipment and software.
To celebrate S.B. 272, EFF held a hackathon/crowdathon to create a catalog of catalog of California public data systems. We can reap the benefits of their data organization by scrolling to the table at the bottom of the post:
The EFF’s table of California data catalogs provides for each city/county/locality the URL where you can find either a webpage or other document that lists the city’s data inventory.
This exercise is part data-entry, mostly exploratory research. Pick a jurisdiction at random, such as the City/County of San Francisco, and first, just peruse the variety of data systems they have.
Then, think which items in the data catalogue would be interesting from a journalism perspective. Do a little research. Fill out a communal spreadsheet. Repeat 10 times (i.e. look for 10 interesting municipal data systems).
So the first part of this assignment is to visit this Google spreadsheet:
But don’t make a copy. Just fill it in as if it were the only spreadsheet. The entire class, theoretically, should be able to fill in the sheet at the same time. That’s a nice feature when it comes to general crowdsourcing.
You don’t need to alter this spreadsheet. In fact, just don’t. The first columns are boilerplate and copies of the original data, e.g. So start with the boilerplate, like your sunet ID. The first columns mimic columns that are found in the catalog as mandated by SB 272, such as the entity (San Francisco) and the type of entity (city/county).
Now the harder part are these custom fields that I want you to fill out:
interest_level- on a numerical scale of 1 to 10, 10 being “really interesting to me”, rate this dataset in terms of how interesting it is. Consider factors like how complicated it might be. I know that’s all very vague, but just try to make a number.
Why is this interesting to journalists?- Write a paragraph or two about why this data – even in its official (and clunky) description – is of journalistic interest.
Examples of other stories/records requests that refer/rely on similar database/dataset- so if you were able to make a case that this data is of journalistic interest, that means someone else has probably written something similar. Use Google to find a couple of examples.
Direct URL to official data/homepage (if exists)- Many of these SB 272 datasets will require a records request because the agency hasn’t gotten around to uploading it. But see if the data is already online, and if it is, link to it.
EPA’s catalog is here:
Not a lot of entries (compared to SF), and many of the entries seem relevant to web production/site management/systems operations, not necessarily data or records worth requesting, e.g.
- Barracuda Backup
- Microsoft Exchange
- VMware ESXi
One thing does catch my eye: a “Rent Stabilization Program Database”. I have no idea what’s actually in it, but rent control is a hot topic. If it’s the data system used to track which properties are considered rent-control, we might be able to calculate statistics of how that’s changed.
You can look at the first row of this spreadsheet to see how I filled things out:
(This is outdated because SF has updated their data catalog. Not sure if this database has been renamed, or removed, but in any case…)
SF data catalog can be found here:
(it might be easier to just download it as a CSV and open in a spreadsheet)
A lot of the entries don’t seem to interest me. But one does stick out: the “enCampment” data system maintained by the “GSA-Public Works”, with a purpose of “Tracking homeless encampment location and timeframes.”.
I didn’t complete the entry on our sheet, but here’s a public records request that refers to that dataset:
I’ve talked about how great spreadsheets are for journalism and data keeping. So why not use spreadsheets for all kind of text input? One problem you might run into is trying to make a new line with Enter but ending up in the next cell.
If you want to make a empty-line/new line in a spreadsheet, you have to hold down CTRL before hitting Enter.
Some of the columns require narrative text. I post this tip here just incase you’re like me and want to break the text up with some whitespace.
- Some catalogs list computer systems that are, well, data, but not database-data. Which is so arbitrary, I know. The easiest way to explain this is: don’t ask for things like their Microsoft Outlook mail system. Just because that’s something that every jurisdiction has.
- The additional fields require you to ask yourself why something is interesting. That is to prevent you from picking random items that you have no F-ing clue what they are.
- There is one spreadsheet for the entire class. Avoid picking data sets already listed.
- I know most of you only care about nearby governments. But part of this exercise is to show how similar all cities/counties are. So please do your research for data sets across at least 4 different jurisdictions. They don’t all have to be places you’ve heard of.
Again, the class spreadsheet is here:
EFF’s landing page is here:
(All of this information in this section is just info about where to find civic data in general. None of it directly applies to the actual assignment)
Google will still be the most straightforward way to look for data and any info hosted on the web. But data is a big world, with information better suited to specialized sites and scrapers. It’s worth knowing of a few data portals, which let you see how common (or not) data is across the United States and elsewhere.
My go-to site for local (city/county/state) data is opendatanetwork.com. It doesn’t include every data portal (just the Socrata ones), but I like using it for ideas and general awareness of what data is out there.
Example search for “police stops”
There are other great data portals, such as public.enigma.io and data.gov.uk, but OpenDataNetwork will be the most relevant for local and regional (i.e. non-federal) investigative topics.
Remember that cities and counties (especially across different states) can have incredible disparity in what data and records they seem to offer. That’s how jurisdictions work – San Francisco doesn’t have to follow NYC’s example of posting taxi data, SF may not even have the same kind of taxi regulartory agency. Policies will differ wildly even with cities in the same state, or 10 minutes apart like Palo Alto and Menlo Park.
While it’s helpful to realize that data/data-management can be similar across bureaucracies – encouraging you to try for data in one city because you know of other cities doing it without problem – sometimes it’s an interesting story when a particular city or agency is the odd ball out in having common data.
It’s worth visiting the U.S. City Open Data Census website, a project of the Open Knowledge Foundation, Code for America and the Sunlight Foundation. Besides being a cool visualization that doubles as a statusboard to show which cities are following the data trend, just seeing how they’ve categorized the possible datasets may give you new ideas: