Harmonized Historical Census Data

Overview

The U.S. Census is a rich source for population-level demographic, socioeconomic, and housing data. Although the Census Bureau has been collecting data since the 1700s, the data collected has changed over time, with topics being removed, added, and redefined decade-to-decade, making long-term longitudinal analyses difficult. Here, we sought to compile and harmonize variables from the 1970 census through the 2010 census at the census tract level in the following categories:

population
income
poverty
assisted income
unemployment
educational attainment
racial composition
housing occupancy
housing value

Getting the Data

Downloading the CSV file

The data is contained in a CSV file called harmonized_historical_census_data.csv which is a table of all census tracts listed by their FIPS ID and corresponding tract vintage.

Import Directly Into `R`

Use the following code to download the deprivation index data.frame directly into R:

census_data <- 'https://github.com/geomarker-io/harmonized_historical_census_data/blob/main/harmonized_historical_census_data.rds' %>% 
    url() %>% 
    gzcon() %>% 
    readRDS() %>% 
    as_tibble()

Usage

Matching longitudinal address data to census tracts

First, geocode longitudinal address data using the DeGAUSS geocoder.

Then, use the DeGAUSS Spatiotemporal Census Tract container to assign census tract identifiers for the appropriate decade. This container requires a date range (start_date and end_date columns). Rows with date ranges that span mulitple decades will be split to one row per decade.

For example,

would become

where a 2010 tract identifier is assigned to the first row, and the second row is split into one row with a 2000 tract identifier and one row with a 2010 tract identifier.

Joining to harmonized_historical_census_data

Once census_tract_vintage and census_tract_id is assinged for each row, the harmonized_census_data can be joined to your longitudinal address data using both the census_tract_vintage and census_tract_id columns as keys.

For example, in R this could look like

dplyr::left_join(
  address_data,
  harmonized_census_data,
  by = c("census_tract_vintage", "census_tract_id")
)

Details

The table below includes specific variable definitions for each census decade.

Reproducibility

All census data was obtained from nhgis.org. Each harmonized_historical_census_data variable was derived using the formulas listed under the derivation columns, and census variables included in each formula column can be found in the NHGIS dataset listed in the corresponding NHGIS dataset column.