The U.S. Census is a rich source for population-level demographic, socioeconomic, and housing data. Although the Census Bureau has been collecting data since the 1700s, the data collected has changed over time, with topics being removed, added, and redefined decade-to-decade, making long-term longitudinal analyses difficult. Here, we sought to compile and harmonize variables from the 1970 census through the 2010 census at the census tract level in the following categories:
The data is contained in a CSV file called harmonized_historical_census_data.csv which is a table of all census tracts listed by their FIPS ID and corresponding tract vintage.
R
Use the following code to download the deprivation index data.frame directly into R:
census_data <- 'https://github.com/geomarker-io/harmonized_historical_census_data/blob/main/harmonized_historical_census_data.rds' %>%
url() %>%
gzcon() %>%
readRDS() %>%
as_tibble()
First, geocode longitudinal address data using the DeGAUSS geocoder.
Then, use the DeGAUSS Spatiotemporal Census Tract
container to assign census tract identifiers for the appropriate
decade. This container requires a date range (start_date
and end_date
columns). Rows with date ranges that span
mulitple decades will be split to one row per decade.
For example,
would become
where a 2010 tract identifier is assigned to the first row, and the second row is split into one row with a 2000 tract identifier and one row with a 2010 tract identifier.
Once census_tract_vintage
and
census_tract_id
is assinged for each row, the
harmonized_census_data
can be joined to your longitudinal
address data using both the census_tract_vintage
and
census_tract_id
columns as keys.
For example, in R
this could look like
dplyr::left_join(
address_data,
harmonized_census_data,
by = c("census_tract_vintage", "census_tract_id")
)
The table below includes specific variable definitions for each census decade.
All census data was obtained from nhgis.org. Each
harmonized_historical_census_data
variable was derived
using the formulas listed under the derivation
columns, and
census variables included in each formula column can be found in the
NHGIS dataset listed in the corresponding NHGIS dataset
column.