CoDEC tabular-data-resource specifications • codec

Introduction

The CoDEC tabular-data-resource (TDR) specifications provide a set of patterns designed to make sharing tabular community-level data easier. Examples of this type of data include the pediatric hospitalization rate per month per census tract, the total number of gunshots per season per neighborhood, and the housing code enforcement density per year per ZIP code.

The specifications are built on top of the Frictionless Tabular Data Resource and define a CoDEC TDR as a table of data with accompanying metadata. See the fr package for tools to curate, read, and write frictionless TDRs in R. Within the {codec} R package, use check_codec_tdr() to validate an existing TDR against the CoDEC specifications (version 1.0.1).

Data

Data is specified as average values or total counts for census tract geographies during a specific year (or year and month). The required census tract identifer and year (or year and month) columns in a CoDEC TDR contain the spatiotemporal information that can be used to link other data.

Census Tract Identifier Column

A CoDEC TDR must include a census tract column named census_tract_id_{year}, where {year} is replaced with the decennial vintage of the census tract geographies used to create the dataset (i.e., census_tract_id_2000, census_tract_id_2010, or census_tract_id_2020).

The census tract identifier column MUST contain 11-digit GEOID identifiers for all census tracts in Hamilton County (GEOID: 39061). A list of required census tract identifiers for 2000, 2010, and 2020 are available in the {{cincy}} R package (e.g., cincy::tract_tigris_2010).

A CoDEC tdr that was not created at a census tract level should link to a URL (using the homepage property) that contains code and a descriptive README file about how the data was harmonized (e.g., areal interpolation) with census tract geographies.

Year (and Month) Temporal Column(s)

Year (and month) temporal variables in a CoDEC TDR must be in a “tidy format” so that each row represents one observation in time. This allows for cumulatively updating data resources without changing field-specific metadata.

A CoDEC TDR must include a column called year that contains only integers representing the year during which the data was collected (e.g., 2018, 2023).

It also may contain a month column, in which case the unique combination of the year and month columns represent the calendar month during which the data was collected (e.g., “2023” and “11” together represent November of 2023).

File Structure

Like all tabular-data-resources, a CoDEC TDR consists of a directory that must contain exactly one data (.csv) file and one metadata file named tabular-data-resource.yaml.

The name of the directory and the name of the CSV file containing the data MUST be identical to the name property.

For example,

mydata
├── mydata.csv
└── tabular-data-resource.yaml

Both files must be encoded using UTF-8 with newlines encoded as either \n or \r\n.

The data file must follow the RFC 4180 standard for CSV files. In addition:

the filename must end with .csv
the first row must be a header row, containing the unique name of each field
if a value is missing, it must be represented by either NA or an empty string

The metadata file MUST be a YAML file named tabular-data-resource.yaml adhering to the metadata specifications.

Metadata

The metadata of a CoDEC tdr is represented as a hierarchical list in a specific format. On disk, this metadata is stored separately from the data as a tabular-data-resource.yaml file.

The metadata of a CoDEC TDR is represented as a hierarchical list of properties :

Property

Each property (or “metadata property”) is a named value used to describe the data resource.

name	value
profile	`tabular-data-resource`
name	identifer composed of lower case alphanumeric characters, `_`, `-`, or `.`
path	relative file path or URL of data file
version	semantic version of the data resource
title	human-friendly title of the resource
homepage	homepage on the web related to the data; ideally a code repository used to create the data
description	additional notes about the resource
schema	a list object containing items in schema

Schema

The schema (or “table schema”) is a special property that is a list of information about the fields (or columns) in a tabular-data-resource.

name	value
fields	a list object as long as the number of fields each containing the items in fields
missingValues	the string values that should be considered missing observations
primaryKey	a field or set of fields that uniquely identifies each row
foreignKeys	a field or set of fields that connect to a separate table

Fields

fields (or “field descriptors”) is a special schema property that is a list of each of the fields in a tabular-data-resource, each with field-specific properties.

name	value
name	machine-friendly name of the field
title	human-friendly name of the field
description	any additional notes about the field
type	Frictionless type of the field
constraints	Frictionless constraints, including `enum`, an array of possible values or factor levels

Example

An example CoDEC tdr looks like:

name: tract_poverty
path: tract_poverty.csv
title: Fraction of Census Tract Households in Poverty
version: 1.2.1
description: |
  Measures derived from the 5-year American Community Survey.
  Downloaded from (IPUMS NHGIS)[https://nhgis.org) and
  converted to match 2020 census tract boundaries.
schema:
  fields:
    census_tract_id_2020:
      name: census_tract_id_2020
      title: Census Tract Identifier
      description: 2020 vintage census tract identifier
      type: string
    year:
      name: year
      title: Year
      type: number
    fraction_poverty:
      name: fraction_poverty
      title: Fraction of Households in Poverty
      type: number

The CSV data file for this example CoDEC tdr would contain values for all vintage 2020 census tracts in Hamilton County, but only the first and last five are shown here:

year, census_tract_id_2020, fraction_poverty
2020, 39061021508, 0.057
2020, 39061021421, 0.031
2020, 39061023300, 0.030
2020, 39061002000, 0.098
2020, 39061002500, 0.442
 ...      ...       ... 
2020, 39061021604, 0.259
2020, 39061024700, 0.062
2020, 39061026102, 0.154
2020, 39061023501, 0.046
2020, 39061009800, 0.391