Skip to contents

The CoDEC tabular-data-resource (or tdr) specifications provide a set of patterns useful for sharing tabular community-level data. Examples of this data include the pediatric hospitalizations rate per month per census tract, the total number of gunshots per season per neighborhood, and the housing code enforcement density per year per ZIP code.

Each CoDEC tdr consists of (1) a single table of data and (2) its metadata. CoDEC utilizes metadata specifications1 for community-level data in an effort to make them more interoperable and reusable.

Data

Data is specified as average values or total counts for census tract geographies during a specific year (or year and month). The required census tract identifer and year (or year and month) columns in a CoDEC tdr contain the spatiotemporal information that can be used to link it to with other data, including spatiotemporal health information.

Census Tract Identifier Column

A CoDEC tdr MUST include a census tract column named census_tract_id_{year}, where {year} is replaced with the decennial vintage of the census tract geographies used to create the dataset (i.e., census_tract_id_2000, census_tract_id_2010, or census_tract_id_2020).

The census tract identifier column MUST contain 11-digit GEOID identifiers for all census tracts in Hamilton County (GEOID: 39061). A list of required census tract identifiers for 2000, 2010, and 2020 are available in the {{cincy}} R package (e.g., cincy::tract_tigris_2010).

A CoDEC tdr that was not created at a census tract level SHOULD link to a URL (using the homepage property) that contains code and a descriptive README file about how the data was harmonized (e.g., areal interpolation) with census tract geographies.

Year (and Month) Temporal Column(s)

Year (and month) temporal variables in a CoDEC tdr MUST be in a “tidy” format so that each row represents one time point of observations. This allows for updating datasets with newer data without changing field metadata. A CoDEC tdr MUST include a column called year that contains only integers representing the year during which the data was collected (e.g., 2018, 2023). It also MAY contain a month column, in which case the unique combination of the year and month columns represent the calendar month during which the data was collected (e.g., “2023” and “11” together represent November of 2023).

File Structure

A CoDEC tdr consists of a directory that MUST contain exactly one data (.csv) file and exactly one metadata file (tabular-data-resource.yaml).

The name of the directory and the name of the CSV file containing the data MUST be identical to the name property.

For example,

mydata
├── mydata.csv
└── tabular-data-resource.yaml

Both files MUST be encoded using UTF-8 with newlines encoded as either \n or \r\n.

The data file MUST follow the RFC 4180 standard for CSV files. In addition:

  • the filename MUST end with .csv
  • the first row MUST be a header row, containing the unique name of each field
  • if a value is missing, it MUST be represented by either NA or an empty string (``)

The metadata file MUST be a YAML file named tabular-data-resource.yaml adhering to the metadata specifications.

Metadata

Metadata is information about data, but does not contain the data itself. For example, a CSV file cannot tell R (or other software) anything about itself including general information like its name, title, description, or homepage, as well as details on each column, including names, titles, types, formats, and constraints.

The metadata of a CoDEC tdr is represented as a hierarchical list in a specific format. On disk, this metadata is stored separately from the data as a tabular-data-resource.yaml file and in R, it is stored in the attributes of a data.frame.

The CoDEC specifications are based on the Frictionless Data Resource, Table Schema, and Tabular Data Resource standards.

The metadata of a CoDEC tdr is hierarchically composed of different descriptors :

  • property (or “metadata property”) are named values used to describe the data resource. The value of most properties are a single character string (e.g., name = "my_data"), but some are lists.

  • schema (or “table schema”) is a special property that is a list of information about the fields (or columns) in a tabular-data-resource. schema includes a list of fields, as well as the value used to denote missingness and which fields are primary or foreign keys.

  • fields (or “field descriptors”) is a special schema descriptor list of each of the fields in a tabular-data-resource, each with different descriptors containing field-specific information.

A CoDEC tdr MUST contain name and path descriptors. All other properties, schema, and fields MAY be present, but MUST be one of:

property

name value
profile profile of this descriptor (always set to tabular-data-resource here)
name an identifier string composed of lower case alphanumeric characters, _, -, and .
path location of data associated with resource as a POSIX path relative to the tabular-data-resource.yaml file or a fully qualified URL
version semantic version of the data resource
title human-friendly title of the resource
homepage homepage on the web related to the data; ideally a code repository used to create the data
description additional notes about the resource
schema a list object containing items in schema

schema

name value
fields a list object as long as the number of fields each containing the items in fields
primaryKey a field or set of fields that uniquely identifies each row
foreignKey a field or set of fields that connect to a separate table

fields

name value
name machine-friendly name of field/column; must be identical to name of column in data CSV file
title human-friendly name of field/column
description any additional notes about the field/column
type Frictionless type of the field/column (e.g., string, number, boolean)
constraints Frictionless constraints, including enum, an array of possible values or factor levels

All fields in the CSV data file MUST be described in the metadata and vice-versa.

See Curating metadata for tabular data in R using attributes for details on how to save the tabular-data-resource to disk.

Example

An example CoDEC tdr looks like:

name: tract_poverty
path: tract_poverty.csv
title: Fraction of Census Tract Households in Poverty
version: 1.2.1
description: measures derived from the 5-year American Community Survey
schema:
  fields:
    census_tract_id_2010:
      name: census_tract_id_2010
      title: Census Tract
      description: 2010 vintage census tract identifier
      type: string
    year:
      name: year
      title: Year
      type: integer
    fraction_poverty:
      name: fraction_poverty
      title: Fraction of Households in Poverty
      type: number

The CSV data file for this example CoDEC tdr would contain values for all vintage 2020 census tracts in Hamilton County, but only the first and last five are shown here:

year, census_tract_id_2020, fraction_poverty
2020, 39061021508, 0.057
2020, 39061021421, 0.031
2020, 39061023300, 0.030
2020, 39061002000, 0.098
2020, 39061002500, 0.442
 ...      ...       ... 
2020, 39061021604, 0.259
2020, 39061024700, 0.062
2020, 39061026102, 0.154
2020, 39061023501, 0.046
2020, 39061009800, 0.391

CoDEC specifications are versioned with {CoDEC}; this article describes version 0.7.2.