Using R to Build a Community Data Explorer for Cincinnati (CoDEC)

CCHMC R Users Group

Cole Brokamp, Erika Manning, Andrew Vancil

5/10/23

👋 Welcome

Join the RUG Outlook group for updates and events. {width=180%}

📣 BUG & RUG present:

Bilingual Data Science Meeting

July 13, 2023

In-person at S1.203

About BUG: The Biomedical Informatics Users Group (BUG) is a community of bioinformatics researchers and data scientists looking to sharing knowledge, insights, and build community. We organized a series of user-led, informal talks and discussions by and for data researchers at CCHMC and UC. Please contact Krishna Roskin at krishna.roskin@cchmc.org for more details or to sign up to present at a future BUGs meeting.

Using R to Build a Community Data Explorer for Cincinnati (CoDEC)

  1. Introduction to CoDEC
  2. Sharing CoDEC Data
  3. Exploring CoDEC Data

Background

The White House’s Equitable Data Working Group1:

  • Equitable data are “those that allow for rigorous assessment of the extent to which government programs and policies yield consistently fair, just, and impartial treatment of all individuals.”
  • Equitable data should “illuminate opportunities for targeted actions that will result in demonstrably improved outcomes for underserved communities.”
  • Make disaggregated data the norm while being “… intentional about when data are collected and shared, as well as how data are protected so as not to exacerbate the vulnerability of members of underserved communities, many of whom face the heightened risk of harm if their privacy is not protected.”

Disaggregation

  • Open data can fall short of driving action if it is not equitable.

  • Disaggregating1 data by sensitive attributes, like race and ethnicity, can elucidate inequities that would otherwise remain hidden.

Open data is necessary and not sufficient to drive the type of action that we need to create a more equitable society.

— The U.S. Chief Data Scientist, Denice Ross2

Privacy

  • Data are people1
  • Privacy is a spectrum of the tradeoffs between risks and benefits to individuals and populations
  • Data collected at the individual-level by one organization often cannot be shared2 with another organization due to legal restrictions or organization-specific data governance policies
  • Community-level (e.g. neighborhood, census tract, ZIP code) data disaggregated by gender, race, or other sensitive attributes
  • Achieving data harmonization upstream of storage allows for contribution of disaggregated, community-level data without disclosing individual-level data when sharing across organizations

The TRUST principles for digital repositories1

Creating and maintaining an open community-level data resource equips the entire community for data-powered decision making and boosts organizational trustworthiness. Demonstrating reliability and capability of appropriately managing shared data helps earn the trust of organizations and communities intended to be served:

  • 🤲 transparent: make specific repository services and data holdings verifiable by publicly accessible evidence
  • 📃 responsible: ensure authenticity and integrity of data holdings
  • 👥 user-focused: meet data management norms and expectations of target user communities
  • ⏳️️ sustainable: preserve services and data holdings for the long-term
  • ⚙️ technological: provide infrastructure and capabilities supporting secure, persistent, and reliable services

FAIR1

  • 🔎 findable: use a unique and persistent identifier, add rich metadata (using existing standards2)
  • 🔓 accessible: store in a data repository (⚠️ personal/classified information, but metadata still accessible)
  • ⚙️ interoperable: use an open file format with controlled vocabularies, reference relevant datasets
  • ♻️ reusable: well documented, including a description (README with data sources, background, and how to reproduce the data), a data dictionary (field descriptions, units, titles, missingness), and usage licenses (for code3 or data/presentations/papers4)

Community Data Explorer for Cincinnati (CoDEC)

A data repository composed of equitable, community-level data for Cincinnati.

CoDEC Aims

  1. Define common data specification for community-level data considering FAIR, TRUST, privacy, equitable disaggregation
  2. Create and disseminate methods and tools for harmonizing and sharing community-level data, including spatiotemporal interpolation, data validation, API for accessing data at scale and on demand
  3. Serve a portable interactive data catalog derived on demand from metadata and an open API to link to or include with other data catalogs (e.g., C2D2)
  4. Create an interactive web application to explore community-level distributions across Cincinnati and explore simple relationships between community-level measures

CoDEC Overview

%%{init: { "fontFamily": "arial" } }%%

flowchart TD

classDef I fill:#E49865,stroke:#333,stroke-width:0px;
classDef II fill:#EACEC5,stroke:#333,stroke-width:0px;
classDef III fill:#CBD6D5,stroke:#333,stroke-width:0px;
classDef IIII fill:#8CB4C3,stroke:#333,stroke-width:0px;
classDef V fill:#396175,color:#F6EAD8,stroke:#333,stroke-width:0px;

subgraph source-box [data sources]
    org(community \norganization):::I
    jfs(government \n organization):::I
    cchmc("healthcare \n organization"):::I
    acs("built, natural, and \n social environment"):::I
end
class source-box II

stage(collection of community-\nlevel data):::I

org --> |"data \n support"| stage
jfs --> |decentralized \n geocoding| stage
cchmc --> |spatiotemporal \n aggregation| stage
acs --> |automatic \n interpolation| stage
stage --> codec-box

subgraph codec-box ["Community Data Explorer for Cincinnati (CoDEC)"]
    ingest("(meta)data harmonization"):::IIII
    data(community-level \n tabular data resource):::IIII
    data-catalog("interactive data catalog\n geomarker.io/codec"):::IIII
    ingest --> data
    data --> data-catalog
    data --> api(data API):::IIII
    api --> bindings(R code \n for accessing data):::IIII
    data-catalog --> download(explore, map, download):::V
end

class codec-box III

bindings --> dashboard("dashboards and reports"):::V
bindings --> qr(QI & research):::V
api ---> anywhere(public access):::V

Data Harmonization

CoDEC encodes data streams about the communities in which we live into a common format (census tract and month) so that it can be decoded into different community-level geographies and different time frames.

CoDEC Integrated Data Cores

 

 

How to Read Data in R Using CoDEC

codec::codec_data("hamilton_property_code_enforcement")
# A tibble: 226 × 3
   census_tract_id_2020 violations_per_household  year
   <chr>                                   <dbl> <int>
 1 39061000200                             0.328  2022
 2 39061000700                             0.647  2022
 3 39061000900                             2.65   2022
 4 39061001000                             1.38   2022
 5 39061001100                             2.01   2022
 6 39061001600                             5.06   2022
 7 39061001700                             5.43   2022
 8 39061001800                             3.09   2022
 9 39061001900                             1.32   2022
10 39061002000                             0.957  2022
# ℹ 216 more rows

How to Read Metadata in R Using CoDEC

codec::codec_data("hamilton_property_code_enforcement") |>
  codec::glimpse_tdr()
$attributes
# A tibble: 7 × 2
  name        value                                                             
  <chr>       <chr>                                                             
1 profile     tabular-data-resource                                             
2 name        hamilton_property_code_enforcement                                
3 path        hamilton_property_code_enforcement.csv                            
4 version     0.1.2                                                             
5 title       Hamilton County Property Code Enforcement                         
6 homepage    https://geomarker.io/hamilton_property_code_enforcement           
7 description Number of property code enforcements per household by census tract

$schema
# A tibble: 3 × 4
  name                     description                               type  title
  <chr>                    <chr>                                     <chr> <chr>
1 census_tract_id_2020     census tract identifier                   stri… <NA> 
2 violations_per_household number of property code enforcements per… numb… <NA> 
3 year                     data year                                 inte… Year 

Sharing CoDEC Data

Frictionless Standards

Developed by the Open Knowledge Foundation, the frictionless1 standards are a set of patterns for describing data, including datasets (Data Package), files (Data Resource), and tables (Table Schema). A Data Package is a simple container format used to describe and package a collection of data and metadata, including schemas. These metadata are contained in a specific file (separate from the data file), usually written in JSON or YAML, that describes something specific to each Frictionless Standard:

  • Table Schema: describes a tabular file by providing its dimension, field data types, relations, and constraints
  • Data Resource: describes an exact tabular file providing a path to the file and details like title, description, and others
  • Tabular Data Resource = Data Resource + Table Schema
  • CSV dialect: describes the formatting specific to the various dialects of CSV files
  • Data Package & Tabular Data Package: describes a collection of tabular files providing data resource information from above along with general information about the package itself, a license, authors, and other metadata

CoDEC Specifications

%%{init: { "fontFamily": "Arial" } }%%

flowchart TB

classDef I fill:#E49865,stroke:#333,stroke-width:2px;
classDef II fill:#EACEC5,stroke:#333,stroke-width:2px;
classDef III fill:#CBD6D5,stroke:#333,stroke-width:2px;
classDef IIII fill:#8CB4C3,stroke:#333,stroke-width:2px;

tdr([tabular-data-resource]):::I

name(name):::II
path(path):::II
version(version):::II   
schema([schema]):::II
title(title):::II
homepage(homepage):::II
description(description):::II

tdr --- name
tdr --- path
tdr --- version   
tdr --- title
tdr --- description
tdr --- homepage
tdr --- schema

schema --- fields([fields]):::III
schema --- primaryKey(primaryKey):::III
schema --- foreignKey(foreignKey):::III

fields --- field_name_1(field_1:\nname \n title \n description \n type):::IIII
fields --- field_name_2(field_2:\nname \n title \n type \n constraints):::IIII
fields --- field_name_3(field_3:\nname \n title \n description \n type \n constraints):::IIII

{cincy}

CoDEC relies on the {cincy} R package to define Cincinnati-area geographies and interpolate area-level data between census tracts, neighborhoods, and ZIP codes in different years.

The codec R package supports CoDEC data infrastructure through:

Curating metadata for tabular data in R using attributes

Reading and writing tabular data resources

Tools for checking against CoDEC specifications

Serve core tabular data resources through data catalog

Exploring CoDEC

Screenshot

Shiny

  • {shiny} is a powerful tool for developing reactive, interactive web applications

  • Here, we use Shiny to create a data explorer for CoDEC

  • With Shiny the user has the power to explore the data of their choosing and visually see the link across data displays

Leveraging data standards for Shiny

  • Rather than a static data source (such as a .rds file), with each initialization of the app, the user is actually performing a fresh “pull” from CoDEC

  • As CoDEC updates, the explorer updates

d_drive <- codec::codec_data("hamilton_drivetime") 

drive_meta <- codec::glimpse_schema(d_drive)

#...

Inset panel and scatterplot

  • The core pieces of the app are a {leaflet} interactive map and an inset scatterplot made using {ggiraph} and {cowplot}

  • Inspired by and leveraging the {biscale} package, the explorer visualizes bivariate relationships using a blended color scale

    • The scatterplot background is divided into the 9 levels between both variables and doubles as a color legend
    • Both axes contain histograms displaying the univariate distribution of each selected metric
    • When hovering over a point on the plot, the corresponding histogram indicators are highlighted
  • By using the function shiny::absolutepanel(), we can overlay a plotting panel on top of the map and escape the gridded layout that many Shiny apps default to

{bslib} layout

  • The explorer is built using the {bslib} package, which fully implements the latest Bootstrap 5 components
    • “fillable” pages/containers that naturally fill the user’s window
    • Modern cards and sidebars
  • Using only a couple lines of code, we can easily match the theme of the app to the CoDEC website
ui <- bslib::page_fillable(
  theme = bslib::bs_theme(version = 5,
                   "bg" = "#FFFFFF",
                   "fg" = "#396175",
                   "primary" = "#C28273",
                   "grid-gutter-width" = "0.0rem",
                   "border-radius" = "0.5rem",
                   "btn-border-radius" = "0.25rem")
  
  #...
)

Connecting data displays and metadata

  • In order to connect the selected inputs to the plotting portions, we implemented a series of reactive “crosstalk” connections
    • We take advantage of the CoDEC metadata to display user-friendly variable titles while linking to R-friendly variable names
    • CoDEC metadata is also crucial for connecting individual metrics to the data core that they are a part of
  • The explorer also takes advantage of Shiny reactivity by allowing the user to select a tract on the map and highlight it on the scatterplot, as well as the opposite direction
    • Both {leaflet} and {ggiraph} export user-selected objects that can be shared across the app

Interactive Demo

Thank You

🌐 https://geomarker.io/codec

‍💻️ github.com/geomarker-io