Curating metadata for tabular data in R using attributes
Source:vignettes/curating-metadata.Rmd
curating-metadata.Rmd
Inside R, metadata lives in the attributes of the data.frame and its
columns. We can add and change these with several helper functions used
in the example below: add_attrs()
,
add_col_attrs()
, add_type_attrs()
. Using these
functions to set attributes in R means we can do so reproducibly and
changes to the metadata are tracked alongside the R script that creates
the data. This prevents a disconnect between data and metadata, but also
allows for computing on the metadata to use it to create richer
documentation. 1
Adding metadata properties
When creating a tabular dataset in R, data-specific metadata (i.e., “properties”) can be stored in the attributes of the R object (e.g., a data.frame or tibble).
d <- d |>
add_attrs(
name = "mydata",
title = "My Data",
version = "0.1.0",
homepage = "https://geomarker.io/codec"
)
Note that this doesn’t change any of the data values. In R, an
object’s attributes are stored with it as a list. Some attributes
(?attributes
) are treated specially by R (e.g.,
class
, names
, row.names
,
comment
) and usually shouldn’t be modified. Although
all attributes (including the ones we added above) are
available as a list (?attributes
), we can use a function to
extract only the attributes that represent metadata descriptors as a
tibble.
glimpse_attr(d) |>
knitr::kable()
name | value |
---|---|
name | mydata |
version | 0.1.0 |
title | My Data |
homepage | https://geomarker.io/codec |
Adding column-specific metadata properties
Similarly, we can add column-specific attributes (i.e., “schema”). These metadata functions follow the tidy design principles, making it simple to expressively and concisely add metadata using pipes:
d <-
d |>
add_col_attrs(id, title = "Identifier", description = "unique identifier") |>
add_col_attrs(date, title = "Date", description = "date of observation") |>
add_col_attrs(measure, title = "Measure", description = "measured quantity") |>
add_col_attrs(rating, title = "Rating", description = "ordered ranking of observation") |>
add_col_attrs(ranking, title = "Ranking", description = "rank of the observation") |>
add_col_attrs(impt, title = "Important", description = "true if this observation is important")
Adding column-specific metadata properties based on R classes
Automatically add name
, type
and
enum
schema to each column in the data based on their
class:
d <- add_type_attrs(d)
Like for descriptors, there is a helper function to retrieve schema as a tibble:
options(knitr.kable.NA = "")
glimpse_schema(d) |>
knitr::kable()
name | title | description | type | constraints |
---|---|---|---|---|
id | Identifier | unique identifier | string | |
date | Date | date of observation | date | |
measure | Measure | measured quantity | number | |
rating | Rating | ordered ranking of observation | string | good, better, best |
ranking | Ranking | rank of the observation | integer | |
impt | Important | true if this observation is important | boolean |
See Reading and Writing Tabular Data Resources for details on how to save the tabular-data-resource to disk.