Skip to contents

Fuzzy match strings in x to strings in y using optimized string alignment (OSA) distance and ignoring capitalization.

Usage

fuzzy_match(x, y, osa_max_dist = 1, prefilter = c("none", "psk"))

Arguments

x

character vector to match

y

character vector to match to

osa_max_dist

maximum OSA distance to consider a match; Inf is a special case that avoids computing string distance by returning all of y instead of just the best match or matches in y.

prefilter

method used to prefilter y before computing OSA distances; "none" does nothing, and "psk" removes values in y that do not share a phonetic_street_key() with any value in x.

Value

a list of integer vectors representing the position of the best matching string(s) in y for each string in x

Details

If multiple strings in y are tied for the minimum OSA distance from a string in x, all of their indices are included in the return value.

Examples

my_names <-
  c("Pinye", "Pine", "Oalck", "Sunset", "Riverbend", "Greenfild")
the_names <-
  c("Piney", "Pine", "Oak", "Cheshire", "Greenfield", "Maple", "Elm")
matches <- fuzzy_match(my_names, the_names, osa_max_dist = 1)
matches
#> [[1]]
#> [1] 1 2
#> 
#> [[2]]
#> [1] 2
#> 
#> [[3]]
#> integer(0)
#> 
#> [[4]]
#> integer(0)
#> 
#> [[5]]
#> integer(0)
#> 
#> [[6]]
#> [1] 5
#> 

lapply(matches, \(i) the_names[i])
#> [[1]]
#> [1] "Piney" "Pine" 
#> 
#> [[2]]
#> [1] "Pine"
#> 
#> [[3]]
#> character(0)
#> 
#> [[4]]
#> character(0)
#> 
#> [[5]]
#> character(0)
#> 
#> [[6]]
#> [1] "Greenfield"
#> 

x <- as_addr(voter_addresses()[1:100])@street@name
y <- unique(nad_example_data()$nad_addr@street@name)
system.time(fuzzy_match(x, y))
#>    user  system elapsed 
#>   0.482   0.004   0.165 
# larger vectors see a speedup when using
# phonetic_street_key as a prefilter
# but may miss potential matches that are within
# osa_max_dist of each other, but did not have
# identical phonetic codes (e.g., "woolper" and "woopler")
system.time(fuzzy_match(x, y, prefilter = "psk"))
#>    user  system elapsed 
#>   0.296   0.001   0.135