Standardising character strings

Previous steps

If you would like to return to information from the previous section, please click here.

Context

When formatting data and combining tables, there can be inconsistencies in the format of character strings. These include things such as using a combination of spaces " " and underscores _ in site or taxonomic names; a mixture of upper case and lower case characters; et cetera.

A suite of functions in the stringr package provides some very useful tools for manipulating characters, dealing with whitespace and pattern matching functions.

Standardising charactering strings

Taking our example from the benthic cover data from Costa Rica, we can see that the taxa Code, Category and Grouping have a mixture of uppercase, title (i.e. each word starting with a capital letter), and sentence (i.e. only first word with capital letter) formatting:

  # get list of categories
    percent_cover_acosa %>%
      dplyr::select(Code,
                    Category,
                    Grouping) %>%
      distinct()
# # A tibble: 81 x 3
   # Code  Category               Grouping
   # <chr> <chr>                  <chr>
 # 1 ARENA Arena                  arena
 # 2 TURF  Turf                   turf
 # 3 Esp   Esponja                esponja
 # 4 Acc   Alga Calcarea Costrosa Alga calcarea costrosa
 # 5 Hal   Halimeda               macroalga
 # 6 Brio  Briozoo                otro
 # 7 Hid   Hidrozoo               otro
 # 8 lep   Leptogorgia            otro
 # 9 Amp   Amphiroa               macroalga
# 10 gel   Gelidial               turf
# # … with 71 more rows

For setting the Code to a title format:

  # set code to title
    percent_cover_acosa %>%
      mutate(Code = Code %>% str_to_title()) %>%
      dplyr::select(Code,
                    Category,
                    Grouping) %>%
      distinct()
# # A tibble: 50 x 3
   # Code  Category               Grouping
   # <chr> <chr>                  <chr>
 # 1 Arena Arena                  arena
 # 2 Turf  Turf                   turf
 # 3 Esp   Esponja                esponja
 # 4 Acc   Alga Calcarea Costrosa Alga calcarea costrosa
 # 5 Hal   Halimeda               macroalga
 # 6 Brio  Briozoo                otro
 # 7 Hid   Hidrozoo               otro
 # 8 Lep   Leptogorgia            otro
 # 9 Amp   Amphiroa               macroalga
# 10 Gel   Gelidial               turf

Using this function changed the ARENA as well as the lep, gel codes in a single line of code!

And for standardising the Category to a sentence format to match the Grouping column:

  # set category to match grouping
    percent_cover_acosa %>%
      mutate(Category = Category %>% str_to_sentence()) %>%
      dplyr::select(Code,
                    Category,
                    Grouping) %>%
      distinct()
# # A tibble: 81 x 3
   # Code  Category               Grouping
   # <chr> <chr>                  <chr>
 # 1 ARENA Arena                  arena
 # 2 TURF  Turf                   turf
 # 3 Esp   Esponja                esponja
 # 4 Acc   Alga calcarea costrosa Alga calcarea costrosa
 # 5 Hal   Halimeda               macroalga
 # 6 Brio  Briozoo                otro
 # 7 Hid   Hidrozoo               otro
 # 8 lep   Leptogorgia            otro
 # 9 Amp   Amphiroa               macroalga
# 10 gel   Gelidial               turf
# # … with 71 more rows

For making replacements within a character string and standardising to match (for example) a site concordance table with lower case and underscore _ symbols instead of spaces:

  # get list of sites
    percent_cover_acosa$Site %>% unique() %>% sort()
 # [1] "Bajo Mauren"       "Burbujas"          "Cueva del Tiburón" "El Arbolito"
 # [5] "El Jardín"         "Isla Ballena"      "Islotes"           "La Catarata"
 # [9] "La Viuda"          "Mogos"             "Nicuesa"           "Punta Gallardo"
# [13] "San Jocesito"      "San Pedrillo"      "Sándalo 1"         "Sándalo 2"
# [17] "Tómbolo noreste 1" "Tómbolo noreste 2" "Tómbolo sur"       "Tres Hermanas"

  # set to lower with "_"
    percent_cover_acosa %>%
      mutate(Site = Site %>% str_replace(" ", "_"),
             Site = Site %>% str_to_lower()) %>%
      pull(Site) %>% unique() %>% sort()
 # [1] "bajo_mauren"       "burbujas"          "cueva_del tiburón" "el_arbolito"
 # [5] "el_jardín"         "isla_ballena"      "islotes"           "la_catarata"
 # [9] "la_viuda"          "mogos"             "nicuesa"           "punta_gallardo"
# [13] "san_jocesito"      "san_pedrillo"      "sándalo_1"         "sándalo_2"
# [17] "tómbolo_noreste 1" "tómbolo_noreste 2" "tómbolo_sur"       "tres_hermanas"

Next Steps

There are many different ways to use the stringr package to standardise and format data. Users are encouraged to review Hadley Wickham’s r4ds and vignettes for a deeper understanding and familiarity for how to deal with special characters, groupings, et cetera.

Once you have sharpened your skills for manipulating character strings and having clean data, we should learn how to store them as intermediate_data for the next steps

Standardising character strings

CORDIO East Africa & GCRMN Data Standards Working Group

11 August 2023

Previous steps

Context

Standardising charactering strings

Next Steps