If you would like to return to information from the previous section, please click here.
When formatting data and combining tables, there can be
inconsistencies in the format of character strings. These include things
such as using a combination of spaces " "
and underscores
_
in site or taxonomic names; a mixture of upper case and
lower case characters; et cetera.
A suite of functions in the stringr
package provides some very useful tools for manipulating characters,
dealing with whitespace and pattern matching functions.
Taking our example from the benthic cover data from Costa Rica, we
can see that the taxa Code
, Category
and
Grouping
have a mixture of uppercase,
title (i.e. each word starting with a capital letter), and
sentence (i.e. only first word with capital letter)
formatting:
# get list of categories
percent_cover_acosa %>%
dplyr::select(Code,
Category,
Grouping) %>%
distinct()
# # A tibble: 81 x 3
# Code Category Grouping
# <chr> <chr> <chr>
# 1 ARENA Arena arena
# 2 TURF Turf turf
# 3 Esp Esponja esponja
# 4 Acc Alga Calcarea Costrosa Alga calcarea costrosa
# 5 Hal Halimeda macroalga
# 6 Brio Briozoo otro
# 7 Hid Hidrozoo otro
# 8 lep Leptogorgia otro
# 9 Amp Amphiroa macroalga
# 10 gel Gelidial turf
# # … with 71 more rows
For setting the Code
to a title format:
# set code to title
percent_cover_acosa %>%
mutate(Code = Code %>% str_to_title()) %>%
dplyr::select(Code,
Category,
Grouping) %>%
distinct()
# # A tibble: 50 x 3
# Code Category Grouping
# <chr> <chr> <chr>
# 1 Arena Arena arena
# 2 Turf Turf turf
# 3 Esp Esponja esponja
# 4 Acc Alga Calcarea Costrosa Alga calcarea costrosa
# 5 Hal Halimeda macroalga
# 6 Brio Briozoo otro
# 7 Hid Hidrozoo otro
# 8 Lep Leptogorgia otro
# 9 Amp Amphiroa macroalga
# 10 Gel Gelidial turf
Using this function changed the ARENA
as well as the
lep
, gel
codes in a single line of code!
And for standardising the Category
to a
sentence format to match the Grouping
column:
# set category to match grouping
percent_cover_acosa %>%
mutate(Category = Category %>% str_to_sentence()) %>%
dplyr::select(Code,
Category,
Grouping) %>%
distinct()
# # A tibble: 81 x 3
# Code Category Grouping
# <chr> <chr> <chr>
# 1 ARENA Arena arena
# 2 TURF Turf turf
# 3 Esp Esponja esponja
# 4 Acc Alga calcarea costrosa Alga calcarea costrosa
# 5 Hal Halimeda macroalga
# 6 Brio Briozoo otro
# 7 Hid Hidrozoo otro
# 8 lep Leptogorgia otro
# 9 Amp Amphiroa macroalga
# 10 gel Gelidial turf
# # … with 71 more rows
For making replacements within a character string and standardising
to match (for example) a site concordance table with lower case
and underscore _
symbols instead of spaces:
# get list of sites
percent_cover_acosa$Site %>% unique() %>% sort()
# [1] "Bajo Mauren" "Burbujas" "Cueva del Tiburón" "El Arbolito"
# [5] "El Jardín" "Isla Ballena" "Islotes" "La Catarata"
# [9] "La Viuda" "Mogos" "Nicuesa" "Punta Gallardo"
# [13] "San Jocesito" "San Pedrillo" "Sándalo 1" "Sándalo 2"
# [17] "Tómbolo noreste 1" "Tómbolo noreste 2" "Tómbolo sur" "Tres Hermanas"
# set to lower with "_"
percent_cover_acosa %>%
mutate(Site = Site %>% str_replace(" ", "_"),
Site = Site %>% str_to_lower()) %>%
pull(Site) %>% unique() %>% sort()
# [1] "bajo_mauren" "burbujas" "cueva_del tiburón" "el_arbolito"
# [5] "el_jardín" "isla_ballena" "islotes" "la_catarata"
# [9] "la_viuda" "mogos" "nicuesa" "punta_gallardo"
# [13] "san_jocesito" "san_pedrillo" "sándalo_1" "sándalo_2"
# [17] "tómbolo_noreste 1" "tómbolo_noreste 2" "tómbolo_sur" "tres_hermanas"
There are many different ways to use the stringr
package
to standardise and format data. Users are encouraged to review Hadley
Wickham’s r4ds and vignettes
for a deeper understanding and familiarity for how to deal with special
characters, groupings, et cetera.
Once you have sharpened your skills for manipulating character
strings and having clean data, we should learn how to store them as
intermediate_data
for the next steps