If you would like to review information from the previous section, please click here.
To sharpen skills for Data Formatting & Standardisation, the Homework for this module includes additional exercises for importing data, linking tables, setting data in “long” format and ordering factors.
Once data are clean and in a standard format, data objects are saved
in the data_intermediate
folder as a *.rda
file.
Course participants should be able to identify the location of the
data (i.e. data_locale
and data_file
) and use
read_csv()
to import the data:
## -- import csv data -- ##
# point to data locale
data_locale <- "data_raw/examples/formatting/"
# point to data file
data_file <- "kenya.csv"
# import data
percent_cover_kenya <-
paste0(data_locale, data_file) %>%
read_csv()
# # A tibble: 1,221 x 27
# X1 `#` DatasetID `data source/ow… Country Year `Date (YYYY-MM-… Sector Site
# <dbl> <lgl> <lgl> <lgl> <chr> <dbl> <lgl> <lgl> <lgl>
# 1 1 NA NA NA Kenya 2008 NA NA NA
# 2 2 NA NA NA Kenya 2008 NA NA NA
# 3 3 NA NA NA Kenya 2008 NA NA NA
# 4 4 NA NA NA Kenya 2008 NA NA NA
# 5 5 NA NA NA Kenya 2008 NA NA NA
# 6 6 NA NA NA Kenya 2008 NA NA NA
# 7 7 NA NA NA Kenya 2008 NA NA NA
# 8 8 NA NA NA Kenya 2008 NA NA NA
# 9 9 NA NA NA Kenya 2008 NA NA NA
# 10 10 NA NA NA Kenya 2008 NA NA NA
# # … with 1,211 more rows, and 18 more variables: Station <chr>, Zone <lgl>,
# # Depth (m) <lgl>, Latitude <lgl>, Longitude <lgl>, Transect length (m) <lgl>,
# # Distance (cm) <dbl>, Method <lgl>, Observer <lgl>, Benthic category <lgl>,
# # Benthic code <chr>, 1 <dbl>, 2 <dbl>, 3 <dbl>, 4 <dbl>, 5 <dbl>, 6 <dbl>,
# # 7 <dbl>
# Warning message:
# Missing column names filled in: 'X1' [1]
One will note that there is a number of columns with NA
values. To quickly exclude these columns, one can use:
## -- using dplyr::select_if() function -- ##
# exclude empty columns
percent_cover_kenya %>% dplyr::select_if(~any(!is.na(.)))
As this command would automatically select columns that contain
NA
values, another strategy is to select individual columns
using dplyr::select()
. This provides clear documentation of
which columns were selected:
# exclude empty columns
percent_cover_kenya %<>%
dplyr::select(Country,
Year,
Station,
`Distance (cm)`,
# `Transect length (m)`,
`Benthic code`,
`1`,
`2`,
`3`,
`4`,
`5`,
`6`,
`7`)
# # A tibble: 1,221 x 13
# Country Year Station `Distance (cm)` `Transect lengt… `Benthic code` `1` `2`
# <chr> <dbl> <chr> <dbl> <lgl> <chr> <dbl> <dbl>
# 1 Kenya 2008 Coral G… 32 NA AT 0 32
# 2 Kenya 2008 Coral G… 52 NA FA 0 0
# 3 Kenya 2008 Coral G… 60 NA HC 60 0
# 4 Kenya 2008 Coral G… 96 NA AT 0 0
# 5 Kenya 2008 Coral G… 100 NA FA 0 68
# 6 Kenya 2008 Coral G… 107 NA FA 0 0
# 7 Kenya 2008 Coral G… 154 NA AT 0 0
# 8 Kenya 2008 Coral G… 170 NA FA 0 0
# 9 Kenya 2008 Coral G… 170 NA HC 0 70
# 10 Kenya 2008 Coral G… 178 NA HC 0 0
# # … with 1,211 more rows, and 5 more variables: 3 <dbl>, 4 <dbl>, 5 <dbl>, 6 <dbl>,
# # 7 <dbl>
As the data were imported in a “wide” format with quadrat
numbers in the columns, to facilitate the summary and standardisation of
these data we will set them into a “long” format using
gather()
:
## -- set to long -- ##
# stack the quadrates
percent_cover_kenya %<>%
gather(Quadrate, Value,
-Country,
-Year,
-Station,
-`Distance (cm)`,
# -`Transect length (m)`,
-`Benthic code`)
# # A tibble: 9,768 x 7
# Country Year Station `Transect length (m)` `Benthic code` Quadrate Value
# <chr> <dbl> <chr> <lgl> <chr> <chr> <dbl>
# 1 Kenya 2008 Coral Garden NA AT Distance (cm) 32
# 2 Kenya 2008 Coral Garden NA FA Distance (cm) 52
# 3 Kenya 2008 Coral Garden NA HC Distance (cm) 60
# 4 Kenya 2008 Coral Garden NA AT Distance (cm) 96
# 5 Kenya 2008 Coral Garden NA FA Distance (cm) 100
# 6 Kenya 2008 Coral Garden NA FA Distance (cm) 107
# 7 Kenya 2008 Coral Garden NA AT Distance (cm) 154
# 8 Kenya 2008 Coral Garden NA FA Distance (cm) 170
# 9 Kenya 2008 Coral Garden NA HC Distance (cm) 170
# 10 Kenya 2008 Coral Garden NA HC Distance (cm) 178
# # … with 9,758 more rows
To standardise the benthic taxa codes, the forcats
package provides a way to recode groupings:
## -- next is to create a `Grouping` column to classify the `Benthic Taxa` -- ##
# set grouping using `Benthic code`
percent_cover_kenya %<>%
mutate(Grouping = `Benthic code` %>% factor() %>%
fct_recode(`Non-living` = "RB",
`Live coral` = "HC",
Macroalgae = "AT",
`Non-living` = "S",
`Sessile invertebrates` = "SG",
Macroalgae = "FA",
`Sessile invertebrates` = "SC",
Macroalgae = "HAL",
`Live coral` = "FAVIA",
`Crustose algae` = "CA",
`Live coral` = "PLATY",
`Live coral` = "POM",
`Live coral` = "FAVITES",
`Sessile invertebrates` = "GAL",
`Live coral` = "ACRO",
`Live coral` = "MONTI",
`Sessile invertebrates` = "SP",
`Sessile invertebrates` = "ZO",
`Bleached or dead coral` = "POB",
`Live coral` = "ST",
`Sessile invertebrates` = "SPO",
`Sessile invertebrates` = "HYD",
`Sessile invertebrates` = "SV"))
These classifications can be linked with the WIO regional benthic categories:
# link to wio regional taxa
percent_cover_kenya %<>%
left_join(wio_benthic_taxa %>%
rename(`Benthic code` = benthic_code))
After checking the percent cover data and ensuring that the benthic
codes match up, one should save the data object to the
data_intermediate
folder as an *.rda
file.
This is done by pointing to the save location (which could be in your
participants_code
personal folder):
# point to save locale
save_locale <- "data_intermediate/examples/formatting/"
# save percent cover data
save(percent_cover_kenya,
file = paste0(save_locale, "percent_cover_kenya.rda"))
In future modules, we will see the utility of separating the Data Formatting & Standardisation from the Visualisation, Mapping and Reporting modules.
The next challenge is for visualising data.