Consolidation and discussion for DFaS Module

Previous steps

If you would like to review information from the previous section, please click here.

Context

To sharpen skills for Data Formatting & Standardisation, the Homework for this module includes additional exercises for importing data, linking tables, setting data in “long” format and ordering factors.

Once data are clean and in a standard format, data objects are saved in the data_intermediate folder as a *.rda file.

Importing data

Course participants should be able to identify the location of the data (i.e. data_locale and data_file) and use read_csv() to import the data:

 ## -- import csv data -- ##
   # point to data locale
    data_locale <- "data_raw/examples/formatting/"

  # point to data file
    data_file <- "kenya.csv"

  # import data
    percent_cover_kenya <-
      paste0(data_locale, data_file) %>%
      read_csv()
# # A tibble: 1,221 x 27
      # X1 `#`   DatasetID `data source/ow… Country  Year `Date (YYYY-MM-… Sector Site
   # <dbl> <lgl> <lgl>     <lgl>            <chr>   <dbl> <lgl>            <lgl>  <lgl>
 # 1     1 NA    NA        NA               Kenya    2008 NA               NA     NA
 # 2     2 NA    NA        NA               Kenya    2008 NA               NA     NA
 # 3     3 NA    NA        NA               Kenya    2008 NA               NA     NA
 # 4     4 NA    NA        NA               Kenya    2008 NA               NA     NA
 # 5     5 NA    NA        NA               Kenya    2008 NA               NA     NA
 # 6     6 NA    NA        NA               Kenya    2008 NA               NA     NA
 # 7     7 NA    NA        NA               Kenya    2008 NA               NA     NA
 # 8     8 NA    NA        NA               Kenya    2008 NA               NA     NA
 # 9     9 NA    NA        NA               Kenya    2008 NA               NA     NA
# 10    10 NA    NA        NA               Kenya    2008 NA               NA     NA
# # … with 1,211 more rows, and 18 more variables: Station <chr>, Zone <lgl>,
# #   Depth (m) <lgl>, Latitude <lgl>, Longitude <lgl>, Transect length (m) <lgl>,
# #   Distance (cm) <dbl>, Method <lgl>, Observer <lgl>, Benthic category <lgl>,
# #   Benthic code <chr>, 1 <dbl>, 2 <dbl>, 3 <dbl>, 4 <dbl>, 5 <dbl>, 6 <dbl>,
# #   7 <dbl>
# Warning message:
# Missing column names filled in: 'X1' [1]

Excluding empty columns

One will note that there is a number of columns with NA values. To quickly exclude these columns, one can use:

 ## -- using dplyr::select_if() function -- ##
  # exclude empty columns
    percent_cover_kenya %>% dplyr::select_if(~any(!is.na(.)))

As this command would automatically select columns that contain NA values, another strategy is to select individual columns using dplyr::select(). This provides clear documentation of which columns were selected:

  # exclude empty columns
    percent_cover_kenya %<>%
      dplyr::select(Country,
                    Year,
                    Station,
                    `Distance (cm)`,
                    # `Transect length (m)`,
                    `Benthic code`,
                    `1`,
                    `2`,
                    `3`,
                    `4`,
                    `5`,
                    `6`,
                    `7`)
# # A tibble: 1,221 x 13
   # Country  Year Station  `Distance (cm)` `Transect lengt… `Benthic code`   `1`   `2`
   # <chr>   <dbl> <chr>              <dbl> <lgl>            <chr>          <dbl> <dbl>
 # 1 Kenya    2008 Coral G…              32 NA               AT                 0    32
 # 2 Kenya    2008 Coral G…              52 NA               FA                 0     0
 # 3 Kenya    2008 Coral G…              60 NA               HC                60     0
 # 4 Kenya    2008 Coral G…              96 NA               AT                 0     0
 # 5 Kenya    2008 Coral G…             100 NA               FA                 0    68
 # 6 Kenya    2008 Coral G…             107 NA               FA                 0     0
 # 7 Kenya    2008 Coral G…             154 NA               AT                 0     0
 # 8 Kenya    2008 Coral G…             170 NA               FA                 0     0
 # 9 Kenya    2008 Coral G…             170 NA               HC                 0    70
# 10 Kenya    2008 Coral G…             178 NA               HC                 0     0
# # … with 1,211 more rows, and 5 more variables: 3 <dbl>, 4 <dbl>, 5 <dbl>, 6 <dbl>,
# #   7 <dbl>

Getting data into “long” format

As the data were imported in a “wide” format with quadrat numbers in the columns, to facilitate the summary and standardisation of these data we will set them into a “long” format using gather():

 ## -- set to long -- ##
  # stack the quadrates
    percent_cover_kenya %<>%
      gather(Quadrate, Value,
             -Country,
             -Year,
             -Station,
             -`Distance (cm)`,
             # -`Transect length (m)`,
             -`Benthic code`)
# # A tibble: 9,768 x 7
   # Country  Year Station      `Transect length (m)` `Benthic code` Quadrate      Value
   # <chr>   <dbl> <chr>        <lgl>                 <chr>          <chr>         <dbl>
 # 1 Kenya    2008 Coral Garden NA                    AT             Distance (cm)    32
 # 2 Kenya    2008 Coral Garden NA                    FA             Distance (cm)    52
 # 3 Kenya    2008 Coral Garden NA                    HC             Distance (cm)    60
 # 4 Kenya    2008 Coral Garden NA                    AT             Distance (cm)    96
 # 5 Kenya    2008 Coral Garden NA                    FA             Distance (cm)   100
 # 6 Kenya    2008 Coral Garden NA                    FA             Distance (cm)   107
 # 7 Kenya    2008 Coral Garden NA                    AT             Distance (cm)   154
 # 8 Kenya    2008 Coral Garden NA                    FA             Distance (cm)   170
 # 9 Kenya    2008 Coral Garden NA                    HC             Distance (cm)   170
# 10 Kenya    2008 Coral Garden NA                    HC             Distance (cm)   178
# # … with 9,758 more rows

Standardising taxonomic groups

To standardise the benthic taxa codes, the forcats package provides a way to recode groupings:

 ## -- next is to create a `Grouping` column to classify the `Benthic Taxa` -- ##
  # set grouping using `Benthic code`
    percent_cover_kenya %<>%
      mutate(Grouping = `Benthic code` %>% factor() %>%
                          fct_recode(`Non-living`             =      "RB",
                                     `Live coral`             =      "HC",
                                     Macroalgae               =      "AT",
                                     `Non-living`             =       "S",
                                     `Sessile invertebrates`  =      "SG",
                                     Macroalgae               =      "FA",
                                     `Sessile invertebrates`  =      "SC",
                                     Macroalgae               =     "HAL",
                                     `Live coral`             =   "FAVIA",
                                     `Crustose algae`         =      "CA",
                                     `Live coral`             =   "PLATY",
                                     `Live coral`             =     "POM",
                                     `Live coral`             = "FAVITES",
                                     `Sessile invertebrates`  =     "GAL",
                                     `Live coral`             =    "ACRO",
                                     `Live coral`             =   "MONTI",
                                     `Sessile invertebrates`  =      "SP",
                                     `Sessile invertebrates`  =      "ZO",
                                     `Bleached or dead coral` =     "POB",
                                     `Live coral`             =      "ST",
                                     `Sessile invertebrates`  =     "SPO",
                                     `Sessile invertebrates`  =     "HYD",
                                     `Sessile invertebrates`  =      "SV"))

Linking tables

These classifications can be linked with the WIO regional benthic categories:

  # link to wio regional taxa
    percent_cover_kenya %<>%
      left_join(wio_benthic_taxa %>%
                  rename(`Benthic code` = benthic_code))

Managing intermediate data objects

After checking the percent cover data and ensuring that the benthic codes match up, one should save the data object to the data_intermediate folder as an *.rda file. This is done by pointing to the save location (which could be in your participants_code personal folder):

  # point to save locale
    save_locale <- "data_intermediate/examples/formatting/"

  # save percent cover data
    save(percent_cover_kenya,
      file = paste0(save_locale, "percent_cover_kenya.rda"))

Next steps

In future modules, we will see the utility of separating the Data Formatting & Standardisation from the Visualisation, Mapping and Reporting modules.

The next challenge is for visualising data.