If you would like to return to information from the previous session, please click here.
When transcribing information from field monitoring into a spreadsheet, it is common to repeat information such as site coordinates, temperature, metadata related to sample techniques, observer name, et cetera. Additionally, information related to taxonomic groups, such as higher-order taxonomy (e.g. Family), functional groups, and other information can also be included as separate columns to go alongside the primary percent cover or abundance data.
As a data set grows in long form (as opposed to
wide form), these additional columns can create a unnecessarily
large data set and can be prone to introduce errors (e.g. in repeated
column values). For example, as a monitoring data set grows as a time
series and sites, the number of rows in a spreadsheet could be in the
1,000s (or in R: 1e3) of rows!
In R, this is best handled as a number of separate tables, that can be joined when the additional information is required (e.g. when aggregating taxa information to Family or GCRMN Tier 1 “Live hard coral”) or by site categories (e.g. “reef crest”, “lagoon”, et cetera).
As a continuation of the DFaS Module, at the end of this lesson, you should be able to:
left_join()right_join())
[The code for this example can be found here:
creation_code/exercises/formatting/create_percent_cover_acosa.R]
From the Data imports and standardisation exercise, we imported a separate table of species characteristics, which basically links the taxa codes with the full name of the taxa and benthic category.
To remind us what this taxa table looks like:
taxa_descriptions
# A tibble: 81 x 3
   Code  Category               Grouping
   <chr> <chr>                  <chr>
 1 ARENA Arena                  arena
 2 TURF  Turf                   turf
 3 Esp   Esponja                esponja
 4 Acc   Alga Calcarea Costrosa Alga calcarea costrosa
 5 Hal   Halimeda               macroalga
 6 Brio  Briozoo                otro
 7 Hid   Hidrozoo               otro
 8 lep   Leptogorgia            otro
 9 Amp   Amphiroa               macroalga
10 gel   Gelidial               turf
# … with 71 more rowsThere are a number of benefits for using codes for the routine data collection and entry, including:
Poc as a code for Pocillopora)The process for joining tables is fairly simple, just:
  # join species data
    sessiles_dat.acosa %<>%
      left_join(taxa_descriptions)
Joining, by = "Codigo"The resultant object now looks something like this:
  percent_cover_acosa %>% quickview()
  X..sitio Conservation.Area  Locality        Site                       Diver Transect
1        1             ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez        1
2        1             ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez        1
3        1             ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez        1
        Date                  Keypuncher Keypunch.date Depth Depth.category  Code
1 2017-02-14 Carolina Sheridan Rodríguez    2017-02-17   6.5         Somero ARENA
2 2017-02-14 Carolina Sheridan Rodríguez    2017-02-17   6.5         Somero  TURF
3 2017-02-14 Carolina Sheridan Rodríguez    2017-02-17   6.5         Somero   Esp
  Average Percentage Category Grouping Quadrat Value
1  45.550     45.550    Arena    arena       1  55.5
2  44.875     44.875     Turf     turf       1  40.0
3   1.850      1.850  Esponja  esponja       1   0.5In the suite of {dplyr} joins, one can specify the
columns to use in linking the tables. If these are not specified, a
message will be printed to the R console “Joining by =”Codigo”. These
messages can be helpful to keep track of how the tables are being
joined.
The joining columns can be specified. And, in cases where the joining
columns are specified, these tools can specify column names, for
example: left_join(by = c("Codigo" = "Codigo")).
There are occasions when only part of a joining table are needed.
Using pipes and some {dplyr} tools, we can do this as part
of the join command:
  # join species data
    percent_cover_acosa %<>%
      left_join(taxa_descriptions %>%
                  dplyr::select(Code,
                                Grouping))
Joining, by = "Code"In some cases, we may want to use other join functions
(e.g. full_join or right_join) to have more
control of how data are matched across tables. We will work on some
examples of this as part of the Homework for this
module.
Some useful examples can be found by typing in the R console:
?left_join
There are also a number of useful joins as part of the
{fuzzyjoin} package. Among them, distance_join
or geo_join, are useful when trying to link tables of
spatial coordinates. This might be useful if there is GPS coordinates
for a coral reef monitoring site that change slightly in time (e.g. due
to available satellites).
Now that we have the taxonomic and benthic category information
linked to our percent cover data, we will now get the data into a “long”
format using gather().