If you would like to return to information from the previous session, please click here.
When transcribing information from field monitoring into a spreadsheet, it is common to repeat information such as site coordinates, temperature, metadata related to sample techniques, observer name, et cetera. Additionally, information related to taxonomic groups, such as higher-order taxonomy (e.g. Family), functional groups, and other information can also be included as separate columns to go alongside the primary percent cover or abundance data.
As a data set grows in long form (as opposed to
wide form), these additional columns can create a unnecessarily
large data set and can be prone to introduce errors (e.g. in repeated
column values). For example, as a monitoring data set grows as a time
series and sites, the number of rows in a spreadsheet could be in the
1,000s (or in R: 1e3
) of rows!
In R, this is best handled as a number of separate tables, that can be joined when the additional information is required (e.g. when aggregating taxa information to Family or GCRMN Tier 1 “Live hard coral”) or by site categories (e.g. “reef crest”, “lagoon”, et cetera).
As a continuation of the DFaS Module, at the end of this lesson, you should be able to:
left_join()
right_join()
)
[The code for this example can be found here:
creation_code/exercises/formatting/create_percent_cover_acosa.R
]
From the Data imports and standardisation exercise, we imported a separate table of species characteristics, which basically links the taxa codes with the full name of the taxa and benthic category.
To remind us what this taxa table looks like:
taxa_descriptions
# A tibble: 81 x 3
Code Category Grouping
<chr> <chr> <chr>
1 ARENA Arena arena
2 TURF Turf turf
3 Esp Esponja esponja
4 Acc Alga Calcarea Costrosa Alga calcarea costrosa
5 Hal Halimeda macroalga
6 Brio Briozoo otro
7 Hid Hidrozoo otro
8 lep Leptogorgia otro
9 Amp Amphiroa macroalga
10 gel Gelidial turf
# … with 71 more rows
There are a number of benefits for using codes for the routine data collection and entry, including:
Poc
as a code for Pocillopora)The process for joining tables is fairly simple, just:
# join species data
sessiles_dat.acosa %<>%
left_join(taxa_descriptions)
Joining, by = "Codigo"
The resultant object now looks something like this:
percent_cover_acosa %>% quickview()
X..sitio Conservation.Area Locality Site Diver Transect
1 1 ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez 1
2 1 ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez 1
3 1 ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez 1
Date Keypuncher Keypunch.date Depth Depth.category Code
1 2017-02-14 Carolina Sheridan Rodríguez 2017-02-17 6.5 Somero ARENA
2 2017-02-14 Carolina Sheridan Rodríguez 2017-02-17 6.5 Somero TURF
3 2017-02-14 Carolina Sheridan Rodríguez 2017-02-17 6.5 Somero Esp
Average Percentage Category Grouping Quadrat Value
1 45.550 45.550 Arena arena 1 55.5
2 44.875 44.875 Turf turf 1 40.0
3 1.850 1.850 Esponja esponja 1 0.5
In the suite of {dplyr}
joins, one can specify the
columns to use in linking the tables. If these are not specified, a
message will be printed to the R console “Joining by =”Codigo”. These
messages can be helpful to keep track of how the tables are being
joined.
The joining columns can be specified. And, in cases where the joining
columns are specified, these tools can specify column names, for
example: left_join(by = c("Codigo" = "Codigo"))
.
There are occasions when only part of a joining table are needed.
Using pipes and some {dplyr}
tools, we can do this as part
of the join command:
# join species data
percent_cover_acosa %<>%
left_join(taxa_descriptions %>%
dplyr::select(Code,
Grouping))
Joining, by = "Code"
In some cases, we may want to use other join functions
(e.g. full_join
or right_join
) to have more
control of how data are matched across tables. We will work on some
examples of this as part of the Homework for this
module.
Some useful examples can be found by typing in the R console:
?left_join
There are also a number of useful joins as part of the
{fuzzyjoin}
package. Among them, distance_join
or geo_join
, are useful when trying to link tables of
spatial coordinates. This might be useful if there is GPS coordinates
for a coral reef monitoring site that change slightly in time (e.g. due
to available satellites).
Now that we have the taxonomic and benthic category information
linked to our percent cover data, we will now get the data into a “long”
format using gather()
.