Linking tables with left

Previous steps

If you would like to return to information from the previous session, please click here.

Context

When transcribing information from field monitoring into a spreadsheet, it is common to repeat information such as site coordinates, temperature, metadata related to sample techniques, observer name, et cetera. Additionally, information related to taxonomic groups, such as higher-order taxonomy (e.g. Family), functional groups, and other information can also be included as separate columns to go alongside the primary percent cover or abundance data.

As a data set grows in long form (as opposed to wide form), these additional columns can create a unnecessarily large data set and can be prone to introduce errors (e.g. in repeated column values). For example, as a monitoring data set grows as a time series and sites, the number of rows in a spreadsheet could be in the 1,000s (or in R: 1e3) of rows!

In R, this is best handled as a number of separate tables, that can be joined when the additional information is required (e.g. when aggregating taxa information to Family or GCRMN Tier 1 “Live hard coral”) or by site categories (e.g. “reef crest”, “lagoon”, et cetera).

As a continuation of the DFaS Module, at the end of this lesson, you should be able to:

Link tables using left_join()
Have a basic understanding of other table joins (e.g. right_join())

[The code for this example can be found here: creation_code/exercises/formatting/create_percent_cover_acosa.R]

Joining tables

From the Data imports and standardisation exercise, we imported a separate table of species characteristics, which basically links the taxa codes with the full name of the taxa and benthic category.

To remind us what this taxa table looks like:

taxa_descriptions
# A tibble: 81 x 3
   Code  Category               Grouping
   <chr> <chr>                  <chr>
 1 ARENA Arena                  arena
 2 TURF  Turf                   turf
 3 Esp   Esponja                esponja
 4 Acc   Alga Calcarea Costrosa Alga calcarea costrosa
 5 Hal   Halimeda               macroalga
 6 Brio  Briozoo                otro
 7 Hid   Hidrozoo               otro
 8 lep   Leptogorgia            otro
 9 Amp   Amphiroa               macroalga
10 gel   Gelidial               turf
# … with 71 more rows

There are a number of benefits for using codes for the routine data collection and entry, including:

Saving time during the data capture process (e.g. using Poc as a code for Pocillopora)
Facilitating the data entry (by focusing on the species code and percent cover or abundance data)
Allows for the development of more elaborate taxonomic and trait information for taxa and site characteristics that can be separate from the data entry process.

The process for joining tables is fairly simple, just:

  # join species data
    sessiles_dat.acosa %<>%
      left_join(taxa_descriptions)
Joining, by = "Codigo"

The resultant object now looks something like this:

  percent_cover_acosa %>% quickview()
  X..sitio Conservation.Area  Locality        Site                       Diver Transect
1        1             ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez        1
2        1             ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez        1
3        1             ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez        1
        Date                  Keypuncher Keypunch.date Depth Depth.category  Code
1 2017-02-14 Carolina Sheridan Rodríguez    2017-02-17   6.5         Somero ARENA
2 2017-02-14 Carolina Sheridan Rodríguez    2017-02-17   6.5         Somero  TURF
3 2017-02-14 Carolina Sheridan Rodríguez    2017-02-17   6.5         Somero   Esp
  Average Percentage Category Grouping Quadrat Value
1  45.550     45.550    Arena    arena       1  55.5
2  44.875     44.875     Turf     turf       1  40.0
3   1.850      1.850  Esponja  esponja       1   0.5

In the suite of {dplyr} joins, one can specify the columns to use in linking the tables. If these are not specified, a message will be printed to the R console “Joining by =”Codigo”. These messages can be helpful to keep track of how the tables are being joined.

The joining columns can be specified. And, in cases where the joining columns are specified, these tools can specify column names, for example: left_join(by = c("Codigo" = "Codigo")).

There are occasions when only part of a joining table are needed. Using pipes and some {dplyr} tools, we can do this as part of the join command:

  # join species data
    percent_cover_acosa %<>%
      left_join(taxa_descriptions %>%
                  dplyr::select(Code,
                                Grouping))
Joining, by = "Code"

Other types of joins

In some cases, we may want to use other join functions (e.g. full_join or right_join) to have more control of how data are matched across tables. We will work on some examples of this as part of the Homework for this module.

Some useful examples can be found by typing in the R console: ?left_join

There are also a number of useful joins as part of the {fuzzyjoin} package. Among them, distance_join or geo_join, are useful when trying to link tables of spatial coordinates. This might be useful if there is GPS coordinates for a coral reef monitoring site that change slightly in time (e.g. due to available satellites).

Next steps

Now that we have the taxonomic and benthic category information linked to our percent cover data, we will now get the data into a “long” format using gather().

Linking tables with left_join

CORDIO East Africa & GCRMN Data Standards Working Group

11 August 2023

Previous steps

Context

Joining tables

Other types of joins

Next steps