Getting data into long format with gather

Context

In order to facilitate summaries and visualisation of data, it is helpful to organise data in a “long” format, as opposed to a “wide” format. For this exercise, we will illustrate the use of a key function gather() and show how it can be used for data summaries. In later modules, it will be more apparent how this is useful in visualisation and mapping of data.

At the end of this lesson, you should be able to:

Command the use of gather() for setting “wide” data into “long”
Use “long” data to create data summaries
Have a basic understanding of how to reverse from “long” to “wide” using spread()

[The code for this example can be found here: creation_code/exercises/formatting/create_percent_cover_acosa.R]

Getting data into “long” format

Working with the object of percent cover and taxonomic groups, the object looks like:

> percent_cover_acosa
# A tibble: 28,710 x 18
   `# sitio` `Conservation Ar… Locality  Site  Diver    Transect Date       Keypuncher
       <dbl> <chr>             <chr>     <chr> <chr>       <dbl> <date>     <chr>
 1         1 ACOSA             Dominical El A… Carolin…        1 2017-02-14 Carolina Sh…
 2         1 ACOSA             Dominical El A… Carolin…        1 2017-02-14 Carolina Sh…
 3         1 ACOSA             Dominical El A… Carolin…        1 2017-02-14 Carolina Sh…
 4         1 ACOSA             Dominical El A… Carolin…        1 2017-02-14 Carolina Sh…
 5         1 ACOSA             Dominical El A… Carolin…        2 2017-02-14 Carolina Sh…
 6         1 ACOSA             Dominical El A… Carolin…        1 2017-02-14 Carolina Sh…
 7         1 ACOSA             Dominical El A… Carolin…        1 2017-02-14 Carolina Sh…
 8         1 ACOSA             Dominical El A… Carolin…        1 2017-02-14 Carolina Sh…
 9         1 ACOSA             Dominical El A… Carolin…        1 2017-02-14 Carolina Sh…
10         1 ACOSA             Dominical El A… Carolin…        1 2017-02-14 Carolina Sh…
# … with 28,700 more rows, and 10 more variables: Keypunch date <date>, Depth <dbl>,
#   Depth category <chr>, Code <chr>, Average <dbl>, Percentage <dbl>, Category <chr>,
#   Grouping <chr>, Quadrat <chr>, Value <dbl>

After the metadata columns (i.e. site, diver, Fecha, et cetera), the replicate quadrates are in columns (i.e. numbers 1 to 10) and some additional summary columns (i.e. Total, Average). The data are currently in a “wide” format.

In order to facilitate the summary of these data, we will modify the data object by “stacking” the replicate quadrate data. This will result in an object in a “long” format:

  # set core data into long format
    percent_cover_acosa %<>%
      gather(Quadrat, Value,
             -`# sitio`,
             -`Conservation Area`,
             -Locality,
             -Site,
             -Diver,
             -Transect,
             -Date,
             -Keypuncher,
             -`Keypunch date`,
             -Depth,
             -`Depth category`,
             -Code,
             -Category,
             -Average,
             -Percentage,
             -Grouping)

The first parameter in the function is the “key” (i.e. Quadrat). This is used to keep track of what is “stacked” in the long data object. The next paramter is the “value”, which we are calling Value - but we could use another descriptor like Percent cover or something less ambiguous.

For the metadata values, we use a - sign to tell gather to omit them from the “stacking”.

To get an idea of what this looks like (after selecting a subset of columns):

> percent_cover_acosa %>% dplyr::select(Locality, Site, Transect, Date, Code, Quadrat, Value)
# A tibble: 35,486 x 7
   Locality Site         Transect  Date       Code   Quadrat Value
   <chr>     <chr>           <dbl> <date>     <chr>  <chr>   <dbl>
 1 Dominical El Arbolito         1 2017-02-14 ARENA  1        55.5
 2 Dominical El Arbolito         1 2017-02-14 TURF   1        40
 3 Dominical El Arbolito         1 2017-02-14 Esp    1         0.5
 4 Dominical El Arbolito         1 2017-02-14 Acc    1         4
 5 Dominical El Arbolito         2 2017-02-14 Hal    1         0
 6 Dominical El Arbolito         1 2017-02-14 Brio   1         0
 7 Dominical El Arbolito         1 2017-02-14 Hid    1         0
 8 Dominical El Arbolito         1 2017-02-14 lep    1         0
 9 Dominical El Arbolito         1 2017-02-14 Amp    1         0
10 Dominical El Arbolito         1 2017-02-14 gel    1         0
# … with 35,476 more rows

As one can see, this places the row-wise data in the “wide” data table, into a vertical “long” format. This significantly expands the number of rows (i.e. 35e3 rows!). This sounds like a lot, but it is actually easily handled in R to summarise data:

   # summarise data
     percent_cover_acosa %>%
       group_by(Locality,
                Site,
                Transect,
                Date,
                Quadrat) %>%
       summarise(`Total cover` = Value %>% sum(na.rm = TRUE))
`summarise()` has grouped output by 'Localidad', 'Sitio', 'Transecto', 'Fecha'. You can override using the `.groups` argument.
# A tibble: 869 x 6
# Groups:   Localidad, Sitio, Transecto, Fecha [79]
   Locality  Site       Transect   Date      Quadrat `Total cover`
   <chr>     <chr>           <dbl> <date>     <chr>           <dbl>
 1 Dominical El Arbolito         1 2017-02-14 1                 100
 2 Dominical El Arbolito         1 2017-02-14 10                108
 3 Dominical El Arbolito         1 2017-02-14 2                 100
 4 Dominical El Arbolito         1 2017-02-14 3                 100
 5 Dominical El Arbolito         1 2017-02-14 4                 100
 6 Dominical El Arbolito         1 2017-02-14 5                 100
 7 Dominical El Arbolito         1 2017-02-14 6                 100
 8 Dominical El Arbolito         1 2017-02-14 7                 100
 9 Dominical El Arbolito         1 2017-02-14 8                 100
10 Dominical El Arbolito         1 2017-02-14 9                 105
# … with 859 more rows

This give us an indication of whether there is more (or less) than 100% cover observed across all taxa. In a few lines of code, one can conduct a quick Quality Assurance on the data. We will work on aspects of this as part of the Homework for this module.

Getting data into long format with gather

CORDIO East Africa & GCRMN Data Standards Working Group

11 August 2023

Previous steps

Context

Getting data into “long” format

Next steps