If you would like to return to information from the previous session, please click here.
In order to facilitate summaries and visualisation of data, it is
helpful to organise data in a “long” format, as opposed to a “wide”
format. For this exercise, we will illustrate the use of a key function
gather()
and show how it can be used for data summaries. In
later modules, it will be more apparent how this is useful in
visualisation and mapping of data.
At the end of this lesson, you should be able to:
gather()
for setting “wide” data
into “long”spread()
[The code for this example can be found here:
creation_code/exercises/formatting/create_percent_cover_acosa.R
]
Working with the object of percent cover and taxonomic groups, the object looks like:
> percent_cover_acosa
# A tibble: 28,710 x 18
`# sitio` `Conservation Ar… Locality Site Diver Transect Date Keypuncher
<dbl> <chr> <chr> <chr> <chr> <dbl> <date> <chr>
1 1 ACOSA Dominical El A… Carolin… 1 2017-02-14 Carolina Sh…
2 1 ACOSA Dominical El A… Carolin… 1 2017-02-14 Carolina Sh…
3 1 ACOSA Dominical El A… Carolin… 1 2017-02-14 Carolina Sh…
4 1 ACOSA Dominical El A… Carolin… 1 2017-02-14 Carolina Sh…
5 1 ACOSA Dominical El A… Carolin… 2 2017-02-14 Carolina Sh…
6 1 ACOSA Dominical El A… Carolin… 1 2017-02-14 Carolina Sh…
7 1 ACOSA Dominical El A… Carolin… 1 2017-02-14 Carolina Sh…
8 1 ACOSA Dominical El A… Carolin… 1 2017-02-14 Carolina Sh…
9 1 ACOSA Dominical El A… Carolin… 1 2017-02-14 Carolina Sh…
10 1 ACOSA Dominical El A… Carolin… 1 2017-02-14 Carolina Sh…
# … with 28,700 more rows, and 10 more variables: Keypunch date <date>, Depth <dbl>,
# Depth category <chr>, Code <chr>, Average <dbl>, Percentage <dbl>, Category <chr>,
# Grouping <chr>, Quadrat <chr>, Value <dbl>
After the metadata columns (i.e. site, diver, Fecha, et
cetera), the replicate quadrates are in columns (i.e. numbers 1 to
10) and some additional summary columns (i.e. Total
,
Average
). The data are currently in a “wide” format.
In order to facilitate the summary of these data, we will modify the data object by “stacking” the replicate quadrate data. This will result in an object in a “long” format:
# set core data into long format
percent_cover_acosa %<>%
gather(Quadrat, Value,
-`# sitio`,
-`Conservation Area`,
-Locality,
-Site,
-Diver,
-Transect,
-Date,
-Keypuncher,
-`Keypunch date`,
-Depth,
-`Depth category`,
-Code,
-Category,
-Average,
-Percentage,
-Grouping)
The first parameter in the function is the “key”
(i.e. Quadrat
). This is used to keep track of what is
“stacked” in the long data object. The next paramter is the “value”,
which we are calling Value
- but we could use another
descriptor like Percent cover
or something less
ambiguous.
For the metadata values, we use a -
sign to
tell gather to omit them from the “stacking”.
To get an idea of what this looks like (after selecting a subset of columns):
> percent_cover_acosa %>% dplyr::select(Locality, Site, Transect, Date, Code, Quadrat, Value)
# A tibble: 35,486 x 7
Locality Site Transect Date Code Quadrat Value
<chr> <chr> <dbl> <date> <chr> <chr> <dbl>
1 Dominical El Arbolito 1 2017-02-14 ARENA 1 55.5
2 Dominical El Arbolito 1 2017-02-14 TURF 1 40
3 Dominical El Arbolito 1 2017-02-14 Esp 1 0.5
4 Dominical El Arbolito 1 2017-02-14 Acc 1 4
5 Dominical El Arbolito 2 2017-02-14 Hal 1 0
6 Dominical El Arbolito 1 2017-02-14 Brio 1 0
7 Dominical El Arbolito 1 2017-02-14 Hid 1 0
8 Dominical El Arbolito 1 2017-02-14 lep 1 0
9 Dominical El Arbolito 1 2017-02-14 Amp 1 0
10 Dominical El Arbolito 1 2017-02-14 gel 1 0
# … with 35,476 more rows
As one can see, this places the row-wise data in the “wide” data
table, into a vertical “long” format. This significantly expands the
number of rows (i.e. 35e3
rows!). This sounds like a lot,
but it is actually easily handled in R to summarise data:
# summarise data
percent_cover_acosa %>%
group_by(Locality,
Site,
Transect,
Date,
Quadrat) %>%
summarise(`Total cover` = Value %>% sum(na.rm = TRUE))
`summarise()` has grouped output by 'Localidad', 'Sitio', 'Transecto', 'Fecha'. You can override using the `.groups` argument.
# A tibble: 869 x 6
# Groups: Localidad, Sitio, Transecto, Fecha [79]
Locality Site Transect Date Quadrat `Total cover`
<chr> <chr> <dbl> <date> <chr> <dbl>
1 Dominical El Arbolito 1 2017-02-14 1 100
2 Dominical El Arbolito 1 2017-02-14 10 108
3 Dominical El Arbolito 1 2017-02-14 2 100
4 Dominical El Arbolito 1 2017-02-14 3 100
5 Dominical El Arbolito 1 2017-02-14 4 100
6 Dominical El Arbolito 1 2017-02-14 5 100
7 Dominical El Arbolito 1 2017-02-14 6 100
8 Dominical El Arbolito 1 2017-02-14 7 100
9 Dominical El Arbolito 1 2017-02-14 8 100
10 Dominical El Arbolito 1 2017-02-14 9 105
# … with 859 more rows
This give us an indication of whether there is more (or less) than 100% cover observed across all taxa. In a few lines of code, one can conduct a quick Quality Assurance on the data. We will work on aspects of this as part of the Homework for this module.
Now that we the linked percent cover and taxonomic information in a
long format, we will now explore how to modify factors using a fantastic
package called {forcats}
.