Creating data aggregations and summaries

Previous steps

If you would like to return to information from the previous section, please click here.

Context

Being able to visualise status & trends of coral reefs is indispensable for a monitoring programme. In the first instance, visualising data serves as an important tool for identifying potential errors in the data (e.g. if values in a scatter plot are beyond expected ranges). Secondly, converting raw data into a visual summary is often necessary to communicate monitoring results to managers, decision-makers and the broader community.

For this module, we will start by building skills for aggregating and summarising data (e.g. grouping replicates at the site level, calculating variation around mean values) and then introduce {ggplot2} the “grammar of graphics”.

The code for the following examples can be found here: analysis_code/examples/visualising/plot_sessiles_dat.acosa.R

Getting to know your data

To begin, we will load the percent cover data object from our previous example.

##
## 1. Set up
##
 ## -- load percent cover data -- ##
  # point to data locale
    data_locale <- "data_intermediate/examples/formatting/"

  # call to data
    load(paste0(data_locale,  "sessiles_dat.acosa.rda"))

As the tibble form of a data table only prints a subset of the columns:

   # have a look
  percent_cover_acosa
# A tibble: 28,710 x 19
   `# sitio` `Conservation Ar… Locality  Site  Diver Transect Date       Keypuncher `Keypunch date`
       <dbl> <chr>             <chr>     <chr> <chr>    <dbl> <date>     <chr>      <date>
 1         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
 2         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
 3         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
 4         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
 5         1 ACOSA             Dominical El A… Caro…        2 2017-02-14 Carolina … 2017-02-17
 6         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
 7         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
 8         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
 9         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
10         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
# … with 28,700 more rows, and 10 more variables: Depth <dbl>, Depth category <fct>, Code <chr>,
#   Average <dbl>, Percentage <dbl>, Category <chr>, Grouping <fct>, Quadrat <chr>, Value <dbl>,
#   Dataset_id <chr>

We can use the convenience function quickview() from integrate.R to inspect the first 3 rows of data across all columns:

   # have a quick look
     percent_cover_acosa %>% quickview()
  X..sitio Conservation.Area  Locality        Site                       Diver Transect       Date
1        1             ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez        1 2017-02-14
2        1             ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez        1 2017-02-14
3        1             ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez        1 2017-02-14
                   Keypuncher Keypunch.date Depth Depth.category  Code Average Percentage Category
1 Carolina Sheridan Rodríguez    2017-02-17   6.5        Shallow ARENA  45.550     45.550    Arena
2 Carolina Sheridan Rodríguez    2017-02-17   6.5        Shallow  TURF  44.875     44.875     Turf
3 Carolina Sheridan Rodríguez    2017-02-17   6.5        Shallow   Esp   1.850      1.850  Esponja
               Grouping Quadrat Value            Dataset_id
1            Non-living       1  55.5 gulfo_dulce_2006-2014
2            Macroalgae       1  40.0 gulfo_dulce_2006-2014
3 Sessile invertebrates       1   0.5 gulfo_dulce_2006-2014

Note that in doing this, we are losing some of the special characters (e.g. # and spaces) in column names that are preserved in the tibble() format. As we are just printing to the screen and not modifying the object itself, we retain those features.

In this data set, there are a number of different levels that we might want to aggregate and/or summarise. For example, in the managing of factors, we saw there are a number of different benthic categories that we might want to aggregate (e.g. combine all living coral or macroalgae into a single group) before summarising (e.g. taking the average per site).

We can get familiar with the different types of levels by inspecting individual columns:

   # get list of sites
     percent_cover_acosa$Site %>% unique() %>% sort()
 [1] "Bajo Mauren"       "Burbujas"          "Cueva del Tiburón" "El Arbolito"       "El Jardín"
 [6] "Isla Ballena"      "Islotes"           "La Catarata"       "La Viuda"          "Mogos"
[11] "Nicuesa"           "Punta Gallardo"    "San Jocesito"      "San Pedrillo"      "Sándalo 1"
[16] "Sándalo 2"         "Tómbolo noreste 1" "Tómbolo noreste 2" "Tómbolo sur"       "Tres Hermanas"

We can also see that there are a number of “Localities”, which we can also inspect:

   # get list of localities
     percent_cover_acosa$Locality %>% unique() %>% sort()
[1] "Dominical"        "Golfo Dulce"      "Isla del Caño"    "Peninsula de Osa" "PNMarino Ballena"

Note in these examples, we are getting the unique() values of a column and then sorting them by piping the values to sort()

Another useful function is table() which we can look at a number of different levels in a data object:

   # get levels of replication
     percent_cover_acosa %>% with(., table(Site, Transect))
                   Transect
Site                  1   2   3
  Bajo Mauren       390 400 440
  Burbujas          390   0   0
  Cueva del Tiburón 380 360 390
  El Arbolito       140 160 150
  El Jardín         320 380 360
  Isla Ballena      170 170 380
  Islotes           280 230 240
  La Catarata       230 270 250
  La Viuda          380 360 360
  Mogos             210 220 250
  Nicuesa           200 170 210
  Punta Gallardo    240 180 180
  San Jocesito      280 300 280
  San Pedrillo      380 380 380
  Sándalo 1         150 120 150
  Sándalo 2         230  60 110
  Tómbolo noreste 1 270 170 170
  Tómbolo noreste 2 330   0 240
  Tómbolo sur       130 130 570
  Tres Hermanas     390 390 370

This table() is showing the number of rows for each Transect and Site combination. This can be a quick way to inspect whether there is balanced replication or number of cases we would expect (e.g. in point intercept data, we should be seeing the same number of points at each level of replication).

Aggregating & summarising data

Once we have a better idea of what we might want to aggregate and summarise in our data, we can begin the process. For this example, we are going to:

summarise the detailed species codes (i.e. Codigo) to a coarser grouping (i.e. Agrupacion)
calculate the sum() of individual quadrates
calculate the mean() and standard deviation sd() for each transect
summarise by site and locality

This sounds like a lot, but in R it is fairly straight-forward operation, which we will see in a moment.

The equivalent of this type of data summary would be done in a pivot table in Excel, where we would simply select the data and “drag & drop” the column names we want to summarise and select the summary function for the different levels.

To start, we first need to sum() the Values from each Quadrat for the coarse taxonomic groupings:

  # summarise by coarser taxa grouping
    percent_cover_acosa %>%
      group_by(Locality,
               Site,
               Transect,
               Quadrat,
               Grouping) %>%
      summarise(Value = Value %>% sum(na.rm = TRUE))
# `summarise()` regrouping output by 'Localidad', 'Sitio', 'Transecto', 'Quadrat' (override with `.groups` argument)
# # A tibble: 2,900 x 6
# # Groups:   Locality, Site, Transect, Quadrat [580]
   # Locality  Site        Transect Quadrat Grouping              Value
   # <chr>     <chr>          <dbl> <chr>   <fct>                 <dbl>
 # 1 Dominical El Arbolito        1 1       Crustose algae          4
 # 2 Dominical El Arbolito        1 1       Non-living             55.5
 # 3 Dominical El Arbolito        1 1       Macroalgae             40
 # 4 Dominical El Arbolito        1 1       Sessile invertebrates   0.5
 # 5 Dominical El Arbolito        1 10      Crustose algae          1
 # 6 Dominical El Arbolito        1 10      Non-living             10
 # 7 Dominical El Arbolito        1 10      Macroalgae             86.5
 # 8 Dominical El Arbolito        1 10      Sessile invertebrates   2.5
 # 9 Dominical El Arbolito        1 2       Crustose algae          3
# 10 Dominical El Arbolito        1 2       Non-living             56

Note that the group_by() statement allows us to set the different levels for the data summary. We use the pipe %>% to send to the summarise() command, where we are creating the column Value with the sum() of the values. In Excel, this is like a pivot table function.

A few things to note in this output:

Quadrat is a character() value which is why the sequence of quadrates is 1, 10… This is generally good practise, as if there are true character values (e.g. “A1”), if we set them to as.numeric() the function will replace the character values with NA.
There are <NA> values in the taxonomic groupings (i.e. Grouping), which means that we have imperfect matches in our concordance file. We will learn how to clean this up as part of the Homework.
At this point, we have not changed the original data object, we are just %>%ing the output to the console.

Now that we have the total percent cover values for individual quadrates at transects, sites and locations, the next step is to summarise at the Transect levels, like this:

  # summarise by transect
    percent_cover_acosa %>%
      group_by(Locality,
               Site,
               Transect,
               Quadrat,
               Grouping) %>%
      summarise(Value = Value %>% sum(na.rm = TRUE)) %>%
      group_by(Locality,
               Site,
               Transect,
               Grouping) %>%
      summarise(Value_mean = Value %>% mean(na.rm = TRUE),
                Value_sd   = Value %>%   sd(na.rm = TRUE))
# `summarise()` regrouping output by 'Localidad', 'Sitio', 'Transecto', 'Quadrat' (override with `.groups` argument)
# `summarise()` regrouping output by 'Localidad', 'Sitio', 'Transecto' (override with `.groups` argument)
# # A tibble: 290 x 6
# # Groups:   Locality, Site, Transect [58]
   # Locality  Site        Transect Grouping              Value_mean Value_sd
   # <chr>     <chr>          <dbl> <fct>                      <dbl>    <dbl>
 # 1 Dominical El Arbolito        1 Crustose algae              5        3.97
 # 2 Dominical El Arbolito        1 Non-living                 45.6     16.7
 # 3 Dominical El Arbolito        1 Macroalgae                 46.8     16.8
 # 4 Dominical El Arbolito        1 Sessile invertebrates       2.68     2.10
 # 5 Dominical El Arbolito        2 Crustose algae              8.3      6.78
 # 6 Dominical El Arbolito        2 Non-living                 10        0
 # 7 Dominical El Arbolito        2 Macroalgae                 79.0      6.34
 # 8 Dominical El Arbolito        2 Sessile invertebrates       2.75     2.27
 # 9 Dominical El Arbolito        3 Crustose algae             24.2     29.8
# 10 Dominical El Arbolito        3 Non-living                  8.6     11.1
# # … with 280 more rows

We can see that we now have Transect means and standard deviations for each high level taxonomic group. A few things to note in the steps of this calculation:

We are piping (i.e. %>%) the quadrat summaries to then summarise by transect. As tibbles have a “memory” of the previous grouping, sometimes we must ungroup() or the tibble will try to regroup by the previous group_by() statement. You can check this by commenting out the second group_by() statement. As we are conforming to higher level groupings for the next summary (i.e. Location, Site, Transect), this is not necessary in this case.
This calculation could be split up by first creating an object that summarises the Quadrats and then pipe that for the next level of aggregation. As we have two other levels (potentially) to aggregate, we may want to separate out these as various steps. As the next part of this module will use the higher levels to visualise the data, we can summarise a single object which we can use to create additional aggregations “on-the-fly”.
Depending on the sampling design, we could omit the Transect level and obtain a mean() and sd() at the site level (assuming that the quadrates are independent). As Transect can sometimes be associated with reef position, depth, or other nested structure, it makes sense to retain this level of variation. However, it can make it more difficult to visualise the different levels of variation (e.g. the next level of aggregation at the site will be a mean() of means with its own standard deviation!).

Next steps

Now that we have a good handle on aggregating and summarising data, it is a good time to put these into a visual form.