If you would like to return to information from the previous section, please click here.
Being able to visualise status & trends of coral reefs is indispensable for a monitoring programme. In the first instance, visualising data serves as an important tool for identifying potential errors in the data (e.g. if values in a scatter plot are beyond expected ranges). Secondly, converting raw data into a visual summary is often necessary to communicate monitoring results to managers, decision-makers and the broader community.
For this module, we will start by building skills for aggregating and
summarising data (e.g. grouping replicates at the site level,
calculating variation around mean values) and then introduce
{ggplot2} the “grammar of graphics”.
The code for the following examples can be found here:
analysis_code/examples/visualising/plot_sessiles_dat.acosa.R
To begin, we will load the percent cover data object from our previous example.
##
## 1. Set up
##
 ## -- load percent cover data -- ##
  # point to data locale
    data_locale <- "data_intermediate/examples/formatting/"
  # call to data
    load(paste0(data_locale,  "sessiles_dat.acosa.rda"))As the tibble form of a data table only prints a subset
of the columns:
   # have a look
  percent_cover_acosa
# A tibble: 28,710 x 19
   `# sitio` `Conservation Ar… Locality  Site  Diver Transect Date       Keypuncher `Keypunch date`
       <dbl> <chr>             <chr>     <chr> <chr>    <dbl> <date>     <chr>      <date>
 1         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
 2         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
 3         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
 4         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
 5         1 ACOSA             Dominical El A… Caro…        2 2017-02-14 Carolina … 2017-02-17
 6         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
 7         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
 8         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
 9         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
10         1 ACOSA             Dominical El A… Caro…        1 2017-02-14 Carolina … 2017-02-17
# … with 28,700 more rows, and 10 more variables: Depth <dbl>, Depth category <fct>, Code <chr>,
#   Average <dbl>, Percentage <dbl>, Category <chr>, Grouping <fct>, Quadrat <chr>, Value <dbl>,
#   Dataset_id <chr>We can use the convenience function quickview() from
integrate.R to inspect the first 3 rows of data across all
columns:
   # have a quick look
     percent_cover_acosa %>% quickview()
  X..sitio Conservation.Area  Locality        Site                       Diver Transect       Date
1        1             ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez        1 2017-02-14
2        1             ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez        1 2017-02-14
3        1             ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez        1 2017-02-14
                   Keypuncher Keypunch.date Depth Depth.category  Code Average Percentage Category
1 Carolina Sheridan Rodríguez    2017-02-17   6.5        Shallow ARENA  45.550     45.550    Arena
2 Carolina Sheridan Rodríguez    2017-02-17   6.5        Shallow  TURF  44.875     44.875     Turf
3 Carolina Sheridan Rodríguez    2017-02-17   6.5        Shallow   Esp   1.850      1.850  Esponja
               Grouping Quadrat Value            Dataset_id
1            Non-living       1  55.5 gulfo_dulce_2006-2014
2            Macroalgae       1  40.0 gulfo_dulce_2006-2014
3 Sessile invertebrates       1   0.5 gulfo_dulce_2006-2014Note that in doing this, we are losing some of the special characters
(e.g. # and spaces) in column names that are preserved in
the tibble() format. As we are just printing to the screen
and not modifying the object itself, we retain those features.
In this data set, there are a number of different levels that we might want to aggregate and/or summarise. For example, in the managing of factors, we saw there are a number of different benthic categories that we might want to aggregate (e.g. combine all living coral or macroalgae into a single group) before summarising (e.g. taking the average per site).
We can get familiar with the different types of levels by inspecting individual columns:
   # get list of sites
     percent_cover_acosa$Site %>% unique() %>% sort()
 [1] "Bajo Mauren"       "Burbujas"          "Cueva del Tiburón" "El Arbolito"       "El Jardín"
 [6] "Isla Ballena"      "Islotes"           "La Catarata"       "La Viuda"          "Mogos"
[11] "Nicuesa"           "Punta Gallardo"    "San Jocesito"      "San Pedrillo"      "Sándalo 1"
[16] "Sándalo 2"         "Tómbolo noreste 1" "Tómbolo noreste 2" "Tómbolo sur"       "Tres Hermanas"We can also see that there are a number of “Localities”, which we can also inspect:
   # get list of localities
     percent_cover_acosa$Locality %>% unique() %>% sort()
[1] "Dominical"        "Golfo Dulce"      "Isla del Caño"    "Peninsula de Osa" "PNMarino Ballena"Note in these examples, we are getting the unique()
values of a column and then sorting them by piping the values
to sort()
Another useful function is table() which we can look at
a number of different levels in a data object:
   # get levels of replication
     percent_cover_acosa %>% with(., table(Site, Transect))
                   Transect
Site                  1   2   3
  Bajo Mauren       390 400 440
  Burbujas          390   0   0
  Cueva del Tiburón 380 360 390
  El Arbolito       140 160 150
  El Jardín         320 380 360
  Isla Ballena      170 170 380
  Islotes           280 230 240
  La Catarata       230 270 250
  La Viuda          380 360 360
  Mogos             210 220 250
  Nicuesa           200 170 210
  Punta Gallardo    240 180 180
  San Jocesito      280 300 280
  San Pedrillo      380 380 380
  Sándalo 1         150 120 150
  Sándalo 2         230  60 110
  Tómbolo noreste 1 270 170 170
  Tómbolo noreste 2 330   0 240
  Tómbolo sur       130 130 570
  Tres Hermanas     390 390 370This table() is showing the number of rows for each
Transect and Site combination. This can be a
quick way to inspect whether there is balanced replication or number of
cases we would expect (e.g. in point intercept data, we should
be seeing the same number of points at each level of replication).
Once we have a better idea of what we might want to aggregate and summarise in our data, we can begin the process. For this example, we are going to:
Codigo) to a
coarser grouping (i.e. Agrupacion)sum() of individual quadratesmean() and standard deviation
sd() for each transectThis sounds like a lot, but in R it is fairly
straight-forward operation, which we will see in a moment.
The equivalent of this type of data summary would be done in a pivot table in Excel, where we would simply select the data and “drag & drop” the column names we want to summarise and select the summary function for the different levels.
To start, we first need to sum() the Values
from each Quadrat for the coarse taxonomic groupings:
  # summarise by coarser taxa grouping
    percent_cover_acosa %>%
      group_by(Locality,
               Site,
               Transect,
               Quadrat,
               Grouping) %>%
      summarise(Value = Value %>% sum(na.rm = TRUE))
# `summarise()` regrouping output by 'Localidad', 'Sitio', 'Transecto', 'Quadrat' (override with `.groups` argument)
# # A tibble: 2,900 x 6
# # Groups:   Locality, Site, Transect, Quadrat [580]
   # Locality  Site        Transect Quadrat Grouping              Value
   # <chr>     <chr>          <dbl> <chr>   <fct>                 <dbl>
 # 1 Dominical El Arbolito        1 1       Crustose algae          4
 # 2 Dominical El Arbolito        1 1       Non-living             55.5
 # 3 Dominical El Arbolito        1 1       Macroalgae             40
 # 4 Dominical El Arbolito        1 1       Sessile invertebrates   0.5
 # 5 Dominical El Arbolito        1 10      Crustose algae          1
 # 6 Dominical El Arbolito        1 10      Non-living             10
 # 7 Dominical El Arbolito        1 10      Macroalgae             86.5
 # 8 Dominical El Arbolito        1 10      Sessile invertebrates   2.5
 # 9 Dominical El Arbolito        1 2       Crustose algae          3
# 10 Dominical El Arbolito        1 2       Non-living             56Note that the group_by() statement allows us to set the
different levels for the data summary. We use the pipe
%>% to send to the summarise() command,
where we are creating the column Value with the
sum() of the values. In Excel, this is like a pivot table
function.
A few things to note in this output:
Quadrat is a character() value which is
why the sequence of quadrates is 1, 10… This is generally good practise,
as if there are true character values (e.g. “A1”), if we set them to
as.numeric() the function will replace the character values
with NA.<NA> values in the taxonomic groupings
(i.e. Grouping), which means that we have imperfect matches
in our concordance file. We will learn how to clean this up as part of
the Homework.%>%ing the output to the console.Now that we have the total percent cover values for individual
quadrates at transects, sites and locations, the next step is to
summarise at the Transect levels, like this:
  # summarise by transect
    percent_cover_acosa %>%
      group_by(Locality,
               Site,
               Transect,
               Quadrat,
               Grouping) %>%
      summarise(Value = Value %>% sum(na.rm = TRUE)) %>%
      group_by(Locality,
               Site,
               Transect,
               Grouping) %>%
      summarise(Value_mean = Value %>% mean(na.rm = TRUE),
                Value_sd   = Value %>%   sd(na.rm = TRUE))
# `summarise()` regrouping output by 'Localidad', 'Sitio', 'Transecto', 'Quadrat' (override with `.groups` argument)
# `summarise()` regrouping output by 'Localidad', 'Sitio', 'Transecto' (override with `.groups` argument)
# # A tibble: 290 x 6
# # Groups:   Locality, Site, Transect [58]
   # Locality  Site        Transect Grouping              Value_mean Value_sd
   # <chr>     <chr>          <dbl> <fct>                      <dbl>    <dbl>
 # 1 Dominical El Arbolito        1 Crustose algae              5        3.97
 # 2 Dominical El Arbolito        1 Non-living                 45.6     16.7
 # 3 Dominical El Arbolito        1 Macroalgae                 46.8     16.8
 # 4 Dominical El Arbolito        1 Sessile invertebrates       2.68     2.10
 # 5 Dominical El Arbolito        2 Crustose algae              8.3      6.78
 # 6 Dominical El Arbolito        2 Non-living                 10        0
 # 7 Dominical El Arbolito        2 Macroalgae                 79.0      6.34
 # 8 Dominical El Arbolito        2 Sessile invertebrates       2.75     2.27
 # 9 Dominical El Arbolito        3 Crustose algae             24.2     29.8
# 10 Dominical El Arbolito        3 Non-living                  8.6     11.1
# # … with 280 more rowsWe can see that we now have Transect means and standard
deviations for each high level taxonomic group. A few things to note in
the steps of this calculation:
%>%) the quadrat summaries to
then summarise by transect. As tibbles have a “memory” of
the previous grouping, sometimes we must ungroup() or the
tibble will try to regroup by the previous
group_by() statement. You can check this by commenting out
the second group_by() statement. As we are conforming to
higher level groupings for the next summary (i.e. Location, Site,
Transect), this is not necessary in this case.Quadrats and then pipe that for the next
level of aggregation. As we have two other levels (potentially) to
aggregate, we may want to separate out these as various steps. As the
next part of this module will use the higher levels to visualise the
data, we can summarise a single object which we can use to create
additional aggregations “on-the-fly”.Transect level and obtain a mean() and
sd() at the site level (assuming that the quadrates are
independent). As Transect can sometimes be associated with
reef position, depth, or other nested structure, it makes sense to
retain this level of variation. However, it can make it more difficult
to visualise the different levels of variation (e.g. the next level of
aggregation at the site will be a mean() of means with its
own standard deviation!).Now that we have a good handle on aggregating and summarising data, it is a good time to put these into a visual form.