If you would like to return to information from the previous section, please click here.
Being able to visualise status & trends of coral reefs is indispensable for a monitoring programme. In the first instance, visualising data serves as an important tool for identifying potential errors in the data (e.g. if values in a scatter plot are beyond expected ranges). Secondly, converting raw data into a visual summary is often necessary to communicate monitoring results to managers, decision-makers and the broader community.
For this module, we will start by building skills for aggregating and
summarising data (e.g. grouping replicates at the site level,
calculating variation around mean values) and then introduce
{ggplot2}
the “grammar of graphics”.
The code for the following examples can be found here:
analysis_code/examples/visualising/plot_sessiles_dat.acosa.R
To begin, we will load the percent cover data object from our previous example.
##
## 1. Set up
##
## -- load percent cover data -- ##
# point to data locale
data_locale <- "data_intermediate/examples/formatting/"
# call to data
load(paste0(data_locale, "sessiles_dat.acosa.rda"))
As the tibble
form of a data table only prints a subset
of the columns:
# have a look
percent_cover_acosa
# A tibble: 28,710 x 19
`# sitio` `Conservation Ar… Locality Site Diver Transect Date Keypuncher `Keypunch date`
<dbl> <chr> <chr> <chr> <chr> <dbl> <date> <chr> <date>
1 1 ACOSA Dominical El A… Caro… 1 2017-02-14 Carolina … 2017-02-17
2 1 ACOSA Dominical El A… Caro… 1 2017-02-14 Carolina … 2017-02-17
3 1 ACOSA Dominical El A… Caro… 1 2017-02-14 Carolina … 2017-02-17
4 1 ACOSA Dominical El A… Caro… 1 2017-02-14 Carolina … 2017-02-17
5 1 ACOSA Dominical El A… Caro… 2 2017-02-14 Carolina … 2017-02-17
6 1 ACOSA Dominical El A… Caro… 1 2017-02-14 Carolina … 2017-02-17
7 1 ACOSA Dominical El A… Caro… 1 2017-02-14 Carolina … 2017-02-17
8 1 ACOSA Dominical El A… Caro… 1 2017-02-14 Carolina … 2017-02-17
9 1 ACOSA Dominical El A… Caro… 1 2017-02-14 Carolina … 2017-02-17
10 1 ACOSA Dominical El A… Caro… 1 2017-02-14 Carolina … 2017-02-17
# … with 28,700 more rows, and 10 more variables: Depth <dbl>, Depth category <fct>, Code <chr>,
# Average <dbl>, Percentage <dbl>, Category <chr>, Grouping <fct>, Quadrat <chr>, Value <dbl>,
# Dataset_id <chr>
We can use the convenience function quickview()
from
integrate.R
to inspect the first 3 rows of data across all
columns:
# have a quick look
percent_cover_acosa %>% quickview()
X..sitio Conservation.Area Locality Site Diver Transect Date
1 1 ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez 1 2017-02-14
2 1 ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez 1 2017-02-14
3 1 ACOSA Dominical El Arbolito Carolina Sheridan Rodríguez 1 2017-02-14
Keypuncher Keypunch.date Depth Depth.category Code Average Percentage Category
1 Carolina Sheridan Rodríguez 2017-02-17 6.5 Shallow ARENA 45.550 45.550 Arena
2 Carolina Sheridan Rodríguez 2017-02-17 6.5 Shallow TURF 44.875 44.875 Turf
3 Carolina Sheridan Rodríguez 2017-02-17 6.5 Shallow Esp 1.850 1.850 Esponja
Grouping Quadrat Value Dataset_id
1 Non-living 1 55.5 gulfo_dulce_2006-2014
2 Macroalgae 1 40.0 gulfo_dulce_2006-2014
3 Sessile invertebrates 1 0.5 gulfo_dulce_2006-2014
Note that in doing this, we are losing some of the special characters
(e.g. #
and spaces) in column names that are preserved in
the tibble()
format. As we are just printing to the screen
and not modifying the object itself, we retain those features.
In this data set, there are a number of different levels that we might want to aggregate and/or summarise. For example, in the managing of factors, we saw there are a number of different benthic categories that we might want to aggregate (e.g. combine all living coral or macroalgae into a single group) before summarising (e.g. taking the average per site).
We can get familiar with the different types of levels by inspecting individual columns:
# get list of sites
percent_cover_acosa$Site %>% unique() %>% sort()
[1] "Bajo Mauren" "Burbujas" "Cueva del Tiburón" "El Arbolito" "El Jardín"
[6] "Isla Ballena" "Islotes" "La Catarata" "La Viuda" "Mogos"
[11] "Nicuesa" "Punta Gallardo" "San Jocesito" "San Pedrillo" "Sándalo 1"
[16] "Sándalo 2" "Tómbolo noreste 1" "Tómbolo noreste 2" "Tómbolo sur" "Tres Hermanas"
We can also see that there are a number of “Localities”, which we can also inspect:
# get list of localities
percent_cover_acosa$Locality %>% unique() %>% sort()
[1] "Dominical" "Golfo Dulce" "Isla del Caño" "Peninsula de Osa" "PNMarino Ballena"
Note in these examples, we are getting the unique()
values of a column and then sorting them by piping the values
to sort()
Another useful function is table()
which we can look at
a number of different levels in a data object:
# get levels of replication
percent_cover_acosa %>% with(., table(Site, Transect))
Transect
Site 1 2 3
Bajo Mauren 390 400 440
Burbujas 390 0 0
Cueva del Tiburón 380 360 390
El Arbolito 140 160 150
El Jardín 320 380 360
Isla Ballena 170 170 380
Islotes 280 230 240
La Catarata 230 270 250
La Viuda 380 360 360
Mogos 210 220 250
Nicuesa 200 170 210
Punta Gallardo 240 180 180
San Jocesito 280 300 280
San Pedrillo 380 380 380
Sándalo 1 150 120 150
Sándalo 2 230 60 110
Tómbolo noreste 1 270 170 170
Tómbolo noreste 2 330 0 240
Tómbolo sur 130 130 570
Tres Hermanas 390 390 370
This table()
is showing the number of rows for each
Transect
and Site
combination. This can be a
quick way to inspect whether there is balanced replication or number of
cases we would expect (e.g. in point intercept data, we should
be seeing the same number of points at each level of replication).
Once we have a better idea of what we might want to aggregate and summarise in our data, we can begin the process. For this example, we are going to:
Codigo
) to a
coarser grouping (i.e. Agrupacion
)sum()
of individual quadratesmean()
and standard deviation
sd()
for each transectThis sounds like a lot, but in R
it is fairly
straight-forward operation, which we will see in a moment.
The equivalent of this type of data summary would be done in a pivot table in Excel, where we would simply select the data and “drag & drop” the column names we want to summarise and select the summary function for the different levels.
To start, we first need to sum()
the Value
s
from each Quadrat
for the coarse taxonomic groupings:
# summarise by coarser taxa grouping
percent_cover_acosa %>%
group_by(Locality,
Site,
Transect,
Quadrat,
Grouping) %>%
summarise(Value = Value %>% sum(na.rm = TRUE))
# `summarise()` regrouping output by 'Localidad', 'Sitio', 'Transecto', 'Quadrat' (override with `.groups` argument)
# # A tibble: 2,900 x 6
# # Groups: Locality, Site, Transect, Quadrat [580]
# Locality Site Transect Quadrat Grouping Value
# <chr> <chr> <dbl> <chr> <fct> <dbl>
# 1 Dominical El Arbolito 1 1 Crustose algae 4
# 2 Dominical El Arbolito 1 1 Non-living 55.5
# 3 Dominical El Arbolito 1 1 Macroalgae 40
# 4 Dominical El Arbolito 1 1 Sessile invertebrates 0.5
# 5 Dominical El Arbolito 1 10 Crustose algae 1
# 6 Dominical El Arbolito 1 10 Non-living 10
# 7 Dominical El Arbolito 1 10 Macroalgae 86.5
# 8 Dominical El Arbolito 1 10 Sessile invertebrates 2.5
# 9 Dominical El Arbolito 1 2 Crustose algae 3
# 10 Dominical El Arbolito 1 2 Non-living 56
Note that the group_by()
statement allows us to set the
different levels for the data summary. We use the pipe
%>%
to send to the summarise()
command,
where we are creating the column Value
with the
sum()
of the values. In Excel, this is like a pivot table
function.
A few things to note in this output:
Quadrat
is a character()
value which is
why the sequence of quadrates is 1, 10… This is generally good practise,
as if there are true character values (e.g. “A1”), if we set them to
as.numeric()
the function will replace the character values
with NA
.<NA>
values in the taxonomic groupings
(i.e. Grouping
), which means that we have imperfect matches
in our concordance file. We will learn how to clean this up as part of
the Homework.%>%
ing the output to the console.Now that we have the total percent cover values for individual
quadrates at transects, sites and locations, the next step is to
summarise at the Transect
levels, like this:
# summarise by transect
percent_cover_acosa %>%
group_by(Locality,
Site,
Transect,
Quadrat,
Grouping) %>%
summarise(Value = Value %>% sum(na.rm = TRUE)) %>%
group_by(Locality,
Site,
Transect,
Grouping) %>%
summarise(Value_mean = Value %>% mean(na.rm = TRUE),
Value_sd = Value %>% sd(na.rm = TRUE))
# `summarise()` regrouping output by 'Localidad', 'Sitio', 'Transecto', 'Quadrat' (override with `.groups` argument)
# `summarise()` regrouping output by 'Localidad', 'Sitio', 'Transecto' (override with `.groups` argument)
# # A tibble: 290 x 6
# # Groups: Locality, Site, Transect [58]
# Locality Site Transect Grouping Value_mean Value_sd
# <chr> <chr> <dbl> <fct> <dbl> <dbl>
# 1 Dominical El Arbolito 1 Crustose algae 5 3.97
# 2 Dominical El Arbolito 1 Non-living 45.6 16.7
# 3 Dominical El Arbolito 1 Macroalgae 46.8 16.8
# 4 Dominical El Arbolito 1 Sessile invertebrates 2.68 2.10
# 5 Dominical El Arbolito 2 Crustose algae 8.3 6.78
# 6 Dominical El Arbolito 2 Non-living 10 0
# 7 Dominical El Arbolito 2 Macroalgae 79.0 6.34
# 8 Dominical El Arbolito 2 Sessile invertebrates 2.75 2.27
# 9 Dominical El Arbolito 3 Crustose algae 24.2 29.8
# 10 Dominical El Arbolito 3 Non-living 8.6 11.1
# # … with 280 more rows
We can see that we now have Transect
means and standard
deviations for each high level taxonomic group. A few things to note in
the steps of this calculation:
%>%
) the quadrat summaries to
then summarise by transect. As tibble
s have a “memory” of
the previous grouping, sometimes we must ungroup()
or the
tibble
will try to regroup by the previous
group_by()
statement. You can check this by commenting out
the second group_by()
statement. As we are conforming to
higher level groupings for the next summary (i.e. Location, Site,
Transect), this is not necessary in this case.Quadrat
s and then pipe that for the next
level of aggregation. As we have two other levels (potentially) to
aggregate, we may want to separate out these as various steps. As the
next part of this module will use the higher levels to visualise the
data, we can summarise a single object which we can use to create
additional aggregations “on-the-fly”.Transect
level and obtain a mean()
and
sd()
at the site level (assuming that the quadrates are
independent). As Transect
can sometimes be associated with
reef position, depth, or other nested structure, it makes sense to
retain this level of variation. However, it can make it more difficult
to visualise the different levels of variation (e.g. the next level of
aggregation at the site will be a mean()
of means with its
own standard deviation!).Now that we have a good handle on aggregating and summarising data, it is a good time to put these into a visual form.