If you would like to return to the previous section, please click here.
Getting familiar with R
and git
takes
practise. For the Homework for the
DSaRR Module, we set out some routine tasks for using
git
and basic R
exercises.
This wiki page provides some additional comments and results of the DSaRR Module Homework.
git
The intent of the first set of exercises was to get course
participants used to making changes to the repository, adding to the
staging area in git
, commiting and pushing & pulling to
synchronise with Github.
To complete the Homework exercises, we need to make a copy of the homework document:
## -- create local copy of homework script -- ##
# Instructions:
# * 1.1. Copy homework script to your `participants_code` folder:
# copy `exercise_code/homework_data_standards_reproducible_research.R` to
# the `exercise_code` folder in `participants_code/`
# * 1.2. In Gitbash or Git interface with RStudio:
# git add -A
# git status ## -- this verifies local changes in staging area -- ##
# git commit -m 'adding homework to exercise code'
# git pull ## -- this ensures your local copy is up-to-date -- ##
# git push ## -- this uploads your changes to github -- ##
Next, it helps setting the setwd()
that is configured to
your personal copy. As course participants will have made the copy of
the repository in different locations, it is useful to have a register
of that location. This means that when opening R
, you can
point to the project directory automatically:
## -- modify `integrate.R` -- ##
# Instructions:
# * 2.1. Modify line #59 to align to local copy of repository
# from Gitbash, navigate to the project repository using the commands
# `pwd` ## -- this identifies the 'present working directory' -- ##
# `cd` ## -- this 'changes directory'; users will need to type the path -- ##
By adding and committing this change, it provides documentation on when you set the working directory:
# * 2.2. In Gitbash or Git interface with RStudio:
# git add -A
# git status ## -- this verifies local changes in staging area -- ##
# git commit -m 'modifying working directory'
# git pull ## -- this ensures your local copy is up-to-date -- ##
# git push ## -- this uploads your changes to github -- ##
# Participants should copy the output from Gitbash to this script here and
# "comment" the text. This is done by selecting the text and selecting
# 'comment out' from the Edit menu (or using the command + ' keys)
# (Code should have a `#` symbol in front of the text)
Lastly, we need our own copy of the create_reef_data.R
to modify for the exercises:
## -- create local copy of reef data creation code -- ##
# Instructions:
# * 3.1. Copy the `create_reef_data.R` script to your `participants_code` folder:
# copy `creation_code/examples/standards/create_reef_data.R` to the
# `creation_code` folder in your participants_code folder
# * 3.2. In Gitbash or Git interface with RStudio:
# git add -A
# git status ## -- this verifies local changes in staging area -- ##
# git commit -m 'adding local copy of reef data creation code'
# git pull ## -- this ensures your local copy is up-to-date -- ##
# git push ## -- this uploads your changes to github -- ##
At this point, you should have in your participants_code
folder:
integrate.R
that includes the location of
your local copy of the data repository in setwd()
homework_data_standards_reproducible_research.R
scriptcreate_reef_data.R
scriptand all of these changes should be represented in the change history of the repository. To check to see how well you (and everyone else!) has done just type:
gitk &
in Gitbash or a Terminal window.
Further instructions on accessing history from RStudio to come shortly
R
ExercisesSimilar to the git
exercises above, the exercises below
set out some examples of how to work with data objects, filter,
summarise and strategies for breaking up a complex coding task in
R
. One of the criteria for evaluating these exercises was
the amount of documentation and the use of pipe %>%
operators for clarifying the flow of information in multi-step data
transformation & summary steps.
One strategy for simplifying the construction and working with complex data objects is the
## -- re-create `reef_data` from individual list objects --##
# Instructions:
# * 4.1. Copy lines #33-63 from `create_reef_data.R` and paste it below line #65
# We will modify this code for this exercise
# * 4.2. Create 3 separate objects:
# `sites` that contains the list of sites repeated for the number of quadrates
# `quadrate` that contains the repeated quadrate numbers
# `percent_cover` which contains the relative percent covers per site
# * 4.3. Create data frame from individual objects
# Re-create the `reef_data` using the 3 dat objects
# (this look similar to the `reef_data` object)
Indexing can be a powerful tool for progamming with data objects. For data cleaning and visualisation of coral reef monitoring data, there are often multiple, repeated procedures that can be simplified by using indexing.
These exercises provide practise for how to filter different sites and replicates within data objects, for example:
# * 5.1. Use bracket `[]` indexing to select the quadrate data from "coral garden"
# Copy your code & output from the R Console below:
# > reef_data[reef_data$sites == "coral garden" , ]
# sites quadrate percent_cover
# 1 coral garden 1 0.4076133
# 2 coral garden 2 0.5414949
# 3 coral garden 3 0.6517576
# 4 coral garden 4 0.3695736
# 5 coral garden 5 0.6391566
For selecting mulitple entries within a data object, one needs to use
the %in%
operator. For example, for selecting the first
three quadrates from each site:
# * 5.2. Subset quadrates 1-3 from each site
# (Hint: to select using multiple entries one must use `%in%` instead of `==`)
# Copy your code & output from the R Console below:
# > reef_data[reef_data$quadrate %in% 1:3, ]
# sites quadrate percent_cover
# 1 coral garden 1 0.4076133
# 2 coral garden 2 0.5414949
# 3 coral garden 3 0.6517576
# 6 kanamai 1 0.6537655
# 7 kanamai 2 0.6606772
# 8 kanamai 3 0.7895763
# 11 kasa 1 0.3452651
# 12 kasa 2 0.2909224
# 13 kasa 3 0.3492621
# 16 likoni 1 0.7447100
# 17 likoni 2 0.6706833
# 18 likoni 3 0.7056427
# 21 nyali 1 0.6294540
# 22 nyali 2 0.5662192
# 23 nyali 3 0.6945901
# 26 ras iwatine 1 0.5654706
# 27 ras iwatine 2 0.8593679
# 28 ras iwatine 3 0.8364104
# 31 shark point 1 0.9269451
# 32 shark point 2 0.9178892
# 33 shark point 3 0.8948935
# 36 shelly 1 0.6557776
# 37 shelly 2 0.7196681
# 38 shelly 3 0.5841085
A useful way of filtering multiple entries is to create a list of the items of interest. By separating this list from the data operation (i.e. in this case filtering quadrates), means that one can modify this list without having to dig too deeply into the code. This approach will become more useful as we enter into more difficult data wrangling problems and give us better control of data queries and data transformations.
For example, for selecting the odd numbered quadrates in the
qs_of_interest
list, we can quickly change this to be
c(2, 4, 5)
without having to modify the indexing
operation:
# * 5.3. Create a similar subset using a list object, e.g.:
# qs_of_interest <- c(1, 3, 5)
# Copy your code & output from the R Console below:
# > reef_data[reef_data$quadrate %in% qs_of_interest, ]
# sites quadrate percent_cover
# 1 coral garden 1 0.4076133
# 3 coral garden 3 0.6517576
# 5 coral garden 5 0.6391566
# 6 kanamai 1 0.6537655
# 8 kanamai 3 0.7895763
# 10 kanamai 5 0.8084211
# 11 kasa 1 0.3452651
# 13 kasa 3 0.3492621
# 15 kasa 5 0.4713814
# 16 likoni 1 0.7447100
# 18 likoni 3 0.7056427
# 20 likoni 5 0.8029196
# 21 nyali 1 0.6294540
# 23 nyali 3 0.6945901
# 25 nyali 5 0.6457971
# 26 ras iwatine 1 0.5654706
# 28 ras iwatine 3 0.8364104
# 30 ras iwatine 5 0.5043155
# 31 shark point 1 0.9269451
# 33 shark point 3 0.8948935
# 35 shark point 5 0.6947272
# 36 shelly 1 0.6557776
# 38 shelly 3 0.5841085
# 40 shelly 5 0.6652558
Using pipes
can take some getting used to, but is well worth the effort. Pipes
%>%
makes the flow of information much more intuitive
than base R
operations.
For example, this nested set of commands can be a bit difficult to
follow the logical sequence (particularly where the option for the
function round()
setting the number of significant digits,
is far removed from the command itself):
round(mean(sqrt(abs(rnorm(18)))), 3)
A more logical representation of this using pipes %>%
would look like this:
rnorm(18) %>% abs() %>% sqrt() %>% mean() %>% round(3)
For these exercises, we can practise combining our indexing skills with pipes:
## -- using pipes -- ##
# * 6.1. Using the `[]` indexing for "coral garden", get the mean percent cover
# Copy your code & output from the R Console below:
# > reef_data[reef_data$sites == "coral garden" , ]$percent_cover %>% mean()
# [1] 0.5219192
# * 6.2. Use the function `round()` to round the percent cover values to 3 digits
# (Hint: the `%>% operator can be used in the creation of the `percent_cover`
# object in exercise 4.2 above)
# Copy your code & output from the R Console below:
# > reef_data[reef_data$sites == "coral garden" , ]$percent_cover %>% mean() %>% round(3)
# [1] 0.522
Pipes can also be useful for quick visualisations and other data
summaries. For example, a simple boxplot()
can provide a
useful visualisation of the data distribution of coral cover across
multiple sites:
## -- base visualisation -- ##
# * 7.1. Use the function `boxplot()` to examine the variation of `percent_cover`
# for each site
# (Hint: to obtain help on the use of `boxplot` type `?boxplot` in the R Console)
# Copy your code below to examine the graphical output:
# # plot percent cover
# reef_data %>% boxplot(percent_cover ~ sites, data = .)
For this next exercise, we will combine our skills of creating data objects from lists, indexing, tubes and basic visuals to include information of coral genera in our analysis. Course participants will find that breaking down this task into separate tasks is helpful.
Course participants should also be aware that in R
there
are often a number of different ways to arrive at the same result. So,
in this sense, there is no correct answer for these exercises.
It is more of an examination of the coding logic, documentation and
versatility of code that we are evaluation.
For example, we first create a list of genera and the relative percent cover as individual data objects:
##
## 3. Example `reef_data` for Coral Genera
##
## -- create percent cover for multiple genera -- ##
# * 8.1. Copy lines #33-63 of `create_reef_data.R` below and modify it to include
# multiple genera. Use the relative percent cover values as a basis for
# Pocillopora = 50% of relative percent cover values for each site
# Pavona = 30% of percent cover values
# Acropora = 20% of percent cover values
# The general approach for this exercise is:
# i. create individual `relative_cover` values for each genera
# ii. add an additional column to the `data.frame()` called `genus`
# iii. adjust the `rep()` values to include the number of genera
# (Hint: check the `length()` of individual objects to make sure they match)
# Copy your code below or keep in your copy of `create_reef_data.R`:
# set relative cover per site
cvr_cor <- 0.60
cvr_kan <- 0.65
cvr_kas <- 0.45
cvr_lik <- 0.78
cvr_nya <- 0.73
cvr_ras <- 0.68
cvr_sha <- 0.76
cvr_she <- 0.58
# set list of genera
genera <-
c("Pocillopora",
"Pavona",
"Acropora")
# set percentage cover values for each genera
poc <- 0.5
pav <- 0.3
acr <- 0.2
This means that if we want to adjust the relative percent cover values or change the list of genera, we can easily do that outside of the core creation code for the object.
We next set the random number seed to ensure we create the same
numbers. Creating individual objects allows flexibility in subsetting
individual percent cover values for our site_list
:
# set seed for reproducibility
set.seed(3)
# create individual percent cover objects
percent_cover_poc <-
c(rnorm(5, cvr_cor * poc, (cvr_cor * poc / 3.0)),
rnorm(5, cvr_kan * poc, (cvr_kan * poc / 5.2)),
rnorm(5, cvr_kas * poc, (cvr_kas * poc / 3.2)),
rnorm(5, cvr_lik * poc, (cvr_lik * poc / 6.8)),
rnorm(5, cvr_nya * poc, (cvr_nya * poc / 4.2)),
rnorm(5, cvr_ras * poc, (cvr_ras * poc / 4.4)),
rnorm(5, cvr_sha * poc, (cvr_sha * poc / 4.1)),
rnorm(5, cvr_she * poc, (cvr_she * poc / 5.4)))
percent_cover_pov <-
c(rnorm(5, cvr_cor * pav, (cvr_cor * pav / 3.0)),
rnorm(5, cvr_kan * pav, (cvr_kan * pav / 5.2)),
rnorm(5, cvr_kas * pav, (cvr_kas * pav / 3.2)),
rnorm(5, cvr_lik * pav, (cvr_lik * pav / 6.8)),
rnorm(5, cvr_nya * pav, (cvr_nya * pav / 4.2)),
rnorm(5, cvr_ras * pav, (cvr_ras * pav / 4.4)),
rnorm(5, cvr_sha * pav, (cvr_sha * pav / 4.1)),
rnorm(5, cvr_she * pav, (cvr_she * pav / 5.4)))
percent_cover_acr <-
c(rnorm(5, cvr_cor * acr, (cvr_cor * acr / 3.0)),
rnorm(5, cvr_kan * acr, (cvr_kan * acr / 5.2)),
rnorm(5, cvr_kas * acr, (cvr_kas * acr / 3.2)),
rnorm(5, cvr_lik * acr, (cvr_lik * acr / 6.8)),
rnorm(5, cvr_nya * acr, (cvr_nya * acr / 4.2)),
rnorm(5, cvr_ras * acr, (cvr_ras * acr / 4.4)),
rnorm(5, cvr_sha * acr, (cvr_sha * acr / 4.1)),
rnorm(5, cvr_she * acr, (cvr_she * acr / 5.4)))
We now fit everything together, repeating the number of quadrates
n_quads
by the number of genera
and the number
sites in our site_list
to ensure we have the correct
labelling for sites, quadrates, coral genera and percent cover
values:
# generate data
reef_data <-
data.frame(
sites = rep(site_list, each = n_quads * length(genera)),
quadrate = rep(seq(1:5), times = length(site_list) * length(genera)),
genus = rep(genera, each = n_quads) %>% rep(times = length(site_list)),
percent_cover = c(percent_cover_poc[ c(1:5) ], ## -- 1:5 "coral garden " -- ##
percent_cover_pov[ c(1:5) ],
percent_cover_acr[ c(1:5) ],
percent_cover_poc[ c(6:10) ], ## -- 6:10 "kanamai" -- ##
percent_cover_pov[ c(6:10) ],
percent_cover_acr[ c(6:10) ],
percent_cover_poc[ c(11:15) ], ## -- 11:15 "kasa" -- ##
percent_cover_pov[ c(11:15) ],
percent_cover_acr[ c(11:15) ],
percent_cover_poc[ c(16:20) ], ## -- 16:20 "likoni" -- ##
percent_cover_pov[ c(16:20) ],
percent_cover_acr[ c(16:20) ],
percent_cover_poc[ c(21:25) ], ## -- 21:25 "nyali" -- ##
percent_cover_pov[ c(21:25) ],
percent_cover_acr[ c(21:25) ],
percent_cover_poc[ c(26:30) ], ## -- 26:30 "ras iwatine" -- ##
percent_cover_pov[ c(26:30) ],
percent_cover_acr[ c(26:30) ],
percent_cover_poc[ c(31:35) ], ## -- 31:35 "shark point" -- ##
percent_cover_pov[ c(31:35) ],
percent_cover_acr[ c(31:35) ],
percent_cover_poc[ c(36:40) ], ## -- 36:40 "shelly" -- ##
percent_cover_pov[ c(36:40) ],
percent_cover_acr[ c(36:40) ])
)
As mentioned above, using indexing can provide a useful way of
conducting repeated data tasks. In this example, instead of
copying and pasting the boxplot()
code 8
times (i.e. by the number of sites), we simply cycle through each site
in a for()
loop.
We need to adjust the graphic window paramters
(i.e. par()
) to include 2 rows of 4 columns
(i.e. mfrow = c(2, 4)
) and include a special colour palette
to fill the individual coral genera “boxes”:
# * 8.2 Visualise the percent cover by genera for each site
# Use `[]` indexing to select sites and `boxplot()` as in exercise 7.1
# set colour palette
c_palette <-
wesanderson::wes_palette(8,
name = "Cavalcanti1",
type = "continuous")
# set multiple figures in output
par(mfrow = c(2, 4))
# loop to generate figures
for(i in 1:length(site_list)){
# plot indiviual sites
boxplot(percent_cover ~ genus,
data = reef_data,
subset = sites == site_list[i],
col = c_palette[c(1, 2, 8)],
main = paste0(site_list[i]),
xlab = "",
ylab = "Percent cover",
ylim = c(0, 0.5),
las = 2,
yaxs = "i")
}
In order to document your results and save them to the data
repository, we must go through our git
routine and
push
to Github:
## -- submit homework for evaluation -- ##
# * 9.1. In Gitbash or Git interface with RStudio:
# git add -A
# git status ## -- this verifies local changes in staging area -- ##
# git commit -m 'submitting homework'
# git pull ## -- this ensures your local copy is up-to-date -- ##
# git push ## -- this uploads your changes to github -- ##
Now that we have some additional practise in Data Standardisation and Reproducible Research, we can move on to our next module for Data Formatting & Standardisation here