Consolidation and discussion for DSaRR Module

Previous steps

If you would like to return to the previous section, please click here.

Context

Getting familiar with R and git takes practise. For the Homework for the DSaRR Module, we set out some routine tasks for using git and basic R exercises.

This wiki page provides some additional comments and results of the DSaRR Module Homework.

Version control: Using `git`

The intent of the first set of exercises was to get course participants used to making changes to the repository, adding to the staging area in git, commiting and pushing & pulling to synchronise with Github.

To complete the Homework exercises, we need to make a copy of the homework document:

 ## -- create local copy of homework script -- ##
  # Instructions:
  #  * 1.1. Copy homework script to your `participants_code` folder:
  #         copy `exercise_code/homework_data_standards_reproducible_research.R` to
  #           the `exercise_code` folder in `participants_code/`

  #  * 1.2. In Gitbash or Git interface with RStudio:
  #           git add -A
  #           git status  ## -- this verifies local changes in staging area -- ##
  #           git commit -m 'adding homework to exercise code'
  #           git pull    ## -- this ensures your local copy is up-to-date -- ##
  #           git push    ## -- this uploads your changes to github -- ##

Next, it helps setting the setwd() that is configured to your personal copy. As course participants will have made the copy of the repository in different locations, it is useful to have a register of that location. This means that when opening R, you can point to the project directory automatically:

 ## -- modify `integrate.R` -- ##
  # Instructions:
  #  * 2.1. Modify line #59 to align to local copy of repository
  #           from Gitbash, navigate to the project repository using the commands
  #             `pwd`  ## -- this identifies the 'present working directory'            -- ##
  #             `cd`   ## -- this 'changes directory'; users will need to type the path -- ##

By adding and committing this change, it provides documentation on when you set the working directory:

  #  * 2.2. In Gitbash or Git interface with RStudio:
  #           git add -A
  #           git status  ## -- this verifies local changes in staging area -- ##
  #           git commit -m 'modifying working directory'
  #           git pull    ## -- this ensures your local copy is up-to-date -- ##
  #           git push    ## -- this uploads your changes to github -- ##

  #         Participants should copy the output from Gitbash to this script here and
  #           "comment" the text. This is done by selecting the text and selecting
  #           'comment out' from the Edit menu (or using the command + ' keys)
  #         (Code should have a `#` symbol in front of the text)

Lastly, we need our own copy of the create_reef_data.R to modify for the exercises:

 ## -- create local copy of reef data creation code -- ##
  # Instructions:
  #  * 3.1. Copy the `create_reef_data.R` script to your `participants_code` folder:
  #           copy `creation_code/examples/standards/create_reef_data.R` to the
  #           `creation_code` folder in your participants_code folder

  #  * 3.2. In Gitbash or Git interface with RStudio:
  #           git add -A
  #           git status  ## -- this verifies local changes in staging area -- ##
  #           git commit -m 'adding local copy of reef data creation code'
  #           git pull    ## -- this ensures your local copy is up-to-date -- ##
  #           git push    ## -- this uploads your changes to github -- ##

At this point, you should have in your participants_code folder:

A copy of integrate.R that includes the location of your local copy of the data repository in setwd()
A copy of the homework_data_standards_reproducible_research.R script
A copy of the create_reef_data.R script

and all of these changes should be represented in the change history of the repository. To check to see how well you (and everyone else!) has done just type:

gitk &

in Gitbash or a Terminal window.

Further instructions on accessing history from RStudio to come shortly

`R` Exercises

Similar to the git exercises above, the exercises below set out some examples of how to work with data objects, filter, summarise and strategies for breaking up a complex coding task in R. One of the criteria for evaluating these exercises was the amount of documentation and the use of pipe %>% operators for clarifying the flow of information in multi-step data transformation & summary steps.

One strategy for simplifying the construction and working with complex data objects is the

 ## -- re-create `reef_data` from individual list objects --##
  # Instructions:
  #  * 4.1. Copy lines #33-63 from `create_reef_data.R` and paste it below line #65
  #           We will modify this code for this exercise

  #  * 4.2. Create 3 separate objects:
  #           `sites`    that contains the list of sites repeated for the number of quadrates
  #           `quadrate` that contains the repeated quadrate numbers
  #           `percent_cover` which contains the relative percent covers per site

  #  * 4.3. Create data frame from individual objects
  #           Re-create the `reef_data` using the 3 dat objects
  #           (this look similar to the `reef_data` object)

Indexing can be a powerful tool for progamming with data objects. For data cleaning and visualisation of coral reef monitoring data, there are often multiple, repeated procedures that can be simplified by using indexing.

These exercises provide practise for how to filter different sites and replicates within data objects, for example:

  #  * 5.1. Use bracket `[]` indexing to select the quadrate data from "coral garden"
  #           Copy your code & output from the R Console below:

# >     reef_data[reef_data$sites == "coral garden" , ]
         # sites quadrate percent_cover
# 1 coral garden        1     0.4076133
# 2 coral garden        2     0.5414949
# 3 coral garden        3     0.6517576
# 4 coral garden        4     0.3695736
# 5 coral garden        5     0.6391566

For selecting mulitple entries within a data object, one needs to use the %in% operator. For example, for selecting the first three quadrates from each site:

  #  * 5.2. Subset quadrates 1-3 from each site
  #           (Hint: to select using multiple entries one must use `%in%` instead of `==`)
  #           Copy your code & output from the R Console below:

# >     reef_data[reef_data$quadrate %in% 1:3, ]
          # sites quadrate percent_cover
# 1  coral garden        1     0.4076133
# 2  coral garden        2     0.5414949
# 3  coral garden        3     0.6517576
# 6       kanamai        1     0.6537655
# 7       kanamai        2     0.6606772
# 8       kanamai        3     0.7895763
# 11         kasa        1     0.3452651
# 12         kasa        2     0.2909224
# 13         kasa        3     0.3492621
# 16       likoni        1     0.7447100
# 17       likoni        2     0.6706833
# 18       likoni        3     0.7056427
# 21        nyali        1     0.6294540
# 22        nyali        2     0.5662192
# 23        nyali        3     0.6945901
# 26  ras iwatine        1     0.5654706
# 27  ras iwatine        2     0.8593679
# 28  ras iwatine        3     0.8364104
# 31  shark point        1     0.9269451
# 32  shark point        2     0.9178892
# 33  shark point        3     0.8948935
# 36       shelly        1     0.6557776
# 37       shelly        2     0.7196681
# 38       shelly        3     0.5841085

A useful way of filtering multiple entries is to create a list of the items of interest. By separating this list from the data operation (i.e. in this case filtering quadrates), means that one can modify this list without having to dig too deeply into the code. This approach will become more useful as we enter into more difficult data wrangling problems and give us better control of data queries and data transformations.

For example, for selecting the odd numbered quadrates in the qs_of_interest list, we can quickly change this to be c(2, 4, 5) without having to modify the indexing operation:

  #  * 5.3. Create a similar subset using a list object, e.g.:
  #           qs_of_interest <- c(1, 3, 5)
  #           Copy your code & output from the R Console below:

# >     reef_data[reef_data$quadrate %in% qs_of_interest, ]
          # sites quadrate percent_cover
# 1  coral garden        1     0.4076133
# 3  coral garden        3     0.6517576
# 5  coral garden        5     0.6391566
# 6       kanamai        1     0.6537655
# 8       kanamai        3     0.7895763
# 10      kanamai        5     0.8084211
# 11         kasa        1     0.3452651
# 13         kasa        3     0.3492621
# 15         kasa        5     0.4713814
# 16       likoni        1     0.7447100
# 18       likoni        3     0.7056427
# 20       likoni        5     0.8029196
# 21        nyali        1     0.6294540
# 23        nyali        3     0.6945901
# 25        nyali        5     0.6457971
# 26  ras iwatine        1     0.5654706
# 28  ras iwatine        3     0.8364104
# 30  ras iwatine        5     0.5043155
# 31  shark point        1     0.9269451
# 33  shark point        3     0.8948935
# 35  shark point        5     0.6947272
# 36       shelly        1     0.6557776
# 38       shelly        3     0.5841085
# 40       shelly        5     0.6652558

Using pipes can take some getting used to, but is well worth the effort. Pipes %>% makes the flow of information much more intuitive than base R operations.

For example, this nested set of commands can be a bit difficult to follow the logical sequence (particularly where the option for the function round() setting the number of significant digits, is far removed from the command itself):

round(mean(sqrt(abs(rnorm(18)))), 3)

A more logical representation of this using pipes %>% would look like this:

rnorm(18) %>% abs() %>% sqrt() %>% mean() %>% round(3)

For these exercises, we can practise combining our indexing skills with pipes:

 ## -- using pipes -- ##
  #  * 6.1. Using the `[]` indexing for "coral garden", get the mean percent cover
  #           Copy your code & output from the R Console below:

# >     reef_data[reef_data$sites == "coral garden" , ]$percent_cover %>% mean()
# [1] 0.5219192

  #  * 6.2. Use the function `round()` to round the percent cover values to 3 digits
  #           (Hint: the `%>% operator can be used in the creation of the `percent_cover`
  #           object in exercise 4.2 above)
  #           Copy your code & output from the R Console below:

# >     reef_data[reef_data$sites == "coral garden" , ]$percent_cover %>% mean() %>% round(3)
# [1] 0.522

Pipes can also be useful for quick visualisations and other data summaries. For example, a simple boxplot() can provide a useful visualisation of the data distribution of coral cover across multiple sites:

 ## -- base visualisation -- ##
  #  * 7.1. Use the function `boxplot()` to examine the variation of `percent_cover`
  #           for each site
  #           (Hint: to obtain help on the use of `boxplot` type `?boxplot` in the R Console)
  #           Copy your code below to examine the graphical output:

  # # plot percent cover
    # reef_data %>% boxplot(percent_cover ~ sites, data  = .)

For this next exercise, we will combine our skills of creating data objects from lists, indexing, tubes and basic visuals to include information of coral genera in our analysis. Course participants will find that breaking down this task into separate tasks is helpful.

Course participants should also be aware that in R there are often a number of different ways to arrive at the same result. So, in this sense, there is no correct answer for these exercises. It is more of an examination of the coding logic, documentation and versatility of code that we are evaluation.

For example, we first create a list of genera and the relative percent cover as individual data objects:

##
## 3. Example `reef_data` for Coral Genera
##
 ## -- create percent cover for multiple genera -- ##
  #  * 8.1. Copy lines #33-63 of `create_reef_data.R` below and modify it to include
  #           multiple genera. Use the relative percent cover values as a basis for
  #             Pocillopora =  50% of relative percent cover values for each site
  #             Pavona      =  30% of percent cover values
  #             Acropora    =  20% of percent cover values
  #         The general approach for this exercise is:
  #           i.   create individual `relative_cover` values for each genera
  #           ii.  add an additional column to the `data.frame()` called `genus`
  #           iii. adjust the `rep()` values to include the number of genera
  #           (Hint: check the `length()` of individual objects to make sure they match)
  #           Copy your code below or keep in your copy of `create_reef_data.R`:

  # set relative cover per site
    cvr_cor <- 0.60
    cvr_kan <- 0.65
    cvr_kas <- 0.45
    cvr_lik <- 0.78
    cvr_nya <- 0.73
    cvr_ras <- 0.68
    cvr_sha <- 0.76
    cvr_she <- 0.58

  # set list of genera
    genera <-
      c("Pocillopora",
        "Pavona",
        "Acropora")

  # set percentage cover values for each genera
    poc <- 0.5
    pav <- 0.3
    acr <- 0.2

This means that if we want to adjust the relative percent cover values or change the list of genera, we can easily do that outside of the core creation code for the object.

We next set the random number seed to ensure we create the same numbers. Creating individual objects allows flexibility in subsetting individual percent cover values for our site_list:

  # set seed for reproducibility
    set.seed(3)

  # create individual percent cover objects
    percent_cover_poc <-
      c(rnorm(5, cvr_cor * poc, (cvr_cor * poc / 3.0)),
        rnorm(5, cvr_kan * poc, (cvr_kan * poc / 5.2)),
        rnorm(5, cvr_kas * poc, (cvr_kas * poc / 3.2)),
        rnorm(5, cvr_lik * poc, (cvr_lik * poc / 6.8)),
        rnorm(5, cvr_nya * poc, (cvr_nya * poc / 4.2)),
        rnorm(5, cvr_ras * poc, (cvr_ras * poc / 4.4)),
        rnorm(5, cvr_sha * poc, (cvr_sha * poc / 4.1)),
        rnorm(5, cvr_she * poc, (cvr_she * poc / 5.4)))

    percent_cover_pov <-
      c(rnorm(5, cvr_cor * pav, (cvr_cor * pav / 3.0)),
        rnorm(5, cvr_kan * pav, (cvr_kan * pav / 5.2)),
        rnorm(5, cvr_kas * pav, (cvr_kas * pav / 3.2)),
        rnorm(5, cvr_lik * pav, (cvr_lik * pav / 6.8)),
        rnorm(5, cvr_nya * pav, (cvr_nya * pav / 4.2)),
        rnorm(5, cvr_ras * pav, (cvr_ras * pav / 4.4)),
        rnorm(5, cvr_sha * pav, (cvr_sha * pav / 4.1)),
        rnorm(5, cvr_she * pav, (cvr_she * pav / 5.4)))

    percent_cover_acr <-
      c(rnorm(5, cvr_cor * acr, (cvr_cor * acr / 3.0)),
        rnorm(5, cvr_kan * acr, (cvr_kan * acr / 5.2)),
        rnorm(5, cvr_kas * acr, (cvr_kas * acr / 3.2)),
        rnorm(5, cvr_lik * acr, (cvr_lik * acr / 6.8)),
        rnorm(5, cvr_nya * acr, (cvr_nya * acr / 4.2)),
        rnorm(5, cvr_ras * acr, (cvr_ras * acr / 4.4)),
        rnorm(5, cvr_sha * acr, (cvr_sha * acr / 4.1)),
        rnorm(5, cvr_she * acr, (cvr_she * acr / 5.4)))

We now fit everything together, repeating the number of quadrates n_quads by the number of genera and the number sites in our site_list to ensure we have the correct labelling for sites, quadrates, coral genera and percent cover values:

  # generate data
    reef_data <-
      data.frame(
        sites         = rep(site_list, each = n_quads * length(genera)),
        quadrate      = rep(seq(1:5), times = length(site_list) * length(genera)),
        genus         = rep(genera, each = n_quads) %>% rep(times = length(site_list)),
        percent_cover = c(percent_cover_poc[ c(1:5) ],   ## -- 1:5 "coral garden " -- ##
                          percent_cover_pov[ c(1:5) ],
                          percent_cover_acr[ c(1:5) ],
                          percent_cover_poc[ c(6:10) ],  ## -- 6:10 "kanamai"      -- ##
                          percent_cover_pov[ c(6:10) ],
                          percent_cover_acr[ c(6:10) ],
                          percent_cover_poc[ c(11:15) ], ## -- 11:15 "kasa"        -- ##
                          percent_cover_pov[ c(11:15) ],
                          percent_cover_acr[ c(11:15) ],
                          percent_cover_poc[ c(16:20) ], ## -- 16:20 "likoni"      -- ##
                          percent_cover_pov[ c(16:20) ],
                          percent_cover_acr[ c(16:20) ],
                          percent_cover_poc[ c(21:25) ], ## -- 21:25 "nyali"       -- ##
                          percent_cover_pov[ c(21:25) ],
                          percent_cover_acr[ c(21:25) ],
                          percent_cover_poc[ c(26:30) ], ## -- 26:30 "ras iwatine" -- ##
                          percent_cover_pov[ c(26:30) ],
                          percent_cover_acr[ c(26:30) ],
                          percent_cover_poc[ c(31:35) ], ## -- 31:35 "shark point" -- ##
                          percent_cover_pov[ c(31:35) ],
                          percent_cover_acr[ c(31:35) ],
                          percent_cover_poc[ c(36:40) ], ## -- 36:40 "shelly"      -- ##
                          percent_cover_pov[ c(36:40) ],
                          percent_cover_acr[ c(36:40) ])
        )

As mentioned above, using indexing can provide a useful way of conducting repeated data tasks. In this example, instead of copying and pasting the boxplot() code 8 times (i.e. by the number of sites), we simply cycle through each site in a for() loop.

We need to adjust the graphic window paramters (i.e. par()) to include 2 rows of 4 columns (i.e. mfrow = c(2, 4)) and include a special colour palette to fill the individual coral genera “boxes”:

  #  * 8.2 Visualise the percent cover by genera for each site
  #          Use `[]` indexing to select sites and `boxplot()` as in exercise 7.1

  # set colour palette
    c_palette <-
      wesanderson::wes_palette(8,
                               name = "Cavalcanti1",
                               type = "continuous")

  # set multiple figures in output
    par(mfrow = c(2, 4))

  # loop to generate figures
    for(i in 1:length(site_list)){

    # plot indiviual sites
      boxplot(percent_cover ~ genus,
              data = reef_data,
              subset = sites == site_list[i],
              col = c_palette[c(1, 2, 8)],
              main = paste0(site_list[i]),
              xlab = "",
              ylab = "Percent cover",
              ylim = c(0, 0.5),
              las = 2,
              yaxs = "i")

    }

In order to document your results and save them to the data repository, we must go through our git routine and push to Github:

 ## -- submit homework for evaluation -- ##
  #  * 9.1. In Gitbash or Git interface with RStudio:
  #           git add -A
  #           git status  ## -- this verifies local changes in staging area -- ##
  #           git commit -m 'submitting homework'
  #           git pull    ## -- this ensures your local copy is up-to-date -- ##
  #           git push    ## -- this uploads your changes to github -- ##

Next steps

Now that we have some additional practise in Data Standardisation and Reproducible Research, we can move on to our next module for Data Formatting & Standardisation here

Consolidation and discussion for DSaRR Module

CORDIO East Africa & GCRMN Data Standards Working Group

11 August 2023

Previous steps

Context

Version control: Using `git`

`R` Exercises

Next steps

Consolidation and discussion for DSaRR Module

CORDIO East Africa & GCRMN Data Standards Working Group

11 August 2023

Previous steps

Context

Version control: Using git

R Exercises

Next steps

Version control: Using `git`

`R` Exercises