Project workflow and using integrate.R

Previous steps

If you would like to return to information from the previous session, please click here.

Context

As part of the development of a standardised project repository for GCRMN reef monitoring data, with folders for creating data objects, visualising, reporting code, et cetera, there is also a file which outlines the basic steps of the data standardisation and analysis.

The idea behind this file (called integrate.R) is that it provides the documentation of each step in order to reproduce the results of the data standardisation, visualisation, reporting, et cetera. This file also helps with the Set up of the project, by loading the necessary packages needed for the project (e.g. data manipulation tools, importing, visualisation, mapping), sets the working directory for the project, creates special functions and parameters.

The main body of integrate.R sets out the sequence of data cleaning steps and analysis, by listing individual scripts. By using the command source(), one can call 100’s of lines of code in a single line of code!

This wiki page provides an overview of a general workflow and integrate.R script for a coral reef monitoring data project, with the idea that using a similar project structure will facilitate collaboration and transferability of analyses and code.

Example Workflow

Once cloning the data repository for this course, one will notice that there is a high-level folder structure, with folders for data, code, and outputs, and a single *.R file (i.e. integrate.R). The individual folders have a nested structure to separate different coding, analysis, and visualisation routines:

A repository for coral reef monitoring data might have folders for different localities, methods, dates or data types. This helps in organising the data, code and outputs for a “living” data project.

This folder structure not only helps keep our data and code organised, it also reflects the workflow for the different steps in the project. For example, data importation & cleaning process (i.e. creation_code) creates a *.rda object save to the data_intermediate folder. These data objects are then loaded into individual scripts in analysis_code to produce outputs (e.g. *.png) saved to the figures folder. Finally, a script in rmarkdown will import *.png figures and other binary outputs for creating a report of monitoring results:

Of course, there are alternatives for setting up projects in R. The philosophy of this approach is that individual, modular scripts of < 80-100 lines long are easier to proof read, troubleshoot (if there is a problem), and separate different tasks of data cleaning, analysis, visualisation and reporting.

The advantages of this particular approach will become more apparent as we progress in the training course.

Example Integrate.R

In the project repository, we have set up an example integrate.R file to use for this training course.

The script sets out basic information on the name of the project, purpose or objective, the approach, main authors, date and other metadata required for the project. This is a good place to put warnings about confidentiality and responsible use of monitoring data.

##
##  Project Name:  Building the WIO Global Coral Reef Monitoring Network
##                 to make coral reef data secure & accessible
##
##  Objective:     Provide course structure and modules for data
##                 systematisation and visualisation training course
##
##  Approach:
##
##  Authors:       Franz Smith, Mishal Gudka, David Obura, and others
##                 CORDIO East Africa
##                 Universidad San Francisco de Quito
##
##
##  Date:          2021-04-30
##

##  Notes:         1. This file is intended to provide a guide to the basic
##                    workflow of the project, attempting to 'integrate' the
##                    different steps necessary to conduct the analyses &
##                    create visual outputs

As mentioned, the Set up for working on the project begins with cleaning the workspace rm(list = ls()) and loading the necessary packages for the project. Sometimes it is helpful to separate the list of packages into their broad functionality (e.g. data manipulation, visualisation):

##
##  1. Set up the core functionality
##
  # clean up
    rm(list=ls())

  # call to core packages for data manipulation
    library(dplyr)
    library(tidyr)
    library(magrittr)
    library(purrr)
    library(lubridate)
    library(hms)
    library(stringr)
    library(forcats)

  # for importing different formats
    library(readr)
    library(readxl)

  # call to visualisation & output generation
    library(ggplot2)
    library(GGally)
    library(Cairo)
    library(extrafont)
    library(RColorBrewer)
    library(viridis)

  # functionality for spatial analyses
    library(raster)
    library(rgdal)
    library(sf)
    library(rgeos)

It is also convenient to use integrate.R to set the working directory for the project, fonts & themes, and other settings necessary to have comparable results across collaborators and institutions. Special functions (for example, quickview() which is a convenience function for viewing the top rows of a tibble) and other settings (e.g. projection details for mapping).

Without going into too much detail, this is just an example of how you can use integrate.R to manage routine settings for an individual project:

  # point to working directory        ## -- will need to adjust for local copy -- ##
    setwd("research/gcrmn_wio_data_course")

  # set font for graphical outputs
    theme_set(theme_bw(base_family = "Helvetica"))
    CairoFonts(  # slight mod to example in ?CairoFonts page
               regular    = "Helvetica:style = Regular",
               bold       = "Helvetica:style = Bold",
               italic     = "Helvetica:style = Oblique",
               bolditalic = "Helvetica:style = BoldOblique"
               )

  # call to map theme
    source("R/theme_nothing.R")

  # create helper function for reviewing data
    quickview <- function(x, n = 3L){ head(data.frame(x), n = n) }

  # set utm details
    utm_details <-
      paste0("+proj=utm +zone=15 +south +datum=WGS84 +units=m",
             " +no_defs +ellps=WGS84") %>% CRS()

After the Set up, the remainder of integrate.R goes through the individual steps for creating data objects, cleaning and standardisation and analysis. As a general strategy, we set out the location of the scripts separately as it saves repeated typing. In addition, if the project structure changes (e.g. we might want to add a folder level in formatting to have separate benthic and fish folders for those examples), it means that only the top line of code needs to change and the rest of the steps of that sequence should run.

##
## 2. Generate core data objects
##
  # point to creation locale
    creation_locale <- "creation_code/examples/formatting/"

  # create percent cover data object for costa rica
    source(paste0(creation_locale, "create_sessiles_dat.acosa.R"))

Next steps

Now that we have covered how the project repository is set up & aspects of the workflow, we can now get into some code!