Previous steps

If you would like to review information from the previous section, please click here.

Context

Continuing with providing WIO data course participants with more “hands on” experience with R and data formatting and standardisation, we have provided exercises for this week’s Homework, including:

These exercises build on skills obtained in the previous modules, including the use of git for documentation and version control, the use of pipes %>%, indexing, and the suite of tidyverse packages.

CORDIO East Africa staff will be available for support on Saturday 26th of June if participants require further assisatnce.

In addition, course participants are encouraged to use the project’s issues page to log an issue that we can address outside of the Saturday morning support session.

Getting set up

Similar to last week’s homework, we will start by making copies of the master scripts in exercise_code and creation_code/examples/standards folders into your own personal folder,

For example, the user dobura has copied the various master creation_code and homework scripts to his personal folder outlined in orange:

This means that participants can modify these scripts for completing their homework, making comments, and testing data structures.

So, for the first task of this homework participants should first:

 ## -- create local copy of homework script -- ##
  # Instructions:
  #  * 1.1. Copy homework script to your `participants_code` folder:
  #         copy `exercise_code/homework_data_formatting_standardisation.R` to
  #           the `exercise_code` folder in `participants_code/`

  #  * 1.2. In Gitbash or Git interface with RStudio:
  #           git add -A
  #           git status  ## -- this verifies local changes in staging area -- ##
  #           git commit -m 'adding data standardisation homework to exercise code'
  #           git pull    ## -- this ensures your local copy is up-to-date -- ##
  #           git push    ## -- this uploads your changes to github -- ##

By using the git commands to add, commit, push & pull after making this change provides a register in the project repository and provides a copy of a “fresh” script before you start altering. Users can refer back to this version (and other changes) by using the gitk & command.

Next we need a copy of the create_percent_cover_kenya.R script, which contains much of the base code for this week’s Homework:

 ## -- create local copy of reef data creation code -- ##
  # Instructions:
  #  * 1.3. Copy `creation_code/examples/standards/create_percent_cover_kenya.R`
  #           to the `creation_code` folder in your participants_code folder.

  #  * 1.4. In Gitbash or Git interface with RStudio:
  #           git add -A
  #           git status  ## -- this verifies local changes in staging area -- ##
  #           git commit -m 'adding local copy of reef data creation code'
  #           git pull    ## -- this ensures your local copy is up-to-date -- ##
  #           git push    ## -- this uploads your changes to github -- ##

Similar to above, using the git commands to add, commit, push & pull registers this change and provides a “clean” copy of the script in the project database for future reference.

Importing raw data

We will start by practising our skills for importing data of different formats, including *.csv and *.xlsx. Note that for this exercise, we will follow the coding standard by separating out the data_locale <- and the file_name <- assignation. This means that if there is a change to the folder structure or file name, we can change this at the top of the script, and the rest should run:

 ## -- import data -- ##
  # Instructions:
  #  * 2.1. Using your personal copy of `create_percent_cover_kenya.R`
  #           Import `kenya.csv` from `data_raw` folder
  #           Print the header of the object to screen and copy it here:

  #  * 2.2. Exclude empty columns from kenya daa
  #           Determine which columns are empty. Hint: the `anyNA()` function scans columns for NA values
  #           Use the `dplyr::select()` to exclude columns with NA values
  #           Print the header of the object to screen and copy it here:

As we import the data, we notice there are a number of columns that only contain NA values, so we will exclude these in the data cleansing process.

Please make sure to copy your code and output to the screen to the homework script, so we can see what your results looked like.

Next we will import the benthic taxa concordances that we will use later on in the data cleaning and “quality assurance” process:

  #  * 2.3. Import benthic data concordances
  #           In the ## 1. Set up section of the `create_percent_cover_kenya.R` script
  #             include code to import the `wio_regional_benthic_taxa.xlsx` spreadsheet
  #             Hint:  use `data_locale <-` and `data_file <-` to identify the
  #               place & file to import
  #           Copy the section of the code and copy it here:

Standardisting columns, dates & taxa

For the next task in the Homework, we will work in getting the data into shape, practise formatting columns, standardise taxa and derive percent cover measures for our percent_cover_kenya data object.

Participants may have noticed that the import of the kenya.csv file is in a “wide” format, meaning that the individual quadrates are in columns, notated by the quadrate number:

> percent_cover_kenya
# A tibble: 1,221 x 27
      X1 `#`   DatasetID `data source/ow… Country  Year `Date (YYYY-MM-… Sector Site
   <dbl> <lgl> <lgl>     <lgl>            <chr>   <dbl> <lgl>            <lgl>  <lgl>
 1     1 NA    NA        NA               Kenya    2008 NA               NA     NA
 2     2 NA    NA        NA               Kenya    2008 NA               NA     NA
 3     3 NA    NA        NA               Kenya    2008 NA               NA     NA
 4     4 NA    NA        NA               Kenya    2008 NA               NA     NA
 5     5 NA    NA        NA               Kenya    2008 NA               NA     NA
 6     6 NA    NA        NA               Kenya    2008 NA               NA     NA
 7     7 NA    NA        NA               Kenya    2008 NA               NA     NA
 8     8 NA    NA        NA               Kenya    2008 NA               NA     NA
 9     9 NA    NA        NA               Kenya    2008 NA               NA     NA
10    10 NA    NA        NA               Kenya    2008 NA               NA     NA
# … with 1,211 more rows, and 18 more variables: Station <chr>, Zone <lgl>,
#   Depth (m) <lgl>, Latitude <lgl>, Longitude <lgl>, Transect length (m) <lgl>,
#   Distance (cm) <dbl>, Method <lgl>, Observer <lgl>, Benthic category <lgl>,
#   Benthic code <chr>, 1 <dbl>, 2 <dbl>, 3 <dbl>, 4 <dbl>, 5 <dbl>, 6 <dbl>,
#   7 <dbl>

Recall that the tibble format optimises the output in the console, showing the first few columns and the rest as a list. This is extremely useful when dealing with large objects (e.g. the percent_cover_kenya object is 1,221 rows by 27 columns!).

Use the convenience function quickview() that is created in the integrate.R script to have a look at the first 3 rows of the object. Another useful function for this is glimpse():

> percent_cover_kenya %>% glimpse()
Rows: 1,221
Columns: 27
$ X1                    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ `#`                   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ DatasetID             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `data source/owner`   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Country               <chr> "Kenya", "Kenya", "Kenya", "Kenya", "Kenya", "Kenya",…
$ Year                  <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008,…
$ `Date (YYYY-MM-DD)`   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Sector                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Site                  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Station               <chr> "Coral Garden", "Coral Garden", "Coral Garden", "Cora…
$ Zone                  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Depth (m)`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Latitude              <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Longitude             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Transect length (m)` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Distance (cm)`       <dbl> 32, 52, 60, 96, 100, 107, 154, 170, 170, 178, 196, 20…
$ Method                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Observer              <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Benthic category`    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Benthic code`        <chr> "AT", "FA", "HC", "AT", "FA", "FA", "AT", "FA", "HC",…
$ `1`                   <dbl> 0, 0, 60, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2…
$ `2`                   <dbl> 32, 0, 0, 0, 68, 0, 0, 0, 70, 0, 26, 0, 0, 70, 13, 0,…
$ `3`                   <dbl> 0, 52, 0, 44, 0, 11, 47, 16, 0, 8, 0, 28, 16, 0, 0, 7…
$ `4`                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `5`                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `6`                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `7`                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

So, to get the data into the preferred “long” format, we will use the funtion gather():

 ## -- Get data into "long" format -- ##
  #  * 3.1. As the imported data from `kenya.csv` are in a "wide" format (i.e.
  #           data from individual quadrates are in columns), we need to put
  #           the data into a "long" format (i.e. individual quadrates should
  #           be stacked on one another).
  #         Hint:  Use the `gather()` function to name the "stacked" column and
  #            identify the column as data.  Get additional help by typing
  #            `?gather` in the Console
  #           Copy the section of the code, header output and copy it here:

Please don’t forget to copy your code and header ouput into the Homework script so we can see your result!

The next thing we will look at for formatting the kenya.csv data is the date. Keen observers would have noticed that in this data there is only information on Year, which is not ideal. For GCRMN coral monitoring data, information should include the Date of collection as part of the standard metadata. As we all know, significant events can occur during a year (e.g. cyclones, bleaching events), so it is important to record this information.

For this exercise, we will use today’s month & date (or whenever you are working on the Homework) to create a date using paste0() and as.Date().

 ## -- Create date estimation for kenya data--##
  # Instructions:
  #  * 3.2. As the data from kenya only has the `Year`, create a new column
  #           called `Date` and use the year and today's month and date to
  #           create a date column
  #             Hint:  Use `paste0()` to paste the elements of the date together
  #               and pipe `%>%` to the `as.Date()` function
  #           Copy the section of the code, header output and copy it here:

Participants that wish to explore different formats for dates should have a look at: ?strptime that provides abbreviations for different formats of days, months, et cetera. These can then be used, for example as.Date(format = "%Y-%b-%d").

Standardising Taxa

The next task for the DFaS Homework is to convert & standardise the Benthic codes. Short codes can be extremely useful in the capturing of field data and streamlines the data input process. However, it is difficult to translate these across different monitoring programmes, observers, and ensure consistancy through time.

Similar to the Date discussion above, it is important that GCRMN data includes translations for these codes. In this exercise, we will explore a manual translation using the forcats package and a separate concordance file that we will examine below:

  #  * 3.3. Standardising taxa:
  #           Use the function `unique()` to get a list of benthic_codes and
  #             create a column called `Taxa name` to translate the codes to
  #             taxa names
  #               Hint: use `mutate()` and the `fct_recode()` function to
  #                 document your code translations
  #           Copy the section of the code and copy it here:

The raw kenya.csv data did not include a percent cover measure, so we will calculate this as part of the data standardisation & formatting:

  # Instructions:
  #  * 3.4. Calculate percent cover
  #           As the `kenya.csv` data provide the `Transect length` and the
  #             `Distance` occupied by each `Benthic code`, we need to transform
  #             these data into a measure of % cover.
  #           Hint: use `mutate()` to create a new column `percent_cover`
  #           Copy the section of the code, header output and copy it here:

One benefit of doing this calculation in R is it provides clear documentation of how it was calculated, as opposed to calculating it in Excel and then exporting the data to a *.csv (for example).

Linking tables

We will practise our skills in linking data tables by using the regional taxa code concordance file wio_regional_benthic_taxa.xlsx. For this, we will need to identify the columns that we wish to link the two data tables. As they have different names in the tables, we will use the rename() function to change the name so they match.

The rest is a standard left_join() operation:

  # Instructions:
  #  * 4.1. Verify that your `Taxa names` match the standard codes for WIO
  #           Use `left_join()` to join the object created from your import of
  #             `wio_regional_benthic_taxa.xlsx` to link the tables.
  #           Hint:  Make sure the column names you want to join have the
  #             same name.  The `rename()` function can help in standardising names.
  #           Copy the section of the code and copy it here:

  #  * 4.2. Use `dplyr::select()` to select the original `Benthic code`, your
  #           translation, and the standard WIO codes and list them below:
  #           Hint: use the `distinct()` function to include only the unique
  #             combinations.
  #           Copy the section of the code and copy it here:

Now things will get interesting to check the Benthic names that we created above with the WIO regional taxa codes. Use the dplyr::select() and distinct() to review the original codes, our translations, the “official” taxa names, and the Grouping that we created.

Participants may want to re-classify their Groupings to make sure they are correct before the next task:

Organising factors

 # Instructions:
  #  * 5.1. Use the following groupings to `fct_recode()` into standardise the `Benthic code`
  #           categories from #4. above:
#     grouping_order <-
#       c("Live coral",
#         "Bleached or dead coral",
#         "Macroalgae",
#         "Crustose algae",
#         "Sessile invertebrates",
#         "Non-living")
  #           Hint:  Use `mutate()` to create a new column with this classification.
  #         Copy the section of the code and copy it here:

  #  * 5.2. Using the `group_by()` function, group the data by `Station`, `Date`,
  #           and the taxonomic grouping classification from above and
  #           calculate the average percent cover
  #         Copy the section of the code, header output and copy it here:

Please don’t forget to copy your code and header output into this script so we can see how your results turned out!

Managing intermediate objects

Our last task is to save the clean, formatted percent_cover_kenya object to an intermediate locale for visualisation, further analysis and reporting:

  # Instructions:
  #  * 6.1. Add a data set identifier to your `percent_cover_kenya` data object
  #           Hint: create a character string object called `dataset_id` and
  #             use `mutate()` to create new column with the Dataset id

  #  * 6.2. Save the `percent_cover_kenya` as an intermediate `*.rda` object to
  #           use in future exercises.
  #         Hint:  point to a `save_locale <-` in the `intermediate_data` folder and
  #           provide a file name for the object using `paste0()`
  #           Copy the section of the code and copy it here:

Submit homework for evaluation

We then need to use the git commands to add, commit, and push to submit our homework for evaluation:

  # Instructions:
  #  * 8.1. In Gitbash or Git interface with RStudio:
  #           git add -A  ## -- this adds your work to the staging area -- ##
  #           git status  ## -- this verifies local changes in staging area -- ##
  #           git commit -m 'submitting homework for data formatting & standardisation'
  #           git pull    ## -- this ensures your local copy is up-to-date -- ##
  #           git push    ## -- this uploads your changes to github -- ##

A greater mastery of these skills will be necessary for next week, as we will start creating our own individual scripts and working with more complicated data sets and more sophisticated data wrangling tasks!