Data Skills

This section is a work in progress. We hope to add a series of tutorials for how to access and analyze SAEON datasets and other useful resources, mostly using the R programming language. We’ve also included a section on R basics to help you get familiar with the language. Note that most tutorials will include an overview of the data and details for accessing them, so we hope they will still be useful even if you don’t use R.

R Intro

Why Learn R?

R Basics

Community data

Datasets

Weather stations

R Intro

Why Learn R?

R Basics

R Basics: Weather Data

This tutorial relies on data provided as supplementary material for the paper Slingsby et al. 2017. Intensifying postfire weather and biological invasion drive species loss in a Mediterranean-type biodiversity hotspot. PNAS. http://dx.doi.org/10.1073/pnas.1619014114. The data can be downloaded here. It’s a ~13MB .xlsx file.

The study presents the results of 44 years of observation of permanent vegetation plots in the Cape of Good Hope Section of Table Mountain National Park, South Africa. The plots are dubbed “The Taylor Plots” in honour of Hugh Taylor, who established them in 1966. We’ll describe the details of the data as we work through the different tutorials, but for this first one we will only use the rainfall and temperature data for the reserve.

First, a few R basics

R can do basic calculations

1 + 1

## [1] 2

But to do more complex calculations, its easier to work with objects, which you assign, e.g.

x <- 1 
x

## [1] 1

You can assign objects from other objects – usualy once you’ve performed an operation on them

y <- x + 1
y

## [1] 2

You can perform operations with multiple objects

y + x

## [1] 3

Combine objects

z <- c(x, y)
z

## [1] 1 2

Or apply functions to objects

sum(z)

## [1] 3

You can also access elements within an object using indexing with square brackets

z[1] #the first element in vector z

## [1] 1

z[2] #the second element in vector z

## [1] 2

Or evaluate them using logical expressions

z == 1

## [1]  TRUE FALSE

z < 1

## [1] FALSE FALSE

z > 1

## [1] FALSE  TRUE

Note that you can reassign (or overwrite) objects, so you need to use unique object names if you want to keep them

x <- sum(z)
x

## [1] 3

R keeps these objects in memory

ls() #function to call the list of objects in memory - and "#" lets you put comments like this in your code...

## [1] "x" "y" "z"

This is useful, but if the objects are large, it will slow down the speed of calculations, so it is good practice to discard (large) objects once you’re done with them (or only read them in when you need to)

rm(x, y, z)
ls()

## character(0)

Now let’s explore real data!

Reading in data

First we’ll call the packages/libraries that contain the functions we need

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(readxl)

Now we need some data.

When reading data from a file into R you need to know the system path to the file on your computer, e.g. "/home/jasper/GIT/academic-kickstart/static/datasets".

Typing the whole thing out (or even copying and pasting) can get tedious, so its often better to set your working directory so R knows where to look for files, like so.

setwd("/home/jasper/GIT/academic-kickstart/static/datasets")

If you are trying to run this code on your own, you’ll get an error when trying to set your working directory to a folder on my computer 🙂 – You need to change the line of code above to the folder where you put the data. If you’re on Windows, you’ll need to specify the route drive, e.g. C:. You also need to make sure to use single forwardslashes / or double backslashes \\, rather than the single backslash \ that is the default on Windows.

So my file path on windows would look like:

"C:/jasper/GIT/academic-kickstart/static/datasets"

"C:\\jasper\\GIT\\academic-kickstart\\static\\datasets"

Or on Mac it would be:

"/Users/jasper/GIT/academic-kickstart/static/datasets"

Lastly, if you’ve tried all that and it’s still not working try adding (or removing) slash(es) at the end.

We can check if we got it right using

getwd()

## [1] "/home/jasper/GIT/academic-kickstart/static/datasets"

Ok, once all that’s done, we can use R to have a look at what files are in that directory

list.files()

## [1] "pnas.1619014114.sd01.xlsx"        "SAEON_field instrument list.xlsx"

And in this case we want to read data from pnas.1619014114.sd01.xlsx, but given that this is an excel workbook, it may have multiple separate worksheets of data. Let’s have a quick glimpse the lazy way

excel_sheets("pnas.1619014114.sd01.xlsx")

##  [1] "METADATA"           "weather"            "fires"             
##  [4] "postfireweather"    "enviroment"         "excluded_spp"      
##  [7] "veg1966"            "veg1996"            "veg2010"           
## [10] "traits"             "speciesclimatedata"

There are 11 different sheets. Let’s start with the weather

weather <- read_excel("pnas.1619014114.sd01.xlsx", sheet = "weather")

## New names:
## * `` -> ...1

#Let's have a look at the first few lines?
head(weather) #

## # A tibble: 6 x 6
##    ...1 Date                 Temp  Rain Month  Year
##   <dbl> <dttm>              <dbl> <dbl> <dbl> <dbl>
## 1     1 1961-01-02 00:00:00  21     0       1  1961
## 2     2 1961-01-03 00:00:00  20.6   0       1  1961
## 3     3 1961-01-04 00:00:00  15     3.5     1  1961
## 4     4 1961-01-05 00:00:00  21     0       1  1961
## 5     5 1961-01-06 00:00:00  19.8   0.8     1  1961
## 6     6 1961-01-07 00:00:00  22     0       1  1961

or last few lines

tail(weather)

## # A tibble: 6 x 6
##    ...1 Date                 Temp  Rain Month  Year
##   <dbl> <dttm>              <dbl> <dbl> <dbl> <dbl>
## 1 17474 2009-11-25 00:00:00  15       0    11  2009
## 2 17475 2009-11-26 00:00:00  18.9     0    11  2009
## 3 17476 2009-11-27 00:00:00  16.9     0    11  2009
## 4 17477 2009-11-28 00:00:00  22.2     0    11  2009
## 5 17478 2009-11-29 00:00:00  19.2     0    11  2009
## 6 17479 2009-11-30 00:00:00  17.3     0    11  2009

There are 17476 rows of data, so it’s a bit tricky to inspect it all! Fortunately we can call a summary like so

summary(weather)

##       ...1            Date                          Temp      
##  Min.   :    1   Min.   :1961-01-02 00:00:00   Min.   : 6.30  
##  1st Qu.: 4371   1st Qu.:1973-12-19 18:00:00   1st Qu.:15.00  
##  Median : 8740   Median :1985-12-09 12:00:00   Median :17.50  
##  Mean   : 8740   Mean   :1985-12-05 08:31:36   Mean   :17.62  
##  3rd Qu.:13109   3rd Qu.:1997-12-07 06:00:00   3rd Qu.:20.00  
##  Max.   :17479   Max.   :2009-11-30 00:00:00   Max.   :35.30  
##       Rain             Month             Year     
##  Min.   : 0.0000   Min.   : 1.000   Min.   :1961  
##  1st Qu.: 0.0000   1st Qu.: 4.000   1st Qu.:1973  
##  Median : 0.0000   Median : 7.000   Median :1985  
##  Mean   : 0.9914   Mean   : 6.514   Mean   :1985  
##  3rd Qu.: 0.2000   3rd Qu.:10.000   3rd Qu.:1997  
##  Max.   :97.0000   Max.   :12.000   Max.   :2009

Nice, but this doesn’t help see trends etc. It’s difficult to eyeball daily rainfall so let’s calculate the annual totals

Data wrangling

Here we use the group_by() and summarise() functions provided by library(dplyr) to calculate the annual rainfall totals using sum().

annualrain <- weather %>% group_by(Year) %>% summarise(Rainfall = sum(Rain))

head(annualrain) #to see what we get

## # A tibble: 6 x 2
##    Year Rainfall
##   <dbl>    <dbl>
## 1  1961     382 
## 2  1963     304.
## 3  1964     418.
## 4  1965     403.
## 5  1966     389 
## 6  1967     371.

If you were to translate this line of code into English it would say:

Take the weather dataframe, and for each Year, sum the total rainfall across all observation (months in this case).

library(dplyr) and other tidyverse packages contain a series of functions that can be thought of as verbs, providing the grammar for you develop code statements, e.g.:

(Pro tip: Do yourself a favour and apply the fourth law here!)

Binding the verbs together into code sentences is made possible by the pipe %>% function, which essentially allows you to pass the output of one function (or verb) to the next without making a new object in memory each time.

i.e. without %>% this code would have to be written

step1 <- group_by(weather, Year)
step2 <- summarise(step1, Rainfall = sum(Rain))
head(step2)

## # A tibble: 6 x 2
##    Year Rainfall
##   <dbl>    <dbl>
## 1  1961     382 
## 2  1963     304.
## 3  1964     418.
## 4  1965     403.
## 5  1966     389 
## 6  1967     371.

Which creates the unwanted objects like step1, which just waste memory, so let’s get rid of them

rm(step1, step2)

So looking at annual totals is easier to interpret than daily data, but not as good as a graph…

Plotting data with `library(ggplot2)`

To plot the data using the functions in library(ggplot2) we first describe the data, telling ggplot what data object we want to use and which variables should be our x and y axes, like so.

r <- ggplot(annualrain, aes(x = Year, y = Rainfall))

Then we can play with different plot types

r + geom_line() #A line graph - as easy as that!!!

I find time-series are often easier to look at if you shade one side, like so

r + geom_area()

Which is obviously easier to interpret than a scatterplot of points

r + geom_point()

Unless you add a trend line, like a loess spline

r + geom_point() + geom_smooth(method = 'loess')

## `geom_smooth()` using formula 'y ~ x'

So there have been spells of wetter and dryer years, but no clear trend. What about if we wanted to see if there’ve been any changes in seasonality? I.e. particular months show patterns of getting wetter or dryer?

More data wrangling and plotting

To summarise the data by month is very easy, all we need do is add Month to the group_by() statement in the code we used to get the annual totals.

monthlyrain <- weather %>% group_by(Year, Month) %>% summarise(Rainfall = sum(Rain))

## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.

head(monthlyrain)

## # A tibble: 6 x 3
## # Groups:   Year [1]
##    Year Month Rainfall
##   <dbl> <dbl>    <dbl>
## 1  1961     1     44.7
## 2  1961     2      5.7
## 3  1961     3     11.9
## 4  1961     4      9.4
## 5  1961     5     45.2
## 6  1961     6     86.4

The only issue now is that it is difficult to interpret the table directly, and we have 3 variables, so it’s not as simple to plot.

Fortunately, library(ggplot2) has the functions facet_grid() and facet_wrap() that make it very easy to make multi-panel plots. Here we use the same code as the line plot above, but apply it to the object monthlyrain and add facet_grid() indicating that the rows of the mutiplot should be the months.

r <- ggplot(monthlyrain, aes(x = Year, y = Rainfall))

r + geom_line() + facet_grid(rows = vars(Month))

Nice! We can clearly see that months 4 to 8 (April to August) are the wetter months in this region, but its still difficult to tell which years were wetter or drier relative to the long term average. To do this we need to calculate the anomalies (i.e. the differences from the mean).

Getting the mean for each month is very easy to do by adapting our code above and applying it to the monthlyrain dataframe we have already calculated. We simply change the function passed to summarise() from sum() to mean(), and remove Year from the group_by() statement.

i.e we’re saying

Take the monthlyrain dataframe, and for each Month, calculate the average rainfall across all years.

like so

monthlymean <- monthlyrain %>% group_by(Month) %>% summarise(MeanMonthlyRainfall = mean(Rainfall))

head(monthlymean)

## # A tibble: 6 x 2
##   Month MeanMonthlyRainfall
##   <dbl>               <dbl>
## 1     1                11.4
## 2     2                13.7
## 3     3                13.4
## 4     4                32.6
## 5     5                47.6
## 6     6                58.7

Ok, but now we need to match the MeanMonthlyRainfall with each month in monthlyrain. This is very easy with the function merge().

monthlyrain <- merge(monthlyrain, monthlymean)

head(monthlyrain)

##   Month Year Rainfall MeanMonthlyRainfall
## 1     1 1961     44.7              11.425
## 2     1 1999      5.2              11.425
## 3     1 1989      6.4              11.425
## 4     1 1979     14.8              11.425
## 5     1 2006      0.7              11.425
## 6     1 1996      2.3              11.425

Ok, now we want to calculate the positive and negative differences from the mean. For plotting purposes we’ll do this as separate Positive and Negative columns. R makes this easy, by allowing simple mathematical operations and allowing us to access or assign new columns using the $ operator, like so.

monthlyrain$Positive <- monthlyrain$Rainfall - monthlyrain$MeanMonthlyRainfall 

head(monthlyrain)

##   Month Year Rainfall MeanMonthlyRainfall Positive
## 1     1 1961     44.7              11.425   33.275
## 2     1 1999      5.2              11.425   -6.225
## 3     1 1989      6.4              11.425   -5.025
## 4     1 1979     14.8              11.425    3.375
## 5     1 2006      0.7              11.425  -10.725
## 6     1 1996      2.3              11.425   -9.125

Now we have negative values in our Positive column, but that’s ok, because we can set them to zero using a logical operator that detects negative values, like so:

monthlyrain$Positive[which(monthlyrain$Positive < 0)] <- 0

head(monthlyrain)

##   Month Year Rainfall MeanMonthlyRainfall Positive
## 1     1 1961     44.7              11.425   33.275
## 2     1 1999      5.2              11.425    0.000
## 3     1 1989      6.4              11.425    0.000
## 4     1 1979     14.8              11.425    3.375
## 5     1 2006      0.7              11.425    0.000
## 6     1 1996      2.3              11.425    0.000

Ok, there were a few steps that went into that. Let’s break it down:

the logical expression < returns TRUE or FALSE for whether the values in monthlyrain$Positive are less than zero or not;
the which() function translates all the values that are TRUE to their position in the vector/column monthlyrain$Positive;
the square brackets (or indexing) allows us to call all those values in the vector/column monthlyrain$Positive; and
assign them the value zero with <- 0

Now we do the same for the Negative values:

monthlyrain$Negative <- monthlyrain$Rainfall - monthlyrain$MeanMonthlyRainfall
monthlyrain$Negative[which(monthlyrain$Negative > 0)] <- 0

head(monthlyrain)

##   Month Year Rainfall MeanMonthlyRainfall Positive Negative
## 1     1 1961     44.7              11.425   33.275    0.000
## 2     1 1999      5.2              11.425    0.000   -6.225
## 3     1 1989      6.4              11.425    0.000   -5.025
## 4     1 1979     14.8              11.425    3.375    0.000
## 5     1 2006      0.7              11.425    0.000  -10.725
## 6     1 1996      2.3              11.425    0.000   -9.125

And we’re ready to start plotting. I’ll build this up piece by piece. First we’ll set up the plot and add the positive anomaly as a “ribbon” plot

g <- ggplot(monthlyrain, aes(x = Year))

g <- g + geom_ribbon(aes(ymin = 0, ymax = Positive, fill="Positive"))

g

Now we’ll add the negative anomaly ribbon

g <- g + geom_ribbon(aes(ymin = Negative, ymax = 0 ,fill="Negative"))

g

Next we can set colours to make sure they make sense – e.g. positive (wetter) anomalies should be blue and negative (drier) should be red.

g <- g + scale_fill_manual(name="",values=c("#D6604D", "#4393C3"))

g

Now we can split it into a multipanel by month using facet_grid()

g <- g + facet_grid(cols = vars(Month))

g

And add a title, meaningful y-axis label and flip the axes so we can read the graph properly

g <- g + ggtitle("Monthly Rainfall Anomaly (1961-2010)") + coord_flip() + ylab("Rainfall (mm)")

g

Not much to observe other than the positive anomalies are typically bigger than the negative anomalies. This is not surprising, because the negative anomalies are bounded by zero (i.e. no rain is the driest it can get), while there are no constraints on the positive anomaly. For example, the wettest day on record had 97 mm.

What about summer anomalies?

Now see if you can adapt the code above to develop the same figure using the Temp data in the object weather.

Note that you’ll need to calculate monthly means, not sum the totals, and better to swap the colours so red is the positive anomaly (hotter) and blue the negative (cooler).

It should look like this:

## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.

This looks a bit more interesting, with fewer cool anomalies and more hot anomalies as we head towards the present. It turns out the temperature has risen by ~1.2 degrees over the observation period.

Community data

Analysing biotic communities

The previous tutorial covered a bunch of the basics with some pretty straightforward data, and was more focused on how R works. Now we’ll be looking at slightly more complicated data (species data), which will require learning some fancier data handling.

R has a ridiculous number of libraries and functions for analyzing environmental and biodiversity data – check out the Environmetrics Task View for a primer on a small sample. The issue is that many of them have their own specific data format requirements, requiring an inordinate amount of data wrangling. My aim for this session is to walk you through an example dataset, sharing a few tidyverse and base R data wrangling tricks, and highlighting some common perils and pitfalls when analysing patterns and change in biotic community survey data.

First, let’s get the R packages we need and set our working directory

library(tidyverse)
library(readxl)
library(vegan) #comment out with "#" if you don't have this installed - or run install.packages("vegan")

setwd("/home/jasper/GIT/academic-kickstart/static/datasets/") 
#Change to your local path to the folder where you put the dataset

Description of community data

We’re going to use the same dataset from Slingsby et al. 2017 that we used in the previous tutorial. The data can be downloaded here. It’s a ~10MB .xlsx file.

This is a vegetation survey of 81 plots, each 5 by 10m, from across the Cape of Good Hope section of Table Mountain National Park. For more information check out the worksheet labeled METADATA and/or read the paper.

Let’s look at the sheets in our Excel workbook.

excel_sheets("pnas.1619014114.sd01.xlsx")

##  [1] "METADATA"           "weather"            "fires"             
##  [4] "postfireweather"    "enviroment"         "excluded_spp"      
##  [7] "veg1966"            "veg1996"            "veg2010"           
## [10] "traits"             "speciesclimatedata"

There’s quite a few sheets in the workbook, but the datasets most commonly used in analyses of biotic communities are three separate data objects (usually matrices, but see below) that include:

Species by sites – also called the community data matrix (where values are one of presence/absence, counts/abundance, %cover, biomass or similar
Environmental variables by site (e.g. soils, climate, location, etc)
Traits by species (e.g. growth form or other qualitative or quantitative life history or functional traits)

We have environment and traits, which are the second and third matrices, and in this case we have three community matrices (veg1966, veg1996 and veg2010), because the sites were first surveyed in 1966 and then resurveyed in 1996 and 2010.

Note that if you ever want to see what data are in a sheet, but don’t want to open the file in Excel, you can easily have a glimpse at them to by just selecting a small range in read_excel() like so

read_excel("pnas.1619014114.sd01.xlsx", sheet = "veg1966", range = "A1:K5")

## New names:
## * `` -> ...1

## # A tibble: 4 x 11
##   ...1  `Adenandra unif… `Adenandra vill… `Adenocline pau… `Agathosma cili…
##   <chr>            <dbl>            <dbl>            <dbl>            <dbl>
## 1 CP_1                 0                7                0                0
## 2 CP_10                0                0                0                0
## 3 CP_1…                0                7                0                0
## 4 CP_12                0                0                0                0
## # … with 6 more variables: `Agathosma hookeri` <dbl>, `Agathosma
## #   imbricata` <dbl>, `Agathosma lanceolata` <dbl>, `Agathosma
## #   serpyllacea` <dbl>, `Aizoon paniculatum` <dbl>, `Amphithalea
## #   ericifolia` <dbl>

read_excel("pnas.1619014114.sd01.xlsx", sheet = "veg2010", range = "A1:K5")

## New names:
## * `` -> ...1

## # A tibble: 4 x 11
##   ...1  `Acacia saligna` `Adenandra unif… `Adenandra vill… `Agapanthus wal…
##   <chr>            <dbl>            <dbl>            <dbl>            <dbl>
## 1 CP_1                 0                0                3                0
## 2 CP_10                0                0                0                0
## 3 CP_12                0                0                0                0
## 4 CP_13                0                0                0                0
## # … with 6 more variables: `Agathosma ciliaris` <dbl>, `Agathosma
## #   hookeri` <dbl>, `Agathosma imbricata` <dbl>, `Agathosma serpyllacea` <dbl>,
## #   `Albuca flaccida` <dbl>, `Albuca juncifolia` <dbl>

In this case the veg1966 and veg1996 data were Braun-Blanquet cover-abundance classes, which I have converted to abundance using the midpoint of each class. THIS IS NOT NECESSARILY DEFENSIBLE!!! but it’s fine for the purposes of this tutorial. The veg2010 dataset is actual counts of individuals.

Let’s read them in and have a closer look.

species66 <- read_excel("pnas.1619014114.sd01.xlsx", sheet = "veg1966")

## New names:
## * `` -> ...1

head(species66)

## # A tibble: 6 x 428
##   ...1  `Adenandra unif… `Adenandra vill… `Adenocline pau… `Agathosma cili…
##   <chr>            <dbl>            <dbl>            <dbl>            <dbl>
## 1 CP_1                 0                7                0                0
## 2 CP_10                0                0                0                0
## 3 CP_1…                0                7                0                0
## 4 CP_12                0                0                0                0
## 5 CP_13                0                0                0                0
## 6 CP_14                0                0                0                0
## # … with 423 more variables: `Agathosma hookeri` <dbl>, `Agathosma
## #   imbricata` <dbl>, `Agathosma lanceolata` <dbl>, `Agathosma
## #   serpyllacea` <dbl>, `Aizoon paniculatum` <dbl>, `Amphithalea
## #   ericifolia` <dbl>, `Anaxeton laeve` <dbl>, `Anemone knowltonia` <dbl>,
## #   `Anthochortus laxiflorus` <dbl>, `Anthospermum aethiopicum` <dbl>,
## #   `Anthospermum bergianum` <dbl>, `Anthospermum galioides` <dbl>, `Apium
## #   decumbens` <dbl>, `Arctotis aspera` <dbl>, `Argyrolobium filiforme` <dbl>,
## #   `Aristea africana` <dbl>, `Aristea capitata` <dbl>, `Aristea glauca` <dbl>,
## #   `Aspalathus abietina` <dbl>, `Aspalathus argyrella` <dbl>, `Aspalathus
## #   callosa` <dbl>, `Aspalathus capensis` <dbl>, `Aspalathus carnosa` <dbl>,
## #   `Aspalathus chenopoda` <dbl>, `Aspalathus divaricata` <dbl>, `Aspalathus
## #   ericifolia` <dbl>, `Aspalathus hispida` <dbl>, `Aspalathus
## #   laricifolia` <dbl>, `Aspalathus linguiloba` <dbl>, `Aspalathus
## #   microphylla` <dbl>, `Aspalathus retroflexa` <dbl>, `Aspalathus
## #   sericea` <dbl>, `Aspalathus serpens` <dbl>, `Asparagus aethiopicus` <dbl>,
## #   `Asparagus capensis` <dbl>, `Asparagus lignosus` <dbl>, `Asparagus
## #   rubicundus` <dbl>, `Berkheya barbata` <dbl>, `Berzelia abrotanoides` <dbl>,
## #   `Berzelia lanuginosa` <dbl>, `Bobartia filiformis` <dbl>, `Bobartia
## #   gladiata` <dbl>, `Bobartia indica` <dbl>, `Bolusafra bituminosa` <dbl>,
## #   `Bulbine abyssinica` <dbl>, `Caesia contorta` <dbl>, `Capelio
## #   tabularis` <dbl>, `Capeobolus brevicaulis` <dbl>, `Capeochloa
## #   cincta` <dbl>, `Carpacoce spermacocea` <dbl>, `Carpacoce
## #   vaginellata` <dbl>, `Carpobrotus acinaciformis` <dbl>, `Cassine
## #   peragua` <dbl>, `Cassytha ciliolata` <dbl>, `Centella affinis` <dbl>,
## #   `Centella glabrata` <dbl>, `Centella macrocarpa` <dbl>, `Centella
## #   tridentata` <dbl>, `Chironia baccifera` <dbl>, `Chironia decumbens` <dbl>,
## #   `Chironia linoides` <dbl>, `Chrysocoma coma-aurea` <dbl>, `Cineraria
## #   geifolia` <dbl>, `Cliffortia atrata` <dbl>, `Cliffortia falcata` <dbl>,
## #   `Cliffortia ferruginea` <dbl>, `Cliffortia filifolia` <dbl>, `Cliffortia
## #   glauca` <dbl>, `Cliffortia obcordata` <dbl>, `Cliffortia
## #   polygonifolia` <dbl>, `Cliffortia ruscifolia` <dbl>, `Cliffortia
## #   stricta` <dbl>, `Cliffortia subsetacea` <dbl>, `Clutia alaternoides` <dbl>,
## #   `Clutia ericoides` <dbl>, `Clutia polygonoides` <dbl>, `Coleonema
## #   album` <dbl>, `Conyza pinnatifida` <dbl>, `Corymbium africanum` <dbl>,
## #   `Corymbium glabrum` <dbl>, `Crassula capensis` <dbl>, `Crassula
## #   coccinea` <dbl>, `Crassula cymosa` <dbl>, `Crassula fascicularis` <dbl>,
## #   `Crassula flava` <dbl>, `Crassula nudicaulis` <dbl>, `Crassula
## #   subulata` <dbl>, `Cullumia squarrosa` <dbl>, `Cussonia thyrsiflora` <dbl>,
## #   `Cymbopogon marginatus` <dbl>, `Cynanchum africanum` <dbl>, `Cynanchum
## #   obtusifolium` <dbl>, `Cynodon dactylon` <dbl>, `Cyperus thunbergii` <dbl>,
## #   `Diastella divaricata` <dbl>, `Dilatris corymbosa` <dbl>, `Dilatris
## #   pillansii` <dbl>, `Dimorphotheca nudicaulis` <dbl>, `Diosma hirsuta` <dbl>,
## #   `Diosma oppositifolia` <dbl>, …

species10 <- read_excel("pnas.1619014114.sd01.xlsx", sheet = "veg2010")

## New names:
## * `` -> ...1

head(species10)

## # A tibble: 6 x 339
##   ...1  `Acacia saligna` `Adenandra unif… `Adenandra vill… `Agapanthus wal…
##   <chr>            <dbl>            <dbl>            <dbl>            <dbl>
## 1 CP_1                 0                0                3                0
## 2 CP_10                0                0                0                0
## 3 CP_12                0                0                0                0
## 4 CP_13                0                0                0                0
## 5 CP_14                0                0                0                0
## 6 CP_15                0                0                0                0
## # … with 334 more variables: `Agathosma ciliaris` <dbl>, `Agathosma
## #   hookeri` <dbl>, `Agathosma imbricata` <dbl>, `Agathosma serpyllacea` <dbl>,
## #   `Albuca flaccida` <dbl>, `Albuca juncifolia` <dbl>, `Amphithalea
## #   ericifolia` <dbl>, `Anaxeton laeve` <dbl>, `Anemone knowltonia` <dbl>,
## #   `Anthochortus capensis` <dbl>, `Anthospermum aethiopicum` <dbl>,
## #   `Anthospermum bergianum` <dbl>, `Anthospermum galioides` <dbl>, `Apium
## #   decumbens` <dbl>, `Arctotheca calendula` <dbl>, `Aristea africana` <dbl>,
## #   `Aspalathus callosa` <dbl>, `Aspalathus capensis` <dbl>, `Aspalathus
## #   carnosa` <dbl>, `Aspalathus chenopoda` <dbl>, `Aspalathus ciliaris` <dbl>,
## #   `Aspalathus hispida` <dbl>, `Aspalathus linguiloba` <dbl>, `Aspalathus
## #   microphylla` <dbl>, `Aspalathus retroflexa` <dbl>, `Aspalathus
## #   serpens` <dbl>, `Asparagus capensis` <dbl>, `Asparagus lignosus` <dbl>,
## #   `Asparagus rubicundus` <dbl>, `Babiana ambigua` <dbl>, `Babiana
## #   villosula` <dbl>, `Berkheya barbata` <dbl>, `Berzelia abrotanoides` <dbl>,
## #   `Berzelia lanuginosa` <dbl>, `Bobartia gladiata` <dbl>, `Bobartia
## #   indica` <dbl>, `Bolusafra bituminosa` <dbl>, `Bulbine abyssinica` <dbl>,
## #   `Bulbine alooides` <dbl>, `Bulbine lagopus` <dbl>, `Caesia contorta` <dbl>,
## #   `Capelio tabularis` <dbl>, `Capeochloa cincta` <dbl>, `Carpobrotus
## #   acinaciformis` <dbl>, `Carpobrotus edulis` <dbl>, `Cassine peragua` <dbl>,
## #   `Cassytha ciliolata` <dbl>, `Centella macrocarpa` <dbl>, `Centella
## #   tridentata` <dbl>, `Chenolea diffusa` <dbl>, `Chironia baccifera` <dbl>,
## #   `Chironia decumbens` <dbl>, `Chironia linoides` <dbl>, `Cliffortia
## #   atrata` <dbl>, `Cliffortia falcata` <dbl>, `Cliffortia ferruginea` <dbl>,
## #   `Cliffortia obcordata` <dbl>, `Cliffortia stricta` <dbl>, `Cliffortia
## #   subsetacea` <dbl>, `Clutia alaternoides` <dbl>, `Clutia ericoides` <dbl>,
## #   `Coleonema album` <dbl>, `Colpoon compressum` <dbl>, `Corymbium
## #   africanum` <dbl>, `Corymbium glabrum` <dbl>, `Cotula coronopifolia` <dbl>,
## #   `Cotula turbinata` <dbl>, `Crassula fascicularis` <dbl>, `Crassula
## #   nudicaulis` <dbl>, `Cymbopogon marginatus` <dbl>, `Cynodon dactylon` <dbl>,
## #   `Dasispermum hispidum` <dbl>, `Diastella divaricata` <dbl>, `Dilatris
## #   pillansii` <dbl>, `Diosma hirsuta` <dbl>, `Diosma oppositifolia` <dbl>,
## #   `Diospyros glabra` <dbl>, `Disa bracteata` <dbl>, `Disa obliqua` <dbl>,
## #   `Disa purpurascens` <dbl>, `Disparago tortilis` <dbl>, `Disperis
## #   capensis` <dbl>, `Drimia capensis` <dbl>, `Drosanthemum candens` <dbl>,
## #   `Drosera aliciae` <dbl>, `Drosera trinervia` <dbl>, `Edmondia
## #   sesamoides` <dbl>, `Ehrharta calycina` <dbl>, `Ehrharta ramosa` <dbl>,
## #   `Elegia cuspidata` <dbl>, `Elegia filacea` <dbl>, `Elegia juncea` <dbl>,
## #   `Elegia microcarpa` <dbl>, `Elegia nuda` <dbl>, `Elegia persistens` <dbl>,
## #   `Elegia stipularis` <dbl>, `Elegia vaginulata` <dbl>, `Elytropappus
## #   scaber` <dbl>, `Eragrostis capensis` <dbl>, `Erepsia anceps` <dbl>, …

Yup, that looks about right.

We’re going to start playing with the 1966 data in a second, but first, what if your data aren’t in a species by site matrix?

Often biodiversity data are stored in “long” or “list” format with three columns like Site, Species and Abundance or similar. This is also the tidy data format preferred by the various packages that adopt a tidyverse approach, so it is worth taking a second to explore a few tricks that help convert between formats.

In our case we have the data in matrix format, so first let’s convert it to long format.

hmm <- species66 %>% pivot_longer(cols = -1, 
                                  names_to = "Species", 
                                  values_to = "Abundance")

hmm

## # A tibble: 34,587 x 3
##    ...1  Species                Abundance
##    <chr> <chr>                      <dbl>
##  1 CP_1  Adenandra uniflora             0
##  2 CP_1  Adenandra villosa              7
##  3 CP_1  Adenocline pauciflora          0
##  4 CP_1  Agathosma ciliaris             0
##  5 CP_1  Agathosma hookeri              0
##  6 CP_1  Agathosma imbricata            0
##  7 CP_1  Agathosma lanceolata           0
##  8 CP_1  Agathosma serpyllacea          0
##  9 CP_1  Aizoon paniculatum             0
## 10 CP_1  Amphithalea ericifolia         2
## # … with 34,577 more rows

Note that the names_to command sets a name for the column where the column names from the matrix are stored, while ‘values_to’ names the value column. You also have to select the columns you want to gather, and in this case we wanted to exclude the first column (the sites) as they are not abundance data.

Now let’s convert the long format data back to a matrix.

hmm <- hmm %>% pivot_wider(id_cols = 1, 
                           names_from = Species, 
                           values_from = Abundance, 
                           values_fill = 0)

hmm

## # A tibble: 81 x 428
##    ...1  `Adenandra unif… `Adenandra vill… `Adenocline pau… `Agathosma cili…
##    <chr>            <dbl>            <dbl>            <dbl>            <dbl>
##  1 CP_1                 0                7                0                0
##  2 CP_10                0                0                0                0
##  3 CP_1…                0                7                0                0
##  4 CP_12                0                0                0                0
##  5 CP_13                0                0                0                0
##  6 CP_14                0                0                0                0
##  7 CP_15                0                7                0                0
##  8 CP_16                0                0                0                0
##  9 CP_17                0                0                0                0
## 10 CP_18                0                0                0                0
## # … with 71 more rows, and 423 more variables: `Agathosma hookeri` <dbl>,
## #   `Agathosma imbricata` <dbl>, `Agathosma lanceolata` <dbl>, `Agathosma
## #   serpyllacea` <dbl>, `Aizoon paniculatum` <dbl>, `Amphithalea
## #   ericifolia` <dbl>, `Anaxeton laeve` <dbl>, `Anemone knowltonia` <dbl>,
## #   `Anthochortus laxiflorus` <dbl>, `Anthospermum aethiopicum` <dbl>,
## #   `Anthospermum bergianum` <dbl>, `Anthospermum galioides` <dbl>, `Apium
## #   decumbens` <dbl>, `Arctotis aspera` <dbl>, `Argyrolobium filiforme` <dbl>,
## #   `Aristea africana` <dbl>, `Aristea capitata` <dbl>, `Aristea glauca` <dbl>,
## #   `Aspalathus abietina` <dbl>, `Aspalathus argyrella` <dbl>, `Aspalathus
## #   callosa` <dbl>, `Aspalathus capensis` <dbl>, `Aspalathus carnosa` <dbl>,
## #   `Aspalathus chenopoda` <dbl>, `Aspalathus divaricata` <dbl>, `Aspalathus
## #   ericifolia` <dbl>, `Aspalathus hispida` <dbl>, `Aspalathus
## #   laricifolia` <dbl>, `Aspalathus linguiloba` <dbl>, `Aspalathus
## #   microphylla` <dbl>, `Aspalathus retroflexa` <dbl>, `Aspalathus
## #   sericea` <dbl>, `Aspalathus serpens` <dbl>, `Asparagus aethiopicus` <dbl>,
## #   `Asparagus capensis` <dbl>, `Asparagus lignosus` <dbl>, `Asparagus
## #   rubicundus` <dbl>, `Berkheya barbata` <dbl>, `Berzelia abrotanoides` <dbl>,
## #   `Berzelia lanuginosa` <dbl>, `Bobartia filiformis` <dbl>, `Bobartia
## #   gladiata` <dbl>, `Bobartia indica` <dbl>, `Bolusafra bituminosa` <dbl>,
## #   `Bulbine abyssinica` <dbl>, `Caesia contorta` <dbl>, `Capelio
## #   tabularis` <dbl>, `Capeobolus brevicaulis` <dbl>, `Capeochloa
## #   cincta` <dbl>, `Carpacoce spermacocea` <dbl>, `Carpacoce
## #   vaginellata` <dbl>, `Carpobrotus acinaciformis` <dbl>, `Cassine
## #   peragua` <dbl>, `Cassytha ciliolata` <dbl>, `Centella affinis` <dbl>,
## #   `Centella glabrata` <dbl>, `Centella macrocarpa` <dbl>, `Centella
## #   tridentata` <dbl>, `Chironia baccifera` <dbl>, `Chironia decumbens` <dbl>,
## #   `Chironia linoides` <dbl>, `Chrysocoma coma-aurea` <dbl>, `Cineraria
## #   geifolia` <dbl>, `Cliffortia atrata` <dbl>, `Cliffortia falcata` <dbl>,
## #   `Cliffortia ferruginea` <dbl>, `Cliffortia filifolia` <dbl>, `Cliffortia
## #   glauca` <dbl>, `Cliffortia obcordata` <dbl>, `Cliffortia
## #   polygonifolia` <dbl>, `Cliffortia ruscifolia` <dbl>, `Cliffortia
## #   stricta` <dbl>, `Cliffortia subsetacea` <dbl>, `Clutia alaternoides` <dbl>,
## #   `Clutia ericoides` <dbl>, `Clutia polygonoides` <dbl>, `Coleonema
## #   album` <dbl>, `Conyza pinnatifida` <dbl>, `Corymbium africanum` <dbl>,
## #   `Corymbium glabrum` <dbl>, `Crassula capensis` <dbl>, `Crassula
## #   coccinea` <dbl>, `Crassula cymosa` <dbl>, `Crassula fascicularis` <dbl>,
## #   `Crassula flava` <dbl>, `Crassula nudicaulis` <dbl>, `Crassula
## #   subulata` <dbl>, `Cullumia squarrosa` <dbl>, `Cussonia thyrsiflora` <dbl>,
## #   `Cymbopogon marginatus` <dbl>, `Cynanchum africanum` <dbl>, `Cynanchum
## #   obtusifolium` <dbl>, `Cynodon dactylon` <dbl>, `Cyperus thunbergii` <dbl>,
## #   `Diastella divaricata` <dbl>, `Dilatris corymbosa` <dbl>, `Dilatris
## #   pillansii` <dbl>, `Dimorphotheca nudicaulis` <dbl>, `Diosma hirsuta` <dbl>,
## #   `Diosma oppositifolia` <dbl>, …

Note that I added the values_fill = 0 command, because a major advantage of long format biodiversity data is you don’t usually retain records of species with zero abundance at a site (i.e. absences) as these are usually implicit. They do need to be present in a matrix, so this command simply tells the function to fill absence records with 0.

Is the output the same as our original data?

identical(hmm, species66)

## [1] TRUE

Ok, so what do we want to know?

Summarizing community data

Total abundance by plot? We can do this by summing the abundance of all species in each site (i.e. row in the matrix).

rowSums(species66)

## Error in rowSums(species66): 'x' must be numeric

A classic R error message. It seems unhelpful, but as you learn R the error messages make more sense. In this case it is telling us that not all the values in species66 are numbers (i.e. of class “numeric”). Some may be of class “character”, “factor”, etc.

If we look at the data more closely…

head(species66)

## # A tibble: 6 x 428
##   ...1  `Adenandra unif… `Adenandra vill… `Adenocline pau… `Agathosma cili…
##   <chr>            <dbl>            <dbl>            <dbl>            <dbl>
## 1 CP_1                 0                7                0                0
## 2 CP_10                0                0                0                0
## 3 CP_1…                0                7                0                0
## 4 CP_12                0                0                0                0
## 5 CP_13                0                0                0                0
## 6 CP_14                0                0                0                0
## # … with 423 more variables: `Agathosma hookeri` <dbl>, `Agathosma
## #   imbricata` <dbl>, `Agathosma lanceolata` <dbl>, `Agathosma
## #   serpyllacea` <dbl>, `Aizoon paniculatum` <dbl>, `Amphithalea
## #   ericifolia` <dbl>, `Anaxeton laeve` <dbl>, `Anemone knowltonia` <dbl>,
## #   `Anthochortus laxiflorus` <dbl>, `Anthospermum aethiopicum` <dbl>,
## #   `Anthospermum bergianum` <dbl>, `Anthospermum galioides` <dbl>, `Apium
## #   decumbens` <dbl>, `Arctotis aspera` <dbl>, `Argyrolobium filiforme` <dbl>,
## #   `Aristea africana` <dbl>, `Aristea capitata` <dbl>, `Aristea glauca` <dbl>,
## #   `Aspalathus abietina` <dbl>, `Aspalathus argyrella` <dbl>, `Aspalathus
## #   callosa` <dbl>, `Aspalathus capensis` <dbl>, `Aspalathus carnosa` <dbl>,
## #   `Aspalathus chenopoda` <dbl>, `Aspalathus divaricata` <dbl>, `Aspalathus
## #   ericifolia` <dbl>, `Aspalathus hispida` <dbl>, `Aspalathus
## #   laricifolia` <dbl>, `Aspalathus linguiloba` <dbl>, `Aspalathus
## #   microphylla` <dbl>, `Aspalathus retroflexa` <dbl>, `Aspalathus
## #   sericea` <dbl>, `Aspalathus serpens` <dbl>, `Asparagus aethiopicus` <dbl>,
## #   `Asparagus capensis` <dbl>, `Asparagus lignosus` <dbl>, `Asparagus
## #   rubicundus` <dbl>, `Berkheya barbata` <dbl>, `Berzelia abrotanoides` <dbl>,
## #   `Berzelia lanuginosa` <dbl>, `Bobartia filiformis` <dbl>, `Bobartia
## #   gladiata` <dbl>, `Bobartia indica` <dbl>, `Bolusafra bituminosa` <dbl>,
## #   `Bulbine abyssinica` <dbl>, `Caesia contorta` <dbl>, `Capelio
## #   tabularis` <dbl>, `Capeobolus brevicaulis` <dbl>, `Capeochloa
## #   cincta` <dbl>, `Carpacoce spermacocea` <dbl>, `Carpacoce
## #   vaginellata` <dbl>, `Carpobrotus acinaciformis` <dbl>, `Cassine
## #   peragua` <dbl>, `Cassytha ciliolata` <dbl>, `Centella affinis` <dbl>,
## #   `Centella glabrata` <dbl>, `Centella macrocarpa` <dbl>, `Centella
## #   tridentata` <dbl>, `Chironia baccifera` <dbl>, `Chironia decumbens` <dbl>,
## #   `Chironia linoides` <dbl>, `Chrysocoma coma-aurea` <dbl>, `Cineraria
## #   geifolia` <dbl>, `Cliffortia atrata` <dbl>, `Cliffortia falcata` <dbl>,
## #   `Cliffortia ferruginea` <dbl>, `Cliffortia filifolia` <dbl>, `Cliffortia
## #   glauca` <dbl>, `Cliffortia obcordata` <dbl>, `Cliffortia
## #   polygonifolia` <dbl>, `Cliffortia ruscifolia` <dbl>, `Cliffortia
## #   stricta` <dbl>, `Cliffortia subsetacea` <dbl>, `Clutia alaternoides` <dbl>,
## #   `Clutia ericoides` <dbl>, `Clutia polygonoides` <dbl>, `Coleonema
## #   album` <dbl>, `Conyza pinnatifida` <dbl>, `Corymbium africanum` <dbl>,
## #   `Corymbium glabrum` <dbl>, `Crassula capensis` <dbl>, `Crassula
## #   coccinea` <dbl>, `Crassula cymosa` <dbl>, `Crassula fascicularis` <dbl>,
## #   `Crassula flava` <dbl>, `Crassula nudicaulis` <dbl>, `Crassula
## #   subulata` <dbl>, `Cullumia squarrosa` <dbl>, `Cussonia thyrsiflora` <dbl>,
## #   `Cymbopogon marginatus` <dbl>, `Cynanchum africanum` <dbl>, `Cynanchum
## #   obtusifolium` <dbl>, `Cynodon dactylon` <dbl>, `Cyperus thunbergii` <dbl>,
## #   `Diastella divaricata` <dbl>, `Dilatris corymbosa` <dbl>, `Dilatris
## #   pillansii` <dbl>, `Dimorphotheca nudicaulis` <dbl>, `Diosma hirsuta` <dbl>,
## #   `Diosma oppositifolia` <dbl>, …

…we see that the first row is actually a vector of plot names (e.g. CP_1), which is of class “character”. These will be useful later, but for this analysis we want to drop them. This is easy using indexing with square brackets “[,]”.

rowSums(species66[,2:ncol(species66)])

##  [1]  522.0  876.0  331.0  678.0 1029.5  523.0  880.5 1054.5 1306.0 1196.5
## [11]  396.0  531.0  473.0  625.5  197.0  837.0  178.0  377.0  275.0  144.0
## [21]  723.0  624.5  391.0  577.0  441.0  442.0  347.5  419.0  181.0  307.0
## [31] 1341.0  788.5  616.5  452.5   59.0  389.5  244.0  370.0  460.0  837.0
## [41]  214.0  251.0  184.0  448.0  129.0  859.0  299.0  353.0  424.0  467.5
## [51]  461.0  625.0  536.0 1377.5  554.5  311.0  509.5  332.0 1574.0 2345.0
## [61]  562.5  574.0 1584.0  257.0  860.5 1901.0  688.0  622.0  985.0 2045.5
## [71]  485.5  282.0 1335.0  184.0  355.5  270.0  481.0  341.0   50.0  409.0
## [81]  731.0

In this case we selected all columns from the 2nd column to the total number of columns as returned by ncol() (i.e. all but the first column). You could also do it like this…

rowSums(species66[,-1])

##  [1]  522.0  876.0  331.0  678.0 1029.5  523.0  880.5 1054.5 1306.0 1196.5
## [11]  396.0  531.0  473.0  625.5  197.0  837.0  178.0  377.0  275.0  144.0
## [21]  723.0  624.5  391.0  577.0  441.0  442.0  347.5  419.0  181.0  307.0
## [31] 1341.0  788.5  616.5  452.5   59.0  389.5  244.0  370.0  460.0  837.0
## [41]  214.0  251.0  184.0  448.0  129.0  859.0  299.0  353.0  424.0  467.5
## [51]  461.0  625.0  536.0 1377.5  554.5  311.0  509.5  332.0 1574.0 2345.0
## [61]  562.5  574.0 1584.0  257.0  860.5 1901.0  688.0  622.0  985.0 2045.5
## [71]  485.5  282.0 1335.0  184.0  355.5  270.0  481.0  341.0   50.0  409.0
## [81]  731.0

This just highlights that R almost always provides a number of ways to do the same thing. For example, if one wanted to calculate the row sums using the tidyverse approach, you could do this

species66[,-1] %>% rowwise() %>% pmap_dbl(.,sum)

##  [1]  522.0  876.0  331.0  678.0 1029.5  523.0  880.5 1054.5 1306.0 1196.5
## [11]  396.0  531.0  473.0  625.5  197.0  837.0  178.0  377.0  275.0  144.0
## [21]  723.0  624.5  391.0  577.0  441.0  442.0  347.5  419.0  181.0  307.0
## [31] 1341.0  788.5  616.5  452.5   59.0  389.5  244.0  370.0  460.0  837.0
## [41]  214.0  251.0  184.0  448.0  129.0  859.0  299.0  353.0  424.0  467.5
## [51]  461.0  625.0  536.0 1377.5  554.5  311.0  509.5  332.0 1574.0 2345.0
## [61]  562.5  574.0 1584.0  257.0  860.5 1901.0  688.0  622.0  985.0 2045.5
## [71]  485.5  282.0 1335.0  184.0  355.5  270.0  481.0  341.0   50.0  409.0
## [81]  731.0

Either way, a vector of numbers is not very informative… Let’s plot it in a histogram to look at the frequency of plots with different abundances.

ggplot(data.frame(abundance = rowSums(species66[,-1])), 
       aes(abundance)) +
 geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Wow! It ranges from 50 to 2345! The reasons will become clear later on.

Species diversity

What about species number by plot? This is easy to achieve by just counting the number of positive entries in each row, which we get using the logical expression >0, like so

rowSums(species66[,-1]>0)

##  [1] 51 50 52 64 44 44 66 53 31 19 27 43 37 34 26 46 14 51 37 15 67 52 51 45 40
## [26] 34 20 51 28 35 56 61 51 28  6 20 21 22 36 33 17 32 18 42 15 47 25 36 36 12
## [51] 39 49 25 49 35 36 39 39 74 71 24 58 79 25 44 74 14 33 51 73 24 24 72 20 11
## [76] 39 27 27  4 46 38

ggplot(data.frame(species_number = rowSums(species66[,-1]>0)), 
       aes(species_number)) +
 geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What about the relationship between the two?

ggplot(data.frame(species_number = rowSums(species66[,-1]>0),
                  abundance = rowSums(species66[,-1])),
       aes(abundance, species_number)) +
  geom_point() + 
  geom_smooth(method = "lm", formula = y ~ log(x))

A strong positive relationship. One would usually expect a close relationship between abundance and species number in a vegetation survey, because having more plants in your plot means greater potential to sample more species. E.g. if you only have 10 individuals, the maximum number of species you can sample is 10… In Fynbos, the number of individuals you sample in a set area depends on the vegetation age since fire, with greater numbers in younger plots, because the individuals are smaller.

SIDE NOTE ON SPECIES DIVERSITY MEASURES: This is why I have been referring to species number and not species richness. Species richness, when one looks across all taxonomic groups that use the term, is typically defined as the number of species encountered for a given number of individuals sampled. This accounts for any differences in the amount of effort invested in each sampling event (e.g. trawl netting, mist netting, aerial counts, etc).

It is only when working with sedentary organisms like plants that we typically sample per unit area, in which case we are actually sampling species density and NOT species richness. While useful, species density has different properties to species richness that we should always bear in mind – e.g. species density typically declines with the size of the organism sampled, which depends on the kind of organism (tree vs grass) or the age of the organism (seedling vs adult). One can estimate species richness from vegetation plot data, but that requires having counts of the number of individuals and estimating the average number of species one samples for a given number of individuals. I highly recommend you read Gotelli and Colwell 2001. Ecology Letters if you are ever working with diversity data.

Comparison between surveys

So let’s have a look at the change in diversity within sites between surveys. We can use the same code as above, just plotting the species counts from the one year against the other.

ggplot(data.frame(species_number66 = rowSums(species66[,-1]>0), 
                  species_number10 = rowSums(species10[,-1]>0)), 
       aes(species_number66, species_number10)) +
  geom_point() + 
  geom_smooth(method = "lm")

## Error in data.frame(species_number66 = rowSums(species66[, -1] > 0), species_number10 = rowSums(species10[, : arguments imply differing number of rows: 81, 63

Oh dear! The number of plots surveyed in each year was different for some reason. This is where the plot numbers come in. We need to make sure we’re comparing the same plots. First, we need to see what plot numbers are common between years.

(plots <- intersect(species66$...1, species10$...1))

##  [1] "CP_1"  "CP_10" "CP_12" "CP_13" "CP_14" "CP_15" "CP_16" "CP_17" "CP_18"
## [10] "CP_19" "CP_2"  "CP_21" "CP_22" "CP_27" "CP_28" "CP_29" "CP_3"  "CP_31"
## [19] "CP_34" "CP_36" "CP_37" "CP_38" "CP_39" "CP_4"  "CP_40" "CP_42" "CP_44"
## [28] "CP_45" "CP_46" "CP_47" "CP_48" "CP_49" "CP_50" "CP_55" "CP_56" "CP_57"
## [37] "CP_58" "CP_59" "CP_61" "CP_62" "CP_63" "CP_64" "CP_65" "CP_66" "CP_67"
## [46] "CP_70" "CP_71" "CP_72" "CP_73" "CP_75" "CP_76" "CP_78" "CP_79" "CP_8" 
## [55] "CP_80" "CP_82" "CP_83" "CP_88" "CP_89" "CP_9"  "CP_92" "CP_95" "CP_99"

Note that if you get an error, it may be because you don’t have a column called ‘…1’. This was an automatic name R gave the column. On Mac it may be ‘X__1’. If you get an error, run the code names(species66)[1] to see what it called it on your computer and change the column name in the lines of code above and below. Sorry about that.

Now we need to extract these “intersect()ing” plots from each survey’s community data matrix, like so

species66c <- species66[which(species66$...1 %in% plots), -1]
species10c <- species10[which(species10$...1 %in% plots), -1]

Now let’s try to look at the change in diversity within sites between surveys again.

ggplot(data.frame(species_number66 = rowSums(species66c>0), 
                  species_number10 = rowSums(species10c>0)), 
       aes(species_number66, species_number10)) +
  geom_point() + 
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

Ok, a lot of difference between the two surveys. One thing we haven’t accounted for is the fact that the surveys record all species, but in Fynbos we usually exclude the “seasonally apparent” species like geophytes or annuals and only analyze the data for the “permanently apparent” species. This is because some species are only visible for a few weeks or less, but we only survey the plots once – and not every week for a full year. In this analysis we also want to exclude alien species.

You’ll note that one of the spreadsheets in our dataset is called “excluded_spp”

excel_sheets("pnas.1619014114.sd01.xlsx")

##  [1] "METADATA"           "weather"            "fires"             
##  [4] "postfireweather"    "enviroment"         "excluded_spp"      
##  [7] "veg1966"            "veg1996"            "veg2010"           
## [10] "traits"             "speciesclimatedata"

We need to read these in…

excludes <- read_excel("pnas.1619014114.sd01.xlsx", sheet = "excluded_spp")

## New names:
## * `` -> ...1

head(excludes)

## # A tibble: 6 x 2
##    ...1 x                    
##   <dbl> <chr>                
## 1     1 Acacia saligna       
## 2     2 Adenocline pauciflora
## 3     3 Agapanthus walshii   
## 4     4 Albuca flaccida      
## 5     5 Albuca juncifolia    
## 6     6 Annesorhiza altiscapa

…and cut them out of our community data matrices, and plot again

species66c <- species66c[, -which(colnames(species66c) %in% excludes$x)]
species10c <- species10c[, -which(colnames(species10c) %in% excludes$x)]

ggplot(data.frame(species_number66 = rowSums(species66c>0), 
                  species_number10 = rowSums(species10c>0)), 
       aes(species_number66, species_number10)) +
  geom_point() + 
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

…very similar. So why the difference among surveys?

Well it could be the differences in post-fire vegetation age and the impacts on the numbers of individuals, and thus species sampled. Let’s look at the relationship between the change in numbers of species and change in vegetation age.

Environmental data

Vegetation age for each plot at the time of each survey is in the enviroment spreadsheet. NOTE that I spelled it incorrectly in the data I submitted to PNAS… Nobody’s perfect…

env <- read_excel("pnas.1619014114.sd01.xlsx", sheet = "enviroment", na = "NA")

## New names:
## * `` -> ...1

head(env)

## # A tibble: 6 x 10
##    ...1  Plot Moisture_class Age1996 Age2010 Age1966 Aliens_max firecount66_10
##   <dbl> <dbl>          <dbl>   <dbl>   <dbl>   <dbl>      <dbl>          <dbl>
## 1     1     1              5      10      10       8          3              3
## 2     2     2              5      10      10       5          1              3
## 3     3     3              5      10      10       7          1              3
## 4     4     4              4      10      24      18          3             NA
## 5     5     8              5      10       3       3          3              2
## 6     6     9              5      10      24       3          2              1
## # … with 2 more variables: firecount66_96 <dbl>, firecount96_10 <dbl>

Ok, first we need to match the plots in the env object to those in the community data matrices. This is a little tricky, because the Plot column in the env object only has numbers, whereas in the community data matrices they were prefixed with “CP_”. We’ll have to paste() them in, like so

(env$...1 <- paste("CP_", env$Plot, sep = ""))

##  [1] "CP_1"   "CP_2"   "CP_3"   "CP_4"   "CP_8"   "CP_9"   "CP_10"  "CP_12" 
##  [9] "CP_13"  "CP_14"  "CP_15"  "CP_16"  "CP_17"  "CP_18"  "CP_19"  "CP_21" 
## [17] "CP_22"  "CP_23"  "CP_24"  "CP_25"  "CP_27"  "CP_28"  "CP_29"  "CP_30" 
## [25] "CP_31"  "CP_34"  "CP_36"  "CP_37"  "CP_38"  "CP_39"  "CP_40"  "CP_42" 
## [33] "CP_44"  "CP_45"  "CP_46"  "CP_47"  "CP_48"  "CP_49"  "CP_50"  "CP_54" 
## [41] "CP_55"  "CP_56"  "CP_57"  "CP_58"  "CP_59"  "CP_60"  "CP_61"  "CP_62" 
## [49] "CP_63"  "CP_64"  "CP_65"  "CP_66"  "CP_67"  "CP_68"  "CP_70"  "CP_71" 
## [57] "CP_72"  "CP_73"  "CP_74"  "CP_75"  "CP_76"  "CP_78"  "CP_79"  "CP_80" 
## [65] "CP_81"  "CP_82"  "CP_83"  "CP_84"  "CP_86"  "CP_87"  "CP_88"  "CP_89" 
## [73] "CP_90"  "CP_91"  "CP_92"  "CP_94"  "CP_95"  "CP_97"  "CP_98"  "CP_99" 
## [81] "CP_100"

And then join the species10 column names to env. Note that the left_join function retains only the rows that were matched, and it sorts them into the same order as the first object supplied.

Note that you may need to change the ‘…1’ above to whatever your first column of the species10 dataframe is named as mentioned earlier. Otherwise the next line of code returns an empty object and the plotting fails.

env <- left_join(species10[,1], env)

## Joining, by = "...1"

Ok, now we’re ready to plot the change in numbers of species with the change in vegetation age.

ggplot(data.frame(
    change_in_species_number = rowSums(species66c>0) - rowSums(species10c>0), 
    change_in_age = env$Age1966 - env$Age2010), 
       aes(change_in_species_number, change_in_age)) +
  geom_point() + 
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 6 rows containing non-finite values (stat_smooth).

## Warning: Removed 6 rows containing missing values (geom_point).

Ignore the error. This is because some the plots don’t have a known vegetation age as they have never burnt on record.

So vegetation age clearly is an issue, with numbers of species declining as the vegetation ages (or increasing after a plot burns).

One approach to deal with this is to “rarefy” the samples. This involves estimating each plot’s expected species richness by randomly sampling a set number of individuals from each community. Again, I highly recommend you read Gotelli and Colwell 2001. Ecology Letters.

In this case we’ll sample 50 individuals, because that is the minimum abundance recorded in our plots.

ggplot(data.frame(species_richness66 = rarefy(x = ceiling(species66c), sample = 50), 
                  species_richness10 = rarefy(x = ceiling(species10c), sample = 50)), 
       aes(species_richness66, species_richness10)) +
  geom_point() + 
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

Much more similar… but still some differences. It looks like there may be fewer species in 2010, but this should be revealed by the slope of a regression line. We can get the slope by looking at the summary() of the linear model (lm()),

sr66 <- rarefy(x = ceiling(species66c), sample = 50)
sr10 <- rarefy(x = ceiling(species10c), sample = 50)

summary(lm(sr10 ~ sr66 - 1))

## 
## Call:
## lm(formula = sr10 ~ sr66 - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0426 -1.9667 -0.1091  2.9906  8.0104 
## 
## Coefficients:
##      Estimate Std. Error t value Pr(>|t|)    
## sr66  0.79423    0.02309    34.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.51 on 62 degrees of freedom
## Multiple R-squared:  0.9502,	Adjusted R-squared:  0.9494 
## F-statistic:  1183 on 1 and 62 DF,  p-value: < 2.2e-16

Note that I forced the intercept to be 0 by adding “-1” to the formula. The interesting thing is that the slope is 0.7942285, which suggests lower species numbers in 2010…

We can also test this using a paired T test

t.test(sr66, sr10, paired = T)

## 
## 	Paired t-test
## 
## data:  sr66 and sr10
## t = 6.9289, df = 62, p-value = 2.854e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.471771 4.476248
## sample estimates:
## mean of the differences 
##                3.474009

Yup, there was a significant drop in species numbers between 1966 and 2010, even once we accounted for differences in vegetation age/numbers of individuals. Bear in mind that these results are not necessarily defensible (and do not appear in the paper), because the 1966 “abundance” data is only an estimate made by taking the mid-point of the categorical abundance classes recorded in the original survey. This may well bias the rarefaction results.

The Fourth Corner…

I’m not going to delve into this much, but since one of the biggest topics community ecologists are interested in is the relationship between species’ traits and environmental variables I feel it bears mentioning.

Why is this called “the fourth corner”? I mentioned that analyses of biotic community data typically involve 3 data matrices:

Species by sites
Environmental variables by site
Traits by species

The “fourth corner” is the matrix we have to estimate from the other matrices, because we can’t observe it directly – that of traits by environment.

There are a multitude of methods out there for looking at the fourth corner, and covering this is beyond the scope of this tutorial, but I thought I’d demonstrate one of the simplest approaches – that of comparing the relative dominance of different functional types across different environments.

First we need to read in the trait data, which is the worksheet called traits.

traits <- read_excel("pnas.1619014114.sd01.xlsx", sheet = "traits")

## New names:
## * `` -> ...1

head(traits)

## # A tibble: 6 x 9
##    ...1 species family_GM resprout_postfi…  herb geophyte graminoid low_shrub
##   <dbl> <chr>   <chr>                <dbl> <dbl>    <dbl>     <dbl>     <dbl>
## 1    17 Adenan… RUTACEAE                 0     0        0         0         1
## 2    18 Adenan… RUTACEAE                 0     0        0         0         1
## 3    29 Agatho… RUTACEAE                 1     0        0         0         1
## 4    31 Agatho… RUTACEAE                 1     0        0         0         1
## 5    33 Agatho… RUTACEAE                 0     0        0         0         1
## 6    37 Agatho… RUTACEAE                 1     0        0         0         1
## # … with 1 more variable: tall_shrub <dbl>

You’ll see we have a column called resprout_postfire, which contains information on the postfire regeneration strategy for all species, indicating whether they are killed by fire and have to recruit from seed (0 = obligate seeder) or can persist through fire and resprout from remaining material (1 = sprouter).

First we pull out vectors of the species names matching each regeneration type. I’m sure there’s an easier way, but this at least is relatively transparent.

sprouters <- unlist(na.omit(traits[traits$resprout_postfire == 1,]$species))

seeders <- unlist(na.omit(traits[traits$resprout_postfire == 0,]$species))

sprouters[1:10]

##  [1] "Agathosma bifida"       "Agathosma capensis"     "Agathosma hookeri"     
##  [4] "Agathosma imbricata"    "Amphithalea ericifolia" "Aristea africana"      
##  [7] "Aristea capitata"       "Aristea glauca"         "Aspalathus linguiloba" 
## [10] "Aspalathus microphylla"

Now we use the sets of species names to extract the species counts and abundances for each regeneration type for each plot and bind them into a new dataframe that includes the soil moisture classes. The classes are labelled 1 to 5 representing: 1 = Permanently Wet, 2 = Seasonally Wet, 3 = Seasonally Moist (typically bordering a seasonal wetland with indicators of some influence such as the presence of a few wetland species), 4 = Well-drained (no evidence of topographic influence on soil moisture and likely reliant on mist/fog and rain only), 5 = Dry (essentially class 4 combined with other factors like N aspect and shallow soil that would affect plant water stress).

regen <- data.frame(
  sprouter_spp = as_tibble(species10c>0) %>% select(any_of(sprouters)) %>% rowwise() %>% pmap_dbl(.,sum),
  seeder_spp = as_tibble(species10c>0) %>% select(any_of(seeders)) %>% rowwise() %>% pmap_dbl(.,sum),
  sprouter_abund = as_tibble(species10c) %>% select(any_of(sprouters)) %>% rowwise() %>% pmap_dbl(.,sum),
  seeder_abund = as_tibble(species10c) %>% select(any_of(seeders)) %>% rowwise() %>% pmap_dbl(.,sum),
  soil_moisture = as.factor(env$Moisture_class))

Turn it into tidy (i.e. long format) data for plotting.

regen <- regen %>% pivot_longer(
  cols = c("sprouter_spp", "seeder_spp", "sprouter_abund", "seeder_abund"), 
  names_to = c("Regeneration mode", "Variable"),
  names_pattern = "(.*)_(.*)",
  values_to = "Number")

regen

## # A tibble: 252 x 4
##    soil_moisture `Regeneration mode` Variable Number
##    <fct>         <chr>               <chr>     <dbl>
##  1 5             sprouter            spp          19
##  2 5             seeder              spp          31
##  3 5             sprouter            abund       526
##  4 5             seeder              abund      1292
##  5 5             sprouter            spp          19
##  6 5             seeder              spp          27
##  7 5             sprouter            abund       371
##  8 5             seeder              abund      2132
##  9 4             sprouter            spp          19
## 10 4             seeder              spp          29
## # … with 242 more rows

Note the regexpression passed to names_pattern to convert the column names (“sprouter_spp”, “seeder_spp”, “sprouter_abund”, “seeder_abund”) into two columns (“Regeneration mode”, “Variable”) filled with “seeder”/“sprouter” and “spp”/“abund” respectively.

Now we can plot them

pal <- c("#56B4E9", "#CC79A7")#, "#009E73", "#0072B2", "#E69F00") #colourblind palette

ggplot(data = regen, aes(y = `Number`, x = soil_moisture, fill = `Regeneration mode`)) +
  geom_boxplot() +
  facet_wrap(~Variable, scales = "free") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  scale_fill_manual(values=rev(pal))

Interestingly, the abundance of seeders is much higher in wetter sites and decreases towards drier sites, while the sprouters show little difference. In contrast, specis counts for both regeneration types increase towards the drier sites. What have I not done (or accounted for) here and how may it confound these results?

Datasets

Weather stations

One of the biggest challenges of working on global change issues in South Africa is not having easy access to the country’s long-term weather records. One can occasionally access data for research purposes, especially if there’s a student involved, but otherwise our weather service charges a hefty fee for data.

Fortunately, we can access a lot of South Africa’s data through the National Oceanic and Atmospheric Administration of the USA (NOAA) for free here!!!

NOAA serve a massive range of data sets, but the following appear to be the most useful in terms of station data:

BUT WAIT, THERE’S MORE! If you call now, you’ll get this dedicated R package written by the amazing fellows at rOpenSci absolutely free!!!

Here’s a quick demo for using library(rnoaa) below, largely using repurposed and updated code from Scott Chamberlain’s blog posts here and here, and from the package vignette.

DISCLAIMER: We haven’t read the documentation for the datasets and make no guarantee of their accuracy or of the validity of my assumptions when summarizing and presenting the data. This applies to data quality, gaps, units, etc. etc

Let’s get set up…

library("rnoaa")
library("dplyr")
library("reshape2")
library("lawn")
library("leaflet")
library("lubridate")
library("ggplot2")
library("htmltools")

The Integrated Surface Dataset

This dataset “is composed of worldwide surface weather observations from over 35,000 stations”, nominally at hourly resolution. See link above for more details.

Let’s see what stations we can find for South Africa and surrounds?

###Make a bounding box
bbox <- c(16, -35, 34, -22)

###Search the bounding box for stations
out <- isd_stations_search(bbox = bbox)

## using cached file: /home/jasper/.cache/R/noaa_isd/isd_stations.rds

## date created (size, mb): 2020-08-02 21:27:15 (0.683)

###Have a look at the output
head(out)

## # A tibble: 6 x 11
##   usaf   wban  station_name   ctry  state icao    lat   lon elev_m  begin    end
##   <chr>  <chr> <chr>          <chr> <chr> <chr> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1 673080 99999 CHICUALACUALA  MZ    ""    ""    -22.1  31.7  453   1.97e7 2.02e7
## 2 673090 99999 DINDIZA-GAZA   MZ    ""    ""    -22.1  33.4  102   2.01e7 2.02e7
## 3 673310 99999 MAPULANGUENE-… MZ    ""    ""    -24.5  32.1  151   2.01e7 2.02e7
## 4 673350 99999 XAI XAI        MZ    ""    "FQX… -25.0  33.6    5   1.97e7 2.02e7
## 5 673390 99999 MAPUTO         MZ    ""    ""    -26.0  32.6   59   1.95e7 2.02e7
## 6 673410 99999 MAPUTO         MZ    ""    "FQM… -25.9  32.6   44.2 1.97e7 2.02e7

Difficult to read a table, so let’s put them on the map.

###Make the popup labels
out$uw <- paste("usaf/wban = ", out$usaf, "/", out$wban, sep = "")
out$period <- paste("years = ", substr(out$begin,1,4), " to ", substr(out$end,1,4), sep = "")
name <- paste(sep = "<br/>", out$station_name, out$period, out$uw)

###Plot the map
leaflet(data = out) %>%
  addTiles() %>%
  addCircleMarkers(label = lapply(name, HTML), radius = 3)

There’s also a function for searching for all stations within a specified radius of a point of interest.

out <- isd_stations_search(lat = -34.1, lon = 18.45, radius = 40)
head(out)

## # A tibble: 6 x 12
##   usaf  wban  station_name ctry  state icao    lat   lon elev_m  begin    end
##   <chr> <chr> <chr>        <chr> <chr> <chr> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1 6891… 99999 SIMONSTOWN   SF    ""    ""    -34.2  18.4    293 1.95e7 1.95e7
## 2 6891… 99999 SLANGKOP     SF    ""    ""    -34.2  18.3      8 2.00e7 2.02e7
## 3 6881… 99999 MOLTENO RES… SF    ""    ""    -33.9  18.4     97 2.01e7 2.02e7
## 4 6899… 99999 CT-AWS       SF    ""    ""    -34.0  18.6     46 2.00e7 2.02e7
## 5 6881… 99999 CAPE TOWN I… SF    ""    "FAC… -34.0  18.6     46 1.95e7 2.02e7
## 6 6881… 99999 CAPE TOWN -… SF    ""    ""    -33.9  18.4      2 2.00e7 2.02e7
## # … with 1 more variable: distance <dbl>

And if we visualize the area searched…

pt <- lawn::lawn_point(c(18.45, -34.1))
lawn::lawn_buffer(pt, dist = 40) %>% view

Ok, now let’s get a smattering of stations from across the Cape Peninsula that have data for 2019, combine them into one object and plot

###Pull data from online database to local file and then read in
res1 <- isd(usaf=out$usaf[2], wban="99999", year=2019, force = T)
res2 <- isd(usaf=out$usaf[3], wban="99999", year=2019, force = T)
res3 <- isd(usaf=out$usaf[5], wban="99999", year=2019, force = T)
res4 <- isd(usaf=out$usaf[9], wban="99999", year=2019, force = T)

###Combine data into one dataframe
res_all <- bind_rows(res1, res2, res3, res4) #Note that not all stations have all variables and bind_rows() fills empty columns with NAs

###Create a combined date + time column
res_all$date_time <- ymd_hm(
  sprintf("%s %s", as.character(res_all$date), res_all$time)
)

###Convert "temperature" column from codes to numeric and divide by 10 (the data come in tenths of a degree)
res_all$temperature <- as.numeric(substr(res_all$temperature, 2, 5))/10

###Remove the "NA" values, which NOAA code as "999"
res_all$temperature[which(res_all$temperature > 900)] <- NA

###Visualize the data
ggplot(res_all, aes(date_time, temperature)) +
  geom_line() + 
  facet_wrap(~usaf_station, scales = "free_x")

Recall that the ISD dataset is hourly (or mostly 3-hourly in this case), so perhaps better to look at daily max or min temperatures and precipitation?

###Some dodgy cleaning of the rainfall data - CHECK MY ASSUMPTION THAT NA = 0mm RAINFALL 
res_all$AA1_depth[is.na(res_all$AA1_depth)] <- 0
res_all$AA1_depth[which(res_all$AA1_depth == 9999)] <- NA
res_all$AA1_depth <- as.numeric(res_all$AA1_depth)

res_all$date <- as.Date(res_all$date, format = "%Y%m%d")

###Summarize Tmin, Tmax and Precipitation
dayres <- res_all %>% group_by(usaf_station, date) %>% summarise(tmax = max(temperature, na.rm = T), tmin = min(temperature, na.rm = T), prec = max(AA1_depth, na.rm = T)/10) #note I take max precip and divide by ten, because it seems to be a hourly cumulative sum in tenths of a mm (Check this!)

###Melt and plot
dayres %>% melt(id = c("usaf_station", "date")) %>% ggplot(aes(date, value, colour = usaf_station)) +
  geom_line() + 
  facet_wrap(~variable, scales = "free_y")

Or monthly…

###Fix dates
res_all$month <- month(res_all$date_time) #note this is all in 2019

###Summarize Tmin, Tmax and Precipitation
monthres <- res_all %>% group_by(usaf_station, month) %>% summarise(tmax = max(temperature, na.rm = T), tmin = min(temperature, na.rm = T), prec = sum(AA1_depth, na.rm = T)/30)

###Melt and plot
monthres %>% melt(id = c("usaf_station", "month")) %>% ggplot(aes(month, value, colour = usaf_station)) +
  geom_line() + 
  facet_wrap(~variable, scales = "free_y")

Note that the rnoaa functions save files in a local cache. If you repeat a search, they usually just read in that cache rather than redoing the search and download. Some functions allow you to specify force = TRUE to force a new download, but others don’t. In this case you will need to clear the cache manually. You can find the cache on your machine by calling rappdirs::user_cache_dir("rnoaa/ghcnd").

What if we wanted to look at a longer record? The isd() function only allows us to pull one year at a time so we need to automate using a loop or similar.

We may also be keen to see the names? Or select stations by name rather than usaf/wban codes. You can easily pick your own station names by exploring the interactive map we made above. Then we’ll loop through a few years for each station to get a multi-year record.

Here’s the daily Tmin and Tmax for the period 2013 to 2020 for a few stations around Cape Town.

###Select stations by name and filter
stations <- c("SLANGKOP", "MOLTENO RESERVIOR", "CAPE TOWN INTL", "CAPE POINT")
sout <- filter(out, station_name %in% stations)

###Set date range
yr <- 2013:2020

###Ok, now I'll use 2 nested "for" loops to extract data for all years for each station. "for" loops can often be slower than other approaches like writing functions, but in this case its not a problem because the API call and data download are the slow bit. Using a "for" loop is perhaps easier to follow what I'm doing too. 

drtdat <- list() #Make list object to capture outputs (elements will be stations)

#For each station
for (i in 1:nrow(sout)) { 
  
  res <- list() #Make list object to capture interim outputs stations by station (elements will be years)
  
  #For each year (j), extract the data for station (i)
  for (j in 1:length(yr)) {
    res[[j]] <- tryCatch({
      isd(usaf=sout$usaf[i], wban=sout$wban[i], year=yr[j]) }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
  }
  
  res <- bind_rows(res) #Convert the interim list object into a dataframe
  drtdat[[i]] <- res #Add the dataframe for station i to the output list
}

###Convert output to a dataframe and add the station names
names(drtdat) <- stations
dat <- bind_rows(drtdat)
dat <- right_join(sout[,c("station_name", "usaf")], dat, by = c("usaf"="usaf_station"))

###Create a combined date + time column and year/month/day columns
dat$date_time <- ymd_hm(
  sprintf("%s %s", as.character(dat$date), dat$time)
)
dat$year <- year(dat$date_time) #note this is NOT all in 2019 this time
dat$month <- month(dat$date_time) 
dat$day <- month(dat$date_time) 
dat$date <- as.Date(dat$date, format = "%Y%m%d")

###Convert "temperature" column from codes to numeric
dat$temperature <- as.numeric(substr(dat$temperature, 2, 5))/10

###Remove the "NA" values, which NOAA code as "999"
dat$temperature[which(dat$temperature > 900)] <- NA

###Some dodgy cleaning of the rainfall data - CHECK MY ASSUMPTION THAT NA = 0mm RAINFALL 
dat$AA1_depth[is.na(dat$AA1_depth)] <- 0
dat$AA1_depth[which(dat$AA1_depth == 9999)] <- NA
dat$AA1_depth <- as.numeric(dat$AA1_depth)

###Daily summary of Tmin and Tmax
daydat <- dat %>% group_by(station_name, date) %>% summarise(tmax = max(temperature, na.rm = T), tmin = min(temperature, na.rm = T)) 

###Melt and plot
daydat %>% melt(id = c("station_name", "date")) %>% ggplot(aes(date, value, colour = variable)) +
  geom_line() + 
  facet_wrap(~station_name)

Or precipitation by month?

###Summarize Tmin, Tmax and Precipitation
monthdat <- dat %>% group_by(station_name, year, month) %>% summarise(prec = sum(AA1_depth, na.rm = T)/30)
monthdat$date <- as.Date(paste(monthdat$year, monthdat$month, "01", sep = "/"), format = "%Y/%m/%d")

###Melt and plot
monthdat %>% select(station_name, date, prec) %>% ggplot(aes(date, prec)) +
  geom_line() + 
  facet_wrap(~station_name, scales = "free_y")

Ok, what about looking at even longer records? Most of South Africa has experienced pretty extreme drought in the past 5 years or so, so let’s pull out a few major cities and towns that give us good coverage of the country.

The Global Historical Climatology Network

Note that the Integrated Surface Dataset (ISD) gives us hourly data, which is overkill for looking at longer records as the dataset gets very large. For this operation it is better to query the Global Historical Climatology Network as described by Menne et al. 2012. CAVEAT: We take no responsibility for the data. Have a look at the paper for details on error corrections, biases, etc.

Let’s have a look at Cape Town International Airport

ct <- meteo_tidy_ghcnd(stationid = "SFM00068816", var = c("TMAX", "TMIN", "PRCP"))

head(ct)

## # A tibble: 6 x 5
##   id          date        prcp  tmax  tmin
##   <chr>       <date>     <dbl> <dbl> <dbl>
## 1 SFM00068816 1973-06-01    NA    NA    NA
## 2 SFM00068816 1973-06-02    NA    NA    NA
## 3 SFM00068816 1973-06-03    NA    NA    NA
## 4 SFM00068816 1973-06-04    NA    NA    NA
## 5 SFM00068816 1973-06-05    NA    NA    NA
## 6 SFM00068816 1973-06-06    NA    NA    NA

Note that we had to know the station ID “SFM00068816”. For some reason you can’t query the database of stations using the same method as for the ISD and you have to download the entire list of all ~630 000 global weather stations and query that locally to get the station ID.

station_data <- ghcnd_stations() # Takes a while to run
head(station_data)

## # A tibble: 6 x 11
##   id    latitude longitude elevation state name  gsn_flag wmo_id element
##   <chr>    <dbl>     <dbl>     <dbl> <chr> <chr> <chr>    <chr>  <chr>  
## 1 ACW0…     17.1     -61.8      10.1 ""    ST J… ""       ""     TMAX   
## 2 ACW0…     17.1     -61.8      10.1 ""    ST J… ""       ""     TMIN   
## 3 ACW0…     17.1     -61.8      10.1 ""    ST J… ""       ""     PRCP   
## 4 ACW0…     17.1     -61.8      10.1 ""    ST J… ""       ""     SNOW   
## 5 ACW0…     17.1     -61.8      10.1 ""    ST J… ""       ""     SNWD   
## 6 ACW0…     17.1     -61.8      10.1 ""    ST J… ""       ""     PGTM   
## # … with 2 more variables: first_year <int>, last_year <int>

You can now query this database using the meteo_nearby_stations() function, which allows you to find all stations within a certain radius of a coordinate or the closest x stations. Here are the closest stations to Cape Point lighthouse.

lat_lon_df <- data.frame(id = "somewhere",
                         latitude = -34.35,
                         longitude = 18.48)
nearby_stations <-  meteo_nearby_stations(lat_lon_df = lat_lon_df,
                    station_data = station_data, radius = 100)

nearby_stations$somewhere

## # A tibble: 23 x 5
##    id          name                      latitude longitude distance
##    <chr>       <chr>                        <dbl>     <dbl>    <dbl>
##  1 SF000047220 GROOT CONSTANTIA             -34.0      18.4     36.0
##  2 SF000213300 EERSTERIVIER (BOS)           -34.0      18.7     41.0
##  3 SF000206890 TABLE.MTN WOODHEAD TUNNEL    -34.0      18.4     42.2
##  4 SFM00068816 CAPE TOWN INTL               -34.0      18.6     44.3
##  5 SF000207470 TABLE.MTN PLATTEKLIP         -34.0      18.4     44.8
##  6 SF000208660 ROYAL OBSERVATORY            -33.9      18.5     46.7
##  7 SF000207760 CAPE TOWN FIRE STATION       -33.9      18.4     46.9
##  8 SF000207460 TABL.MTN MOLTENO RSV         -33.9      18.4     47.0
##  9 SF000210550 CAPE TOWN MAITLAND (CEM)     -33.9      18.5     48.0
## 10 SF000216550 STELLENBOSCH (TNK)           -33.9      18.9     58.9
## # … with 13 more rows

Interesting… Cape Point lighthouse isn’t on the list, even though it’s a Global Atmospheric Watch station (GAWS) and is in the ISD data. No idea why…

Note that you can also just do a good old text search for the names of a few stations you know. This also allows you to filter by multiple criteria at once, e.g. we can filter for element = PRCP to only get rainfall data. You can use the map we made earlier to get your own list of names, but note that like “CAPE POINT” they’re not all in the GHCND database.

Since most of South Africa has experienced extreme drought in the past few years, perhaps we should pull out and compare the rainfall for a few stations from around the country? In this case we’ve selected a few of the larger towns and cities.

###Select stations by name and filter
stations <- c("CAPE TOWN INTL",
"SPRINGBOK",
"BLOEMFONTEIN AIRPOR",
"PORT ELIZABETH INTL",
"DURBAN INTL", 
"JOHANNESBURG INTL")

stas <- station_data %>% filter(element == "PRCP") %>%
  filter(name %in% stations)

stas

## # A tibble: 6 x 11
##   id    latitude longitude elevation state name  gsn_flag wmo_id element
##   <chr>    <dbl>     <dbl>     <dbl> <chr> <chr> <chr>    <chr>  <chr>  
## 1 SF00…    -29.7      17.9    1007   ""    SPRI… "GSN"    68512  PRCP   
## 2 SF00…    -29.1      26.3    1354   ""    BLOE… "GSN"    68442  PRCP   
## 3 SFM0…    -26.1      28.2    1694.  ""    JOHA… ""       68368  PRCP   
## 4 SFM0…    -30.0      31.0      10.1 ""    DURB… ""       68588  PRCP   
## 5 SFM0…    -34.0      18.6      46   ""    CAPE… "GSN"    68816  PRCP   
## 6 SFM0…    -34.0      25.6      68.9 ""    PORT… "GSN"    68842  PRCP   
## # … with 2 more variables: first_year <int>, last_year <int>

Looks like we should be able to compare the period 1980 to 2019 for all stations. Let’s download the data.

dat <- meteo_pull_monitors(stas$id, var = c("TMAX", "TMIN", "PRCP"), date_min = "1980-01-01", date_max = "2019-12-31")

dat[1:5,]

## # A tibble: 5 x 5
##   id          date        prcp  tmax  tmin
##   <chr>       <date>     <dbl> <dbl> <dbl>
## 1 SF002146700 1980-01-01     0    NA    NA
## 2 SF002146700 1980-01-02     0    NA    NA
## 3 SF002146700 1980-01-03     0    NA    NA
## 4 SF002146700 1980-01-04     0    NA    NA
## 5 SF002146700 1980-01-05     0    NA    NA

Ok, now let’s add our station names, summarize by year, and plot.

stas$shortname <- c("SBK", "BFN", "JHB", "DBN", "CPT", "PE")
dat <- left_join(dat, stas[,c("id", "name", "shortname")])
dat$year <- substr(dat$date,1,4)

mdat <- dat %>% group_by(shortname, year) %>% summarise(rain = sum(prcp, na.rm = T)/10) #Get annual totals

mdat$year <- as.numeric(mdat$year)
mdat$rain <- as.numeric(mdat$rain)

###Visualize the data
ggplot(mdat, aes(x = year, y = rain)) +
  geom_line() + 
  facet_wrap(~shortname)

Or something prettier?

###calculate average rainfall and yearly anomalies
ymean <- mdat %>% group_by(shortname) %>% summarise(mean = mean(rain, na.rm = T))
mdat <- left_join(mdat, ymean, by = "shortname")
mdat$positive <- mdat$rain - mdat$mean
mdat$positive[which(mdat$positive<0)] <- 0
mdat$negative <- mdat$rain - mdat$mean
mdat$negative[which(mdat$negative>0)] <- 0

###Plot
ggplot(mdat, aes(x = year)) + 
  geom_ribbon(aes(ymin = 0, ymax = positive, fill="positive")) + geom_ribbon(aes(ymin = negative, ymax = 0 ,fill="negative")) +
  scale_fill_manual(name="",values=c("#D6604D", "#4393C3")) +
  facet_grid(shortname ~ ., scales = "free_y") +
  ylab("rainfall anomaly in mm")

We haven’t dealt with data gaps, data quality verification, etc etc, but this quick and dirty analysis suggests it’s been a pretty dismal decade for South Africa all round…

Data Skills

Overview

Getting started

Why Learn R?

1. R has incredibly powerful capabilities for cleaning, analyzing and visualizing data

2. Your analyses are REPEATABLE and potentially REPRODUCIBLE*!!!

3. R is good for you

4. If you decide to leave science (or change your field of study), R is a “soft skill” that has solid benefits in many career paths!

5. It’s FREE!!!

How to learn R?

The learning curve…

My motivation…

Here are a few key tricks

Where to learn R?

R Basics: Weather Data

First, a few R basics

Reading in data

Data wrangling

Plotting data with `library(ggplot2)`

More data wrangling and plotting

What about summer anomalies?

Analysing biotic communities

Description of community data

Summarizing community data

Species diversity

Comparison between surveys

Environmental data

The Fourth Corner…

Datasets

Weather stations

The Integrated Surface Dataset

The Global Historical Climatology Network

Site Usage

Open Data Portal

Research

About

Data Skills

Overview

Getting started

Why Learn R?

1. R has incredibly powerful capabilities for cleaning, analyzing and visualizing data

2. Your analyses are REPEATABLE and potentially REPRODUCIBLE*!!!

3. R is good for you

4. If you decide to leave science (or change your field of study), R is a “soft skill” that has solid benefits in many career paths!

5. It’s FREE!!!

How to learn R?

The learning curve…

My motivation…

Here are a few key tricks

Where to learn R?

R Basics: Weather Data

First, a few R basics

Reading in data

Data wrangling

Plotting data with library(ggplot2)

More data wrangling and plotting

What about summer anomalies?

Analysing biotic communities

Description of community data

Summarizing community data

Species diversity

Comparison between surveys

Environmental data

The Fourth Corner…

Datasets

Weather stations

The Integrated Surface Dataset

The Global Historical Climatology Network

Site Usage

Open Data Portal

Research

About

Plotting data with `library(ggplot2)`