This section is a work in progress. We hope to add a series of tutorials for how to access and analyze SAEON datasets and other useful resources, mostly using the R programming language. We’ve also included a section on R basics to help you get familiar with the language. Note that most tutorials will include an overview of the data and details for accessing them, so we hope they will still be useful even if you don’t use R.
This set of tutorials includes:
If you’ve opened this post, you’re clearly interested in learning R and probably don’t need much convincing, but I’ll provide a few quick reminders just to keep your morale up. If your morale is pretty good, feel free to jump to the sections where I explain How best to learn R (including installation), and Where to find good resources.
As an aside, you may be interested to know that Lai et al. 2019 report that “The number of […ecological…] studies reported using R as their primary tool in data analysis increased linearly from 11.4% in 2008 to 58.0% in 2017.”
While there are many, many, many good reasons to learn R, here are my Top 5:
R is rapidly becoming the software of choice for data analysis in academia and data science because has many packages/libraries that contain functions that cover just about anything you want to do – databasing, stats, GIS, visualization, etc – All in one environment! There’s no need to go through the pain of moving data between different software packages anymore! Do it all in one seamless workflow in R!!!
There is also a huge user community and online resources providing tutorials, help fora, code, etc – see Where below for a start.
Because you write scripts when doing analyses in R, if you need to rerun your analysis, you need only rerun your script (as opposed to remembering what you clicked in which dropdown menu in what version of what programme etc.)
Scripts are also a MAJOR help if you need to run the same kind of analysis multiple times – you can just borrow from your old code! Similarly, if you want to perform the same analysis on a large number of datasets or data objects you can set the code to loop through them automatically – as opposed to “click here, click there, click here again, and click again… – repeat X 1000…
Your script also provides a useful “recipe” that helps you remember and explain the methods of the analysis you did! (I decided to italicize this because it should really be a separate point).
I should warn that there are a few extra steps one has to take to make sure your analysis stays repeatable (e.g. because of updates to R functions), and can be run on any computer, but this is easy to learn. Either way, you should learn it, because all science should be reproducible, firstly to clearly show your method and evidence of the correctness of your results, and secondly to enable others to make use of them.
Lastly, reproducibility is likely to be the way of the future… One of the big criticisms of peer-reviewed science is that it is too slow and not responsive to societal needs. This can change. If the question is clearly defined and straightforward, the method you apply is peer-reviewed and accepted, and the results do not need in-depth interpretation (e.g. the answer is simply “Yes/No”), the turn-around time to publication can be greatly reduced.
(and I thank Adam Wilson for these points – although he may have someone else to thank)
a) Learning a programming language can help you learn how to think logically.
A [wo]man who does not know a foreign language is ignorant of
his[their] own. – Johann Wolfgang von Goethe (1749 – 1832)
b) Programming gives you access to more computer power.
The computer is incredibly fast, accurate, and stupid. [Wo]Man is unbelievably slow, inaccurate, and brilliant. The marriage of the two is a force beyond calculation. – Leo Cherne
c) I’d also highlight that once you understand one programming language, picking up another is much, much easier should you wish/need to do so.
(i.e. open source)
SPSS costs $99.00 USD per user per month ArcGIS???!!! – Don’t even go there…
R is also platform independent (i.e. operates on Windows, Mac, Linux, …)
The initial learning curve for any programming language can be incredibly steep. I have heard that R is less steep than many, but that’s not likely to make you feel any better when you’ve been banging your head against a wall for three days and still haven’t managed to get your analysis to run… (Yes, this WILL happen. Many, many times unfortunately…).
The only emotion as strong as the frustration you’ll be feeling after three days of banging your head is the sheer ecstacy of getting your code to work and seeing the outputs of your analysis! All of a sudden it will all be worth it!
The story I relate when people ask me if learning R is worth it is the tale of my first paper (published in 2006, so not THAT long ago…). All analyses in the paper are based on a dated molecular phylogeny of the beautiful and charismatic sedge genus Tetraria. Of course, one of the journal’s reviewers recommended we use a different method to date the phylogeny, meaning we had to rerun every singe analysis in the manuscript. While we knew that this would make no difference to the results whatsoever (and were proven to be correct of course), one cannot just ignore reviewers comments if you want to get your science published. Despite knowing exactly what needed to happen step-by-step, it took me 3 weeks to repeat all analyses in the paper – using 9 different software packages across Windows, Mac and Linux environments… A few years later, and two years after I started to learn R (NOTE: I’m still learning R, you can never say “I’ve learnt R”), I spent 3 hours writing an R function that does the entire analysis in under 5 minutes!!!
The point of the story is that while you may know how to do the analysis you want to do in Excel/SPSS/ArcGIS/Statistica/JMP/(insert just about any software here) in a few hours, and it may take you a day or three to work out how to in R, sooner or later YOU WILL ALWAYS HAVE TO REPEAT YOUR ANALYSIS – usually quite a few times too…
THE FIRST LAW OF LEARNING R is accepting that the first time you try to do anything it may take you a little longer to get it to work in R than whatever you’re used to. The beauty is that you can do this safe in the knowledge that you’re saving yourself a lot of time further down the line.
THE SECOND LAW OF LEARNING R is having a dataset and a tricky problem to solve or analysis to perform. This is usually where the benefits of R’s flexibility (or existing code libaries) are clear, and one is motivated to spend the time pounding away at it.
THE THIRD LAW OF LEARNING R is community!!! Fortunately, there are a large number of R addicts out there who can provide ALL of the answers you need – BUT you need to work out how to ask Google or the online Forum the right question (always check the forum rules before you add a post!). You should also find any useR groups in your neighbourhood (or start one!). Sometimes it is very difficult to clarify your question and having a human to help work it out a huge help!
THE FOURTH LAW OF LEARNING R is to read about how R works. If pounding away at the code isn’t working, it’s time to get a better idea of how R (or the package or function you’re trying to use) works and how to tell it to do what you want it to do. R will always do what you tell it to do, it’s making sure that it does (or has done) what you want it to do is the difficult bit. (NOTE: This also highlights the need to “reality check” that the output you get from R makes sense.)
THE FIFTH LAW OF LEARNING R is to never give up!!!
Here’s a quick list of resources to help get you started learning R.
I provide a quick guide to Installing R in my primer on Handling Spatial Data in R, which I’ll come back to lower down.
R for Data Science is a fantastic resource for learning how to programme in R, including visualization and communication, with a focus on a group of relatively new and very powerful R packages collectively known as the tidyverse. The only drawback for scientists is that it doesn’t delve into statistics and modelling very much, although it does point you to these great resources.
A very thorough R resource, and great tool for learning statistics at the same time, is The R Book by Michael J. Crawley. Google it to find a copy. While it is a bit more old school and doesn’t include the latest and greatest tidyverse packages, in my opinion the first couple of chapters are a must read, giving you a great intro to how R works with objects and different data formats etc that will stand you in good stead in the long run. Do yourself a favour and take the time to work through them!
There’s also a large number of other books to help get you started in R from introductory material through to methods in specific fields of research here, here and here.
If books are too much for you, there are cheat sheets to help get you stared with various packages. Googling will turn up more too.
There are also online R courses run by groups like swirl, DataCamp or Coursera, but I feel you’re not likely to remember much if you don’t have your own data/problem to solve.
And various different sources all over the internet:
I’ll reiterate. Community is a big word in R.
While I wouldn’t recommend starting with trying to use R as a geographical information system (GIS), once you have a bit of experience under your belt there are a growing number of great resources for this. Have a look at my primers on Handling Spatial Data in R and the resources therein to get you started.
This tutorial relies on data provided as supplementary material for the paper Slingsby et al. 2017. Intensifying postfire weather and biological invasion drive species loss in a Mediterranean-type biodiversity hotspot. PNAS. http://dx.doi.org/10.1073/pnas.1619014114. The data can be downloaded here. It’s a ~13MB .xlsx file.
The study presents the results of 44 years of observation of permanent vegetation plots in the Cape of Good Hope Section of Table Mountain National Park, South Africa. The plots are dubbed “The Taylor Plots” in honour of Hugh Taylor, who established them in 1966. We’ll describe the details of the data as we work through the different tutorials, but for this first one we will only use the rainfall and temperature data for the reserve.
R can do basic calculations
1 + 1
## [1] 2
But to do more complex calculations, its easier to work with objects, which you assign, e.g.
x <- 1
x
## [1] 1
You can assign objects from other objects – usualy once you’ve performed an operation on them
y <- x + 1
y
## [1] 2
You can perform operations with multiple objects
y + x
## [1] 3
Combine objects
z <- c(x, y)
z
## [1] 1 2
Or apply functions to objects
sum(z)
## [1] 3
You can also access elements within an object using indexing with square brackets
z[1] #the first element in vector z
## [1] 1
z[2] #the second element in vector z
## [1] 2
Or evaluate them using logical expressions
z == 1
## [1] TRUE FALSE
z < 1
## [1] FALSE FALSE
z > 1
## [1] FALSE TRUE
Note that you can reassign (or overwrite) objects, so you need to use unique object names if you want to keep them
x <- sum(z)
x
## [1] 3
R keeps these objects in memory
ls() #function to call the list of objects in memory - and "#" lets you put comments like this in your code...
## [1] "x" "y" "z"
This is useful, but if the objects are large, it will slow down the speed of calculations, so it is good practice to discard (large) objects once you’re done with them (or only read them in when you need to)
rm(x, y, z)
ls()
## character(0)
Now let’s explore real data!
First we’ll call the packages/libraries that contain the functions we need
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.0 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readxl)
Now we need some data.
When reading data from a file into R you need to know the system path to the file on your computer, e.g. "/home/jasper/GIT/academic-kickstart/static/datasets"
.
Typing the whole thing out (or even copying and pasting) can get tedious, so its often better to set your working directory so R knows where to look for files, like so.
setwd("/home/jasper/GIT/academic-kickstart/static/datasets")
If you are trying to run this code on your own, you’ll get an error when trying to set your working directory to a folder on my computer 🙂 – You need to change the line of code above to the folder where you put the data. If you’re on Windows, you’ll need to specify the route drive, e.g. C:
. You also need to make sure to use single forwardslashes /
or double backslashes \\
, rather than the single backslash \
that is the default on Windows.
So my file path on windows would look like:
"C:/jasper/GIT/academic-kickstart/static/datasets"
or
"C:\\jasper\\GIT\\academic-kickstart\\static\\datasets"
Or on Mac it would be:
"/Users/jasper/GIT/academic-kickstart/static/datasets"
Lastly, if you’ve tried all that and it’s still not working try adding (or removing) slash(es) at the end.
We can check if we got it right using
getwd()
## [1] "/home/jasper/GIT/academic-kickstart/static/datasets"
Ok, once all that’s done, we can use R to have a look at what files are in that directory
list.files()
## [1] "pnas.1619014114.sd01.xlsx" "SAEON_field instrument list.xlsx"
And in this case we want to read data from pnas.1619014114.sd01.xlsx
, but given that this is an excel workbook, it may have multiple separate worksheets of data. Let’s have a quick glimpse the lazy way
excel_sheets("pnas.1619014114.sd01.xlsx")
## [1] "METADATA" "weather" "fires"
## [4] "postfireweather" "enviroment" "excluded_spp"
## [7] "veg1966" "veg1996" "veg2010"
## [10] "traits" "speciesclimatedata"
There are 11 different sheets. Let’s start with the weather
weather <- read_excel("pnas.1619014114.sd01.xlsx", sheet = "weather")
## New names:
## * `` -> ...1
#Let's have a look at the first few lines?
head(weather) #
## # A tibble: 6 x 6
## ...1 Date Temp Rain Month Year
## <dbl> <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 1 1961-01-02 00:00:00 21 0 1 1961
## 2 2 1961-01-03 00:00:00 20.6 0 1 1961
## 3 3 1961-01-04 00:00:00 15 3.5 1 1961
## 4 4 1961-01-05 00:00:00 21 0 1 1961
## 5 5 1961-01-06 00:00:00 19.8 0.8 1 1961
## 6 6 1961-01-07 00:00:00 22 0 1 1961
or last few lines
tail(weather)
## # A tibble: 6 x 6
## ...1 Date Temp Rain Month Year
## <dbl> <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 17474 2009-11-25 00:00:00 15 0 11 2009
## 2 17475 2009-11-26 00:00:00 18.9 0 11 2009
## 3 17476 2009-11-27 00:00:00 16.9 0 11 2009
## 4 17477 2009-11-28 00:00:00 22.2 0 11 2009
## 5 17478 2009-11-29 00:00:00 19.2 0 11 2009
## 6 17479 2009-11-30 00:00:00 17.3 0 11 2009
There are 17476 rows of data, so it’s a bit tricky to inspect it all! Fortunately we can call a summary like so
summary(weather)
## ...1 Date Temp
## Min. : 1 Min. :1961-01-02 00:00:00 Min. : 6.30
## 1st Qu.: 4371 1st Qu.:1973-12-19 18:00:00 1st Qu.:15.00
## Median : 8740 Median :1985-12-09 12:00:00 Median :17.50
## Mean : 8740 Mean :1985-12-05 08:31:36 Mean :17.62
## 3rd Qu.:13109 3rd Qu.:1997-12-07 06:00:00 3rd Qu.:20.00
## Max. :17479 Max. :2009-11-30 00:00:00 Max. :35.30
## Rain Month Year
## Min. : 0.0000 Min. : 1.000 Min. :1961
## 1st Qu.: 0.0000 1st Qu.: 4.000 1st Qu.:1973
## Median : 0.0000 Median : 7.000 Median :1985
## Mean : 0.9914 Mean : 6.514 Mean :1985
## 3rd Qu.: 0.2000 3rd Qu.:10.000 3rd Qu.:1997
## Max. :97.0000 Max. :12.000 Max. :2009
Nice, but this doesn’t help see trends etc. It’s difficult to eyeball daily rainfall so let’s calculate the annual totals
Here we use the group_by()
and summarise()
functions provided by library(dplyr)
to calculate the annual rainfall totals using sum()
.
annualrain <- weather %>% group_by(Year) %>% summarise(Rainfall = sum(Rain))
head(annualrain) #to see what we get
## # A tibble: 6 x 2
## Year Rainfall
## <dbl> <dbl>
## 1 1961 382
## 2 1963 304.
## 3 1964 418.
## 4 1965 403.
## 5 1966 389
## 6 1967 371.
If you were to translate this line of code into English it would say:
Take the
weather
dataframe, and for each Year, sum the total rainfall across all observation (months in this case).
library(dplyr)
and other tidyverse packages contain a series of functions that can be thought of as verbs, providing the grammar for you develop code statements, e.g.:
arrange()
filter()
mutate()
slice()
summarise()
group_by()
(Pro tip: Do yourself a favour and apply the fourth law here!)
Binding the verbs together into code sentences is made possible by the pipe %>%
function, which essentially allows you to pass the output of one function (or verb) to the next without making a new object in memory each time.
i.e. without %>%
this code would have to be written
step1 <- group_by(weather, Year)
step2 <- summarise(step1, Rainfall = sum(Rain))
head(step2)
## # A tibble: 6 x 2
## Year Rainfall
## <dbl> <dbl>
## 1 1961 382
## 2 1963 304.
## 3 1964 418.
## 4 1965 403.
## 5 1966 389
## 6 1967 371.
Which creates the unwanted objects like step1
, which just waste memory, so let’s get rid of them
rm(step1, step2)
So looking at annual totals is easier to interpret than daily data, but not as good as a graph…
library(ggplot2)
To plot the data using the functions in library(ggplot2)
we first describe the data, telling ggplot what data object we want to use and which variables should be our x and y axes, like so.
r <- ggplot(annualrain, aes(x = Year, y = Rainfall))
Then we can play with different plot types
r + geom_line() #A line graph - as easy as that!!!
I find time-series are often easier to look at if you shade one side, like so
r + geom_area()
Which is obviously easier to interpret than a scatterplot of points
r + geom_point()
Unless you add a trend line, like a loess spline
r + geom_point() + geom_smooth(method = 'loess')
## `geom_smooth()` using formula 'y ~ x'
So there have been spells of wetter and dryer years, but no clear trend. What about if we wanted to see if there’ve been any changes in seasonality? I.e. particular months show patterns of getting wetter or dryer?
To summarise the data by month is very easy, all we need do is add Month to the group_by()
statement in the code we used to get the annual totals.
monthlyrain <- weather %>% group_by(Year, Month) %>% summarise(Rainfall = sum(Rain))
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
head(monthlyrain)
## # A tibble: 6 x 3
## # Groups: Year [1]
## Year Month Rainfall
## <dbl> <dbl> <dbl>
## 1 1961 1 44.7
## 2 1961 2 5.7
## 3 1961 3 11.9
## 4 1961 4 9.4
## 5 1961 5 45.2
## 6 1961 6 86.4
The only issue now is that it is difficult to interpret the table directly, and we have 3 variables, so it’s not as simple to plot.
Fortunately, library(ggplot2)
has the functions facet_grid()
and facet_wrap()
that make it very easy to make multi-panel plots. Here we use the same code as the line plot above, but apply it to the object monthlyrain
and add facet_grid()
indicating that the rows of the mutiplot should be the months.
r <- ggplot(monthlyrain, aes(x = Year, y = Rainfall))
r + geom_line() + facet_grid(rows = vars(Month))
Nice! We can clearly see that months 4 to 8 (April to August) are the wetter months in this region, but its still difficult to tell which years were wetter or drier relative to the long term average. To do this we need to calculate the anomalies (i.e. the differences from the mean).
Getting the mean for each month is very easy to do by adapting our code above and applying it to the monthlyrain
dataframe we have already calculated. We simply change the function passed to summarise()
from sum()
to mean()
, and remove Year from the group_by()
statement.
i.e we’re saying
Take the
monthlyrain
dataframe, and for each Month, calculate the average rainfall across all years.
like so
monthlymean <- monthlyrain %>% group_by(Month) %>% summarise(MeanMonthlyRainfall = mean(Rainfall))
head(monthlymean)
## # A tibble: 6 x 2
## Month MeanMonthlyRainfall
## <dbl> <dbl>
## 1 1 11.4
## 2 2 13.7
## 3 3 13.4
## 4 4 32.6
## 5 5 47.6
## 6 6 58.7
Ok, but now we need to match the MeanMonthlyRainfall
with each month in monthlyrain
. This is very easy with the function merge()
.
monthlyrain <- merge(monthlyrain, monthlymean)
head(monthlyrain)
## Month Year Rainfall MeanMonthlyRainfall
## 1 1 1961 44.7 11.425
## 2 1 1999 5.2 11.425
## 3 1 1989 6.4 11.425
## 4 1 1979 14.8 11.425
## 5 1 2006 0.7 11.425
## 6 1 1996 2.3 11.425
Ok, now we want to calculate the positive and negative differences from the mean. For plotting purposes we’ll do this as separate Positive
and Negative
columns. R
makes this easy, by allowing simple mathematical operations and allowing us to access or assign new columns using the $
operator, like so.
monthlyrain$Positive <- monthlyrain$Rainfall - monthlyrain$MeanMonthlyRainfall
head(monthlyrain)
## Month Year Rainfall MeanMonthlyRainfall Positive
## 1 1 1961 44.7 11.425 33.275
## 2 1 1999 5.2 11.425 -6.225
## 3 1 1989 6.4 11.425 -5.025
## 4 1 1979 14.8 11.425 3.375
## 5 1 2006 0.7 11.425 -10.725
## 6 1 1996 2.3 11.425 -9.125
Now we have negative values in our Positive column, but that’s ok, because we can set them to zero using a logical operator that detects negative values, like so:
monthlyrain$Positive[which(monthlyrain$Positive < 0)] <- 0
head(monthlyrain)
## Month Year Rainfall MeanMonthlyRainfall Positive
## 1 1 1961 44.7 11.425 33.275
## 2 1 1999 5.2 11.425 0.000
## 3 1 1989 6.4 11.425 0.000
## 4 1 1979 14.8 11.425 3.375
## 5 1 2006 0.7 11.425 0.000
## 6 1 1996 2.3 11.425 0.000
Ok, there were a few steps that went into that. Let’s break it down:
<
returns TRUE
or FALSE
for whether the values in monthlyrain$Positive
are less than zero or not;which()
function translates all the values that are TRUE
to their position in the vector/column monthlyrain$Positive
;monthlyrain$Positive
; and<- 0
Now we do the same for the Negative
values:
monthlyrain$Negative <- monthlyrain$Rainfall - monthlyrain$MeanMonthlyRainfall
monthlyrain$Negative[which(monthlyrain$Negative > 0)] <- 0
head(monthlyrain)
## Month Year Rainfall MeanMonthlyRainfall Positive Negative
## 1 1 1961 44.7 11.425 33.275 0.000
## 2 1 1999 5.2 11.425 0.000 -6.225
## 3 1 1989 6.4 11.425 0.000 -5.025
## 4 1 1979 14.8 11.425 3.375 0.000
## 5 1 2006 0.7 11.425 0.000 -10.725
## 6 1 1996 2.3 11.425 0.000 -9.125
And we’re ready to start plotting. I’ll build this up piece by piece. First we’ll set up the plot and add the positive anomaly as a “ribbon” plot
g <- ggplot(monthlyrain, aes(x = Year))
g <- g + geom_ribbon(aes(ymin = 0, ymax = Positive, fill="Positive"))
g
Now we’ll add the negative anomaly ribbon
g <- g + geom_ribbon(aes(ymin = Negative, ymax = 0 ,fill="Negative"))
g
Next we can set colours to make sure they make sense – e.g. positive (wetter) anomalies should be blue and negative (drier) should be red.
g <- g + scale_fill_manual(name="",values=c("#D6604D", "#4393C3"))
g
Now we can split it into a multipanel by month using facet_grid()
g <- g + facet_grid(cols = vars(Month))
g
And add a title, meaningful y-axis label and flip the axes so we can read the graph properly
g <- g + ggtitle("Monthly Rainfall Anomaly (1961-2010)") + coord_flip() + ylab("Rainfall (mm)")
g
Not much to observe other than the positive anomalies are typically bigger than the negative anomalies. This is not surprising, because the negative anomalies are bounded by zero (i.e. no rain is the driest it can get), while there are no constraints on the positive anomaly. For example, the wettest day on record had 97 mm.
Now see if you can adapt the code above to develop the same figure using the Temp
data in the object weather
.
Note that you’ll need to calculate monthly means, not sum the totals, and better to swap the colours so red is the positive anomaly (hotter) and blue the negative (cooler).
It should look like this:
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
This looks a bit more interesting, with fewer cool anomalies and more hot anomalies as we head towards the present. It turns out the temperature has risen by ~1.2 degrees over the observation period.
The previous tutorial covered a bunch of the basics with some pretty straightforward data, and was more focused on how R works. Now we’ll be looking at slightly more complicated data (species data), which will require learning some fancier data handling.
R has a ridiculous number of libraries and functions for analyzing environmental and biodiversity data – check out the Environmetrics Task View for a primer on a small sample. The issue is that many of them have their own specific data format requirements, requiring an inordinate amount of data wrangling. My aim for this session is to walk you through an example dataset, sharing a few tidyverse
and base R data wrangling tricks, and highlighting some common perils and pitfalls when analysing patterns and change in biotic community survey data.
First, let’s get the R packages we need and set our working directory
library(tidyverse)
library(readxl)
library(vegan) #comment out with "#" if you don't have this installed - or run install.packages("vegan")
setwd("/home/jasper/GIT/academic-kickstart/static/datasets/")
#Change to your local path to the folder where you put the dataset
We’re going to use the same dataset from Slingsby et al. 2017 that we used in the previous tutorial. The data can be downloaded here. It’s a ~10MB .xlsx file.
This is a vegetation survey of 81 plots, each 5 by 10m, from across the Cape of Good Hope section of Table Mountain National Park. For more information check out the worksheet labeled METADATA
and/or read the paper.
Let’s look at the sheets in our Excel workbook.
excel_sheets("pnas.1619014114.sd01.xlsx")
## [1] "METADATA" "weather" "fires"
## [4] "postfireweather" "enviroment" "excluded_spp"
## [7] "veg1966" "veg1996" "veg2010"
## [10] "traits" "speciesclimatedata"
There’s quite a few sheets in the workbook, but the datasets most commonly used in analyses of biotic communities are three separate data objects (usually matrices, but see below) that include:
We have environment
and traits
, which are the second and third matrices, and in this case we have three community matrices (veg1966
, veg1996
and veg2010
), because the sites were first surveyed in 1966 and then resurveyed in 1996 and 2010.
Note that if you ever want to see what data are in a sheet, but don’t want to open the file in Excel, you can easily have a glimpse at them to by just selecting a small range in read_excel() like so
read_excel("pnas.1619014114.sd01.xlsx", sheet = "veg1966", range = "A1:K5")
## New names:
## * `` -> ...1
## # A tibble: 4 x 11
## ...1 `Adenandra unif… `Adenandra vill… `Adenocline pau… `Agathosma cili…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CP_1 0 7 0 0
## 2 CP_10 0 0 0 0
## 3 CP_1… 0 7 0 0
## 4 CP_12 0 0 0 0
## # … with 6 more variables: `Agathosma hookeri` <dbl>, `Agathosma
## # imbricata` <dbl>, `Agathosma lanceolata` <dbl>, `Agathosma
## # serpyllacea` <dbl>, `Aizoon paniculatum` <dbl>, `Amphithalea
## # ericifolia` <dbl>
read_excel("pnas.1619014114.sd01.xlsx", sheet = "veg2010", range = "A1:K5")
## New names:
## * `` -> ...1
## # A tibble: 4 x 11
## ...1 `Acacia saligna` `Adenandra unif… `Adenandra vill… `Agapanthus wal…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CP_1 0 0 3 0
## 2 CP_10 0 0 0 0
## 3 CP_12 0 0 0 0
## 4 CP_13 0 0 0 0
## # … with 6 more variables: `Agathosma ciliaris` <dbl>, `Agathosma
## # hookeri` <dbl>, `Agathosma imbricata` <dbl>, `Agathosma serpyllacea` <dbl>,
## # `Albuca flaccida` <dbl>, `Albuca juncifolia` <dbl>
In this case the veg1966
and veg1996
data were Braun-Blanquet cover-abundance classes, which I have converted to abundance using the midpoint of each class. THIS IS NOT NECESSARILY DEFENSIBLE!!! but it’s fine for the purposes of this tutorial. The veg2010
dataset is actual counts of individuals.
Let’s read them in and have a closer look.
species66 <- read_excel("pnas.1619014114.sd01.xlsx", sheet = "veg1966")
## New names:
## * `` -> ...1
head(species66)
## # A tibble: 6 x 428
## ...1 `Adenandra unif… `Adenandra vill… `Adenocline pau… `Agathosma cili…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CP_1 0 7 0 0
## 2 CP_10 0 0 0 0
## 3 CP_1… 0 7 0 0
## 4 CP_12 0 0 0 0
## 5 CP_13 0 0 0 0
## 6 CP_14 0 0 0 0
## # … with 423 more variables: `Agathosma hookeri` <dbl>, `Agathosma
## # imbricata` <dbl>, `Agathosma lanceolata` <dbl>, `Agathosma
## # serpyllacea` <dbl>, `Aizoon paniculatum` <dbl>, `Amphithalea
## # ericifolia` <dbl>, `Anaxeton laeve` <dbl>, `Anemone knowltonia` <dbl>,
## # `Anthochortus laxiflorus` <dbl>, `Anthospermum aethiopicum` <dbl>,
## # `Anthospermum bergianum` <dbl>, `Anthospermum galioides` <dbl>, `Apium
## # decumbens` <dbl>, `Arctotis aspera` <dbl>, `Argyrolobium filiforme` <dbl>,
## # `Aristea africana` <dbl>, `Aristea capitata` <dbl>, `Aristea glauca` <dbl>,
## # `Aspalathus abietina` <dbl>, `Aspalathus argyrella` <dbl>, `Aspalathus
## # callosa` <dbl>, `Aspalathus capensis` <dbl>, `Aspalathus carnosa` <dbl>,
## # `Aspalathus chenopoda` <dbl>, `Aspalathus divaricata` <dbl>, `Aspalathus
## # ericifolia` <dbl>, `Aspalathus hispida` <dbl>, `Aspalathus
## # laricifolia` <dbl>, `Aspalathus linguiloba` <dbl>, `Aspalathus
## # microphylla` <dbl>, `Aspalathus retroflexa` <dbl>, `Aspalathus
## # sericea` <dbl>, `Aspalathus serpens` <dbl>, `Asparagus aethiopicus` <dbl>,
## # `Asparagus capensis` <dbl>, `Asparagus lignosus` <dbl>, `Asparagus
## # rubicundus` <dbl>, `Berkheya barbata` <dbl>, `Berzelia abrotanoides` <dbl>,
## # `Berzelia lanuginosa` <dbl>, `Bobartia filiformis` <dbl>, `Bobartia
## # gladiata` <dbl>, `Bobartia indica` <dbl>, `Bolusafra bituminosa` <dbl>,
## # `Bulbine abyssinica` <dbl>, `Caesia contorta` <dbl>, `Capelio
## # tabularis` <dbl>, `Capeobolus brevicaulis` <dbl>, `Capeochloa
## # cincta` <dbl>, `Carpacoce spermacocea` <dbl>, `Carpacoce
## # vaginellata` <dbl>, `Carpobrotus acinaciformis` <dbl>, `Cassine
## # peragua` <dbl>, `Cassytha ciliolata` <dbl>, `Centella affinis` <dbl>,
## # `Centella glabrata` <dbl>, `Centella macrocarpa` <dbl>, `Centella
## # tridentata` <dbl>, `Chironia baccifera` <dbl>, `Chironia decumbens` <dbl>,
## # `Chironia linoides` <dbl>, `Chrysocoma coma-aurea` <dbl>, `Cineraria
## # geifolia` <dbl>, `Cliffortia atrata` <dbl>, `Cliffortia falcata` <dbl>,
## # `Cliffortia ferruginea` <dbl>, `Cliffortia filifolia` <dbl>, `Cliffortia
## # glauca` <dbl>, `Cliffortia obcordata` <dbl>, `Cliffortia
## # polygonifolia` <dbl>, `Cliffortia ruscifolia` <dbl>, `Cliffortia
## # stricta` <dbl>, `Cliffortia subsetacea` <dbl>, `Clutia alaternoides` <dbl>,
## # `Clutia ericoides` <dbl>, `Clutia polygonoides` <dbl>, `Coleonema
## # album` <dbl>, `Conyza pinnatifida` <dbl>, `Corymbium africanum` <dbl>,
## # `Corymbium glabrum` <dbl>, `Crassula capensis` <dbl>, `Crassula
## # coccinea` <dbl>, `Crassula cymosa` <dbl>, `Crassula fascicularis` <dbl>,
## # `Crassula flava` <dbl>, `Crassula nudicaulis` <dbl>, `Crassula
## # subulata` <dbl>, `Cullumia squarrosa` <dbl>, `Cussonia thyrsiflora` <dbl>,
## # `Cymbopogon marginatus` <dbl>, `Cynanchum africanum` <dbl>, `Cynanchum
## # obtusifolium` <dbl>, `Cynodon dactylon` <dbl>, `Cyperus thunbergii` <dbl>,
## # `Diastella divaricata` <dbl>, `Dilatris corymbosa` <dbl>, `Dilatris
## # pillansii` <dbl>, `Dimorphotheca nudicaulis` <dbl>, `Diosma hirsuta` <dbl>,
## # `Diosma oppositifolia` <dbl>, …
species10 <- read_excel("pnas.1619014114.sd01.xlsx", sheet = "veg2010")
## New names:
## * `` -> ...1
head(species10)
## # A tibble: 6 x 339
## ...1 `Acacia saligna` `Adenandra unif… `Adenandra vill… `Agapanthus wal…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CP_1 0 0 3 0
## 2 CP_10 0 0 0 0
## 3 CP_12 0 0 0 0
## 4 CP_13 0 0 0 0
## 5 CP_14 0 0 0 0
## 6 CP_15 0 0 0 0
## # … with 334 more variables: `Agathosma ciliaris` <dbl>, `Agathosma
## # hookeri` <dbl>, `Agathosma imbricata` <dbl>, `Agathosma serpyllacea` <dbl>,
## # `Albuca flaccida` <dbl>, `Albuca juncifolia` <dbl>, `Amphithalea
## # ericifolia` <dbl>, `Anaxeton laeve` <dbl>, `Anemone knowltonia` <dbl>,
## # `Anthochortus capensis` <dbl>, `Anthospermum aethiopicum` <dbl>,
## # `Anthospermum bergianum` <dbl>, `Anthospermum galioides` <dbl>, `Apium
## # decumbens` <dbl>, `Arctotheca calendula` <dbl>, `Aristea africana` <dbl>,
## # `Aspalathus callosa` <dbl>, `Aspalathus capensis` <dbl>, `Aspalathus
## # carnosa` <dbl>, `Aspalathus chenopoda` <dbl>, `Aspalathus ciliaris` <dbl>,
## # `Aspalathus hispida` <dbl>, `Aspalathus linguiloba` <dbl>, `Aspalathus
## # microphylla` <dbl>, `Aspalathus retroflexa` <dbl>, `Aspalathus
## # serpens` <dbl>, `Asparagus capensis` <dbl>, `Asparagus lignosus` <dbl>,
## # `Asparagus rubicundus` <dbl>, `Babiana ambigua` <dbl>, `Babiana
## # villosula` <dbl>, `Berkheya barbata` <dbl>, `Berzelia abrotanoides` <dbl>,
## # `Berzelia lanuginosa` <dbl>, `Bobartia gladiata` <dbl>, `Bobartia
## # indica` <dbl>, `Bolusafra bituminosa` <dbl>, `Bulbine abyssinica` <dbl>,
## # `Bulbine alooides` <dbl>, `Bulbine lagopus` <dbl>, `Caesia contorta` <dbl>,
## # `Capelio tabularis` <dbl>, `Capeochloa cincta` <dbl>, `Carpobrotus
## # acinaciformis` <dbl>, `Carpobrotus edulis` <dbl>, `Cassine peragua` <dbl>,
## # `Cassytha ciliolata` <dbl>, `Centella macrocarpa` <dbl>, `Centella
## # tridentata` <dbl>, `Chenolea diffusa` <dbl>, `Chironia baccifera` <dbl>,
## # `Chironia decumbens` <dbl>, `Chironia linoides` <dbl>, `Cliffortia
## # atrata` <dbl>, `Cliffortia falcata` <dbl>, `Cliffortia ferruginea` <dbl>,
## # `Cliffortia obcordata` <dbl>, `Cliffortia stricta` <dbl>, `Cliffortia
## # subsetacea` <dbl>, `Clutia alaternoides` <dbl>, `Clutia ericoides` <dbl>,
## # `Coleonema album` <dbl>, `Colpoon compressum` <dbl>, `Corymbium
## # africanum` <dbl>, `Corymbium glabrum` <dbl>, `Cotula coronopifolia` <dbl>,
## # `Cotula turbinata` <dbl>, `Crassula fascicularis` <dbl>, `Crassula
## # nudicaulis` <dbl>, `Cymbopogon marginatus` <dbl>, `Cynodon dactylon` <dbl>,
## # `Dasispermum hispidum` <dbl>, `Diastella divaricata` <dbl>, `Dilatris
## # pillansii` <dbl>, `Diosma hirsuta` <dbl>, `Diosma oppositifolia` <dbl>,
## # `Diospyros glabra` <dbl>, `Disa bracteata` <dbl>, `Disa obliqua` <dbl>,
## # `Disa purpurascens` <dbl>, `Disparago tortilis` <dbl>, `Disperis
## # capensis` <dbl>, `Drimia capensis` <dbl>, `Drosanthemum candens` <dbl>,
## # `Drosera aliciae` <dbl>, `Drosera trinervia` <dbl>, `Edmondia
## # sesamoides` <dbl>, `Ehrharta calycina` <dbl>, `Ehrharta ramosa` <dbl>,
## # `Elegia cuspidata` <dbl>, `Elegia filacea` <dbl>, `Elegia juncea` <dbl>,
## # `Elegia microcarpa` <dbl>, `Elegia nuda` <dbl>, `Elegia persistens` <dbl>,
## # `Elegia stipularis` <dbl>, `Elegia vaginulata` <dbl>, `Elytropappus
## # scaber` <dbl>, `Eragrostis capensis` <dbl>, `Erepsia anceps` <dbl>, …
Yup, that looks about right.
We’re going to start playing with the 1966 data in a second, but first, what if your data aren’t in a species by site matrix?
Often biodiversity data are stored in “long” or “list” format with three columns like Site
, Species
and Abundance
or similar. This is also the tidy
data format preferred by the various packages that adopt a tidyverse
approach, so it is worth taking a second to explore a few tricks that help convert between formats.
In our case we have the data in matrix format, so first let’s convert it to long format.
hmm <- species66 %>% pivot_longer(cols = -1,
names_to = "Species",
values_to = "Abundance")
hmm
## # A tibble: 34,587 x 3
## ...1 Species Abundance
## <chr> <chr> <dbl>
## 1 CP_1 Adenandra uniflora 0
## 2 CP_1 Adenandra villosa 7
## 3 CP_1 Adenocline pauciflora 0
## 4 CP_1 Agathosma ciliaris 0
## 5 CP_1 Agathosma hookeri 0
## 6 CP_1 Agathosma imbricata 0
## 7 CP_1 Agathosma lanceolata 0
## 8 CP_1 Agathosma serpyllacea 0
## 9 CP_1 Aizoon paniculatum 0
## 10 CP_1 Amphithalea ericifolia 2
## # … with 34,577 more rows
Note that the names_to
command sets a name for the column where the column names from the matrix are stored, while ‘values_to’ names the value column. You also have to select the columns you want to gather, and in this case we wanted to exclude the first column (the sites) as they are not abundance data.
Now let’s convert the long format data back to a matrix.
hmm <- hmm %>% pivot_wider(id_cols = 1,
names_from = Species,
values_from = Abundance,
values_fill = 0)
hmm
## # A tibble: 81 x 428
## ...1 `Adenandra unif… `Adenandra vill… `Adenocline pau… `Agathosma cili…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CP_1 0 7 0 0
## 2 CP_10 0 0 0 0
## 3 CP_1… 0 7 0 0
## 4 CP_12 0 0 0 0
## 5 CP_13 0 0 0 0
## 6 CP_14 0 0 0 0
## 7 CP_15 0 7 0 0
## 8 CP_16 0 0 0 0
## 9 CP_17 0 0 0 0
## 10 CP_18 0 0 0 0
## # … with 71 more rows, and 423 more variables: `Agathosma hookeri` <dbl>,
## # `Agathosma imbricata` <dbl>, `Agathosma lanceolata` <dbl>, `Agathosma
## # serpyllacea` <dbl>, `Aizoon paniculatum` <dbl>, `Amphithalea
## # ericifolia` <dbl>, `Anaxeton laeve` <dbl>, `Anemone knowltonia` <dbl>,
## # `Anthochortus laxiflorus` <dbl>, `Anthospermum aethiopicum` <dbl>,
## # `Anthospermum bergianum` <dbl>, `Anthospermum galioides` <dbl>, `Apium
## # decumbens` <dbl>, `Arctotis aspera` <dbl>, `Argyrolobium filiforme` <dbl>,
## # `Aristea africana` <dbl>, `Aristea capitata` <dbl>, `Aristea glauca` <dbl>,
## # `Aspalathus abietina` <dbl>, `Aspalathus argyrella` <dbl>, `Aspalathus
## # callosa` <dbl>, `Aspalathus capensis` <dbl>, `Aspalathus carnosa` <dbl>,
## # `Aspalathus chenopoda` <dbl>, `Aspalathus divaricata` <dbl>, `Aspalathus
## # ericifolia` <dbl>, `Aspalathus hispida` <dbl>, `Aspalathus
## # laricifolia` <dbl>, `Aspalathus linguiloba` <dbl>, `Aspalathus
## # microphylla` <dbl>, `Aspalathus retroflexa` <dbl>, `Aspalathus
## # sericea` <dbl>, `Aspalathus serpens` <dbl>, `Asparagus aethiopicus` <dbl>,
## # `Asparagus capensis` <dbl>, `Asparagus lignosus` <dbl>, `Asparagus
## # rubicundus` <dbl>, `Berkheya barbata` <dbl>, `Berzelia abrotanoides` <dbl>,
## # `Berzelia lanuginosa` <dbl>, `Bobartia filiformis` <dbl>, `Bobartia
## # gladiata` <dbl>, `Bobartia indica` <dbl>, `Bolusafra bituminosa` <dbl>,
## # `Bulbine abyssinica` <dbl>, `Caesia contorta` <dbl>, `Capelio
## # tabularis` <dbl>, `Capeobolus brevicaulis` <dbl>, `Capeochloa
## # cincta` <dbl>, `Carpacoce spermacocea` <dbl>, `Carpacoce
## # vaginellata` <dbl>, `Carpobrotus acinaciformis` <dbl>, `Cassine
## # peragua` <dbl>, `Cassytha ciliolata` <dbl>, `Centella affinis` <dbl>,
## # `Centella glabrata` <dbl>, `Centella macrocarpa` <dbl>, `Centella
## # tridentata` <dbl>, `Chironia baccifera` <dbl>, `Chironia decumbens` <dbl>,
## # `Chironia linoides` <dbl>, `Chrysocoma coma-aurea` <dbl>, `Cineraria
## # geifolia` <dbl>, `Cliffortia atrata` <dbl>, `Cliffortia falcata` <dbl>,
## # `Cliffortia ferruginea` <dbl>, `Cliffortia filifolia` <dbl>, `Cliffortia
## # glauca` <dbl>, `Cliffortia obcordata` <dbl>, `Cliffortia
## # polygonifolia` <dbl>, `Cliffortia ruscifolia` <dbl>, `Cliffortia
## # stricta` <dbl>, `Cliffortia subsetacea` <dbl>, `Clutia alaternoides` <dbl>,
## # `Clutia ericoides` <dbl>, `Clutia polygonoides` <dbl>, `Coleonema
## # album` <dbl>, `Conyza pinnatifida` <dbl>, `Corymbium africanum` <dbl>,
## # `Corymbium glabrum` <dbl>, `Crassula capensis` <dbl>, `Crassula
## # coccinea` <dbl>, `Crassula cymosa` <dbl>, `Crassula fascicularis` <dbl>,
## # `Crassula flava` <dbl>, `Crassula nudicaulis` <dbl>, `Crassula
## # subulata` <dbl>, `Cullumia squarrosa` <dbl>, `Cussonia thyrsiflora` <dbl>,
## # `Cymbopogon marginatus` <dbl>, `Cynanchum africanum` <dbl>, `Cynanchum
## # obtusifolium` <dbl>, `Cynodon dactylon` <dbl>, `Cyperus thunbergii` <dbl>,
## # `Diastella divaricata` <dbl>, `Dilatris corymbosa` <dbl>, `Dilatris
## # pillansii` <dbl>, `Dimorphotheca nudicaulis` <dbl>, `Diosma hirsuta` <dbl>,
## # `Diosma oppositifolia` <dbl>, …
Note that I added the values_fill
= 0 command, because a major advantage of long format biodiversity data is you don’t usually retain records of species with zero abundance at a site (i.e. absences) as these are usually implicit. They do need to be present in a matrix, so this command simply tells the function to fill absence records with 0.
Is the output the same as our original data?
identical(hmm, species66)
## [1] TRUE
Ok, so what do we want to know?
Total abundance by plot? We can do this by summing the abundance of all species in each site (i.e. row in the matrix).
rowSums(species66)
## Error in rowSums(species66): 'x' must be numeric
A classic R error message. It seems unhelpful, but as you learn R the error messages make more sense. In this case it is telling us that not all the values in species66
are numbers (i.e. of class “numeric”). Some may be of class “character”, “factor”, etc.
If we look at the data more closely…
head(species66)
## # A tibble: 6 x 428
## ...1 `Adenandra unif… `Adenandra vill… `Adenocline pau… `Agathosma cili…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CP_1 0 7 0 0
## 2 CP_10 0 0 0 0
## 3 CP_1… 0 7 0 0
## 4 CP_12 0 0 0 0
## 5 CP_13 0 0 0 0
## 6 CP_14 0 0 0 0
## # … with 423 more variables: `Agathosma hookeri` <dbl>, `Agathosma
## # imbricata` <dbl>, `Agathosma lanceolata` <dbl>, `Agathosma
## # serpyllacea` <dbl>, `Aizoon paniculatum` <dbl>, `Amphithalea
## # ericifolia` <dbl>, `Anaxeton laeve` <dbl>, `Anemone knowltonia` <dbl>,
## # `Anthochortus laxiflorus` <dbl>, `Anthospermum aethiopicum` <dbl>,
## # `Anthospermum bergianum` <dbl>, `Anthospermum galioides` <dbl>, `Apium
## # decumbens` <dbl>, `Arctotis aspera` <dbl>, `Argyrolobium filiforme` <dbl>,
## # `Aristea africana` <dbl>, `Aristea capitata` <dbl>, `Aristea glauca` <dbl>,
## # `Aspalathus abietina` <dbl>, `Aspalathus argyrella` <dbl>, `Aspalathus
## # callosa` <dbl>, `Aspalathus capensis` <dbl>, `Aspalathus carnosa` <dbl>,
## # `Aspalathus chenopoda` <dbl>, `Aspalathus divaricata` <dbl>, `Aspalathus
## # ericifolia` <dbl>, `Aspalathus hispida` <dbl>, `Aspalathus
## # laricifolia` <dbl>, `Aspalathus linguiloba` <dbl>, `Aspalathus
## # microphylla` <dbl>, `Aspalathus retroflexa` <dbl>, `Aspalathus
## # sericea` <dbl>, `Aspalathus serpens` <dbl>, `Asparagus aethiopicus` <dbl>,
## # `Asparagus capensis` <dbl>, `Asparagus lignosus` <dbl>, `Asparagus
## # rubicundus` <dbl>, `Berkheya barbata` <dbl>, `Berzelia abrotanoides` <dbl>,
## # `Berzelia lanuginosa` <dbl>, `Bobartia filiformis` <dbl>, `Bobartia
## # gladiata` <dbl>, `Bobartia indica` <dbl>, `Bolusafra bituminosa` <dbl>,
## # `Bulbine abyssinica` <dbl>, `Caesia contorta` <dbl>, `Capelio
## # tabularis` <dbl>, `Capeobolus brevicaulis` <dbl>, `Capeochloa
## # cincta` <dbl>, `Carpacoce spermacocea` <dbl>, `Carpacoce
## # vaginellata` <dbl>, `Carpobrotus acinaciformis` <dbl>, `Cassine
## # peragua` <dbl>, `Cassytha ciliolata` <dbl>, `Centella affinis` <dbl>,
## # `Centella glabrata` <dbl>, `Centella macrocarpa` <dbl>, `Centella
## # tridentata` <dbl>, `Chironia baccifera` <dbl>, `Chironia decumbens` <dbl>,
## # `Chironia linoides` <dbl>, `Chrysocoma coma-aurea` <dbl>, `Cineraria
## # geifolia` <dbl>, `Cliffortia atrata` <dbl>, `Cliffortia falcata` <dbl>,
## # `Cliffortia ferruginea` <dbl>, `Cliffortia filifolia` <dbl>, `Cliffortia
## # glauca` <dbl>, `Cliffortia obcordata` <dbl>, `Cliffortia
## # polygonifolia` <dbl>, `Cliffortia ruscifolia` <dbl>, `Cliffortia
## # stricta` <dbl>, `Cliffortia subsetacea` <dbl>, `Clutia alaternoides` <dbl>,
## # `Clutia ericoides` <dbl>, `Clutia polygonoides` <dbl>, `Coleonema
## # album` <dbl>, `Conyza pinnatifida` <dbl>, `Corymbium africanum` <dbl>,
## # `Corymbium glabrum` <dbl>, `Crassula capensis` <dbl>, `Crassula
## # coccinea` <dbl>, `Crassula cymosa` <dbl>, `Crassula fascicularis` <dbl>,
## # `Crassula flava` <dbl>, `Crassula nudicaulis` <dbl>, `Crassula
## # subulata` <dbl>, `Cullumia squarrosa` <dbl>, `Cussonia thyrsiflora` <dbl>,
## # `Cymbopogon marginatus` <dbl>, `Cynanchum africanum` <dbl>, `Cynanchum
## # obtusifolium` <dbl>, `Cynodon dactylon` <dbl>, `Cyperus thunbergii` <dbl>,
## # `Diastella divaricata` <dbl>, `Dilatris corymbosa` <dbl>, `Dilatris
## # pillansii` <dbl>, `Dimorphotheca nudicaulis` <dbl>, `Diosma hirsuta` <dbl>,
## # `Diosma oppositifolia` <dbl>, …
…we see that the first row is actually a vector of plot names (e.g. CP_1), which is of class “character”. These will be useful later, but for this analysis we want to drop them. This is easy using indexing with square brackets “[,]”.
rowSums(species66[,2:ncol(species66)])
## [1] 522.0 876.0 331.0 678.0 1029.5 523.0 880.5 1054.5 1306.0 1196.5
## [11] 396.0 531.0 473.0 625.5 197.0 837.0 178.0 377.0 275.0 144.0
## [21] 723.0 624.5 391.0 577.0 441.0 442.0 347.5 419.0 181.0 307.0
## [31] 1341.0 788.5 616.5 452.5 59.0 389.5 244.0 370.0 460.0 837.0
## [41] 214.0 251.0 184.0 448.0 129.0 859.0 299.0 353.0 424.0 467.5
## [51] 461.0 625.0 536.0 1377.5 554.5 311.0 509.5 332.0 1574.0 2345.0
## [61] 562.5 574.0 1584.0 257.0 860.5 1901.0 688.0 622.0 985.0 2045.5
## [71] 485.5 282.0 1335.0 184.0 355.5 270.0 481.0 341.0 50.0 409.0
## [81] 731.0
In this case we selected all columns from the 2nd column to the total number of columns as returned by ncol()
(i.e. all but the first column). You could also do it like this…
rowSums(species66[,-1])
## [1] 522.0 876.0 331.0 678.0 1029.5 523.0 880.5 1054.5 1306.0 1196.5
## [11] 396.0 531.0 473.0 625.5 197.0 837.0 178.0 377.0 275.0 144.0
## [21] 723.0 624.5 391.0 577.0 441.0 442.0 347.5 419.0 181.0 307.0
## [31] 1341.0 788.5 616.5 452.5 59.0 389.5 244.0 370.0 460.0 837.0
## [41] 214.0 251.0 184.0 448.0 129.0 859.0 299.0 353.0 424.0 467.5
## [51] 461.0 625.0 536.0 1377.5 554.5 311.0 509.5 332.0 1574.0 2345.0
## [61] 562.5 574.0 1584.0 257.0 860.5 1901.0 688.0 622.0 985.0 2045.5
## [71] 485.5 282.0 1335.0 184.0 355.5 270.0 481.0 341.0 50.0 409.0
## [81] 731.0
This just highlights that R almost always provides a number of ways to do the same thing. For example, if one wanted to calculate the row sums using the tidyverse approach, you could do this
species66[,-1] %>% rowwise() %>% pmap_dbl(.,sum)
## [1] 522.0 876.0 331.0 678.0 1029.5 523.0 880.5 1054.5 1306.0 1196.5
## [11] 396.0 531.0 473.0 625.5 197.0 837.0 178.0 377.0 275.0 144.0
## [21] 723.0 624.5 391.0 577.0 441.0 442.0 347.5 419.0 181.0 307.0
## [31] 1341.0 788.5 616.5 452.5 59.0 389.5 244.0 370.0 460.0 837.0
## [41] 214.0 251.0 184.0 448.0 129.0 859.0 299.0 353.0 424.0 467.5
## [51] 461.0 625.0 536.0 1377.5 554.5 311.0 509.5 332.0 1574.0 2345.0
## [61] 562.5 574.0 1584.0 257.0 860.5 1901.0 688.0 622.0 985.0 2045.5
## [71] 485.5 282.0 1335.0 184.0 355.5 270.0 481.0 341.0 50.0 409.0
## [81] 731.0
Either way, a vector of numbers is not very informative… Let’s plot it in a histogram to look at the frequency of plots with different abundances.
ggplot(data.frame(abundance = rowSums(species66[,-1])),
aes(abundance)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Wow! It ranges from 50 to 2345! The reasons will become clear later on.
What about species number by plot? This is easy to achieve by just counting the number of positive entries in each row, which we get using the logical expression >0
, like so
rowSums(species66[,-1]>0)
## [1] 51 50 52 64 44 44 66 53 31 19 27 43 37 34 26 46 14 51 37 15 67 52 51 45 40
## [26] 34 20 51 28 35 56 61 51 28 6 20 21 22 36 33 17 32 18 42 15 47 25 36 36 12
## [51] 39 49 25 49 35 36 39 39 74 71 24 58 79 25 44 74 14 33 51 73 24 24 72 20 11
## [76] 39 27 27 4 46 38
ggplot(data.frame(species_number = rowSums(species66[,-1]>0)),
aes(species_number)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What about the relationship between the two?
ggplot(data.frame(species_number = rowSums(species66[,-1]>0),
abundance = rowSums(species66[,-1])),
aes(abundance, species_number)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ log(x))
A strong positive relationship. One would usually expect a close relationship between abundance and species number in a vegetation survey, because having more plants in your plot means greater potential to sample more species. E.g. if you only have 10 individuals, the maximum number of species you can sample is 10… In Fynbos, the number of individuals you sample in a set area depends on the vegetation age since fire, with greater numbers in younger plots, because the individuals are smaller.
SIDE NOTE ON SPECIES DIVERSITY MEASURES: This is why I have been referring to species number and not species richness. Species richness, when one looks across all taxonomic groups that use the term, is typically defined as the number of species encountered for a given number of individuals sampled. This accounts for any differences in the amount of effort invested in each sampling event (e.g. trawl netting, mist netting, aerial counts, etc).
It is only when working with sedentary organisms like plants that we typically sample per unit area, in which case we are actually sampling species density and NOT species richness. While useful, species density has different properties to species richness that we should always bear in mind – e.g. species density typically declines with the size of the organism sampled, which depends on the kind of organism (tree vs grass) or the age of the organism (seedling vs adult). One can estimate species richness from vegetation plot data, but that requires having counts of the number of individuals and estimating the average number of species one samples for a given number of individuals. I highly recommend you read Gotelli and Colwell 2001. Ecology Letters if you are ever working with diversity data.
So let’s have a look at the change in diversity within sites between surveys. We can use the same code as above, just plotting the species counts from the one year against the other.
ggplot(data.frame(species_number66 = rowSums(species66[,-1]>0),
species_number10 = rowSums(species10[,-1]>0)),
aes(species_number66, species_number10)) +
geom_point() +
geom_smooth(method = "lm")
## Error in data.frame(species_number66 = rowSums(species66[, -1] > 0), species_number10 = rowSums(species10[, : arguments imply differing number of rows: 81, 63
Oh dear! The number of plots surveyed in each year was different for some reason. This is where the plot numbers come in. We need to make sure we’re comparing the same plots. First, we need to see what plot numbers are common between years.
(plots <- intersect(species66$...1, species10$...1))
## [1] "CP_1" "CP_10" "CP_12" "CP_13" "CP_14" "CP_15" "CP_16" "CP_17" "CP_18"
## [10] "CP_19" "CP_2" "CP_21" "CP_22" "CP_27" "CP_28" "CP_29" "CP_3" "CP_31"
## [19] "CP_34" "CP_36" "CP_37" "CP_38" "CP_39" "CP_4" "CP_40" "CP_42" "CP_44"
## [28] "CP_45" "CP_46" "CP_47" "CP_48" "CP_49" "CP_50" "CP_55" "CP_56" "CP_57"
## [37] "CP_58" "CP_59" "CP_61" "CP_62" "CP_63" "CP_64" "CP_65" "CP_66" "CP_67"
## [46] "CP_70" "CP_71" "CP_72" "CP_73" "CP_75" "CP_76" "CP_78" "CP_79" "CP_8"
## [55] "CP_80" "CP_82" "CP_83" "CP_88" "CP_89" "CP_9" "CP_92" "CP_95" "CP_99"
Note that if you get an error, it may be because you don’t have a column called ‘…1’. This was an automatic name R gave the column. On Mac it may be ‘X__1’. If you get an error, run the code names(species66)[1]
to see what it called it on your computer and change the column name in the lines of code above and below. Sorry about that.
Now we need to extract these “intersect()ing” plots from each survey’s community data matrix, like so
species66c <- species66[which(species66$...1 %in% plots), -1]
species10c <- species10[which(species10$...1 %in% plots), -1]
Now let’s try to look at the change in diversity within sites between surveys again.
ggplot(data.frame(species_number66 = rowSums(species66c>0),
species_number10 = rowSums(species10c>0)),
aes(species_number66, species_number10)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Ok, a lot of difference between the two surveys. One thing we haven’t accounted for is the fact that the surveys record all species, but in Fynbos we usually exclude the “seasonally apparent” species like geophytes or annuals and only analyze the data for the “permanently apparent” species. This is because some species are only visible for a few weeks or less, but we only survey the plots once – and not every week for a full year. In this analysis we also want to exclude alien species.
You’ll note that one of the spreadsheets in our dataset is called “excluded_spp”
excel_sheets("pnas.1619014114.sd01.xlsx")
## [1] "METADATA" "weather" "fires"
## [4] "postfireweather" "enviroment" "excluded_spp"
## [7] "veg1966" "veg1996" "veg2010"
## [10] "traits" "speciesclimatedata"
We need to read these in…
excludes <- read_excel("pnas.1619014114.sd01.xlsx", sheet = "excluded_spp")
## New names:
## * `` -> ...1
head(excludes)
## # A tibble: 6 x 2
## ...1 x
## <dbl> <chr>
## 1 1 Acacia saligna
## 2 2 Adenocline pauciflora
## 3 3 Agapanthus walshii
## 4 4 Albuca flaccida
## 5 5 Albuca juncifolia
## 6 6 Annesorhiza altiscapa
…and cut them out of our community data matrices, and plot again
species66c <- species66c[, -which(colnames(species66c) %in% excludes$x)]
species10c <- species10c[, -which(colnames(species10c) %in% excludes$x)]
ggplot(data.frame(species_number66 = rowSums(species66c>0),
species_number10 = rowSums(species10c>0)),
aes(species_number66, species_number10)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
…very similar. So why the difference among surveys?
Well it could be the differences in post-fire vegetation age and the impacts on the numbers of individuals, and thus species sampled. Let’s look at the relationship between the change in numbers of species and change in vegetation age.
Vegetation age for each plot at the time of each survey is in the enviroment
spreadsheet. NOTE that I spelled it incorrectly in the data I submitted to PNAS… Nobody’s perfect…
env <- read_excel("pnas.1619014114.sd01.xlsx", sheet = "enviroment", na = "NA")
## New names:
## * `` -> ...1
head(env)
## # A tibble: 6 x 10
## ...1 Plot Moisture_class Age1996 Age2010 Age1966 Aliens_max firecount66_10
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 5 10 10 8 3 3
## 2 2 2 5 10 10 5 1 3
## 3 3 3 5 10 10 7 1 3
## 4 4 4 4 10 24 18 3 NA
## 5 5 8 5 10 3 3 3 2
## 6 6 9 5 10 24 3 2 1
## # … with 2 more variables: firecount66_96 <dbl>, firecount96_10 <dbl>
Ok, first we need to match the plots in the env
object to those in the community data matrices. This is a little tricky, because the Plot
column in the env
object only has numbers, whereas in the community data matrices they were prefixed with “CP_”. We’ll have to paste()
them in, like so
(env$...1 <- paste("CP_", env$Plot, sep = ""))
## [1] "CP_1" "CP_2" "CP_3" "CP_4" "CP_8" "CP_9" "CP_10" "CP_12"
## [9] "CP_13" "CP_14" "CP_15" "CP_16" "CP_17" "CP_18" "CP_19" "CP_21"
## [17] "CP_22" "CP_23" "CP_24" "CP_25" "CP_27" "CP_28" "CP_29" "CP_30"
## [25] "CP_31" "CP_34" "CP_36" "CP_37" "CP_38" "CP_39" "CP_40" "CP_42"
## [33] "CP_44" "CP_45" "CP_46" "CP_47" "CP_48" "CP_49" "CP_50" "CP_54"
## [41] "CP_55" "CP_56" "CP_57" "CP_58" "CP_59" "CP_60" "CP_61" "CP_62"
## [49] "CP_63" "CP_64" "CP_65" "CP_66" "CP_67" "CP_68" "CP_70" "CP_71"
## [57] "CP_72" "CP_73" "CP_74" "CP_75" "CP_76" "CP_78" "CP_79" "CP_80"
## [65] "CP_81" "CP_82" "CP_83" "CP_84" "CP_86" "CP_87" "CP_88" "CP_89"
## [73] "CP_90" "CP_91" "CP_92" "CP_94" "CP_95" "CP_97" "CP_98" "CP_99"
## [81] "CP_100"
And then join the species10
column names to env
. Note that the left_join
function retains only the rows that were matched, and it sorts them into the same order as the first object supplied.
Note that you may need to change the ‘…1’ above to whatever your first column of the species10
dataframe is named as mentioned earlier. Otherwise the next line of code returns an empty object and the plotting fails.
env <- left_join(species10[,1], env)
## Joining, by = "...1"
Ok, now we’re ready to plot the change in numbers of species with the change in vegetation age.
ggplot(data.frame(
change_in_species_number = rowSums(species66c>0) - rowSums(species10c>0),
change_in_age = env$Age1966 - env$Age2010),
aes(change_in_species_number, change_in_age)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 6 rows containing non-finite values (stat_smooth).
## Warning: Removed 6 rows containing missing values (geom_point).
Ignore the error. This is because some the plots don’t have a known vegetation age as they have never burnt on record.
So vegetation age clearly is an issue, with numbers of species declining as the vegetation ages (or increasing after a plot burns).
One approach to deal with this is to “rarefy” the samples. This involves estimating each plot’s expected species richness by randomly sampling a set number of individuals from each community. Again, I highly recommend you read Gotelli and Colwell 2001. Ecology Letters.
In this case we’ll sample 50 individuals, because that is the minimum abundance recorded in our plots.
ggplot(data.frame(species_richness66 = rarefy(x = ceiling(species66c), sample = 50),
species_richness10 = rarefy(x = ceiling(species10c), sample = 50)),
aes(species_richness66, species_richness10)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Much more similar… but still some differences. It looks like there may be fewer species in 2010, but this should be revealed by the slope of a regression line. We can get the slope by looking at the summary()
of the linear model (lm()
),
sr66 <- rarefy(x = ceiling(species66c), sample = 50)
sr10 <- rarefy(x = ceiling(species10c), sample = 50)
summary(lm(sr10 ~ sr66 - 1))
##
## Call:
## lm(formula = sr10 ~ sr66 - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.0426 -1.9667 -0.1091 2.9906 8.0104
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## sr66 0.79423 0.02309 34.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.51 on 62 degrees of freedom
## Multiple R-squared: 0.9502, Adjusted R-squared: 0.9494
## F-statistic: 1183 on 1 and 62 DF, p-value: < 2.2e-16
Note that I forced the intercept to be 0 by adding “-1” to the formula. The interesting thing is that the slope is 0.7942285, which suggests lower species numbers in 2010…
We can also test this using a paired T test
t.test(sr66, sr10, paired = T)
##
## Paired t-test
##
## data: sr66 and sr10
## t = 6.9289, df = 62, p-value = 2.854e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.471771 4.476248
## sample estimates:
## mean of the differences
## 3.474009
Yup, there was a significant drop in species numbers between 1966 and 2010, even once we accounted for differences in vegetation age/numbers of individuals. Bear in mind that these results are not necessarily defensible (and do not appear in the paper), because the 1966 “abundance” data is only an estimate made by taking the mid-point of the categorical abundance classes recorded in the original survey. This may well bias the rarefaction results.
I’m not going to delve into this much, but since one of the biggest topics community ecologists are interested in is the relationship between species’ traits and environmental variables I feel it bears mentioning.
Why is this called “the fourth corner”? I mentioned that analyses of biotic community data typically involve 3 data matrices:
The “fourth corner” is the matrix we have to estimate from the other matrices, because we can’t observe it directly – that of traits by environment.
There are a multitude of methods out there for looking at the fourth corner, and covering this is beyond the scope of this tutorial, but I thought I’d demonstrate one of the simplest approaches – that of comparing the relative dominance of different functional types across different environments.
First we need to read in the trait data, which is the worksheet called traits
.
traits <- read_excel("pnas.1619014114.sd01.xlsx", sheet = "traits")
## New names:
## * `` -> ...1
head(traits)
## # A tibble: 6 x 9
## ...1 species family_GM resprout_postfi… herb geophyte graminoid low_shrub
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 17 Adenan… RUTACEAE 0 0 0 0 1
## 2 18 Adenan… RUTACEAE 0 0 0 0 1
## 3 29 Agatho… RUTACEAE 1 0 0 0 1
## 4 31 Agatho… RUTACEAE 1 0 0 0 1
## 5 33 Agatho… RUTACEAE 0 0 0 0 1
## 6 37 Agatho… RUTACEAE 1 0 0 0 1
## # … with 1 more variable: tall_shrub <dbl>
You’ll see we have a column called resprout_postfire
, which contains information on the postfire regeneration strategy for all species, indicating whether they are killed by fire and have to recruit from seed (0 = obligate seeder) or can persist through fire and resprout from remaining material (1 = sprouter).
First we pull out vectors of the species names matching each regeneration type. I’m sure there’s an easier way, but this at least is relatively transparent.
sprouters <- unlist(na.omit(traits[traits$resprout_postfire == 1,]$species))
seeders <- unlist(na.omit(traits[traits$resprout_postfire == 0,]$species))
sprouters[1:10]
## [1] "Agathosma bifida" "Agathosma capensis" "Agathosma hookeri"
## [4] "Agathosma imbricata" "Amphithalea ericifolia" "Aristea africana"
## [7] "Aristea capitata" "Aristea glauca" "Aspalathus linguiloba"
## [10] "Aspalathus microphylla"
Now we use the sets of species names to extract the species counts and abundances for each regeneration type for each plot and bind them into a new dataframe that includes the soil moisture classes. The classes are labelled 1 to 5 representing: 1 = Permanently Wet, 2 = Seasonally Wet, 3 = Seasonally Moist (typically bordering a seasonal wetland with indicators of some influence such as the presence of a few wetland species), 4 = Well-drained (no evidence of topographic influence on soil moisture and likely reliant on mist/fog and rain only), 5 = Dry (essentially class 4 combined with other factors like N aspect and shallow soil that would affect plant water stress).
regen <- data.frame(
sprouter_spp = as_tibble(species10c>0) %>% select(any_of(sprouters)) %>% rowwise() %>% pmap_dbl(.,sum),
seeder_spp = as_tibble(species10c>0) %>% select(any_of(seeders)) %>% rowwise() %>% pmap_dbl(.,sum),
sprouter_abund = as_tibble(species10c) %>% select(any_of(sprouters)) %>% rowwise() %>% pmap_dbl(.,sum),
seeder_abund = as_tibble(species10c) %>% select(any_of(seeders)) %>% rowwise() %>% pmap_dbl(.,sum),
soil_moisture = as.factor(env$Moisture_class))
Turn it into tidy
(i.e. long format) data for plotting.
regen <- regen %>% pivot_longer(
cols = c("sprouter_spp", "seeder_spp", "sprouter_abund", "seeder_abund"),
names_to = c("Regeneration mode", "Variable"),
names_pattern = "(.*)_(.*)",
values_to = "Number")
regen
## # A tibble: 252 x 4
## soil_moisture `Regeneration mode` Variable Number
## <fct> <chr> <chr> <dbl>
## 1 5 sprouter spp 19
## 2 5 seeder spp 31
## 3 5 sprouter abund 526
## 4 5 seeder abund 1292
## 5 5 sprouter spp 19
## 6 5 seeder spp 27
## 7 5 sprouter abund 371
## 8 5 seeder abund 2132
## 9 4 sprouter spp 19
## 10 4 seeder spp 29
## # … with 242 more rows
Note the regexpression passed to names_pattern
to convert the column names (“sprouter_spp”, “seeder_spp”, “sprouter_abund”, “seeder_abund”) into two columns (“Regeneration mode”, “Variable”) filled with “seeder”/“sprouter” and “spp”/“abund” respectively.
Now we can plot them
pal <- c("#56B4E9", "#CC79A7")#, "#009E73", "#0072B2", "#E69F00") #colourblind palette
ggplot(data = regen, aes(y = `Number`, x = soil_moisture, fill = `Regeneration mode`)) +
geom_boxplot() +
facet_wrap(~Variable, scales = "free") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_fill_manual(values=rev(pal))
Interestingly, the abundance of seeders is much higher in wetter sites and decreases towards drier sites, while the sprouters show little difference. In contrast, specis counts for both regeneration types increase towards the drier sites. What have I not done (or accounted for) here and how may it confound these results?
This set of tutorials are aimed at helping you access and analyze SAEON datasets and other useful resources provided by others. Most tutorials use the R statistical computing language, but they should still contain useful info on the datasets and how to access them even if you don’t use R. If you’d like to learn to use R, we recommend you start with the R Intro material.
One of the biggest challenges of working on global change issues in South Africa is not having easy access to the country’s long-term weather records. One can occasionally access data for research purposes, especially if there’s a student involved, but otherwise our weather service charges a hefty fee for data.
Fortunately, we can access a lot of South Africa’s data through the National Oceanic and Atmospheric Administration of the USA (NOAA) for free here!!!
NOAA serve a massive range of data sets, but the following appear to be the most useful in terms of station data:
BUT WAIT, THERE’S MORE! If you call now, you’ll get this dedicated R package written by the amazing fellows at rOpenSci absolutely free!!!
Here’s a quick demo for using library(rnoaa) below, largely using repurposed and updated code from Scott Chamberlain’s blog posts here and here, and from the package vignette.
DISCLAIMER: We haven’t read the documentation for the datasets and make no guarantee of their accuracy or of the validity of my assumptions when summarizing and presenting the data. This applies to data quality, gaps, units, etc. etc
Let’s get set up…
library("rnoaa")
library("dplyr")
library("reshape2")
library("lawn")
library("leaflet")
library("lubridate")
library("ggplot2")
library("htmltools")
This dataset “is composed of worldwide surface weather observations from over 35,000 stations”, nominally at hourly resolution. See link above for more details.
Let’s see what stations we can find for South Africa and surrounds?
###Make a bounding box
bbox <- c(16, -35, 34, -22)
###Search the bounding box for stations
out <- isd_stations_search(bbox = bbox)
## using cached file: /home/jasper/.cache/R/noaa_isd/isd_stations.rds
## date created (size, mb): 2020-08-02 21:27:15 (0.683)
###Have a look at the output
head(out)
## # A tibble: 6 x 11
## usaf wban station_name ctry state icao lat lon elev_m begin end
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 673080 99999 CHICUALACUALA MZ "" "" -22.1 31.7 453 1.97e7 2.02e7
## 2 673090 99999 DINDIZA-GAZA MZ "" "" -22.1 33.4 102 2.01e7 2.02e7
## 3 673310 99999 MAPULANGUENE-… MZ "" "" -24.5 32.1 151 2.01e7 2.02e7
## 4 673350 99999 XAI XAI MZ "" "FQX… -25.0 33.6 5 1.97e7 2.02e7
## 5 673390 99999 MAPUTO MZ "" "" -26.0 32.6 59 1.95e7 2.02e7
## 6 673410 99999 MAPUTO MZ "" "FQM… -25.9 32.6 44.2 1.97e7 2.02e7
Difficult to read a table, so let’s put them on the map.
###Make the popup labels
out$uw <- paste("usaf/wban = ", out$usaf, "/", out$wban, sep = "")
out$period <- paste("years = ", substr(out$begin,1,4), " to ", substr(out$end,1,4), sep = "")
name <- paste(sep = "<br/>", out$station_name, out$period, out$uw)
###Plot the map
leaflet(data = out) %>%
addTiles() %>%
addCircleMarkers(label = lapply(name, HTML), radius = 3)
There’s also a function for searching for all stations within a specified radius of a point of interest.
out <- isd_stations_search(lat = -34.1, lon = 18.45, radius = 40)
head(out)
## # A tibble: 6 x 12
## usaf wban station_name ctry state icao lat lon elev_m begin end
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6891… 99999 SIMONSTOWN SF "" "" -34.2 18.4 293 1.95e7 1.95e7
## 2 6891… 99999 SLANGKOP SF "" "" -34.2 18.3 8 2.00e7 2.02e7
## 3 6881… 99999 MOLTENO RES… SF "" "" -33.9 18.4 97 2.01e7 2.02e7
## 4 6899… 99999 CT-AWS SF "" "" -34.0 18.6 46 2.00e7 2.02e7
## 5 6881… 99999 CAPE TOWN I… SF "" "FAC… -34.0 18.6 46 1.95e7 2.02e7
## 6 6881… 99999 CAPE TOWN -… SF "" "" -33.9 18.4 2 2.00e7 2.02e7
## # … with 1 more variable: distance <dbl>
And if we visualize the area searched…
pt <- lawn::lawn_point(c(18.45, -34.1))
lawn::lawn_buffer(pt, dist = 40) %>% view
Ok, now let’s get a smattering of stations from across the Cape Peninsula that have data for 2019, combine them into one object and plot
###Pull data from online database to local file and then read in
res1 <- isd(usaf=out$usaf[2], wban="99999", year=2019, force = T)
res2 <- isd(usaf=out$usaf[3], wban="99999", year=2019, force = T)
res3 <- isd(usaf=out$usaf[5], wban="99999", year=2019, force = T)
res4 <- isd(usaf=out$usaf[9], wban="99999", year=2019, force = T)
###Combine data into one dataframe
res_all <- bind_rows(res1, res2, res3, res4) #Note that not all stations have all variables and bind_rows() fills empty columns with NAs
###Create a combined date + time column
res_all$date_time <- ymd_hm(
sprintf("%s %s", as.character(res_all$date), res_all$time)
)
###Convert "temperature" column from codes to numeric and divide by 10 (the data come in tenths of a degree)
res_all$temperature <- as.numeric(substr(res_all$temperature, 2, 5))/10
###Remove the "NA" values, which NOAA code as "999"
res_all$temperature[which(res_all$temperature > 900)] <- NA
###Visualize the data
ggplot(res_all, aes(date_time, temperature)) +
geom_line() +
facet_wrap(~usaf_station, scales = "free_x")
Recall that the ISD dataset is hourly (or mostly 3-hourly in this case), so perhaps better to look at daily max or min temperatures and precipitation?
###Some dodgy cleaning of the rainfall data - CHECK MY ASSUMPTION THAT NA = 0mm RAINFALL
res_all$AA1_depth[is.na(res_all$AA1_depth)] <- 0
res_all$AA1_depth[which(res_all$AA1_depth == 9999)] <- NA
res_all$AA1_depth <- as.numeric(res_all$AA1_depth)
res_all$date <- as.Date(res_all$date, format = "%Y%m%d")
###Summarize Tmin, Tmax and Precipitation
dayres <- res_all %>% group_by(usaf_station, date) %>% summarise(tmax = max(temperature, na.rm = T), tmin = min(temperature, na.rm = T), prec = max(AA1_depth, na.rm = T)/10) #note I take max precip and divide by ten, because it seems to be a hourly cumulative sum in tenths of a mm (Check this!)
###Melt and plot
dayres %>% melt(id = c("usaf_station", "date")) %>% ggplot(aes(date, value, colour = usaf_station)) +
geom_line() +
facet_wrap(~variable, scales = "free_y")
Or monthly…
###Fix dates
res_all$month <- month(res_all$date_time) #note this is all in 2019
###Summarize Tmin, Tmax and Precipitation
monthres <- res_all %>% group_by(usaf_station, month) %>% summarise(tmax = max(temperature, na.rm = T), tmin = min(temperature, na.rm = T), prec = sum(AA1_depth, na.rm = T)/30)
###Melt and plot
monthres %>% melt(id = c("usaf_station", "month")) %>% ggplot(aes(month, value, colour = usaf_station)) +
geom_line() +
facet_wrap(~variable, scales = "free_y")
Note that the rnoaa functions save files in a local cache. If you repeat a search, they usually just read in that cache rather than redoing the search and download. Some functions allow you to specify force = TRUE
to force a new download, but others don’t. In this case you will need to clear the cache manually. You can find the cache on your machine by calling rappdirs::user_cache_dir("rnoaa/ghcnd")
.
What if we wanted to look at a longer record? The isd()
function only allows us to pull one year at a time so we need to automate using a loop or similar.
We may also be keen to see the names? Or select stations by name rather than usaf/wban codes. You can easily pick your own station names by exploring the interactive map we made above. Then we’ll loop through a few years for each station to get a multi-year record.
Here’s the daily Tmin and Tmax for the period 2013 to 2020 for a few stations around Cape Town.
###Select stations by name and filter
stations <- c("SLANGKOP", "MOLTENO RESERVIOR", "CAPE TOWN INTL", "CAPE POINT")
sout <- filter(out, station_name %in% stations)
###Set date range
yr <- 2013:2020
###Ok, now I'll use 2 nested "for" loops to extract data for all years for each station. "for" loops can often be slower than other approaches like writing functions, but in this case its not a problem because the API call and data download are the slow bit. Using a "for" loop is perhaps easier to follow what I'm doing too.
drtdat <- list() #Make list object to capture outputs (elements will be stations)
#For each station
for (i in 1:nrow(sout)) {
res <- list() #Make list object to capture interim outputs stations by station (elements will be years)
#For each year (j), extract the data for station (i)
for (j in 1:length(yr)) {
res[[j]] <- tryCatch({
isd(usaf=sout$usaf[i], wban=sout$wban[i], year=yr[j]) }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
res <- bind_rows(res) #Convert the interim list object into a dataframe
drtdat[[i]] <- res #Add the dataframe for station i to the output list
}
###Convert output to a dataframe and add the station names
names(drtdat) <- stations
dat <- bind_rows(drtdat)
dat <- right_join(sout[,c("station_name", "usaf")], dat, by = c("usaf"="usaf_station"))
###Create a combined date + time column and year/month/day columns
dat$date_time <- ymd_hm(
sprintf("%s %s", as.character(dat$date), dat$time)
)
dat$year <- year(dat$date_time) #note this is NOT all in 2019 this time
dat$month <- month(dat$date_time)
dat$day <- month(dat$date_time)
dat$date <- as.Date(dat$date, format = "%Y%m%d")
###Convert "temperature" column from codes to numeric
dat$temperature <- as.numeric(substr(dat$temperature, 2, 5))/10
###Remove the "NA" values, which NOAA code as "999"
dat$temperature[which(dat$temperature > 900)] <- NA
###Some dodgy cleaning of the rainfall data - CHECK MY ASSUMPTION THAT NA = 0mm RAINFALL
dat$AA1_depth[is.na(dat$AA1_depth)] <- 0
dat$AA1_depth[which(dat$AA1_depth == 9999)] <- NA
dat$AA1_depth <- as.numeric(dat$AA1_depth)
###Daily summary of Tmin and Tmax
daydat <- dat %>% group_by(station_name, date) %>% summarise(tmax = max(temperature, na.rm = T), tmin = min(temperature, na.rm = T))
###Melt and plot
daydat %>% melt(id = c("station_name", "date")) %>% ggplot(aes(date, value, colour = variable)) +
geom_line() +
facet_wrap(~station_name)
Or precipitation by month?
###Summarize Tmin, Tmax and Precipitation
monthdat <- dat %>% group_by(station_name, year, month) %>% summarise(prec = sum(AA1_depth, na.rm = T)/30)
monthdat$date <- as.Date(paste(monthdat$year, monthdat$month, "01", sep = "/"), format = "%Y/%m/%d")
###Melt and plot
monthdat %>% select(station_name, date, prec) %>% ggplot(aes(date, prec)) +
geom_line() +
facet_wrap(~station_name, scales = "free_y")
Ok, what about looking at even longer records? Most of South Africa has experienced pretty extreme drought in the past 5 years or so, so let’s pull out a few major cities and towns that give us good coverage of the country.
Note that the Integrated Surface Dataset (ISD) gives us hourly data, which is overkill for looking at longer records as the dataset gets very large. For this operation it is better to query the Global Historical Climatology Network as described by Menne et al. 2012. CAVEAT: We take no responsibility for the data. Have a look at the paper for details on error corrections, biases, etc.
Let’s have a look at Cape Town International Airport
ct <- meteo_tidy_ghcnd(stationid = "SFM00068816", var = c("TMAX", "TMIN", "PRCP"))
head(ct)
## # A tibble: 6 x 5
## id date prcp tmax tmin
## <chr> <date> <dbl> <dbl> <dbl>
## 1 SFM00068816 1973-06-01 NA NA NA
## 2 SFM00068816 1973-06-02 NA NA NA
## 3 SFM00068816 1973-06-03 NA NA NA
## 4 SFM00068816 1973-06-04 NA NA NA
## 5 SFM00068816 1973-06-05 NA NA NA
## 6 SFM00068816 1973-06-06 NA NA NA
Note that we had to know the station ID “SFM00068816”. For some reason you can’t query the database of stations using the same method as for the ISD and you have to download the entire list of all ~630 000 global weather stations and query that locally to get the station ID.
station_data <- ghcnd_stations() # Takes a while to run
head(station_data)
## # A tibble: 6 x 11
## id latitude longitude elevation state name gsn_flag wmo_id element
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 ACW0… 17.1 -61.8 10.1 "" ST J… "" "" TMAX
## 2 ACW0… 17.1 -61.8 10.1 "" ST J… "" "" TMIN
## 3 ACW0… 17.1 -61.8 10.1 "" ST J… "" "" PRCP
## 4 ACW0… 17.1 -61.8 10.1 "" ST J… "" "" SNOW
## 5 ACW0… 17.1 -61.8 10.1 "" ST J… "" "" SNWD
## 6 ACW0… 17.1 -61.8 10.1 "" ST J… "" "" PGTM
## # … with 2 more variables: first_year <int>, last_year <int>
You can now query this database using the meteo_nearby_stations()
function, which allows you to find all stations within a certain radius of a coordinate or the closest x
stations. Here are the closest stations to Cape Point lighthouse.
lat_lon_df <- data.frame(id = "somewhere",
latitude = -34.35,
longitude = 18.48)
nearby_stations <- meteo_nearby_stations(lat_lon_df = lat_lon_df,
station_data = station_data, radius = 100)
nearby_stations$somewhere
## # A tibble: 23 x 5
## id name latitude longitude distance
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 SF000047220 GROOT CONSTANTIA -34.0 18.4 36.0
## 2 SF000213300 EERSTERIVIER (BOS) -34.0 18.7 41.0
## 3 SF000206890 TABLE.MTN WOODHEAD TUNNEL -34.0 18.4 42.2
## 4 SFM00068816 CAPE TOWN INTL -34.0 18.6 44.3
## 5 SF000207470 TABLE.MTN PLATTEKLIP -34.0 18.4 44.8
## 6 SF000208660 ROYAL OBSERVATORY -33.9 18.5 46.7
## 7 SF000207760 CAPE TOWN FIRE STATION -33.9 18.4 46.9
## 8 SF000207460 TABL.MTN MOLTENO RSV -33.9 18.4 47.0
## 9 SF000210550 CAPE TOWN MAITLAND (CEM) -33.9 18.5 48.0
## 10 SF000216550 STELLENBOSCH (TNK) -33.9 18.9 58.9
## # … with 13 more rows
Interesting… Cape Point lighthouse isn’t on the list, even though it’s a Global Atmospheric Watch station (GAWS) and is in the ISD data. No idea why…
Note that you can also just do a good old text search for the names of a few stations you know. This also allows you to filter by multiple criteria at once, e.g. we can filter for element = PRCP
to only get rainfall data. You can use the map we made earlier to get your own list of names, but note that like “CAPE POINT” they’re not all in the GHCND database.
Since most of South Africa has experienced extreme drought in the past few years, perhaps we should pull out and compare the rainfall for a few stations from around the country? In this case we’ve selected a few of the larger towns and cities.
###Select stations by name and filter
stations <- c("CAPE TOWN INTL",
"SPRINGBOK",
"BLOEMFONTEIN AIRPOR",
"PORT ELIZABETH INTL",
"DURBAN INTL",
"JOHANNESBURG INTL")
stas <- station_data %>% filter(element == "PRCP") %>%
filter(name %in% stations)
stas
## # A tibble: 6 x 11
## id latitude longitude elevation state name gsn_flag wmo_id element
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 SF00… -29.7 17.9 1007 "" SPRI… "GSN" 68512 PRCP
## 2 SF00… -29.1 26.3 1354 "" BLOE… "GSN" 68442 PRCP
## 3 SFM0… -26.1 28.2 1694. "" JOHA… "" 68368 PRCP
## 4 SFM0… -30.0 31.0 10.1 "" DURB… "" 68588 PRCP
## 5 SFM0… -34.0 18.6 46 "" CAPE… "GSN" 68816 PRCP
## 6 SFM0… -34.0 25.6 68.9 "" PORT… "GSN" 68842 PRCP
## # … with 2 more variables: first_year <int>, last_year <int>
Looks like we should be able to compare the period 1980 to 2019 for all stations. Let’s download the data.
dat <- meteo_pull_monitors(stas$id, var = c("TMAX", "TMIN", "PRCP"), date_min = "1980-01-01", date_max = "2019-12-31")
dat[1:5,]
## # A tibble: 5 x 5
## id date prcp tmax tmin
## <chr> <date> <dbl> <dbl> <dbl>
## 1 SF002146700 1980-01-01 0 NA NA
## 2 SF002146700 1980-01-02 0 NA NA
## 3 SF002146700 1980-01-03 0 NA NA
## 4 SF002146700 1980-01-04 0 NA NA
## 5 SF002146700 1980-01-05 0 NA NA
Ok, now let’s add our station names, summarize by year, and plot.
stas$shortname <- c("SBK", "BFN", "JHB", "DBN", "CPT", "PE")
dat <- left_join(dat, stas[,c("id", "name", "shortname")])
dat$year <- substr(dat$date,1,4)
mdat <- dat %>% group_by(shortname, year) %>% summarise(rain = sum(prcp, na.rm = T)/10) #Get annual totals
mdat$year <- as.numeric(mdat$year)
mdat$rain <- as.numeric(mdat$rain)
###Visualize the data
ggplot(mdat, aes(x = year, y = rain)) +
geom_line() +
facet_wrap(~shortname)
Or something prettier?
###calculate average rainfall and yearly anomalies
ymean <- mdat %>% group_by(shortname) %>% summarise(mean = mean(rain, na.rm = T))
mdat <- left_join(mdat, ymean, by = "shortname")
mdat$positive <- mdat$rain - mdat$mean
mdat$positive[which(mdat$positive<0)] <- 0
mdat$negative <- mdat$rain - mdat$mean
mdat$negative[which(mdat$negative>0)] <- 0
###Plot
ggplot(mdat, aes(x = year)) +
geom_ribbon(aes(ymin = 0, ymax = positive, fill="positive")) + geom_ribbon(aes(ymin = negative, ymax = 0 ,fill="negative")) +
scale_fill_manual(name="",values=c("#D6604D", "#4393C3")) +
facet_grid(shortname ~ ., scales = "free_y") +
ylab("rainfall anomaly in mm")
We haven’t dealt with data gaps, data quality verification, etc etc, but this quick and dirty analysis suggests it’s been a pretty dismal decade for South Africa all round…