So cold in Dexter
A couple of weeks ago, my wife conned our 15 year old into roto tilling up a part of our pig pen so she could have a garden. She’s slowly starting to plant stuff in it. We also have about 20 apple trees that have started to blossom. I think we’re past the point of worrying about hard frosts that could kill germinating plants or apple blossoms. But how sure should I be? I thought this was a great question for a code club. If you google the name of your town and “frost dates” you’ll find a bunch of gardening websites that give guidance on when to plant in the spring and when to expect your harvesting to be done in the fall. Of course we could find one of those websites, but where’s the fun in that?
Today’s code club will answer this and other questions using functions we’ve already seen including filter
, group_by
, and summarize
. We’ll focus on how we can use the mutate
function to create new columns in our data frames. To spice things up a bit we’ll also talk about how we can use arithmetic functions with logical variables. By the end of today’s Code Club we’ll know when I can stop worrying about my wife’s garden and my apple trees.
Don’t watch the video straight through without firing up RStudio and trying the code and exercises yourself! Please be sure to see the setup instructions before you get going.
Prompt
Like we did last week, I used the NOAA website to track down daily high and low temperatures going back to 1891 for Ann Arbor, MI. You can find similar data for your favorite place by using the Climate Data Online Search to find “Daily Summaries” datasets. Be sure to click on the specific station you want at the bottom of your location’s page that has the longest history.
I have Ann Arbor’s data stored on GitHub and will use read_csv
to read it in and do some small touch ups to read the data in and extract the year, month, and day from the date. We’ll get some warning messages, which we can ignore for this Code Club.
library(tidyverse)
library(lubridate)
aa_weather <- read_csv("https://github.com/riffomonas/generalR_data/blob/master/noaa/USC00200230.csv?raw=true",
col_type=cols(TOBS = col_double())) %>%
select(DATE, TMAX, TMIN, TOBS) %>%
mutate(year = year(DATE),
month=month(DATE),
day = day(DATE))
Breaking down our approach
My question boils down to what is the probability of having a temperature below freezing on each day of the year. Alternatively, we could ask, across the last 130 years, how many times has there been a low temperature below freezing on each day of the year. To do this, we need to do a few things…
- Determine whether a low temperature was below freezing for every day in our dataset
- Aggregate our data by month and day
- Determine the fraction of month-day combinations had a low temperature below freezing across all of the years in our dataset
- Find the day in May where there’s a below 5 or 10% risk of having another day with below freezing temperatures.
If you’ve been following along with past Code Clubs, you already know how to do most of these steps. Now to convert these steps to R code!
The mutate
function
That code block has a line that tells R to create three new columns - year
, month
, and day
. That function is mutate
. We can use mutate
to write over existing columns or to create new columns. In this example, I used mutate
along with the year
, month
, and day
functions from the lubridate
package to extract information from the date
column of our data frame. Let’s step back and see how we can use mutate
to add a column.
The mutate
function takes as its arguments the name of the new column (e.g. year
), an equal sign (i.e. =
), and a formula telling R how to calculate the values for the new column. We want the output of the formula to be a single value. The ?mutate
page calls these “Name-value pairs”. We can have multiple name-value pairs that we set apart with commas as I did in the code chunk above. There are other mutate
functions including mutate_if
, mutate_at
, mutate_all
, transmutate
, transmutate_if
, transmutate_at
, and transmutate_all
. You can learn more about them at their help pages, but honestly, I don’t think I’ve ever used them.
To determine whether a given day had a low temperature below freezing, we will use the mutate function to create a new colum, below_freezing
, that will have a logical value of TRUE
, FALSE
, or NA
if the low temperature was not reported that day. The low temperature is recorded in the TMIN
column so asking TMIN < 0
will tell us whether the tempreature was below freezing.
aa_weather %>%
mutate(below_freezing = TMIN < 0)
Arithmetic with logicals
Having below_freezing
be a logical allows us to do some slick things. We can treat logicals as numbers and use them in functions like sum
and mean
. The key is to remember that TRUE
is 1
and FALSE
is 0
. Probably the best way to remember this is that zero and FALSE
are both nothing. If you can forget, you can always use as.numeric
…
my_vector <- c(TRUE, FALSE, TRUE, TRUE, FALSE)
as.numeric(my_vector)
If we use my_vector
with an arithmetic function, it will convert the logical value to a zero or one for us before evaluating the function.
sum(my_vector)
mean(my_vector)
The sum
function tells us that there are 3 TRUE
values and the mean
function tells us that 0.6 of the values were TRUE
. Although mean(my_vector)
is more concise, we could have also done
sum(my_vector)/length(my_vector)
Getting a fraction with summarize
Do you see where we’re going with this? We can use mean(below_freezing)
in summarize
to get the fraction of years a day was below freezing.
aa_weather %>%
mutate(below_freezing = TMIN < 0) %>%
summarize(frac_below_freezing = mean(below_freezing))
That gives an NA
value because some of the days did not have a TMIN
value. To ignore those values, we can use the na.rm=TRUE
argument in mean
aa_weather %>%
mutate(below_freezing = TMIN < 0) %>%
summarize(frac_below_freezing = mean(below_freezing, na.rm=TRUE))
Getting our daily fraction using group_by
and summarize
Cool, 35% of the days since 1891 have had a low temperature below freezing. But we wanted to aggregate the month-day combinations and then get the fraction across all years. We’ve seen this before with the group_by
/summarize
combination
aa_weather %>%
mutate(below_freezing = TMIN < 0) %>%
group_by(month, day) %>%
summarize(frac_below_freezing = mean(below_freezing, na.rm=TRUE))
Using filter
to focus in on May
Let’s focus in on the dates in May using the filter
function
aa_weather %>%
mutate(below_freezing = TMIN < 0) %>%
group_by(month, day) %>%
summarize(frac_below_freezing = mean(below_freezing, na.rm=TRUE)) %>%
filter(month == 5) %>%
print(n=31)
Based on historic weather data, the garden and apple trees are probably safe after the 10th of May.
aa_weather %>%
mutate(below_freezing = TMIN < 0) %>%
group_by(month, day) %>%
summarize(frac_below_freezing = mean(below_freezing, na.rm=TRUE)) %>%
filter(month == 5) %>%
print(n=31)
It has only been below freezing 11 times after the 10th in the past 130 years (8.5%). If we wanted to be more conservative, we could wait until the 13th since there have only been freezing temperatures on or after the 13th of May, 6 times (4.6%). Sure enough, if we look at the Gardening in the Mitten website, they tell us that Ann Arbor’s spring frost should end by May 10th and the first fall frost will be October 5th
Hopefully, you feel more comfortable using mutate
, filter
, group_by
, and summarize
now that you’ve seen them used in a new context. Are you ready to answer some questions using these strengthened skills?
Assignment
1. When do you think the vegetables in our garden will be done growing?
2. Add a column to aa_weather
that represents the high temperature for the day in Fahrenheit. Using this new column, determine the average number of days per year where the temperature is over 90 F.
3. I’ve heard a rule of thumb that if a day’s low and high temperature is over 100 F, then grass will grow. When can I expect the grass to start growing reliably in May?
Title credit: The Cranberries, So cold in Ireland