Like the weather
Last Saturday (May 9th) was pretty cold here in southeastern Michigan. It was 26F (-3C) when I woke up. Although we’ve had some warm days, this Spring has felt pretty cold. Is that normal? If only we had the skills to answer such a question… In fact, we do! Today’s code club will answer this and other questions using functions we’ve already seen including
summarize. By the end of today’s Code Club we’ll know how normal Friday’s temperatures were and we’ll have a greater familiarity with using these functions. In addition, we’ll learn how to use the
top_n functions to find out which previous year had the lowest temperature.
Don’t watch the video straight through without firing up RStudio and trying the code and exercises yourself! Please be sure to see the setup instructions before you get going.
I used the NOAA website to track down daily high and low temperatures going back to 1891 for Ann Arbor, MI. You can find similar data for your favorite place by using the Climate Data Online Search to find “Daily Summaries” datasets. Be sure to click on the specific station you want at the bottom of your location’s page that has the longest history.
I have Ann Arbor’s data stored on GitHub and will use
read_csv to read it in and do some small touch ups to read the data in and extract the year, month, and day from the date. We’ll get some warning messages, which we can ignore for this Code Club.
library(tidyverse) library(lubridate) aa_weather <- read_csv("https://github.com/riffomonas/generalR_data/blob/master/noaa/USC00200230.csv?raw=true", col_type=cols(TOBS = col_double())) %>% select(DATE, TMAX, TMIN, TOBS) %>% mutate(year = year(DATE), month=month(DATE), day = day(DATE))
summarize functions to find normal for one day
My first question is how weird were Friday’s temperatures. Recall that we can use the
filter function to retrieve rows from a data frame that match a set of specifications. Here we want the rows that correspond to the 5th month (i.e. May) and the 9th day.
may_nineth <- aa_weather %>% filter(month == 5 & day == 9)
We’ll see that
may_nineth is a dataframe with 127 rows and 7 columns including each day’s high (TMAX), low (TMIN), and observed (TOBS) temperature for the day. It’s hard to look through
may_nineth to see how typical our temperatures were. Instead, we can use the
summarize function. Previously, we used summarize with the
group_by function. We can also use it on non-grouped data. Let’s find the average high, low, and observed temperatures. Remember to use
na.rm=T to ignore those temperatures where the value is
may_nineth %>% summarize(ave_high = mean(TMAX, na.rm=T), ave_low = mean(TMIN, na.rm=T), ave_obs = mean(TOBS, na.rm=T), N = n())
That shows us the average temperatures and the number of years of data we have. We can use
quantile to define a range of values. To get the 95% confidence interval, we need to set the probability for the lower bound to 0.025 (i.e. 2.5%) and the upper bound to 0.975 (i.e. 97.5%) - the range is 95%. For the high temperature…
may_nineth %>% summarize(ave_high = mean(TMAX, na.rm=T), lci_high = quantile(TMAX, prob=0.025, na.rm=T), uci_high = quantile(TMAX, prob=0.975, na.rm=T), N = n())
And to get the same information for the lows we can do…
may_nineth %>% summarize(ave_high = mean(TMAX, na.rm=T), lci_high = quantile(TMAX, prob=0.025, na.rm=T), uci_high = quantile(TMAX, prob=0.975, na.rm=T), ave_low = mean(TMIN, na.rm=T), lci_low = quantile(TMIN, prob=0.025, na.rm=T), uci_low = quantile(TMIN, prob=0.975, na.rm=T), N = n())
We get one row back because we already filtered the data to get each year’s data for May 9th. We can see that Friday’s temperatures were lower than we’d expect by the 95% confidence interval.
Finding the historic highs and lows with
What was the lowest temperature recorded on May 9th? As with anything in R, there are many ways to do this. We could use summarize
may_nineth %>% summarize(min_low = min(TMIN)) may_nineth %>% summarize(max_high = max(TMAX))
Using summarize with
max returns a single number but doesn’t tell us the year. We could also use
arrange to sort
may_nineth %>% arrange(TMIN)
The previous low was -2.2C in 1947. If we want a descending sort, we need to use the
may_nineth %>% arrange(desc(TMAX))
The historic high for the day is 30 and was achieved in 1926, 1965, 2014, and 2015. A third approach we’ll discuss is the
top_n function. We can add
top_n with the column name we want to pull the top rows by along with the number of rows we want. We can also give the function a negative number of rows to get the smallest values
may_nineth %>% top_n(TMAX, n=3) may_nineth %>% top_n(TMIN, n=-3)
We see that the output isn’t sorted by
TMIN and that it may return more rows than we asked for if there are ties.
group_by to get the highs and lows for any day of the year
Aside from being cold this year, there’s nothing special about May 9th. We might want similar information for any day of the year. If we put our pipeline together this was what we had to get the ranges for May 9th over the past 127 years
aa_weather %>% filter(month == 5 & day == 9) %>% summarize(ave_high = mean(TMAX, na.rm=T), lci_high = quantile(TMAX, prob=0.025, na.rm=T), uci_high = quantile(TMAX, prob=0.975, na.rm=T), ave_low = mean(TMIN, na.rm=T), lci_low = quantile(TMIN, prob=0.025, na.rm=T), uci_low = quantile(TMIN, prob=0.975, na.rm=T), N = n())
If we wanted to change this to get the same data for any day of the year, we can replace our
filter function with a
group_by on the
daily_t_summary <- aa_weather %>% group_by(month, day) %>% summarize(ave_high = mean(TMAX, na.rm=T), lci_high = quantile(TMAX, prob=0.025, na.rm=T), uci_high = quantile(TMAX, prob=0.975, na.rm=T), ave_low = mean(TMIN, na.rm=T), lci_low = quantile(TMIN, prob=0.025, na.rm=T), uci_low = quantile(TMIN, prob=0.975, na.rm=T), N = n())
Checking that we get what we had before for May 9th, we can add a
filter function call.
daily_t_summary %>% filter(month == 5, day == 9)
Here are the temperatures for my birthday…
daily_t_summary %>% filter(month == 6, day == 20)
Hopefully, you feel more comfortable using
summarize now that you’ve seen them used in a new context. Are you ready to answer some questions using these strengthened skills?
1. Which year was the hottest on your birthday? Which was the hottest since you were born?
2. What were the hottest and coldest temperatures recorded the year you were born?
3. Calculate the average high temperature for each year between 1892 and 2019 (the years we have complete data for). What was the average temperature the year you were born? Which are the coldest and hottest years we have data for (based on the annual average high temperature)?
4. What is the average high temperature and 95% confidence interval for each month of the year? What is it for your birth year?
5. Get the weather data for your favorite place using the approach Pat outlines in the video. What will the temperature be there on your birthday?
Title credit: 10,000 Maniacs, Like the Weathercomments powered by Disqus