# Smoothing lamb prices

In recent Code Club episodes we have been looking at how to plot lamb prices from the United Producers livestock auction that I have been recording in a Google workbook. The data are pretty noisy. One way of smoothing the data has been to use `geom_smooth`

. By adjusting the value of the `span`

argument we have been able to make the smoothed line more (small values) or less squiggly (large values). With the ongoing COVID-19 pandemic there has been a lot in the news of 7-day rolling averages. Rolling averages are also called moving averages. The Ann Arbor News has been plotting daily case counts along with a rolling average for the state of Michigan. I noticed that this representation of the data puts the rolling average at the 4th day of the 7 day series. I haven’t been able to find the raw data that were used to build the plot in an easy to access format. Naturally, this got me to thinking about how to calculate a rolling average for my noisy sheep price data.

For today’s episode of Code Club, we’ll build off of the code and concepts that we have have been working on for the past few weeks. Don’t worry if you missed those episodes. We’ll start with the block of code that we used last week and see how we can add the `lag`

function from the `dplyr`

package to help us calculate the rolling average. Along the way, we’ll use the `mutate`

and `pivot_longer`

functions again, but in a new context. As always, don’t watch the video straight through without firing up RStudio and trying the code and exercises yourself! Please be sure to see the setup instructions before you get going.

## Prompt

Over the past few episodes we’ve discussed `read_sheet`

, `separate`

, and `pivot_longer`

. Here is the code chunk that we developed to “tidy” the data in my Google workbook that contains my sheep price data for my local livestock auction. Be sure to take your time to go through this code chunk to review the different topics we’ve discussed.

```
library(tidyverse)
library(googlesheets4)
auction_data <- read_sheet("https://docs.google.com/spreadsheets/d/1_quMjJRBHDLQSmWQouzzyi1DOejAtCZnAeesdVyRWiQ/edit#gid=1467293328",
sheet="numbers_and_prices",
range="A:J",
col_type = "Ddcccccccc",
na="NA") %>%
rename_all(tolower) %>%
rename("aged_sheep" = "aged sheep",
"feeder_lambs" = "feeder lambs",
"hair_lambs" = "hair lambs",
"new_crop" = "new crop",
"small" = "40-85",
"medium" = "85-105",
"large" = "106-130",
"extra_large" = ">131")
tidy_auction_data <- auction_data %>%
pivot_longer(cols=-c("date", "total"), names_to="classes", values_to = "price_range") %>%
separate(price_range, into=c("min", "max"), sep="-", convert=TRUE) %>%
mutate(midpoint = (min+max)/2)
```

### Rolling averages

A rolling average is the average value of a set of observations that are typically collected over a period of time. They’re used because they help to smooth noisy data. We’ve been seeing them in the reporting of COVID-19 data, but they’re also commonly used to smooth noisy stock market data. A rolling average can be a lagging average where we average across observations before a point of interest, they can be leading where we average across observations before a point of interest, and they can also be centered where we average across observations where the point of interest is in the middle of the observations. We’re interested in the lagging average.

To calculate a rolling average we need the value of the observation on the day of interest and the values that preceded it. Then we need to average across those values. To generate the values, we are going to use the `mutate`

function to add the preceding days’ values as separate columns to a data frame.

`lag`

We can find values for the the preceding days’ observations using the `lag`

function. When you run `library(tidyverse)`

you’ll notice a warning that says, `dplyr::lag() masks stats::lag()`

. This means that the packages `dplyr`

and `stats`

both have `lag`

functions. If you use `?stats::lag`

and `?dplyr::lag`

you can see the differences. The upshot is that the `dplyr`

version is easier to use and implement for our purposes.

As a simple example, consider the numbers 1 through 10

```
x <- 1:10
lag(x) # the default lag is 1
lag(x, n=1)
lag(x, n=2)
lag(x, n=3)
```

You’ll notice that the output vector is the same length as the input, but that the first `n`

values are bumped to the right and replaced with `NA`

values. If we have a series of dates and use `n=2`

, then we are looking at the values starting from two days ago. One other argument to notice is `order_by`

, which allows us to make sure that our values are in the correct order. For instance, we could chose to lag the `midpoint`

column of our data frame by a week and we would want to order the observations by the `date`

column. We’ll see how to do this next

### Adding lagged columns to `tidy_auction_data`

With this in mind, we can create new columns for this week that we are interested in as well as the three previous weeks using `lag`

and `mutate`

. To keep things simple, let’s only focus on the large class of lambs. To create a column with no lag (the observed value for the week) and column with the data for the previous week we can do this

```
tidy_auction_data %>%
filter(classes=="large") %>%
mutate(lag0 = midpoint,
lag1 = lag(midpoint, 1, order_by=date))
```

Since our data are weekly, let’s expand this to create columns to calculate a 4 week rolling average.

```
tidy_auction_data %>%
filter(classes=="large") %>%
mutate(lag0 = midpoint,
lag1 = lag(midpoint, 1, order_by=date),
lag2 = lag(midpoint, 2, order_by=date),
lag3 = lag(midpoint, 3, order_by=date))
```

This gives us the data that we need to calculate our rolling average. We need to average across our four columns. My instinct is to do the following…

```
tidy_auction_data %>%
filter(classes=="large") %>%
mutate(lag0 = midpoint,
lag1 = lag(midpoint, 1, order_by=date),
lag2 = lag(midpoint, 2, order_by=date),
lag3 = lag(midpoint, 3, order_by=date)) %>%
mutate(rolling_average = mean(c(lag0, lag1, lag2, lag3)))
```

This gives us a column of `NA`

values because it is taking all of the rows across the four columns, calculating the mean, and then returning a single value that it repeats across all rows. We get the `NA`

rather than a number because we’re using the default `na.rm=FALSE`

in `mean`

. We want the mean for each row. You could get this to work if you used `group_by`

and `summarize`

```
tidy_auction_data %>%
filter(classes=="large") %>%
mutate(lag0 = midpoint,
lag1 = lag(midpoint, 1, order_by=date),
lag2 = lag(midpoint, 2, order_by=date),
lag3 = lag(midpoint, 3, order_by=date)) %>%
group_by(date) %>%
summarize(midpoint = first(midpoint), rolling_average = mean(c(lag0, lag1, lag2, lag3)))
```

With that approach, I also needed to use the `first`

function to get the observed `midpoint`

value for each `rolling_average`

value. The following is a bit simpler. It works because it returns a vector the same length as the original vectors that is used to build a new column.

```
large <- tidy_auction_data %>%
filter(classes=="large") %>%
mutate(lag0 = midpoint,
lag1 = lag(midpoint, 1, order_by=date),
lag2 = lag(midpoint, 2, order_by=date),
lag3 = lag(midpoint, 3, order_by=date)) %>%
mutate(rolling_average = (lag0 + lag1 + lag2 + lag3)/4) %>%
select(date, midpoint, rolling_average)
```

That works pretty nicely and gives the same values as we saw using the `group_by`

and `summarize`

approach. I went ahead and got rid of all of the columns except for `date`

, `midpoint`

, and `rolling_average`

to make it easier to see the data we’re interested in and saved it as the variable `large`

.

### Plotting the observed and averaged data

Let’s see what our data look like with and without the 4-week rolling average. We currently have `midpoint`

and `rolling_average`

as separate columns. To plot the data together we need to tidy our data frame to get these values together.

```
large %>%
pivot_longer(-date, names_to="method", values_to="price")
```

Now we’re ready to add on our `ggplot`

functions to plot `date`

on the x-axis, `price`

on the y-axis, and `method`

as the color

```
large %>%
pivot_longer(-date, names_to="method", values_to="price") %>%
ggplot(aes(x=date, y=price, color=method)) +
geom_line()
```

Great! I’m going to clean up the plot a bit as we have done in previous weeks

```
large %>%
pivot_longer(-date, names_to="method", values_to="price") %>%
ggplot(aes(x=date, y=price, color=method)) +
geom_line() +
theme_light() +
labs(x="Date",
y="Price ($/100 lbs)",
title="A rolling average smooths the noisiness of the large lamb prices",
subtitle="Lagging four week rolling average of the midpoint prices",
caption="Prices as reported from United Producers in Manchester, MI")
```

The colors are a bit trippy. I’d like to change the colors of the lines so that the `midpoint`

data are `gray`

and the `rolling_average`

data are `dodgerblue`

. We can do this with `scale_color_manual`

```
large %>%
pivot_longer(-date, names_to="method", values_to="price") %>%
ggplot(aes(x=date, y=price, color=method)) +
geom_line() +
theme_light() +
labs(x="Date",
y="Price ($/100 lbs)",
title="A rolling average smooths the noisiness of the large lamb prices",
subtitle="Lagging four week rolling average of the midpoint prices",
caption="Prices as reported from United Producers in Manchester, MI") +
scale_color_manual(values=c("gray", "dodgerblue"))
```

That makes it nicer to look at. We can compare this last plot to

```
large %>%
ggplot(aes(x=date, y=midpoint)) +
geom_line(color="gray") +
geom_smooth(size=0.5, span=0.1, se=FALSE)
```

I kind of prefer the `geom_smooth`

approach since it does a better job of handling the missing data that seemed to occur frequently in 2019. Which do you prefer?

## Exercises

1. How would you modify the code we generated to calculate a 3 week rolling average? 5 week?

2. We covered calculating the rolling average for the large class of lambs. How would we calculate the rolling averages for each class of lambs? Build a plot showing the prices for the small, medium, large, and extra large classes of lambs.

3. We’ve seen how we can repackage other functions we’ve already learned with the `lag`

function to generate a rolling average. As we saw in the first exercise, pivoting between different lag lengths can be a bit tedious. Also, if we wanted to do something like an average over 50 values, that would be horrible. Thankfully, there’s a package called `zoo`

that has a function, `rollmean`

. Install `zoo`

and look at the `?rollmean`

documentation and see whether you can figure out how to replace our `mutate`

functions with a single `mutate`

function that calls `rollmean`