# Session 8

## Topics

• Creating columns
• Keeping our code DRY
• Writing our own functions

We’ve spent the last two lessons “munging” our data to get it to look and behave nicely. If this is the level of detail we need to go to to clean up data from NOAA, can you imagine how much effort might be needed to clean up the data from the last postdoc in your lab?! We generally spend a lot of time cleaning up data. Let’s start doing something “useful” with these data! By the end of the last lesson, if you did all of the exercises, you should have come up with something like the code chunk below.

``````library(tidyverse)

select(DATE, starts_with("T"), PRCP, SNOW, SNWD) %>%
rename("date" = "DATE",
"t_max_c" = "TMAX",
"t_min_c" = "TMIN",
"t_obs_c" = "TOBS",
"total_precip_mm" = "PRCP",
"snow_fall_mm" = "SNOW",
"snow_depth_mm" = "SNWD") %>%
mutate(t_obs_c = ifelse(t_obs_c > 40 | t_obs_c < -40, NA, t_obs_c),
total_precip_mm = ifelse(total_precip_mm > 250, NA, total_precip_mm),
snow_fall_mm = ifelse(snow_fall_mm > 500, NA, snow_fall_mm),
snow_depth_mm = ifelse(snow_depth_mm > 500, NA, snow_depth_mm)
)
``````

## Creating columns

Can you remember what command we use to either change or create a new column in a data frame? We’ve seen it a few times already and now we’re going to do a deeper dive into using it to help analyze our data. As I showed in the previous lesson, I am a good American: I have little intuition for the metric system. I’d like to make columns that have the temperatures in Fahrenheit and precipitation data in inches. The function we’ll use is `mutate` to generate these new columns. In addition, we’ll see how to generate our own functions that we can use with `mutate` and many other settings.

In the previous lesson, I demonstrated the `mutate` function by showing that we could create a `t_max_f` column using the data in the `t_max_c` column. To illustrate a problem, I’m going to use the rule of thumb I learned from too many degrees in engineering: we can double the temperature in Celsius and add 30 to get the temperature in Fahrenheit.

``````aa_weather %>%
mutate(t_max_f = 2 * t_max_c + 30) %>%
select(date, t_max_c, t_max_f)
``````
``````## # A tibble: 46,650 x 3
##    date       t_max_c t_max_f
##    <date>       <dbl>   <dbl>
##  1 1891-10-01    20.6    71.2
##  2 1891-10-02    26.7    83.4
##  3 1891-10-03    26.1    82.2
##  4 1891-10-04    22.8    75.6
##  5 1891-10-05    13.9    57.8
##  6 1891-10-06    14.4    58.8
##  7 1891-10-07    10.6    51.2
##  8 1891-10-08    13.9    57.8
##  9 1891-10-09    15      60
## 10 1891-10-10    16.7    63.4
## # … with 46,640 more rows
``````

We can do the same for the two other temperature columns

``````aa_weather %>%
mutate(t_max_f = 2 * t_max_c + 30,
t_min_f = 2 * t_min_c + 30,
t_obs_f = 2 * t_obs_c + 30) %>%
select(date, starts_with("t_"))
``````
``````## # A tibble: 46,650 x 7
##    date       t_max_c t_min_c t_obs_c t_max_f t_min_f t_obs_f
##    <date>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 1891-10-01    20.6     7.8      NA    71.2    45.6      NA
##  2 1891-10-02    26.7    13.9      NA    83.4    57.8      NA
##  3 1891-10-03    26.1    16.1      NA    82.2    62.2      NA
##  4 1891-10-04    22.8    11.1      NA    75.6    52.2      NA
##  5 1891-10-05    13.9     6.7      NA    57.8    43.4      NA
##  6 1891-10-06    14.4     5        NA    58.8    40        NA
##  7 1891-10-07    10.6     5.6      NA    51.2    41.2      NA
##  8 1891-10-08    13.9     3.3      NA    57.8    36.6      NA
##  9 1891-10-09    15       3.9      NA    60      37.8      NA
## 10 1891-10-10    16.7     5        NA    63.4    40        NA
## # … with 46,640 more rows
``````

## Keeping our code DRY

Excellent! There’s one problem with this code chunk. We repeat the equation to convert between Celsius and Fahrenheit three times. I could imagine getting data from another station that had other temperature data - soil, dew point, etc. We would repeat this same equation over and over if we wanted to convert to Fahrenheit. Given my proclivity to typos, I can imagine that at least one of these equations would have a typo. That would be a headache. Also, imagine this project takes off and I want to publish it or make it accessible to the public. I probably want to replace my rule of thumb with the actual formula (F = 9/5 * C +32). I would need to replace every instance of the rule of thumb with the actual conversion. Again, more headaches.

There is a principle we like to follow called DRY - don’t repeat yourself. We can overcome this repetition by creating a function.

## Writing our own functions

We’ve see a lot of functions already. The last code chunk has at least seven! `%>%`, `=`, `*`, `+`, `mutate`, `select`, and `starts_with`. To create a function we need a special syntax

``````my_function <- function(argument1, argument2, etc.){

result <- #special sauce

return(result)
}
``````

There are other ways to write a function that allow you to set default values or simplify the syntax. What we have will work great for our purposes for a long time. The key points are that the function is a variable that gets is “value” from the `function` function. You give `function` the arguments that are then used in the body of the function, which is organized within curly braces (i.e. `{`, `}`). The last thing in the body of our function is a call to the `return` function, which makes the value the function returns explicit. When the flow hits `return`, the function is done. If you put any code after `return`, it will not be run.

You could then call the function like so…

``````my_function(argument1=4, argument2=23)
``````

This will return a value that you set in the `return` function from when you defined your function. It is a good time to point out we’ve been a bit loose with how we’ve called our previous functions so far. For example, previously we ran

``````tolower(Admin1Name)
``````

That `tolower` function call does not have any argument names listed, only the argument value. If we enter `?tolower`, under the “Usage” and “Arguments” sections we’ll see…

``````Usage:

chartr(old, new, x)
tolower(x)
toupper(x)
casefold(x, upper = FALSE)

Arguments:

x: a character vector, or an object that can be coerced to
character by ‘as.character’.

old: a character string specifying the characters to be
translated.  If a character vector of length 2 or more is
supplied, the first element is used with a warning.

new: a character string specifying the translations. If a
character vector of length 2 or more is supplied, the first
element is used with a warning.

upper: logical: translate to upper or lower case?.
``````

This tells us that `tolower` takes one argument - `x`. We could have written

``````tolower(x=Admin1Name)
``````

But we’re lazy and after you’ve used R a bit, that will seem a bit obvious. Like the `<-`, whether to use argument names is an idiom in R. This can be confusing for beginners. It can also be a pain for beginners to remember all of the argument names! If you chose to leave out the argument names, then they need to be in the order that the function expects them. For example, the following works

``````dna <- "ATGCCTTG"
chartr("ATGC", "TACG", dna)
``````
``````## [1] "TACGGAAC"
``````

But this does not (I regularly make this mistake!)

``````dna <- "ATGCCTTG"
chartr(dna, "ATGC", "TACG")
``````
``````## Error in chartr(dna, "ATGC", "TACG"): 'old' is longer than 'new'
``````

If you are unsure of the syntax or if your arguments get complicated, you are perfectly justified to include the argument names.

``````dna <- "ATGCCTTG"
chartr(x=dna, old="ATGC", new="TACG")
``````
``````## [1] "TACGGAAC"
``````

Back to our example, the following function calls will give the same result

``````my_function(4, 23)
my_function(argument1=4, argument2=23)
my_function(argument2=23, argument1=4)
``````

But this will either give you an incorrect result or an error

``````my_function(23, 4)
``````

Enough with abstract functions! Let’s work on our function to convert Celsius to Fahrenheit.

``````c_to_f <- function(celsius) {

fahrenheit <- 2 * celsius + 30
return(fahrenheit)

}
``````

We can test this with a few temperatures

``````c_to_f(0)
``````
``````## [1] 30
``````
``````c_to_f(20)
``````
``````## [1] 70
``````
``````c_to_f(30)
``````
``````## [1] 90
``````
``````c_to_f(100)
``````
``````## [1] 230
``````

We can also make our original mutate code DRY

``````aa_weather %>%
mutate(t_max_f = c_to_f(t_max_c),
t_min_f = c_to_f(t_min_c),
t_obs_f = c_to_f(t_obs_c)) %>%
select(date, starts_with("t_"))
``````
``````## # A tibble: 46,650 x 7
##    date       t_max_c t_min_c t_obs_c t_max_f t_min_f t_obs_f
##    <date>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 1891-10-01    20.6     7.8      NA    71.2    45.6      NA
##  2 1891-10-02    26.7    13.9      NA    83.4    57.8      NA
##  3 1891-10-03    26.1    16.1      NA    82.2    62.2      NA
##  4 1891-10-04    22.8    11.1      NA    75.6    52.2      NA
##  5 1891-10-05    13.9     6.7      NA    57.8    43.4      NA
##  6 1891-10-06    14.4     5        NA    58.8    40        NA
##  7 1891-10-07    10.6     5.6      NA    51.2    41.2      NA
##  8 1891-10-08    13.9     3.3      NA    57.8    36.6      NA
##  9 1891-10-09    15       3.9      NA    60      37.8      NA
## 10 1891-10-10    16.7     5        NA    63.4    40        NA
## # … with 46,640 more rows
``````

This should give the same result we had before without using the function (or perhaps you found a typo in your previous non-DRY code and now the result is better!). Excellent - the code is now DRY. But as we feared, our analysis has attracted attention and we feel bad that we can’t remember the actual conversion and we are relying on this rule of thumb. To update all three lines in the mutate function we need to modify our `c_to_f` function

``````c_to_f <- function(celsius) {

fahrenheit <- 9/5 * celsius + 32
return(fahrenheit)

}
``````

We need to re-run this function call and then re-run our pipeline

``````aa_weather %>%
mutate(t_max_f = c_to_f(t_max_c),
t_min_f = c_to_f(t_min_c),
t_obs_f = c_to_f(t_obs_c)) %>%
select(date, starts_with("t_"))
``````
``````## # A tibble: 46,650 x 7
##    date       t_max_c t_min_c t_obs_c t_max_f t_min_f t_obs_f
##    <date>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 1891-10-01    20.6     7.8      NA    69.1    46.0      NA
##  2 1891-10-02    26.7    13.9      NA    80.1    57.0      NA
##  3 1891-10-03    26.1    16.1      NA    79.0    61.0      NA
##  4 1891-10-04    22.8    11.1      NA    73.0    52.0      NA
##  5 1891-10-05    13.9     6.7      NA    57.0    44.1      NA
##  6 1891-10-06    14.4     5        NA    57.9    41        NA
##  7 1891-10-07    10.6     5.6      NA    51.1    42.1      NA
##  8 1891-10-08    13.9     3.3      NA    57.0    37.9      NA
##  9 1891-10-09    15       3.9      NA    59      39.0      NA
## 10 1891-10-10    16.7     5        NA    62.1    41        NA
## # … with 46,640 more rows
``````

Again, the strength of DRY code is that it prevents us from inserting typos and allows us to easily change our code in one place and get it updated everywhere that code is used. Although we should always try to be aware of where our code is not DRY, your primary goal should be to get code that works. Once it works, you should go back and DRY it out. This will help you make sure it works correctly and insure that it’s easier to maintain.

## Exercises

1. Create a column in `aa_weather` that contains the difference between the day’s maximum and minimum temperatures. Be sure to give the column a good name! Include a `select` function call so that you can more easily see the relevant columns

2. Create a function called `mm_to_inches` that converts millimeters to inches. There are 10 millimeters in a centimeter and 2.54 centimeters in an inch. Test it out with a few values like 0, 25.4, 50.8, 101.6, and 40.

3. Create columns in `aa_weather` that convert the three precipitation-related columns to inches. You can use `tail()` to get the last rows of a data frame.