# Session 7

## Topics

• Detecting problems in data
• Fixing data problems
• Rinse, repeat

## Detecting problems in data

We’ve read in the weather data as `aa_weather` and now we’d like to get a better sense of the values. We’ll take two approaches - we’ll visualize the data and look at summary statistics of the data.

``````library(tidyverse)

select(DATE, starts_with("T"), PRCP, SNOW, SNWD) %>%
rename("date" = "DATE",
"t_max_c" = "TMAX",
"t_min_c" = "TMIN",
"t_obs_c" = "TOBS",
"total_precip_mm" = "PRCP",
"snow_fall_mm" = "SNOW",
"snow_depth_mm" = "SNWD")
``````

To visualize the data we can make a histogram for each column of numerical data. The syntax will look like this

``````aa_weather %>% ggplot(aes(x=t_min_c)) + geom_histogram()
aa_weather %>% ggplot(aes(x=t_max_c)) + geom_histogram()
aa_weather %>% ggplot(aes(x=t_obs_c)) + geom_histogram()
``````   Two things to comment on here. First, you’ll get the following feedback from running the command:

```````stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
Removed 163 rows containing non-finite values (stat_bin).
``````

Starting with the warning message, it tells us that a number of the days in the dataset did not have values. We know that coverage for `t_min_c`/`TMIN` was not 100% so we should have expected that. Second, it tells us that it is using `bins=30` to build the histogram. It is telling us that we can manually set the number of bins or the width of the bins to draw the histogram.

``````aa_weather %>% ggplot(aes(x=t_min_c)) + geom_histogram(bins=60)
`````` ``````aa_weather %>% ggplot(aes(x=t_min_c)) + geom_histogram(binwidth=1)
`````` Unless we’re interested in generating publication a quality figure, the default is probably good enough for our purposes.

Looking at the three plots that were generated, you’ll notice that the histogram for `t_obs_c` has values greater than 60 and less than -60. Even for Ann Arbor, those temperatures are pretty extreme. I suspect the thermometer was having bad days when those observations were made. Something else that tells us the data don’t make sense is that the `t_min_c` and `t_max_c` values approach those levels, which they should if `t_obs_c` was a single temperature reading. We can use `filter` to identify the days with these odd temperatures

``````aa_weather %>% filter(t_obs_c > 40)
aa_weather %>% filter(t_obs_c < -40)
``````
``````## # A tibble: 1 x 7
##   date       t_max_c t_min_c t_obs_c total_precip_mm snow_fall_mm snow_depth_mm
##   <date>       <dbl>   <dbl>   <dbl>           <dbl>        <dbl>         <dbl>
## 1 1934-05-12    13.3    -0.6      65               0            0             0
## # A tibble: 1 x 7
##   date       t_max_c t_min_c t_obs_c total_precip_mm snow_fall_mm snow_depth_mm
##   <date>       <dbl>   <dbl>   <dbl>           <dbl>        <dbl>         <dbl>
## 1 1944-06-18    33.3    19.4   -66.1               0            0             0
``````

Those pipelines return the data for the two days with the suspect temperatures. I feel good throwing out these two temperatures because it was supposedly -66.1C (-87F) in June and 65C in May when the low temperature for the day was below freezing. Ann Arbor weather can be weird, but those temperatures don’t make sense.

## Fixing data problems

Let’s convert those weird temperatures to `NA` values, which will indicate we don’t know the true value. We’ll do this with two functions, one we’ve already seen, `mutate`, and another that builds upon the logical operators we’ve seen with `filter` called `ifelse`. `mutate` allows us to modify an existing column (i.e. `t_obs_c`) or to create a new column. For example, let’s create a `t_max_f` column that has the maximum temperature for the day in degrees Fahrenheit. We’ll add a `select` function call to make sure we can see the relevant columns.

``````aa_weather %>%
mutate(t_max_f = 9 * t_max_c / 5 + 32) %>%
select(date, t_max_c, t_max_f)
``````
``````## # A tibble: 46,650 x 3
##    date       t_max_c t_max_f
##    <date>       <dbl>   <dbl>
##  1 1891-10-01    20.6    69.1
##  2 1891-10-02    26.7    80.1
##  3 1891-10-03    26.1    79.0
##  4 1891-10-04    22.8    73.0
##  5 1891-10-05    13.9    57.0
##  6 1891-10-06    14.4    57.9
##  7 1891-10-07    10.6    51.1
##  8 1891-10-08    13.9    57.0
##  9 1891-10-09    15      59
## 10 1891-10-10    16.7    62.1
## # … with 46,640 more rows
``````

The other function we need is `ifelse`. This allows us to ask a logical question of the data. If the value is true, then we return one value. If it is false, we return another.

``````my_numbers <- c(-5, -3, -1, 0, 1, 3, 5)
ifelse(my_numbers > 0, "positive", "negative")
``````
``````##  "negative" "negative" "negative" "negative" "positive" "positive" "positive"
``````

From this syntax we can see that `ifelse` has three “slots” - the logical question (i.e. `my_numbers > 0`), the value if the answer is `TRUE` (i.e. “positive”), the value if the answer is `FALSE` (i.e. “negative”). Let’s see how we could put this together to recast the values of `t_obs_c` to remove those extreme temperatures

``````aa_weather %>%
mutate(t_obs_c = ifelse(t_obs_c > 40, NA, t_obs_c),
t_obs_c = ifelse(t_obs_c < -40, NA, t_obs_c)) %>%
ggplot(aes(x=t_obs_c)) +
geom_histogram()
`````` Slick, eh? You can see that we can mutate multiple columns or the same column multiple times by giving mutate two or more arguments separated by commas.

## Rinse, repeat

The histograms for temperature data made it pretty easy to pick off which numbers looked weird. We also probably have a good sense of what temperatures are extreme for an area. After living in the northern US for 25 years, I still don’t know what an extreme, but acceptable snow or rain fall is.

``````aa_weather %>% ggplot(aes(x=total_precip_mm)) + geom_histogram()
aa_weather %>% ggplot(aes(x=snow_fall_mm)) + geom_histogram()
aa_weather %>% ggplot(aes(x=snow_depth_mm)) + geom_histogram()
``````   Let’s think about this for a second - 1000 mm of precipitation is 100 cm or about 40 in. Yeah, that’s a lot. We need to dig deeper into these data. I feel like 150 mm (~10 in) precipitation event is weird, but not outside of what we might expect for Ann Arbor. We can look at the data a different way - perhaps by plotting the data against the date like we did with the Project Tycho data

``````aa_weather %>%
ggplot(aes(x=date, y=total_precip_mm)) +
geom_line()
`````` From this view, there appear to be 2 points that are above 250 mm that are outliers. We can use 250 mm to create a `filter` function call to find those rows with `total_precip_mm` values over 250

``````aa_weather %>%
filter(total_precip_mm > 250)
``````
``````## # A tibble: 2 x 7
##   date       t_max_c t_min_c t_obs_c total_precip_mm snow_fall_mm snow_depth_mm
##   <date>       <dbl>   <dbl>   <dbl>           <dbl>        <dbl>         <dbl>
## 1 1934-12-08    -3.9    -8.3   -17.8            509.            5            NA
## 2 1953-09-05    23.9    12.8    18.3           1286.            0             0
``````

A google search for “Ann Arbor 1953 rain” doesn’t turn up anything and neither does using “1934” in place of “1953”. I’d expect something like that amount of precipitation to be newsworthy! Taking our previous `mutate` function call, let’s add this threshold for `total_precip_mm` to set values above this number as `NA`.

``````aa_weather %>%
mutate(t_obs_c = ifelse(t_obs_c > 40, NA, t_obs_c),
t_obs_c = ifelse(t_obs_c < -40, NA, t_obs_c),
total_precip_mm = ifelse(total_precip_mm > 250, NA, total_precip_mm)) %>%
ggplot(aes(total_precip_mm)) + geom_histogram()
`````` You’ll notice that if we now do

``````aa_weather %>%
ggplot(aes(total_precip_mm)) + geom_histogram()
`````` We again get the very long x-axis. Why is this? In all of our pipelines we are doing operations on the `aa_weather` data frame, but we haven’t “written” anything to a new variable! We could do something like this to create a new data frame

``````clean_aa_weather <- aa_weather %>%
mutate(t_obs_c = ifelse(t_obs_c > 40, NA, t_obs_c),
t_obs_c = ifelse(t_obs_c < -40, NA, t_obs_c),
total_precip_mm = ifelse(total_precip_mm > 250, NA, total_precip_mm))

clean_aa_weather %>%
ggplot(aes(total_precip_mm)) + geom_histogram()
`````` or

``````aa_weather <- aa_weather %>%
mutate(t_obs_c = ifelse(t_obs_c > 40, NA, t_obs_c),
t_obs_c = ifelse(t_obs_c < -40, NA, t_obs_c),
total_precip_mm = ifelse(total_precip_mm > 250, NA, total_precip_mm))

aa_weather %>%
ggplot(aes(total_precip_mm)) + geom_histogram()
`````` I’d prefer to make one pipeline for cleaning up my data. So starting from the top

``````aa_weather <- read_csv('noaa/USC00200230.csv', col_types="ccDdddddddddd") %>%
select(DATE, starts_with("T"), PRCP, SNOW, SNWD) %>%
rename("date" = "DATE",
"t_max_c" = "TMAX",
"t_min_c" = "TMIN",
"t_obs_c" = "TOBS",
"total_precip_mm" = "PRCP",
"snow_fall_mm" = "SNOW",
"snow_depth_mm" = "SNWD") %>%
mutate(t_obs_c = ifelse(t_obs_c > 40, NA, t_obs_c),
t_obs_c = ifelse(t_obs_c < -40, NA, t_obs_c),
total_precip_mm = ifelse(total_precip_mm > 250, NA, total_precip_mm))
``````

Now we have a data frame, `aa_weather`, that has our cleaned up data. What’s better, is that we can throw this code in a new script file and as more data is added to the station, we can rerun the code to clean up the data by the same methods!

## Exercises

1. Perhaps you remember from earlier that we can combine logical questions to make a single filter command. Can you transform our two `filter` functions for finding weird `t_obs_c` values into one?

2. Can you make the `mutate` function call above only modify the `t_obs_c` column once?

3. What’s wrong with this code chunk?

``````aa_weather %>%
mutate(t_obs_c = ifelse(t_obs_c > 40 | t_obs_c <-40, NA, t_obs_c)) %>%
ggplot(aes(x=t_obs_c)) +
geom_histogram()
``````

4. Can you make a column called `is_freezing`, which contains logical data indicating whether the minimum temperature for the day was below freezing?

5. Look at the `snow_fall_mm` and `snow_depth_mm` data and see whether you can spot any other weird data.