Find the candy that I like...
For our next Code Club, we will continue our exploration of data collected by Five Thirty Eight that looked at how people feel about different types of candy. In the first code club, we initially played with the data. Last week, we began to use the
filter function to generate logical (or boolean) queries to identify rows from the data that met our criteria. This time we’ll focus on using
filter with quantitative data and we’ll see how to use it with the
If you would like to revisit Pat’s introduction and the approach taken by several participants, you can watch this video. If you were on the call, you’ll notice a different Introduction. Pat forgot to press record the first time around (again):
In our inaugural code club we loaded data that described people’s preferences for each of 85 different types of candy. The folks at Five Thirty Eight characterized each of the candies by their attributes including things like whether the candy has chocolate, nougat, caramel, and so forth. They also created indices to rate the relative prices and amount of sugar in each candy. In that code club, I had participants generate a figure that showed whether chocolate candies are more expensive as a bar or as bite-sized pieces. I’d like to use that question to build upon the discussion we had last week using
filter and to expand the discussion to calculate statistics for different groups of data using
summarize. Let’s load the data and get going!
library(tidyverse) candy_data <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv", col_types="clllllllllddd")
You’ll recall that
filter identifies rows in a data frame that satisfy some logical question. If the answer to the question is
TRUE, then that row is returned. These candy data have a number of variables that are already
If you look under the “chocolate” heading, you’ll see
candy_data %>% filter(chocolate)
While you can do
candy_data %>% filter(chocolate == TRUE)
It’s a bit redundant and not necessary. Regardless of the approach, we see that 37 of the candies have chocolate in them. Of course, we could have figured that out with the
count function that we discussed earlier
candy_data %>% count(chocolate)
filter command gets us is a new data frame that only contains candies with chocolate. Before moving on to discussing
summarize, I want to show you a few other things we can do with
So far we’ve seen that we can search for specific values using the
!= operators and that we can combine queries with the
| operators. The
!= operators work well when we’re trying to select for specific values. Sometimes, we want a range of values. Let’s say we want a new data frame that contains all of the candies that have a price index greater than 0.5, we could do the following
candy_data %>% filter(pricepercent > 0.5)
Similarly, if we wanted all of the low sugar candies we could do
candy_data %>% filter(sugarpercent < 0.33)
And building upon what we did last week, we can combine the queries with a
& to get the expensive low sugar candies or with a
| to get the candies that are expensive or low in sugar.
candy_data %>% filter(pricepercent > 0.5 & sugarpercent < 0.33) #12 rows candy_data %>% filter(pricepercent > 0.5 | sugarpercent < 0.33) #64 rows
<, you can also use
<= to ask if things are “greater than or equal to” or “less than or equal to” a specific value.
summarize functions generally go together. They allow us to group our data by some variable and then to perform some summary operation for each group. Returning to our question of whether candy with chocolate is more expensive as a bar or in bite-sized form we can do the following
candy_data %>% filter(chocolate == TRUE) %>% group_by(bar) %>% summarize(mean_price = mean(pricepercent))
We could also report back the standard deviation and number of candies in each group by adding them to the list of arguments in
summarize separated by a comma
candy_data %>% filter(chocolate == TRUE) %>% group_by(bar) %>% summarize(mean_price = mean(pricepercent), sd_price = sd(pricepercent), n=n())
Although we should apply a statistical test, it does appear that candy with chocolate is more expensive in bar form. Other functions that work well with
range, and any function that returns a single value.
Perhaps we’d like to partition our data frame by two variables. Say we want to look at all candy to determine whether non-chocolate candy is also more expensive as a bar than as bite-sized. We could modify our earlier code chunk
candy_data %>% filter(!chocolate) %>% # the ! turns TRUE to FALSE and FALSE to TRUE group_by(bar) %>% summarize(mean_price = mean(pricepercent))
If we want both chocolaty and non-chocolate candy data in the same data frame, we will need to drop the
filter function and modify the
group_by function. We can add other groups to partition the data by in the
group_by function separated by commas
candy_data %>% group_by(chocolate, bar) %>% summarize(mean_price = mean(pricepercent))
Sure enough, candy as a bar is more expensive regardless of whether it has chocolate in it!
One final trick to show you is that we can add a logical question to our
candy_data %>% group_by(sugarpercent > 0.50) %>% summarize(mean_price = mean(pricepercent))
Who knew? Candy with more sugar is more expensive.
1. How many of the candies that won more than 75% of their matchups had chocolate?
2. Do fruity candies have a different average price than non-fruity candies?
3. How do the prices of the more favored candies compare to those that are less favored?
4. Come up with your own question to answer with the functions we’ve discussed today