The data is...
For our next Code Club, we will continue our exploration of data collected by Five Thirty Eight that looked at how people feel about the oxford comma and whether the word “data” is plural. Last week, we began to look at these data by renaming the column titles using the
rename function and made the values in those columns easier to work with using the
recode function. This week we will work with the
filter functions to identify rows in the data frame that satisfy certain requirements. We will see how we can use these two functions to subset our
github data frame from last week and characterize the survey responses for different groups of people. You will be able to join the conversation using this link to use Zoom. The session should last an hour. Please be sure to see the setup instructions and code of conduct before we get going.
If you would like to revisit Pat’s introduction and the approach taken by several participants, you can watch this video. If you were on the call, you’ll notice a different Introduction. Pat forgot to press record the first time around:
Here’s where we ended last week. Go ahead and copy this into a new R script file in RStudio
library(tidyverse) github <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/comma-survey/comma-survey.csv") %>% rename(respondent=RespondentID, oxford_or_not="In your opinion, which sentence is more gramatically correct?", heard_of_oxford = "Prior to reading about it above, had you heard of the serial (or Oxford) comma?", care_about_oxford = "How much, if at all, do you care about the use (or lack thereof) of the serial (or Oxford) comma in grammar?", singular_or_plural = "How would you write the following sentence?", think_about_data = "When faced with using the word \"data\", have you ever spent time considering if the word was a singular or plural noun?", care_about_data = "How much, if at all, do you care about the debate over the use of the word \"data\" as a singluar or plural noun?", importance_of_grammar = "In your opinion, how important or unimportant is proper use of grammar?", gender = "Gender", age = "Age", household_income = "Household Income", education = "Education", location = "Location (Census Region)" ) %>% mutate(oxford_or_not=recode(oxford_or_not, "It's important for a person to be honest, kind and loyal."="non_oxford", "It's important for a person to be honest, kind, and loyal."="oxford")) %>% mutate(singular_or_plural=recode(singular_or_plural, "Some experts say it's important to drink milk, but the data is inconclusive."="singular", "Some experts say it's important to drink milk, but the data are inconclusive."="plural"))
Last week we saw the
count function could be used to count the number of times categorical variable values appear. For example,
github %>% count(oxford_or_not)
Generates a table telling us how many people chose the sentence with or without an Oxford comma. We can make the counts conditional on other variables, by including other variables separated by commas
github %>% count(oxford_or_not, singular_or_plural)
This information is useful, but I find
count to be useful for helping me to remember the values for a categorical variable. For example, I’d like to know what levels of education were included in the study.
github %>% count(education)
A final point that is important, but easy to forget - the output of
count is a new data frame that will have columns for the variables you are counting along with the number of cases. The other data in
github will not be carried into the new data frame.
Today we’re going to cover the
filter allows us to select rows from a data frame for further downstream processing. For example, we could do the following to get those respondents that use the Oxford comma
github %>% filter(oxford_or_not == "oxford")
You’ll see that the original
github variable had 1129 rows and this filtered version has 641. You’ll also notice that the
oxford_or_not column only has
oxford in the first 10 rows. We can double check this using the
count function from last week
github %>% filter(oxford_or_not == "oxford") %>% count(oxford_or_not)
Let’s step back and explain what’s going on here. If we look at the argument for
filter, we see
oxford_or_not == "oxford". This “asks” whether the value in the
oxford_or_not variable for each row is
== is a logical equal, which asserts, “The left side is equal to the right side.” If that statement is true, then the expression is
TRUE if not, the expression is
filter evaluates this expression for every row and when the expression is
TRUE, the row is retained. You may recall that in the Candy Crush code club we the line,
answer variable was a logical variable where every value was a
FALSE. It’s the same idea as our
oxford_or_not == "oxford", except
answer already had the, um, answer.
A number of functions can be used to generate a logical (i.e.
FALSE) value. Not all of them are a fit for this dataset, we’ll use
!=. We’ve seen
==, but what about
! in logical operations means “not”. So,
!= means “not equal to”. Let’s try it out…
github %>% filter(oxford_or_not != "oxford") %>% count(oxford_or_not)
That’s effectively the same as
github %>% filter(oxford_or_not == "non_oxford") %>% count(oxford_or_not)
Whether you use
!= depends on how you think of your question.
Perhaps we want to know about people that use the Oxford comma and use “data” as a plural word. We have a few approaches to this question. Using what we already know, you could use two
filter function calls
github %>% filter(oxford_or_not == "oxford") %>% filter(singular_or_plural == "plural") %>% count(oxford_or_not, singular_or_plural)
The pipeline firsts filters based on whether the respondent used the Oxford comma and then on those who use data as a plural noun. We can simplify this to make a single
filter function call
github %>% filter(oxford_or_not == "oxford", singular_or_plural == "plural") %>% count(oxford_or_not, singular_or_plural)
That works! The
filter that both expressions have to evaluate as
TRUE. I prefer to use
& rather than the
github %>% filter(oxford_or_not == "oxford" & singular_or_plural == "plural") %>% count(oxford_or_not, singular_or_plural)
The reason I prefer to use the
& (i.e. the AND operator) is because it allows us to generate more complex queries and it is a good partner to its opposite,
|, the OR operator. Let’s say I want a table of people who use the Oxford comma or who treat data as a plural noun
github %>% filter(oxford_or_not == "oxford" | singular_or_plural == "plural") %>% count(oxford_or_not, singular_or_plural)
For the AND operator, both the left and right side of the
& must be
TRUE for the expression to be
TRUE. For the OR operator, either the left or right side of the
| must be
TRUE for the expression to be
TRUE. That’s why the output from the last expression doesn’t have a row for
Let’s ask another slightly more complicated question. Let’s generate a data frame of respondents that care about the Oxford comma “some” or “a lot” and use the Oxford comma. For these types of questions, it is helpful to use parentheses to organize our query. Remember that the stuff in parentheses is done first.
github %>% filter(oxford_or_not == "oxford" & (care_about_oxford == "Some" | care_about_oxford == "A lot")) %>% count(oxford_or_not, care_about_oxford)
Finally, remember that we can put different columns in the
count function than those that we filtered on. Among people that have heard of the Oxford comma, how much do people care about it?
github %>% filter(heard_of_oxford == "Yes") %>% count(care_about_oxford)
1. Which geographic region was the best represented in the survey?
2. How many respondents cared about grammar?
3. Among those respondents that cared about grammar, did they have a preference for the Oxford comma or using “data” as a plural noun?
4. Come up with your own question and answer it using