The data is...
For our next Code Club, we will continue our exploration of data collected by Five Thirty Eight that looked at how people feel about the oxford comma and whether the word “data” is plural. Last week, we began to look at these data by renaming the column titles using the rename
function and made the values in those columns easier to work with using the recode
function. This week we will work with the count
and filter
functions to identify rows in the data frame that satisfy certain requirements. We will see how we can use these two functions to subset our github
data frame from last week and characterize the survey responses for different groups of people. You will be able to join the conversation using this link to use Zoom. The session should last an hour. Please be sure to see the setup instructions and code of conduct before we get going.
If you would like to revisit Pat’s introduction and the approach taken by several participants, you can watch this video. If you were on the call, you’ll notice a different Introduction. Pat forgot to press record the first time around:
Prompt
Here’s where we ended last week. Go ahead and copy this into a new R script file in RStudio
library(tidyverse)
github <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/comma-survey/comma-survey.csv") %>%
rename(respondent=RespondentID,
oxford_or_not="In your opinion, which sentence is more gramatically correct?",
heard_of_oxford = "Prior to reading about it above, had you heard of the serial (or Oxford) comma?",
care_about_oxford = "How much, if at all, do you care about the use (or lack thereof) of the serial (or Oxford) comma in grammar?",
singular_or_plural = "How would you write the following sentence?",
think_about_data = "When faced with using the word \"data\", have you ever spent time considering if the word was a singular or plural noun?",
care_about_data = "How much, if at all, do you care about the debate over the use of the word \"data\" as a singluar or plural noun?",
importance_of_grammar = "In your opinion, how important or unimportant is proper use of grammar?",
gender = "Gender",
age = "Age",
household_income = "Household Income",
education = "Education",
location = "Location (Census Region)"
) %>%
mutate(oxford_or_not=recode(oxford_or_not, "It's important for a person to be honest, kind and loyal."="non_oxford",
"It's important for a person to be honest, kind, and loyal."="oxford")) %>%
mutate(singular_or_plural=recode(singular_or_plural, "Some experts say it's important to drink milk, but the data is inconclusive."="singular",
"Some experts say it's important to drink milk, but the data are inconclusive."="plural"))
count
Last week we saw the count
function could be used to count the number of times categorical variable values appear. For example,
github %>%
count(oxford_or_not)
Generates a table telling us how many people chose the sentence with or without an Oxford comma. We can make the counts conditional on other variables, by including other variables separated by commas
github %>%
count(oxford_or_not, singular_or_plural)
This information is useful, but I find count
to be useful for helping me to remember the values for a categorical variable. For example, I’d like to know what levels of education were included in the study.
github %>%
count(education)
A final point that is important, but easy to forget - the output of count
is a new data frame that will have columns for the variables you are counting along with the number of cases. The other data in github
will not be carried into the new data frame.
filter
Today we’re going to cover the filter
function. filter
allows us to select rows from a data frame for further downstream processing. For example, we could do the following to get those respondents that use the Oxford comma
github %>%
filter(oxford_or_not == "oxford")
You’ll see that the original github
variable had 1129 rows and this filtered version has 641. You’ll also notice that the oxford_or_not
column only has oxford
in the first 10 rows. We can double check this using the count
function from last week
github %>%
filter(oxford_or_not == "oxford") %>%
count(oxford_or_not)
Let’s step back and explain what’s going on here. If we look at the argument for filter
, we see oxford_or_not == "oxford"
. This “asks” whether the value in the oxford_or_not
variable for each row is "oxford"
. The ==
is a logical equal, which asserts, “The left side is equal to the right side.” If that statement is true, then the expression is TRUE
if not, the expression is FALSE
. filter
evaluates this expression for every row and when the expression is TRUE
, the row is retained. You may recall that in the Candy Crush code club we the line, filter(answer)
. The answer
variable was a logical variable where every value was a TRUE
or FALSE
. It’s the same idea as our oxford_or_not == "oxford"
, except answer
already had the, um, answer.
A number of functions can be used to generate a logical (i.e. TRUE
or FALSE
) value. Not all of them are a fit for this dataset, we’ll use ==
and !=
. We’ve seen ==
, but what about !=
? The !
in logical operations means “not”. So, !=
means “not equal to”. Let’s try it out…
github %>%
filter(oxford_or_not != "oxford") %>%
count(oxford_or_not)
That’s effectively the same as
github %>%
filter(oxford_or_not == "non_oxford") %>%
count(oxford_or_not)
Whether you use ==
or !=
depends on how you think of your question.
Perhaps we want to know about people that use the Oxford comma and use “data” as a plural word. We have a few approaches to this question. Using what we already know, you could use two filter
function calls
github %>%
filter(oxford_or_not == "oxford") %>%
filter(singular_or_plural == "plural") %>%
count(oxford_or_not, singular_or_plural)
The pipeline firsts filters based on whether the respondent used the Oxford comma and then on those who use data as a plural noun. We can simplify this to make a single filter
function call
github %>%
filter(oxford_or_not == "oxford", singular_or_plural == "plural") %>%
count(oxford_or_not, singular_or_plural)
That works! The ,
tells filter
that both expressions have to evaluate as TRUE
. I prefer to use &
rather than the ,
github %>%
filter(oxford_or_not == "oxford" & singular_or_plural == "plural") %>%
count(oxford_or_not, singular_or_plural)
The reason I prefer to use the &
(i.e. the AND operator) is because it allows us to generate more complex queries and it is a good partner to its opposite, |
, the OR operator. Let’s say I want a table of people who use the Oxford comma or who treat data as a plural noun
github %>%
filter(oxford_or_not == "oxford" | singular_or_plural == "plural") %>%
count(oxford_or_not, singular_or_plural)
For the AND operator, both the left and right side of the &
must be TRUE
for the expression to be TRUE
. For the OR operator, either the left or right side of the |
must be TRUE
for the expression to be TRUE
. That’s why the output from the last expression doesn’t have a row for non_oxford
/ singular
.
Let’s ask another slightly more complicated question. Let’s generate a data frame of respondents that care about the Oxford comma “some” or “a lot” and use the Oxford comma. For these types of questions, it is helpful to use parentheses to organize our query. Remember that the stuff in parentheses is done first.
github %>%
filter(oxford_or_not == "oxford" & (care_about_oxford == "Some" | care_about_oxford == "A lot")) %>%
count(oxford_or_not, care_about_oxford)
Finally, remember that we can put different columns in the count
function than those that we filtered on. Among people that have heard of the Oxford comma, how much do people care about it?
github %>%
filter(heard_of_oxford == "Yes") %>%
count(care_about_oxford)
Exercises
1. Which geographic region was the best represented in the survey?
2. How many respondents cared about grammar?
3. Among those respondents that cared about grammar, did they have a preference for the Oxford comma or using “data” as a plural noun?
4. Come up with your own question and answer it using filter
and count