In the previous lesson we saw how we can merge two data frames together and how we can filter
for rows and/or select
columns from a single data frame. We ended by discussing how the pipe operator (i.e. %>%
) can be used to connect functions to direct the flow of data through a pipeline. We now want to take the next step in analyzing our data. Let’s think of some questions we might have about the data in our data frame
In this lesson we will see how we can add functions from the dplyr
package to the pipeline we’ve been developing to answer these and more questions. Let’s recall how we can create a data frame called meta_alpha
.
library(tidyverse)
library(readxl)
metadata <- read_excel(path="raw_data/baxter.metadata.xlsx",
col_types=c(sample = "text", fit_result = "numeric", Site = "text", Dx_Bin = "text",
dx = "text", Hx_Prev = "logical", Hx_of_Polyps = "logical", Age = "numeric",
Gender = "text", Smoke = "logical", Diabetic = "logical", Hx_Fam_CRC = "logical",
Height = "numeric", Weight = "numeric", NSAID = "logical", Diabetes_Med = "logical",
stage = "text")
)
metadata <- mutate(metadata, Height = na_if(Height, 0))
metadata <- mutate(metadata, Weight = na_if(Weight, 0))
metadata <- mutate(metadata, Site = recode(.x=Site, "U of Michigan"="U Michigan"))
metadata <- mutate(metadata, Dx_Bin = recode(.x=Dx_Bin, "Cancer."="Cancer"))
metadata <- mutate(metadata, Gender = recode(.x=Gender, "f"="female", "m"="male"))
metadata <- rename_all(.tbl=metadata, .funs=tolower)
metadata <- rename(.data=metadata,
previous_history=hx_prev,
history_of_polyps=hx_of_polyps,
family_history_of_crc=hx_fam_crc,
diagnosis_bin=dx_bin,
diagnosis=dx,
sex=gender)
metadata <- mutate(metadata, diagnosis = factor(diagnosis, levels=c("normal", "adenoma", "cancer")))
alpha <- read_tsv(file="raw_data/baxter.groups.ave-std.summary",
col_types=cols(group = col_character())) %>%
filter(method=='ave') %>%
select(group, sobs, shannon, invsimpson, coverage)
meta_alpha <- inner_join(metadata, alpha, by=c('sample'='group'))
Let’s take on the first question I posed above, “What is the mean diversity across the three diagnosis groups?” We’ll break the problem down by steps. An outline of what we want to get R to do is to group our data frame by a categorical variable (e.g. diagnosis
) and then within each value of diagnosis
we want it to calculate a summary statistic on another column (e.g. shannon
). The first step will use a new verb - group_by
- to group our data frame by a column. The output of this function will only be slightly different to us. See if you can spot the added wrinkle to the output that is created by using group_by
group_by(meta_alpha, diagnosis)
## # A tibble: 490 x 21
## # Groups: diagnosis [3]
## sample fit_result site diagnosis_bin diagnosis previous_history
## <chr> <dbl> <chr> <chr> <fct> <lgl>
## 1 20036… 0 U Mi… High Risk No… normal FALSE
## 2 20056… 0 U Mi… High Risk No… normal FALSE
## 3 20076… 26 U Mi… High Risk No… normal FALSE
## 4 20096… 10 Toro… Adenoma adenoma FALSE
## 5 20136… 0 U Mi… Normal normal FALSE
## 6 20156… 0 Dana… High Risk No… normal FALSE
## 7 20176… 7 Dana… Cancer cancer TRUE
## 8 20196… 19 U Mi… Normal normal FALSE
## 9 20236… 0 Dana… High Risk No… normal TRUE
## 10 20256… 1509 U Mi… Cancer cancer TRUE
## # … with 480 more rows, and 15 more variables: history_of_polyps <lgl>,
## # age <dbl>, sex <chr>, smoke <lgl>, diabetic <lgl>,
## # family_history_of_crc <lgl>, height <dbl>, weight <dbl>, nsaid <lgl>,
## # diabetes_med <lgl>, stage <chr>, sobs <dbl>, shannon <dbl>,
## # invsimpson <dbl>, coverage <dbl>
You should notice that above the column headings it says Groups: diagnosis[3]
. This indicates that we now have three groups in our data. We can actually write this two different ways and get the same result. The second way and the way that I prefer is
meta_alpha %>%
group_by(diagnosis)
## # A tibble: 490 x 21
## # Groups: diagnosis [3]
## sample fit_result site diagnosis_bin diagnosis previous_history
## <chr> <dbl> <chr> <chr> <fct> <lgl>
## 1 20036… 0 U Mi… High Risk No… normal FALSE
## 2 20056… 0 U Mi… High Risk No… normal FALSE
## 3 20076… 26 U Mi… High Risk No… normal FALSE
## 4 20096… 10 Toro… Adenoma adenoma FALSE
## 5 20136… 0 U Mi… Normal normal FALSE
## 6 20156… 0 Dana… High Risk No… normal FALSE
## 7 20176… 7 Dana… Cancer cancer TRUE
## 8 20196… 19 U Mi… Normal normal FALSE
## 9 20236… 0 Dana… High Risk No… normal TRUE
## 10 20256… 1509 U Mi… Cancer cancer TRUE
## # … with 480 more rows, and 15 more variables: history_of_polyps <lgl>,
## # age <dbl>, sex <chr>, smoke <lgl>, diabetic <lgl>,
## # family_history_of_crc <lgl>, height <dbl>, weight <dbl>, nsaid <lgl>,
## # diabetes_med <lgl>, stage <chr>, sobs <dbl>, shannon <dbl>,
## # invsimpson <dbl>, coverage <dbl>
I prefer this second approach because it breaks up the commands a bit more and makes it more clear that the source of the data is the meta_alpha
data frame. Again, the result is the same. If I had a huge data frame and didn’t want to pipe every column through the pipeline, I could easily add a select line before the group_by
line.
meta_alpha %>%
select(diagnosis, shannon) %>%
group_by(diagnosis)
## # A tibble: 490 x 2
## # Groups: diagnosis [3]
## diagnosis shannon
## <fct> <dbl>
## 1 normal 4.02
## 2 normal 3.98
## 3 normal 3.91
## 4 adenoma 4.16
## 5 normal 3.33
## 6 normal 3.74
## 7 cancer 3.98
## 8 normal 3.69
## 9 normal 4.01
## 10 cancer 3.39
## # … with 480 more rows
Adding the same select
function to the first approach would require more typing and moving things around. It isn’t as flexible.
select(meta_alpha, diagnosis) %>%
group_by(diagnosis)
## # A tibble: 490 x 1
## # Groups: diagnosis [3]
## diagnosis
## <fct>
## 1 normal
## 2 normal
## 3 normal
## 4 adenoma
## 5 normal
## 6 normal
## 7 cancer
## 8 normal
## 9 normal
## 10 cancer
## # … with 480 more rows
For the final step, we’d like to summarize the diversity values within the groups by presenting the mean value. We can do this using the summarize
function from the dplyr
package
meta_alpha %>%
group_by(diagnosis) %>%
summarize(mean_shannon = mean(shannon))
## # A tibble: 3 x 2
## diagnosis mean_shannon
## <fct> <dbl>
## 1 normal 3.58
## 2 adenoma 3.58
## 3 cancer 3.52
The output is a new data frame that has three rows, one for each diagnosis category, and two columns, which contain the diagnosis categories and the mean of the Shannon diversity values. This is a very powerful set of tools!
Let’s say that we realized our Shannon diversity indices are not normally distributed. This revelation makes us realize that instead of the mean, we should instead present the median. Change the previous code to present the median Shannon diversity for each group. Make sure you pick a better column name!
We can use summarize
to present various summaries of the groups. Let’s create a new table with the mean Shannon diversity index and standard deviation for each group.
meta_alpha %>%
group_by(diagnosis) %>%
summarize(mean_shannon = mean(shannon), sd_shannon = sd(shannon))
## # A tibble: 3 x 3
## diagnosis mean_shannon sd_shannon
## <fct> <dbl> <dbl>
## 1 normal 3.58 0.467
## 2 adenoma 3.58 0.473
## 3 cancer 3.52 0.413
Slick, eh? We can continue to add summary statistics by adding columns to the arguments in the summarize
function. Other built-in functions that you can use with the summarize
function include n
, sum
, first
, last
, nth
, quantile
, min
, max
, IQR
, and var
. Note that n
is unique in that it does not take any arguments. If you want to get a count of the number of rows in that group you would use it as n()
rather than as n(shannon)
.
Add the number of individuals in each category to our summary data frame
Now let’s ask a slightly more complicated question, “what is the mean diversity among males and females across the three diagnosis categories?” In other words, we would like the mean diversity, standard deviation, and number of observations for each sex and diagnosis combination. Can you see what we will need to change from the code in the previous activity? First, we’ll need to get the “sex” column from meta_alpha
and we’ll also want to group by each subject’s diagnosis and sex.
meta_alpha %>%
group_by(diagnosis, sex) %>%
summarize(mean_shannon = mean(shannon), sd_shannon = sd(shannon), N=n())
## # A tibble: 6 x 5
## # Groups: diagnosis [3]
## diagnosis sex mean_shannon sd_shannon N
## <fct> <chr> <dbl> <dbl> <int>
## 1 normal female 3.53 0.464 111
## 2 normal male 3.66 0.466 61
## 3 adenoma female 3.57 0.492 80
## 4 adenoma male 3.58 0.462 118
## 5 cancer female 3.56 0.429 52
## 6 cancer male 3.50 0.401 68
Now we have a new column to indicate the sex and we now have six rows instead of three because we have the three diagnosis categories for both of the sexes.
What happens if you use group_by(sex, diagnosis)
instead of group_by(diagnosis, sex)
?
What is the richness among people with and without a history of polyps?
What is the mean age and standard deviation among males and females?
Looking at these data frames can make us go a bit cross eyed if we want to find the diagnosis/sex combination that has the highest and lowest diversity. It would be even worse if we had more categories. Let’s see which site had the highest mean diversity. We’ll start by adapting our earlier code, which helped us calculate the mean diversity for each diagnosis group.
meta_alpha %>%
group_by(site) %>%
summarize(mean_shannon = mean(shannon), sd_shannon = sd(shannon), N=n())
## # A tibble: 4 x 4
## site mean_shannon sd_shannon N
## <chr> <dbl> <dbl> <int>
## 1 Dana Farber 3.56 0.457 120
## 2 MD Anderson 3.47 0.445 95
## 3 Toronto 3.64 0.431 168
## 4 U Michigan 3.53 0.490 107
This gets us a new data frame. Because it only has four rows, it’s easiest to see that the individuals sampled in Toronto had the highest mean diversity while those at MD Anderson had the lowest. How do we get R to do this heavy lifting for us? We can use the arrange
function from the dplyr
package
meta_alpha %>%
group_by(site) %>%
summarize(mean_shannon = mean(shannon), sd_shannon = sd(shannon), N=n()) %>%
arrange(mean_shannon)
## # A tibble: 4 x 4
## site mean_shannon sd_shannon N
## <chr> <dbl> <dbl> <int>
## 1 MD Anderson 3.47 0.445 95
## 2 U Michigan 3.53 0.490 107
## 3 Dana Farber 3.56 0.457 120
## 4 Toronto 3.64 0.431 168
Note that we gave the arrange
function the column that we wanted to sort on. You should also be able to see that the sorted table starts at the lowest mean diversity and goes to the highest. If we want sort to go in decreasing order, we need to use the desc
function within arrange
meta_alpha %>%
group_by(site) %>%
summarize(mean_shannon = mean(shannon), sd_shannon = sd(shannon), N=n()) %>%
arrange(desc(mean_shannon))
## # A tibble: 4 x 4
## site mean_shannon sd_shannon N
## <chr> <dbl> <dbl> <int>
## 1 Toronto 3.64 0.431 168
## 2 Dana Farber 3.56 0.457 120
## 3 U Michigan 3.53 0.490 107
## 4 MD Anderson 3.47 0.445 95
To get the first row we would add the head
function.
meta_alpha %>%
group_by(site) %>%
summarize(mean_shannon = mean(shannon), sd_shannon = sd(shannon), N=n()) %>%
arrange(desc(mean_shannon)) %>%
head(n=1)
## # A tibble: 1 x 4
## site mean_shannon sd_shannon N
## <chr> <dbl> <dbl> <int>
## 1 Toronto 3.64 0.431 168
The nice thing about the pipes is that it enables us to add functions to our pipeline as our question evolves. We could refactor the last two steps in the pipeline with a single line using the top_n
function from the dplyr
package
meta_alpha %>%
group_by(site) %>%
summarize(mean_shannon = mean(shannon), sd_shannon = sd(shannon), N=n()) %>%
top_n(n=1, mean_shannon)
## # A tibble: 1 x 4
## site mean_shannon sd_shannon N
## <chr> <dbl> <dbl> <int>
## 1 Toronto 3.64 0.431 168
Of course we could look at the output from R to see the name of the site, but if we wanted a simpler output, we could use the pull
function from dplyr
meta_alpha %>%
group_by(site) %>%
summarize(mean_shannon = mean(shannon), sd_shannon = sd(shannon), N=n()) %>%
top_n(n=1, mean_shannon) %>%
pull(site)
## [1] "Toronto"
Write the code that tells us the site that had the lowest mean diversity? You may need to consult the help documentation and examples for the top_n
function
Let’s say we want to compare the diversity of people in the different diagnosis groups that have high and low fit results. Recall that the “fit result” column contains continuous values. Clinically, people with fit results over 100 will be referred for a colonoscopy. To get started, we’d like first create a new column that indicates whether someone has a low or high fit result. To achieve this we will make use of the mutate
function in the dplyr
package. You’ll recall we used this function to help us clean up our metadata.
meta_alpha %>%
mutate(high_fit=fit_result>=100)
## # A tibble: 490 x 22
## sample fit_result site diagnosis_bin diagnosis previous_history
## <chr> <dbl> <chr> <chr> <fct> <lgl>
## 1 20036… 0 U Mi… High Risk No… normal FALSE
## 2 20056… 0 U Mi… High Risk No… normal FALSE
## 3 20076… 26 U Mi… High Risk No… normal FALSE
## 4 20096… 10 Toro… Adenoma adenoma FALSE
## 5 20136… 0 U Mi… Normal normal FALSE
## 6 20156… 0 Dana… High Risk No… normal FALSE
## 7 20176… 7 Dana… Cancer cancer TRUE
## 8 20196… 19 U Mi… Normal normal FALSE
## 9 20236… 0 Dana… High Risk No… normal TRUE
## 10 20256… 1509 U Mi… Cancer cancer TRUE
## # … with 480 more rows, and 16 more variables: history_of_polyps <lgl>,
## # age <dbl>, sex <chr>, smoke <lgl>, diabetic <lgl>,
## # family_history_of_crc <lgl>, height <dbl>, weight <dbl>, nsaid <lgl>,
## # diabetes_med <lgl>, stage <chr>, sobs <dbl>, shannon <dbl>,
## # invsimpson <dbl>, coverage <dbl>, high_fit <lgl>
Nice! Instead of getting back a boolean, I’d rather have it output as “high fit” or “low fit”. We can get this using the if_else
function from the dplyr
package. This function asks a logical question (e.g. fit_result>=100
) and if it is true then will return the second argument and if it is false, will return the third argument. We might do something like if_else(fit_result>=100, "high fit", "low fit")
meta_alpha %>%
mutate(fit_category=if_else(fit_result>=100, "high fit", "low fit"))
## # A tibble: 490 x 22
## sample fit_result site diagnosis_bin diagnosis previous_history
## <chr> <dbl> <chr> <chr> <fct> <lgl>
## 1 20036… 0 U Mi… High Risk No… normal FALSE
## 2 20056… 0 U Mi… High Risk No… normal FALSE
## 3 20076… 26 U Mi… High Risk No… normal FALSE
## 4 20096… 10 Toro… Adenoma adenoma FALSE
## 5 20136… 0 U Mi… Normal normal FALSE
## 6 20156… 0 Dana… High Risk No… normal FALSE
## 7 20176… 7 Dana… Cancer cancer TRUE
## 8 20196… 19 U Mi… Normal normal FALSE
## 9 20236… 0 Dana… High Risk No… normal TRUE
## 10 20256… 1509 U Mi… Cancer cancer TRUE
## # … with 480 more rows, and 16 more variables: history_of_polyps <lgl>,
## # age <dbl>, sex <chr>, smoke <lgl>, diabetic <lgl>,
## # family_history_of_crc <lgl>, height <dbl>, weight <dbl>, nsaid <lgl>,
## # diabetes_med <lgl>, stage <chr>, sobs <dbl>, shannon <dbl>,
## # invsimpson <dbl>, coverage <dbl>, fit_category <chr>
If you have more than two cases (e.g. TRUE and FALSE) then you might want to use the case_when
function from the dplyr
package. We might define three levels of fit categories. The case_when
function takes on a different syntax. The arguements to case_when
consist of a logical comparison followed by a ~
, followed by what to do if the logical comparison is true. The arguments are tested in order, so the ordering matters. It’s also good to add one argument where the logical comparison is TRUE
so that there is a default value if something goes wrong.
meta_alpha %>%
mutate(fit_category=case_when(fit_result>=100 ~ "high fit",
fit_result >= 50 ~ "moderate fit",
TRUE ~ "low fit")
) %>%
count(fit_category)
## # A tibble: 3 x 2
## fit_category n
## <chr> <int>
## 1 high fit 126
## 2 low fit 344
## 3 moderate fit 20
I also added a call to the count
function from the dplyr
package to count the number of observations in each fit category. Now we want to return to getting the mean diversity, standard deviation, and number of observations in each category. We will tack on to the end of our pipeline (minus the count
function call) the group_by
and summarize
commands like we did before. I’ll go ahead and sort the resulting data frame so that we can see which categories had the lowest and highest diversity values.
meta_alpha %>%
mutate(fit_category=case_when(fit_result>=100 ~ "high fit",
fit_result >= 50 ~ "moderate fit",
TRUE ~ "low fit")
) %>%
group_by(fit_category, diagnosis) %>%
summarize(mean_shannon = mean(shannon), sd_shannon = sd(shannon), N=n()) %>%
arrange(mean_shannon)
## # A tibble: 9 x 5
## # Groups: fit_category [3]
## fit_category diagnosis mean_shannon sd_shannon N
## <chr> <fct> <dbl> <dbl> <int>
## 1 high fit adenoma 3.43 0.579 31
## 2 high fit cancer 3.49 0.418 90
## 3 moderate fit adenoma 3.50 0.447 14
## 4 moderate fit normal 3.54 NA 1
## 5 low fit normal 3.57 0.471 166
## 6 moderate fit cancer 3.59 0.294 5
## 7 low fit adenoma 3.61 0.447 153
## 8 high fit normal 3.62 0.397 5
## 9 low fit cancer 3.64 0.402 25
In this example you’ll notice that the “fit_result” column does not show up in our final data frame because we were grouping by the “fit_category” and “diagnosis” columns to summarize the data in the “shannon” column. If we had stopped the pipeline after the mutate, then our data frame would have contained all of our previous columns as well as the “fit_category” column. Perhaps we didn’t want to include the “fit_result” column in an output table since that information was captured in “fit_category”. We could use select(-fit_result)
.
Create a summary table that gives the median age and intraquartile range of smokers in our cohort. Instead of having TRUE and FALSE as the values in the “smoke” column, convert those to “smoker” and “non-smoker”
Create a summary table that gives the mean BMI and standard deviation for people in each of the diagnosis groups. You can calculate BMI as the person’s weight in kilograms divided by their height in meters (or centimeters/100) squared. Look at the output of ?mean
and ?sd
to see if you can address the NA
values in the output.
In the last activity you added a column that required calculating the BMI. This is a relatively straightforward calculation. I could imagine perhaps adding some “special sauce” such as checking to see whether the height is reasonable for meters or if it’s more likely to be in centimeters. Alternatively, I might really want the BMI category rather than the actual BMI. I might also need to calculate the BMI of people in different cohorts or multiple times across a study. In each of these cases, I would be likely to encapsulate the code in a function.
The syntax you use to create a function looks like this.
my_killer_function <- function(argument1, argument2){
... special sauce ...
return(value)
}
The return
function tells R to return value
from my_killer_function
. Alternatively, R will return the output from the last function called in the function. Again, thinking about our BMI calculations we could write it like this…
get_bmi <- function(weight_kg, height_cm){
return(weight_kg / (height_cm/100) ^ 2)
}
Once we’ve got the function defined, we can then add it to our pipeline…
meta_alpha %>%
mutate(bmi = get_bmi(weight_kg = weight, height_cm = height)) %>%
group_by(diagnosis) %>%
summarize(median_bmi = mean(bmi, na.rm=T), sd_bmi = sd(bmi, na.rm=T))
## # A tibble: 3 x 3
## diagnosis median_bmi sd_bmi
## <fct> <dbl> <dbl>
## 1 normal 27.0 5.33
## 2 adenoma 26.4 4.35
## 3 cancer 29.1 6.76
The length of the new line of code is a bit longer than the original, but this is much easier to maintain. Again, say I have this function call at three or four places in my analysis. If I found that I forgot to divide the height in centimeters to meters, then I need to find and correct the bug in each case. This is a great way to cause problems. Functions allow you to practice the DRY principle: Don’t Repeat Yourself. As an example, I could update my get_bmi
function to make sure my heights and weights aren’t zero and if they are I’d return an NA
(note that if_else
requires the true and false values to be the same type, so we need to use a special value of NA
[see ?if_else
])
get_bmi <- function(weight_kg, height_cm){
bmi <- if_else(weight_kg == 0 | height_cm == 0, NA_real_, weight_kg / (height_cm/100) ^ 2)
return(bmi)
}
I could then run the same code as before
meta_alpha %>%
mutate(bmi = get_bmi(weight_kg = weight, height_cm = height)) %>%
group_by(diagnosis) %>%
summarize(median_bmi = mean(bmi, na.rm=T), sd_bmi = sd(bmi, na.rm=T))
## # A tibble: 3 x 3
## diagnosis median_bmi sd_bmi
## <fct> <dbl> <dbl>
## 1 normal 27.0 5.33
## 2 adenoma 26.4 4.35
## 3 cancer 29.1 6.76
Alternatively, we might rather have a BMI category column. Instead of doing two mutates or having an even more complicated function, we can create a get_bmi_category
function that calls get_bmi
for us. This will be much easier than putting all of the code as an argument for mutate
. Notice that we can assign an NA
character to any non-sensical values.
get_bmi_category <- function(weight_kg, height_cm){
bmi <- get_bmi(weight_kg, height_cm)
bmi_cat <- case_when(bmi >= 30 ~ "obese",
bmi >= 25 ~ "overweight",
bmi >= 18.5 ~ "normal",
bmi > 0 ~ "underweight"
TRUE ~ NA_character_,
)
return(bmi_cat)
}
## Error: <text>:8:25: unexpected numeric constant
## 7: bmi > 0 ~ "underweight"
## 8: TRUE
## ^
Then I could do…
meta_alpha %>%
mutate(bmi_category = get_bmi_category(weight_kg = weight, height_cm = height)) %>%
group_by(bmi_category) %>%
summarize(mean_age = mean(age, na.rm=T), sd_age = sd(age, na.rm=T), N=n())
## Error: Problem with `mutate()` input `bmi_category`.
## ✖ could not find function "get_bmi_category"
## ℹ Input `bmi_category` is `get_bmi_category(weight_kg = weight, height_cm = height)`.
Generate a function is_obese
that takes in a person’s height and weight and returns a TRUE
or FALSE
value indicating whether the person is obese. Then return the mean and standard deviation for the shannon diversity and the number of people in each category.
Modify the code we used previously to generate a scatter chart with BMI on the x-axis and Shannon diversity on the y-axis. Color the points by diagnosis.