Cleaning and manipulating data with the tidyverse: dplyr, readr, and stringr in action (CC121)

June 30, 2021 • PD Schloss • 4 min read • •

Code

This is the R script that I generated in this episode

library(tidyverse)

shared <- read_tsv("raw_data/baxter.subsample.shared",
         col_types = cols(Group = col_character(),
                          .default = col_double())) %>%
  rename_all(tolower) %>%
  select(group, starts_with("otu")) %>%
  pivot_longer(-group, names_to="otu", values_to="count")

taxonomy <- read_tsv("raw_data/baxter.cons.taxonomy") %>%
  rename_all(tolower) %>%
  select(otu, taxonomy) %>%
  mutate(otu = tolower(otu),
         taxonomy = str_replace_all(taxonomy, "\\(\\d+\\)", ""),
         taxonomy = str_replace(taxonomy, ";unclassified", "_unclassified"),
         taxonomy = str_replace_all(taxonomy, ";unclassified", ""),
         taxonomy = str_replace_all(taxonomy, ";$", ""),
         taxonomy = str_replace_all(taxonomy, ".*;", "")
         )


metadata <- read_tsv("raw_data/baxter.metadata.tsv",
         col_types=cols(sample = col_character())) %>%
  rename_all(tolower) %>%
  rename(group = sample) %>%
  mutate(srn = dx_bin == "Adv Adenoma" | dx_bin == "Cancer",
         lesion = dx_bin == "Adv Adenoma" | dx_bin == "Cancer" | dx_bin == "Adenoma")

composite <- inner_join(shared, taxonomy, by="otu") %>%
  group_by(group, taxonomy) %>%
  summarize(count = sum(count), .groups="drop") %>%
  group_by(group) %>%
  mutate(rel_abund = count / sum(count)) %>%
  ungroup() %>%
  select(-count) %>%
  inner_join(., metadata, by="group")

Installations

If you haven’t been following along, you can get caught up by doing the following:

(windows) Install the Ubuntu Linux BASH shell for Windows 10

(mac) Install homebrew and git

  /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
  brew install git

To get to where we are at the beginning of this episode (you won’t have the same issue numbers at Pat)…
- Set up a GitHub account
- Create a new GitHub repository
  - Call it “mikropml_demo”
  - Make it Public
  - Don’t check the box next to “Initialize this repository with a README”
  - Click the green “Create repository” button
- Go to your command line and enter the following replacing <your_github_id> with your GitHub user id
```
git clone git@github.com:SchlossLab/mikropml_demo.git
cd mikropml_demo
git reset --hard 793d7973aca94e22aed4e1133094971d8e99f2f2
git remote set-url origin git@github.com:<your_github_id>/mikropml_demo.git
git push -u origin master
```
- Return to GitHub and refresh your browser.
- Go to the mikropml_demo directory on your computer and double click on the mikropml_demo.Rproj icon. This will launch RStudio and you’ll be good to go.