Fun with regular expressions in sed and grep
Since starting this project 10 episodes ago, we have yet to really leave the command line interface. You’ve learned a lot of bash syntax -
mv, and probably a few others. If you have all those down, then you’re in great shape and are likely seeing the value of using these bash commands to automate a reproducible workflow. But, if these still seem a bit challenging to you, don’t fret! We’re going to spend a few more episodes in bash to strengthen our familiarity with these commands and my general workflow.
In today’s episode, we’ll see many of those commands and some new ones to help solve a problem I found in our analysis. In the last two episodes, we used special patterns called “regular expressions” with
sed to extract information from our file names and paths. If you did the exercises in the
sed episode, I showed how you can run
sed on the contents of a file rather than its name. But,
sed isn’t the only place that we can use regular expressions in bash. There’s another, probably more popular tool called
grep where we can use regular expressions. Heck, the name grep is short for “globally search for a regular expression and print matching lines”.
After the last episode, I was looking back through our files and noticed that mothur had changed our sequence names because the names had spaces in them. Have I mentioned how horrible spaces are for bioinformatics work?! I also noticed that although most of our sequences start and end at the coordinates that we trimmed them to, there are a few for each region that don’t. In those cases, mothur starts the sequence with a series of periods to indicate missing data. Later on, we might decide to toss those sequences because they’re weird. I’d prefer to have those be hyphens to represent gap characters. Instead of opening these files in a text editor and replacing all the spaces with underscores or replacing the periods with hyphens, we can fix the information using
sed. Along the way we’ll learn a few extra commands to keep things interesting. These are the commands that I often use to diagnose problems or do simple analyses of data in my files.
Even if you’re only watching this video to learn more about bash commands and don’t know what a 16S rRNA gene is, I’m sure you’ll get a lot out of today’s video. Please take the time to follow along on your own computer and attempt the exercises. Don’t worry if you aren’t sure how to solve the exercises, at the end of the video I will provide solutions. If you haven’t been following along but would like to, please check out the notes below where you’ll find instructions on catching up, reference notes, and links to supplemental material. You can find my version of the project on GitHub.
Important things to remember
For many commands, using the command name followed by
-h will bring you to a help page (e.g.
wget --help). For others, you might need to use
info followed by the command name (e.g.
man grep or
info grep). Many commands will get you help by either approach with the
info output being more complete. In the
info interface you can use the arrow keys to work through the document line by line or the space bar scroll a page at a time.
Resources on regular expressions in bash
- Bash Guide for Beginners (Chapter 4)
$- anchors that match at the beginning or end of the line
*- matches the preceding character zero or more times
.- matches any character
\.- matches the
\-- matches the
[AC]- matches either an “A” or “C” in the line
-c: count the number of lines
-v: return lines that don’t match
-E: extended regular expressions
- default output: number of lines, words, and characters for the specified file
-c: count the number of characters
-w: count the number of words
-l: count the number of lines
-n: number of lines to output. The default is 10
If you haven’t been following along, you can get caught up by doing the following:
- (windows) Install the Ubuntu Linux BASH shell for Windows 10
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" brew install wget
- To get to where we are at the beginning of this episode (you won’t have the same issue numbers at Pat)…
- Set up a GitHub account
- Create a new GitHub repository
- Call it “Schloss_rrnAnalysis_XXXX_2020” (feel free to use your own last name)
- Make it Public
- Don’t check the box next to “Initialize this repository with a README”
- Click the green “Create repository” button
Go to your command line and enter the following replacing
<your_github_id>with your GitHub user id
git clone firstname.lastname@example.org:SchlossLab/Schloss_rrnAnalysis_mSphere_2020.git cd Schloss_rrnAnalysis_XXXX_2020 git reset --hard 56ae068 git remote set-url origin email@example.com:<your_github_id>/Schloss_rrnAnalysis_XXXX_2020.git git push -u origin master
- Return to GitHub and refresh your browser
make data/v19/rrnDB.align make data/v4/rrnDB.align
- Then you should now be good to go
1. How many of the sequences in
data/v19/rrnDB.align have ambiguous bases (i.e. Ns) in them?
2. How many of the full length sequences in
data/v19/rrnDB.align contain the standard forward primer to amplify the V3 region (CCTACGGGAGGCAGCAG) or the V4 region (GTGCCAGCMGCCGCGGTAA)? Feel free to use
. to represent the degenerate bases. Remember that the
* represents the previous character occurring zero or more times. As a bonus, see if you can figure out how to modify your regular expression to represent degenerate bases.
3. The fasta sequence headers contain five fields separated by pipe characters (i.e.
|). Can you generate a file that contains the five fields separated by commas (i.e.
,)? Be sure to remove the
>. To stretch yourself, figure out how to give the five fields names without using a text editor.