Introduction to bioinformatics

What is bioinformatics?

Bioinformatics is simply using programming and statistics to address questions hidden in biological data! We’ve been doing bioinformatics for a long time, from tracking bird migrations to studying the spread of human disease. However, recently, the term “bioinformatics” usually refers to the analysis of sequencing data of DNA, RNA, proteins, and other biological compounds using programming tools.

Technological advances such as the rise of DNA sequencing has allowed us to massively increase the amount of biological data that we can collect. A single DNA sequencing experiment may yield a terabyte of data – that’s more memory than most Macbook Pros! How can you analyze this raw data? Certainly not easily on your personal computer or scrolling through lines and lines of a spreadsheet! In order to analyze these data, we need to use more advanced tools such as supercomputers and programming languages that can allow us to automate and paralyze the analysis of vast amounts of data.

How do I learn about bioinformatics?

Learning bioinformatics is kind of like how you learned about math. First, you learned the basics: what are numbers? How do you add, subtract, multiply, and divide? You also learned about what you can do with those skills: how money works, how to calculate how to cut a cake so that everyone gets a slice. If you got really interested in, for example, shooting that rubber band so that it hit your brother every time, you might learn some physics to make that happen.

Bioinformatics is kind of like this. You need to learn some basic skills, but what you do with those skills (and what skills you learn afterwards) depends on what the kinds of questions you’re interested in addressing.

In this post, I’m mainly going to focus on the basic skills to develop, but it’s important that as you learn these skills, it’s also important to be in an environment where you can also develop questions that you’re interested in addressing. If you’re anything like me, just learning how to program with no application or purpose was pretty dry. In fact, I hated it. I hated learning math when it was just abstract numbers on a page for the purposes of getting a good grade. But when I learned that I could apply those numbers and theorems to understand the world around me, that piqued my interest.

I’d highly suggest working in a lab while you learn bioinformatics. That way, you’ll automatically be around or integrated into a project with a question that you are interested in. Plus, you’ll be around folks who likely have more experience than you that will be (relatively) happier about answering all of your little questions. Cause buddy, I promise you’ll have lots.

Okay, you’re working in a lab but it’s unrealistic that your mentor will have the time to teach you everything from scratch. That’s where this blog post comes in. I’ll be explaining what skills you need to know, why, and how to get them. That way, you can take care of the skills part and your mentors and collaborators can help you with the question part.

A note on vocabulary:

When you start learning bioinformatics (or programming generally), there’s a lot of specialized vocabulary that you will learn. Don’t worry, when you first starting learning biology, there was a lot of vocab to learn too, right? In this post, I’ve tried to bold all of the important vocab to learn and added a glossary at the end written in common language. :) There are also lots of words you’re probably already familiar that have another name (such as a personal computer being called a local machine). I’ve tried to also highlight this language as well.

What skills are in a bioinformatician's toolkit?

Since bioinformatics is generally dealing with “big data” (think: data that’s too big to run on a personal computer), most bioinformatics tools are accessed and used on supercomputers or computing clusters. These are rows and rows of computers (think: the part of a desktop that’s not the monitor/screen) connected together to harness massive computational power! MWAHAHAHA!

Anyway, these supercomputers are pretty awesome, but know what super computers generally don’t have? A screen and mouse. That’s right, remember those sci-fi/hacker movies where you see a person typing away on a black screen with tons of tiny green lines in a dark room? That’s called the command line (called terminal on a mac) and that’s how we mainly interface with supercomputers.

This is what the command line looks like:

Say “goodbye” to nice point-and-click buttons and beautiful graphical user interfaces (known colloquially as GUIs), we’re going old-school to the command line.

Microsoft Word is an example of a graphical user interface, or GUI:

Here's a model of how your local machine might interact with a computing cluster:

I’ll have more information about supercomputers later, but for now, this is all you need to know.

Skill #1: The command line

As described above, the command line is the basic way of interfacing with any computer. You can’t click (just move around using the arrow keys) and there’s just one place for you to type your input. The command line has a special language (we usually use shell or a special variant called bourne again shell or bash) that you have to know to interface with it.

Through the command line, you can edit and move around files, send them through special programs that modify them and produce some output, and automate these things too (or do them in parallel).

Since this is the primary way of computing, it’s important that you learn to become comfortable using the command line.

You can learn more about command line programming here:

Codecademy Interactive Command Line course
UNIX shell crash course basics from Happy Belly Bioinformatics
Software Carpentry Workshop: The Unix Shell
Data Carpentry Workshop: Introduction to the Command Line for Genomics

An important note: differences between Macs and PCs

Some things differ on the command line based on your current operating system (computers running Windows compared to Mac OS or linux (unix-based systems)). I’m going to proceed as if you’re using a unix-based system, although I learned on PC. Generally, people like using unix-based systems to learn bioinformatics because most computing clusters run linux, a unix-based system. That’s why you’ll see a lot of bioinformaticians using a mac.

Specific skills that are useful to learn:

Identifying your current folder location (directory) and moving around (examples of commands: pwd, ls, etc.)
Editing, re-naming, and moving existing files (stuff in directories) (examples of commands: cp, mv, piping (|), etc.)
Opening, writing in, and running stuff in other programing languages from the command line (i.e python, R, perl)
Regular expressions and scripting in bash

Resources for learning this:

Codecademy Introduction to the Command Line
Stanford’s Practical Unix Class (CS1U) Note about CS1U: This class is not for total beginners. I’d recommend taking a real intro to the command line unit to familiarize yourself with how the command line works and basics such as ls, pwd, cp, etc. before starting this.

This is a perfect tutorial for the absolute basics from the beginning.

Skill #2: Command line programs:

Often, we have raw data (such as lots of tiny sequenced pieces of DNA) want to do complex things to it to make sense of it (such as patchwork these tiny pieces of DNA to make a single long piece of a chromosome). Bioinformaticians often have written code that will do these various functions automatically for you and that you can use for your analysis, instead of writing things from scratch. These can be called a variety of different things, such as programs, modules, or packages.

The idea is that just like you COULD program something on the command line to make a calculator, you like to use an app that someone else has developed to just do it for you. That’s easier!

Of course, if you are interested in developing these methods and programs yourself, please consider joining a lab that works on this. However, the vast majority of labs just use the programs that other labs that specialize in this have developed to answer their own questions.

These programs can be written in a lot of different programming languages, such as C, C++, python, perl or R. Usually, this doesn’t matter too much because all you have to know is:

What it does (what is the input and output)?
How to install it
How to run it

In this section, I’m going to go over how to how to do each of these things (at least at a high level).

1. How do pick a program and learn what it does:

Usually, scientists will be reading papers of other scientists that have address their research question or a similar question before and see what kinds of bioinformatics tools that those other groups used.

You can see in papers when they cite a program because it is usually in a different font and has a version listed. Here’s an example with the program fastSTRUCTURE, which was developed at Stanford!

Oftentimes, programs will get developed but then get outdated or go out of style. It’s generally best to refer to recent papers when deciding which program to use, as well as read papers that describe how the method works “under the hood” and critiques/reviews of that method before deciding to go with it.

2. How to install it:

Installation is different for your local machine (personal computer) and the cluster.

On your local machine: I generally like to use homebrew to install things on my mac. You can also use pip install on a PC. Most bioinformatics programs have an installation website, so you can usually follow the directions there to install things on your local machine.
On the cluster: This can be quite tricky, so I would recommend contacting an IT person or system administrator for the cluster or ask someone that has experience installing things on the cluster to help with this.

Installing something is often as difficult (or more difficult!) than actually using it!

3. How to run it: All programs have manuals that go through how to enter commands and read the input/output. I would also strongly recommend trying to replicate the example in the manual (usually they include sample data) yourself before jumping into your actual data to make sure that you’re not introducing user error, inadvertently.

Many commonly used programs have really nice tutorials online.

How to learn this:

This kind of analysis is often kind of specialized, so I would look for classes or workshops for the specific analysis you are interested in. At this level, it’s often useful to take an actual bioinformatics class.

Courses at Stanford: (high level)

Genetics 211: Genomics: familiar with Python
(CS262: Computational Genomics): familiar with Python
BIOMEDIN 273B: Deep learning in genomics and biomedicine: Familiar with Python
BIOS 221: Modern Statistics for Modern Biology: familiar with R

Courses away from Stanford for high-level bioinformatics: (focus on environmental bioinformatics)

Skill #3: Developing bioinformatics pipelines and scripting:

Okay, so you have a lot of data files and you have a series of programs you want to send them through. But it seems like a pain to manually send each one through each program. This is where you want to develop a bioinformatics pipeline. This is a bunch of programs that you “pipe” your data through. Often, you want it to be automated to limit your work and also the amount of potential user error. This where scripting comes in.

Scripting is where you write a little program that automatically inputs your data into a program and sometimes, takes the output and automatically puts it into another program.

Generally, bioinformatics scripts are written in the programming languages bash or python. Formerly, many bioinformatics scripts were written in the language perl, although this is becoming an unpopular language.

There are also a variety of tools that are used to improve the development of bioinformatics pipelines.

How to learn this:

You can check out my post on Python here: https://gitlab.com/teamnectarmicrobe/undergrads/-/wikis/17.-Coding-in-Python

Best practices for developing bioinformatics pipelines [this resource is still being developed]

Skill #4: Using a supercomputing cluster (supercomputer)

As mentioned earlier in this post, most of the analyses that you’ll be doing (especially those with many files as shown above), will be too resource-intensive to do on your computer. Generally, you’ll want to conduct this analysis on a supercomputing cluster. Stanford has several different clusters, but most people use one called Sherlock.

Check out this reference to Sherlock on the popular show Silicon Valley: https://www.sherlock.stanford.edu/img/richard.png

How to get started on a cluster:

Make an account – this usually requires requesting an account from an IT person or system administrator
Organize your personal directory on the cluster
Join your group/lab workspace (if you have one)
Learn how to submit jobs

You can read this nice post about how to get started: https://www.sherlock.stanford.edu/docs/overview/introduction/

Check out my post about HPCs for more information: https://www.calliechappell.com/blog/2022/7/21/introduction-to-cluster-computingsuper-computers

How do I submit a job? (And what does that mean anyway?

A supercomputing cluster is an expensive thing to maintain—it doesn’t make sense for everyone to have their own personal supercomputer! All clusters are shared, and the amount of time and space you’re allowed to use at a given time is dependent on how many other people want to use the cluster and how intensive their requests are. For large and data-intensive things, can’t just click “enter” on the command line on a cluster and assume that it’ll just GO, like on your personal computer.

The cluster (or really, the people who operate the cluster) has algorithms that allocate how much time and space you get to use for your cluster. That means you have to submit a request for time and space (called a job) to the cluster to then run and allocate that time and space to your code.

You can read about how to submit jobs on Sherlock (using SLURM) here: https://www.sherlock.stanford.edu/docs/user-guide/running-jobs/

Additional resources:

Here is a great post overviewing a lot of the different skills bioinformaticians develop
Bioinformatics Workbook has lots of posts on bioinformatics basis. Great site!

Advanced topics:

Data carpentry workshop: Project organization and management for genomics
Data carpentry workshop: Data wrangling and processing for genomics

Subject-specific resources:

Population Genetics (popgen):

Popgen Notes

Glossary:

Bioinformatics: Using programming and statistics to address questions hidden in biological data! Recently, the term “bioinformatics” usually refers to the analysis of sequencing data of DNA, RNA, proteins, and other biological compounds using programming tools.
Local machine: your personal computer
Supercomputer/computing cluster: Lots of computers connected together that you can interface with on your personal computer (local machine) through the command line
Directory: folder
File: something that is in a folder (such as a .doc, .txt., .csv file, etc.)
Bioinformatics pipeline: A series of programs with successive inputs and outputs until you get what you’re looking for!
Scripting: writing a small program that “pipes” inputs and outputs between various programs
Submitting a job: Submitting a request for the cluster to run some code, with information about how much time and space that code will require

Introduction: