Introduction to cluster computing/super computers

Introduction

I wrote this blog post back in 2020 specifically for Stanford undergraduates. Even though some parts are Stanford-specific, I hope that this information is generalizable enough to be helpful for all.

What is cluster computing/high performance computing?

Oftentimes, the data that we are analyzing is too large to fit on a single computer. Additionally, the computing power needed to analyze these data may take a single laptop weeks --if not months--to run. To overcome this, we "offload" the analysis to a computing cluster to do.

A computing cluster are many computers connected together to have much more power than a single laptop. Stanford's computing cluster is called Sherlock ⛑ (that's a COVID-19 Sherlock).

Since there are many different clusters at different institutions (or even within the same institution!), I'll be denoting Sherlock-specific information with this emoji: ⛑ Notice the Sherlock logo.... 😉Also, I just learned that you can use emoji in Markdown language, which is what I'm using to code up these wiki pages.

Why use a cluster? Check out this quick module from Stanford: https://srcc.stanford.edu/private/why-use-cluster

Here is some more specific, high-level information about Sherlock: https://www.sherlock.stanford.edu/docs/overview/concepts/

Here is a link to an Sherlock onboarding session offered May 2020: https://srcc.stanford.edu/events/sherlock-boarding-session

Learning the basics:

In order to effectively use the cluster, you'll need to learn the basics about cluster computing and how to use it 😊.

Here's a very introductory course from Yale: https://research.computing.yale.edu/training/introduction-hpc

Here is a link to the slides from this course: https://docs.google.com/presentation/d/1H9FgA8ER-VTJMyRDm0FRy_SH-xGDmBwcRVzY2Gk66fA/edit#slide=id.g33dc120929_0_131

This is a pretty high-level lecture, so I would encourage you to take this Data Carpentry workshop as well 👍: https://psteinb.github.io/hpc-in-a-day/01-01-logging-in/

Quick note:
Everything here is useful, except some of the details about how to log in / submit jobs / etc. is different for Sherlock than what is presented here. Don't worry too much about the details, just focus on the general ideas.

Now that you know the basics of cluster computing, you can get started on Sherlock ⛑.

Getting started

For this post, I'll assume that you have a passing understanding of command line programming. Sherlock (and almost all HPCs) run Linux, so these will be a unix-based commands.

1. Make an account on Sherlock ⛑and get access to our `fukamit` group space. You can request an account [here](https://www.sherlock.stanford.edu/).

2. Log into Sherlock ⛑: Instructions can be found [here](https://www.sherlock.stanford.edu/docs/overview/introduction/) (scroll to 'quick start').

$ ssh sunet@login.sherlock.stanford.edu

You can also log into Sherlock “On Demand” which is browser-based web interface which is pretty cool!

To connect to Sherlock OnDemand, simply point your browser to https://login.sherlock.stanford.edu

Make sure to enter your Stanford credentials.

3. ⛑ Orient yourself: Orient yourself using `ls`, `pwd`, and `cd` within your working directory. Try to find and enter the group directory (our labs is `fukamit`).

Here is more information about logging in and orienting yourself on Sherlock.

To summarize:

(from the Sherlock info site)

$HOME is your home folder, the entrypoint at login. It’s a high speed NFS filesystem (Isilon) and is backed up for you. You can edit the .bash_profile here to add programs and other environment variables to your path, or have modules for software you use often loaded automatically (we will write another post on modules). You should be very careful about putting large content (whether data or programs) in your $HOME, because the space can fill up quickly.
$SCRATCH is a larger storage space where you are encouraged to put data (and larger) files. This could include caching of container images (e.g., Singularity) or other intermediate data files.
$PI_SCRATCH is a scratch space allocated to a PI, which usually is best used for shared data in a lab.

4. Learn to move files onto and off of the cluster: Here is a link to a Stanford site on how to do this with Sherlock. Instead of scp, I sometimes use a graphical interface like Cyberduck, although this doesn't work well with many files.

5. Learn how to access pre-installed software: Unlike your computer, instead of just being about to type `Python` or `R` and be able to use those languages, any time you want to use a program on the cluster, you have to load it.

On Sherlock ⛑, you have to type in `$ module load [group] [program]`.

Here is a list of programs that Sherlock ⛑has installed: https://www.sherlock.stanford.edu/docs/software/list/

For example, if I wanted to use [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), a common quality control program for .fastq files, I would type:

$ module load genomics fastqc

Here is a tutorial for accessing software on Sherlock ⛑: https://srcc.stanford.edu/private/accessing-software

6. Learn how to submit jobs on Sherlock ⛑:

On your laptop, when you hit "enter" on the terminal, it runs the line. This is true on Sherlock as well. However, because there are many people using Sherlock at any given time, you only get to use a tiny portion of it at any given time, which is called your login or home node. However, the power of Sherlock is the ability to run analyses on much larger compute nodes than just the small amount of processing on your login node. However, so does everyone else on Sherlock! How does Sherlock "decide" who gets to use its amazing powers, for how long, and when? Sherlock has a scheduling algorithm that takes all of the **jobs** (commands you want to run), and taking into account how many resources it will need (you define this in your job), it decides who gets to do what, when. This is called a **scheduler**.

From the Sherlock ⛑ [site](https://vsoch.github.io/lessons/sherlock/), you can run two different kinds of jobs:

Interactive jobs means jumping yourself onto a worker node, and issuing commands interactively. If you want to test or otherwise run commands manually, you should ask for an interactive node. You should not be running things on the login node because it hinders performance for other users. Your processes will be killed if they take up too much memory.
Batch jobs means handing a submission script to the SLURM batch scheduler to give instruction for running one or more jobs.

Since we'll be running non-interactive, highly repetitive jobs, we'll be submitting batch jobs.

These jobs require you fill out a special form, which is in the form of a shell script The scheduler uses the SLURM system. You can learn more about SLURM [here](https://slurm.schedmd.com/tutorials.html). Therefore, you have to write jobs that fit the format that SLURM is expecting. Not ever cluster uses SLURM, but many (not just Sherlock ⛑) use SLURM.

Here is a tutorial from Sherlock ⛑ on submitting jobs: https://srcc.stanford.edu/scheduling-jobs-0

Here is a high-level summary of Linux SLURM commands: https://vsoch.github.io/lessons/slurm/

Here's a SLURM quick-start tutorial (but it's pretty technical): https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html

7. Learn to be a good citizen on Sherlock:

Check out these best practices for data management and resource use on Sherlock. Remember, it's a shared resource!

[Part 1](https://srcc.stanford.edu/using-resources-effectively-part-1)

[Part 2](https://srcc.stanford.edu/using-resources-effectively-part-2)

Additional Resources:

Tutorials:

* [RNAseq analysis from a cluster](https://youtu.be/M3RVfv6lUtc)

* [Test your knowledge with this tutorial that includes practice quiz questions!](https://www.melbournebioinformatics.org.au/tutorials/tutorials/hpc/hpc/)

* [Sherlock-specific workshop](https://onedrive.live.com/?cid=c9227ab0bd9490df&id=C9227AB0BD9490DF%217973&authkey=%21AFJ7CHbSYX8X09A) from the Stanford Research Computing Center (SSRC)

Advanced topics:

* [How HPCs work: contextualized to scientific computing](https://www.youtube.com/watch?v=fzVT7Y3EBIM)

Workshops:

Stanford CS107X

* [Bioinformatics Cloud Computing Workshop](https://med.stanford.edu/genecamp/SBCCW.html) @ Stanford (for high school students and undergraduates)

Callie ChappellJuly 21, 2022bioinformatics, cluster computing, undergradComment