Processing Single-Cell data from Mouse

Single cell RNA sequencing (scRNA-Seq) is a fascinating way of getting insights into the molecular processes guiding an individual cell. While RNA won’t provide the full picture on the inner workings of a cell, proteins, hormones and nutrients will have a say in that too, it certainly is a part of the puzzle.

For quite some time, researchers have had the ability to not only look at one cell, but hundreds or thousands of cells. Projects such as the Human Cell Atlas show what is possible.

Today I am taking a step back and looked for a paper that published a single cell atlas of a size that I can process it in a week at home.

The choice fell on the 2015 Science publication by Zeisel et al: “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq”

In that publication scRNA-Seq was applied to cells from mouse brain tissues. I will be downloading the datasets and try to recreate figure 1B panels 2-10. Figure 1B panel 1 shows the clustering algorithm implemented by that team, which I might attempt later but have shelved for now.

Getting the Data

Lucky for me the sequencing data was deposited and can be found in ENA under accession PRJNA258094. In total over 3000 fastq files are grouped under this accession, each sequencing file seems to be a single cell, meaning the sequencing data has already been conveniently split for later use.

Looking at one of these fastq files I notice that reads are relatively short (52 bp). According to the paper each read starts with a 6bp UMI followed by 3-5 guanines, followed by the 5’ end of the transcript. Here are 3 reads from one of the datasets to get an idea of what I am talking about:

@SRR1544698.1 Run0199_AC237YACXX_L2_T1101_C32 length=52
CTAGACGGGGGGAATTCTGGACATTAATTAGGGCTGAAAGCCCTAACTTAAT
+SRR1544698.1 Run0199_AC237YACXX_L2_T1101_C32 length=52
@@@FFFFFHHHDDDDDDEEDDDDDDDDEEEDDDDDDDDDDDDDDDDDDEDDE
@SRR1544698.2 Run0199_AC237YACXX_L2_T1101_C64 length=52
ACACATGGGGGAATTCTCTTGCTTCAACAATAACGTCTCTTTCAGAAGGCAT
+SRR1544698.2 Run0199_AC237YACXX_L2_T1101_C64 length=52
BCCFFFFFHHHHHIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIII

The first read then can be split into these sections:

  UMI    Gs            Transcript
CTAGAC GGGGG GAATTCTGGACATTAATTAGGGCTGAAAGCCCTAACTTAAT

Aligning transcripts directly against a genome might cause issues, as we only have exons here and this might lead to lower quality alignments than expected. As such, to process these reads I will be using software designed to handle both UMIs and gapped alignments.

After a bit of looking around and testing I decided to use the Subread package together with umi_tools. I tried using the STAR aligner but it needed to much memory for creation of the index and other software, such as Salmon is too advanced to allow me to go step by step.

Overall I will be leaning on information provided by the UMI-Tools package on how to process single-cell data with their software, This information can be found here: https://umi-tools.readthedocs.io/en/latest/Single_cell_tutorial.html

Python Environment

I am using mamba to get a python environment setup quickly. Note that I have my channels configured globally, so I am using bioconda implicitly here.

mamba create -n reproduce_single_cell python==3.11 uv==0.3.3 samtools subread umi_tools
mamba activate reproduce_single_cell
uv pip install plotnine==0.13.6 pandas==2.2.2 pysam==0.22.1 scipy==1.14.1 numpy==1.25.2 biopython==1.82 pydantic==2.8.2 dask==2024.8.2

Processing scRNA-Seq data

I already know, that I will need to process over 3000 fastq files to get count data. Only then can I move on to the analysis section. As it is quite a commitment to process 3000 files on a single desktop machine, I first set out to process a single file. Once that works, I will scale it up.

Getting the References

Besides the sequencing data I will be needing a mouse reference genome and a corresponding map of all the genes on that genome.

Both of these are easy to find on ENSEMBL.

wget -P ./data/02/ -c https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
wget -P ./data/02/ -c https://ftp.ensembl.org/pub/release-112/gtf/mus_musculus/Mus_musculus.GRCm39.112.gtf.gz

# Decompress the data
gunzip ./data/02/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
gunzip ./data/02/Mus_musculus.GRCm39.112.gtf.gz

Before using an aligner the first step is to create an index, this is also true for Subread:

subread-buildindex -o ./data/02/genome_index ./data/02/Mus_musculus.GRCm39.dna.primary_assembly.fa

Processing the Data

So now we have to index, we can start processing files.

After some testing I cam up with this script for processing each fastq file from this dataset. I saved this under ./data/02/align.sh and will be using it to process the read datasets.

#!/bin/bash

# Function to check for required dependencies
check_dependencies() {
    for cmd in umi_tools subread-align samtools featureCounts; do
        if ! command -v $cmd &> /dev/null; then
            echo "Error: $cmd is not installed. Please install it and try again."
            exit 1
        fi
    done
}

# Function to display usage
usage() {
    echo "Usage: $0 <reads_file.fastq.gz> <threads> <output_directory> <geneset_file> <genome_index>"
    exit 1
}

# Function to clean up temporary files
cleanup() {
    rm -rf "$TEMP_DIR"
}

# Check for correct number of arguments
if [ "$#" -ne 5 ]; then
    usage
fi

# Set variables
READS_FILE="$1"
THREADS="$2"
OUTPUT_DIR="$3"
GENESET_FILE="$4"
GENOME_INDEX="$5"
TEMP_DIR=$(mktemp -d)
COUNTS_OUTPUT="$OUTPUT_DIR/$(basename "$READS_FILE" .fastq.gz)_counts.tsv.gz"

# Check for dependencies
check_dependencies

# Trap to clean up temporary files on exit
trap cleanup EXIT

# Main processing
{
    UMI_READS="$TEMP_DIR/reads_umi.fastq.gz"
    GROUPED_BAM="$TEMP_DIR/grouped.bam"
    DEDUPLICATED_BAM="$TEMP_DIR/deduplicated.bam"
    ALIGNED_BAM="$TEMP_DIR/aligned.bam"
    SORTED_BAM="$TEMP_DIR/sorted.bam"
    ASSIGNED_SORTED_BAM="$TEMP_DIR/assigned_sorted.bam"
    FEATURECOUNTS_OUTPUT="$TEMP_DIR/featureCounts_output.txt"

    # Extract UMI
    umi_tools extract --bc-pattern=NNNNNN -S "$UMI_READS" -I "$READS_FILE" || { echo "Error in UMI extraction"; exit 1; }

    # Align reads
    subread-align -t 0 -i "$GENOME_INDEX" -r "$UMI_READS" -o "$ALIGNED_BAM" || { echo "Error in alignment"; exit 1; }

    # Sort aligned BAM
    samtools sort "$ALIGNED_BAM" -o "$SORTED_BAM" || { echo "Error in sorting BAM"; exit 1; }
    samtools index "$SORTED_BAM" || { echo "Error in indexing sorted BAM"; exit 1; }
   
    # Count features
    featureCounts -T "$THREADS" -a "$GENESET_FILE" -o "$FEATURECOUNTS_OUTPUT" -R BAM "$SORTED_BAM" || { echo "Error in feature counting"; exit 1; }

    # Sort assigned BAM (using the correct output from featureCounts)
    samtools sort "${SORTED_BAM}.featureCounts.bam" -o "$ASSIGNED_SORTED_BAM" || { echo "Error in sorting assigned BAM"; exit 1; }
    samtools index "$ASSIGNED_SORTED_BAM" || { echo "Error in indexing assigned sorted BAM"; exit 1; }

     # Group UMI reads
    umi_tools group --output-bam -I "$ASSIGNED_SORTED_BAM"  -S "$GROUPED_BAM" || { echo "Error in grouping UMI reads"; exit 1; }
    samtools index "$GROUPED_BAM" || { echo "Error in indexing grouped BAM"; exit 1; }

    # Deduplicate reads
    umi_tools dedup -I "$GROUPED_BAM" -S "$DEDUPLICATED_BAM" || { echo "Error in deduplication"; exit 1; }
    samtools index "$DEDUPLICATED_BAM" || { echo "Error in indexing deduplicated BAM"; exit 1; }


    # Count UMI reads
    umi_tools count --per-gene --gene-tag=XT --assigned-status-tag=XS -I "$DEDUPLICATED_BAM" -S "$COUNTS_OUTPUT" || { echo "Error in UMI counting"; exit 1; }

    # Check if counts file exists
    if [ -f "$COUNTS_OUTPUT" ]; then
        echo "Counts file saved to $COUNTS_OUTPUT"
    else
        echo "Error: Counts file not generated."
        exit 1
    fi
} || {
    echo "An error occurred during processing."
    exit 1
}

Here are the steps this script goes trough:

Extract the UMI from the first 6 bases
Align the reads to the reference using subread
Sort and index the resulting bam file
Assign each read to a transcript
Group all reads assigned to the same transcript
Deduplicate within each transcript based on the UMI
Report number of unique reads per transcript

Now lets move on and run this for all 3492 datasets.

I first get the table of all datasets, I am simply downloading that from ENA using curl:

curl -o ./data/02/dataset.tsv https://www.ebi.ac.uk/ena/portal/api/filereport?accession=PRJNA258094&result=read_run&fields=study_accession,sample_accession,experiment_accession,run_accession,library_strategy,read_count,experiment_title,run_alias,fastq_ftp,sra_ftp,sample_alias,sample_title&format=tsv&download=true&limit=0

Now I have a TSV file with all the metadata I need. ENA really is amazing.

To download, align and count the datasets one by one, I will be using a simple bash script. It ensures I am not cluttering my disk and am running as many jobs in parallel as my computer can stomach. It also ensures if I have to stop it, that it can continue from where it left of, instead of recomputing results I already have.

Here it is:

%%bash
#!/bin/bash

# Set variables
OUTPUT_DIR="./data/02/counts"
THREADS=3
GNUTHREADS=4
GTF="data/02/Mus_musculus.GRCm39.112.gtf"
INDEX="data/02/genome_index"
DATASET="./data/02/dataset.tsv"
ALIGNSH="./data/02/align.sh"

# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

# Get the column number of fastq_ftp
HEADER=$(head -n 1 "$DATASET")
FASTQ_COL=$(echo "$HEADER" | tr '\t' '\n' | nl -w1 -s' ' | grep -w 'fastq_ftp' | awk '{print $1}')

if [ -z "$FASTQ_COL" ]; then
    echo "Error: fastq_ftp column not found in the dataset."
    exit 1
fi

# Function to process each line
process_line() {
    line="$1"
    fastqcol=$2

    FASTQFILE=$(echo "$line" | cut -f"$fastqcol")
    FASTQFILE_NAME=$(basename "$FASTQFILE")
    COUNTS_OUTPUT="$OUTPUT_DIR/${FASTQFILE_NAME%.fastq.gz}_counts.tsv.gz"

    # Check if counts file already exists
    if [ -f "$COUNTS_OUTPUT" ]; then
        echo "Counts file already exists for: $FASTQFILE_NAME, skipping."
        return
    fi

    # Download the FASTQ file
    echo "Downloading FASTQ file from: $FASTQFILE"
    wget -c -q "$FASTQFILE" -O "data/$FASTQFILE_NAME" || { echo "Error downloading $FASTQFILE"; exit 1; }

    # Run the align.sh script
    bash "$ALIGNSH" "data/$FASTQFILE_NAME" "$THREADS" "$OUTPUT_DIR" "$GTF" "$INDEX" > /dev/null 2> /dev/null

    # Delete the FASTQ file after processing
    rm "data/$FASTQFILE_NAME"
}

export -f process_line
export OUTPUT_DIR THREADS GTF INDEX ALIGNSH

# Loop through each line in the dataset, skipping the header and use parallel
tail -n +2 "$DATASET"  | parallel -j "$GNUTHREADS" process_line {} $FASTQ_COL

To run this, took me a few days. But in the end I now have 3492 tsv files with counts for each cell.

Having a Look

All of the files together will form the count matrix. With genes in rows and cells in columns.

I am loading this with dask, as it is faster than relying on pandas.

import os
from glob import glob

import dask.dataframe as dd
import pandas as pd
from tqdm import tqdm

dataset_cache_location = "./data/02/count_table.csv.gz"
if os.path.exists(dataset_cache_location):
    df = pd.read_csv(dataset_cache_location, index_col=0)
else:
    # Get all file paths
    files = glob("data/02/counts/*")

    # Initialize an empty list to hold Dask DataFrames
    dataframes = []

    # Loop through each file
    for file in tqdm(files, desc="Loading dataframes"):
        df = dd.read_csv(file, sep="\t", blocksize=None)
        filename = os.path.basename(file)
        new_column_name = filename.split("_")[0]
        df["cell"] = new_column_name
        dataframes.append(df)

    # Now I concatenate the dataframes and construct a full matrix
    combined_df = dd.concat(dataframes)
    combined_df = combined_df.compute()
    df = combined_df.pivot_table(
        index="gene", columns="cell", values="count", fill_value=0
    )
    del combined_df, dataframes
    df.to_csv(dataset_cache_location, index=True)

df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

The count table I obtained has 49589 rows (genes) and 3492 columns (cells). Using that count table a first good QC plot is to show the number of unique reads per cell:

import plotnine as p9

df_colsums = df.sum().reset_index()
df_colsums.columns = ["cell", "gene_count"]
(
    p9.ggplot(df_colsums, p9.aes(x="gene_count"))
    + p9.geom_histogram(bins=100, fill="blue", color="black")
    + p9.labs(
        title="Number of Reads per Cell",
        x="Reads per Cell",
        y="Frequency",
        subtitle="Minimum of 2500 shown by red line",
    )
    + p9.geom_vline(xintercept=2500, color="red", linetype="dashed")
    + p9.scale_y_log10()
    + p9.theme_light()
)

/home/paul/.local/lib/python3.11/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log10
/home/paul/.local/lib/python3.11/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log10

png

Now I have a table with ENSEMBLE ids but to compare to the authors count able I need gene names, so I will use the GTF I have to quickly get a mapping from ENSEMBLE ID to gene name. This yields a mapping of 57.186 ENSEMBLE ids to their gene names in seconds, and is much faster than blasting the REST API for hours. Also as this is the same GTF file I used during processing earlier, it is guaranteed that all IDs I need are present.

ensemble_mapping = {}

gtf_file = "data/02/Mus_musculus.GRCm39.112.gtf"
gtf_df = pd.read_csv(
    gtf_file, sep="\t", comment="#", header=None, low_memory=False
)
gtf_df.columns = [
    "seqname",
    "source",
    "feature",
    "start",
    "end",
    "score",
    "strand",
    "frame",
    "attribute",
]
gtf_df = gtf_df[gtf_df.feature == "gene"]


def obtain_attributes(attribute_str: str) -> dict:
    attributes = {}
    for e in attribute_str.split(";"):
        e = e.strip()
        if e:
            try:
                a, b = e.split(" ", 1)
                attributes[a] = b.strip('"')
            except ValueError:
                print(f"Warning: Could not parse attribute '{e}'")
    return attributes


for attribute_str in gtf_df.attribute.values:
    attributes = obtain_attributes(attribute_str)
    ensemble_mapping[attributes.get("gene_id")] = attributes.get("gene_name")

all_ids_present = len(
    set(df.index.to_list()).intersection(set(ensemble_mapping.keys()))
) == len(set(df.index.to_list()))

print(
    f"Created mapping for {len(ensemble_mapping)} IDs, which contains all IDs we need: {all_ids_present}"
)

Created mapping for 57186 IDs, which contains all IDs we need: True

This will be helpfull later when I want to have the gene name instead of the ENSEMBL id.

Reproducing Figure 1B

Figure 1B, shows a t-SNE of the counts. I can quickly plot such a t-SNE using scipy.

import pandas as pd
import plotnine as p9
from sklearn.manifold import TSNE


def compute_tsne(df):
    tsne = TSNE(n_components=2, random_state=12345, perplexity=30)
    tsne_results = tsne.fit_transform(df.drop(columns="cell"))

    # Create a DataFrame for the t-SNE results
    tsne_df = pd.DataFrame(tsne_results, columns=["TSNE1", "TSNE2"])
    tsne_df["cell"] = df["cell"]
    return tsne_df


def plot_tsne(df):
    tsne_plot = (
        p9.ggplot(df, p9.aes(x="TSNE1", y="TSNE2"))
        + p9.geom_point(size=3, stroke=0, fill="#4287f5")
        + p9.geom_point(data=df, size=3, fill="none", stroke=0.1)
        + p9.labs(
            title=f"t-SNE Showing {len(df)} Cells",
            subtitle="Each Cell is a Point",
            x="t-SNE Component 1",
            y="t-SNE Component 2",
        )
        + p9.theme_light()
    )

    return tsne_plot

df_transposed = df.T.copy().reset_index()
df_transposed.rename(columns={"index": "cell"}, inplace=True)
df_transposed = df_transposed.rename_axis(None, axis=1).reset_index(drop=True)
tsne_df = compute_tsne(df_transposed)

# Make the plot
plot_tsne(tsne_df)

png

That worked well. But of course I am missing the colouring by intensity. I can add that. For now I will plot the data for the gene Thy1:

def plot_tsne_intensity(
    df: pd.DataFrame, count_matrix: pd.DataFrame, gene: str, name_mapping: dict
):
    df["intensity"] = count_matrix[name_mapping[gene]]
    quantiles = df["intensity"].quantile([0.01, 0.99]).values.tolist()
    plot_data = df[df["intensity"] > 0]

    tsne_plot = (
        p9.ggplot(plot_data, p9.aes(x="TSNE1", y="TSNE2", fill="intensity"))
        + p9.geom_point(size=3, stroke=0)
        + p9.geom_point(
            data=df[df["intensity"] == 0], fill="#4a52ed", size=3, stroke=0
        )
        + p9.geom_point(data=df, size=3, fill="none", stroke=0.1)
        + p9.labs(
            title=f"t-SNE Showing {len(df)} Cells",
            subtitle=f"Each Cell is a Point colored by the count of gene {gene}".title(),
            x="t-SNE Component 1",
            y="t-SNE Component 2",
        )
        + p9.theme_light()
        + p9.scale_fill_gradient2(
            low="#4a52ed",
            mid="white",
            high="red",
            midpoint=quantiles[0],
            limits=(0, quantiles[1]),
            na_value="red",
            name="Count",
        )
    )

    return tsne_plot


name_mapping = {v: k for k, v in ensemble_mapping.items()}
plot_tsne_intensity(
    tsne_df, df_transposed, gene="Thy1", name_mapping=name_mapping
)

png

This plot shows that the Thy1 gene, which I choose as it is one of the genes highlighted in Figure 1B, has a very distinct distribution across the cells. Now that the plotting function is working I can make a first draft of the 10 panel figure needed to reproduce figure 1B. I adjust the plotting function to remove all of the labels and legends, so that the figure appears as close to the original as possible:

import os
import tempfile

import matplotlib.image as mpimg
import matplotlib.pyplot as plt


def plot_tsne_clean(df):
    tsne_plot = (
        p9.ggplot(df, p9.aes(x="TSNE1", y="TSNE2"))
        + p9.geom_point(size=3, stroke=0, fill="#4287f5")
        + p9.geom_point(data=df, size=3, fill="none", stroke=0.1)
        + p9.labs(
            title=f"{len(df)} Cells",
            x="t-SNE Component 1",
            y="t-SNE Component 2",
        )
        + p9.theme_light()
        + p9.theme(
            plot_title=p9.element_text(size=50),
            axis_title=p9.element_blank(),
            axis_text=p9.element_blank(),
            legend_position="none",
            panel_grid=p9.element_blank(),
            panel_background=p9.element_blank(),
            panel_border=p9.element_blank(),
        )
    )

    return tsne_plot


def plot_tsne_intensity_clean(
    df: pd.DataFrame, count_matrix: pd.DataFrame, gene: str, name_mapping: dict
):
    df["intensity"] = count_matrix[name_mapping[gene]]
    quantiles = df["intensity"].quantile([0.01, 0.99]).values.tolist()
    plot_data = df[df["intensity"] > 0]

    tsne_plot = (
        p9.ggplot(plot_data, p9.aes(x="TSNE1", y="TSNE2", fill="intensity"))
        + p9.geom_point(size=3, stroke=0)
        + p9.geom_point(
            data=df[df["intensity"] == 0], fill="#4a52ed", size=3, stroke=0
        )
        + p9.labs(title=gene)
        + p9.theme_light()
        + p9.scale_fill_gradient2(
            low="#4a52ed",
            mid="white",
            high="red",
            midpoint=quantiles[0],
            limits=(0, quantiles[1]),
            na_value="red",
            name="Count",
        )
        + p9.theme(
            plot_title=p9.element_text(size=50),
            axis_title=p9.element_blank(),
            axis_text=p9.element_blank(),
            legend_position="none",
            panel_grid=p9.element_blank(),
            panel_background=p9.element_blank(),
            panel_border=p9.element_blank(),
        )
    )

    return tsne_plot


def make_10_panel_figure(data: pd.DataFrame):
    df_transposed = df.T.copy().reset_index()
    df_transposed.rename(columns={"index": "cell"}, inplace=True)
    df_transposed = df_transposed.rename_axis(None, axis=1).reset_index(
        drop=True
    )

    tsne_df = compute_tsne(df_transposed)

    name_mapping = {v: k for k, v in ensemble_mapping.items()}
    gene_list_ordered = [
        "Thy1",
        "Gad1",
        "Tbr1",
        "Spink8",
        "Mbp",
        "Aldoc",
        "Aif1",
        "Cldn5",
        "Acta2",
    ]
    image_files = []
    with tempfile.TemporaryDirectory() as folder:
        # first plot is just the cells:
        filename = os.path.join(folder, f"plot_tnse.png")
        plot_tsne_clean(tsne_df).save(
            filename=filename,
            width=6,
            height=6,
            verbose=False,
            dpi=300,
        )
        image_files.append(filename)

        for i, gene in enumerate(gene_list_ordered):
            filename = os.path.join(folder, f"plot_{i}_{gene}.png")
            p = plot_tsne_intensity_clean(
                df=tsne_df,
                count_matrix=df_transposed,
                gene=gene,
                name_mapping=name_mapping,
            )
            p9.ggsave(
                p,
                filename=filename,
                width=6,
                height=6,
                verbose=False,
                dpi=300,
            )
            image_files.append(filename)

        # Create a figure with 2 rows and 4 columns
        fig, axs = plt.subplots(2, 5, figsize=(15, 7))

        # Loop through the first 8 images and display them
        for i in range(2):
            for j in range(5):
                img_index = i * 5 + j
                img_path = image_files[img_index]
                img = mpimg.imread(img_path)
                axs[i, j].imshow(img)
                axs[i, j].axis("off")  # Hide the axes

        # Adjust layout
        plt.tight_layout()

        return plt


plot = make_10_panel_figure(df)
plot.show()

png

And there it is. Figure 1B without the clustering. There are some differences though I want to go over.

Firstly: The authors describe that before creating the clusters they filtered the data for certain criteria. I am not doing that here, as for the sake of plotting a t-SNE of the expression values this filtering would change the plots only slightly.

Secondly: If you compare my plot directly with the authors version you will see it looks nothing like it. Why am I claiming it is the same result? t-SNE has stochastic in the name, meaning without knowing the exact state of the algorithm used in the paper I will likely never produce the same plot. Also the software they used to create the t-SNE is not the same that I use now. As such the position of the points will be different.

Nevertheless if we look at the signal between different genes, it is clear to see that the overarching patterns are the same.

Thats all I have today. I hope this was interesting to you.

Getting the Data#

Python Environment#

Processing scRNA-Seq data#

Getting the References#

Processing the Data#

Having a Look#

Reproducing Figure 1B#