Guide Calling for Dual Guides

Run through the 10x Cellranger pipeline and velocyto for single cell RNAseq quatification and using (2) guides quantifiction. all found in the cellranger files folder bash
Guide Calling for dual guide. Use repogle method to take molecule.h5 generated by cellranger and py to run through repogle version of guide calling or use cellranger_guidecalling.ipynb for Direct Capture Perturb-Seq dual guide. Formed guide-specific lists of cells.
Pseudobulk analysis. A. Seperation of guide-specific fastq files. bash B. Whippet pseudobulk for transcript specific analysis, post UMI deduplication. bash C. Transcript quality control. R D. Whippet result visualisation.
Normalisation of adata object and E-distance of KD
Check gene and neighboring gene expression
Create individual umaps per gene of interest
1. UMAPs
2. Rand Index score
Cell phase assignment model from FUCCI-matched single cell paper (GSE146773_)
Differential Expression analysis.
1. Find the shared P1 and P2 genes.
2. Check the shared P1 and P2 across different protospacers with the same A/B and C/D.
CNV Score & Numbat to quantify and Velocity quantification with loom file
ESR1-specific analysis from proliferation analysis to rt-qpcr
Spectra analysis and visualisation for pathway enrichment

This specific stage handles “Guide Calling” and filtering—essentially identifying which CRISPR guides are in which individual cell and ensuring only the “ideal” cells (those with the correct combination of guides) are used for downstream analysis. The process begins by reading the output from Cell Ranger, which is the standard tool for processing 10x Genomics data.

What is happening: The code loads protospacer_calls_per_cell.csv. This file contains raw data on which “features” (guides) were detected in each cell barcode and how many UMIs (Unique Molecular Identifiers) were associated with each.

Normalization: Since one cell barcode can have multiple guides, the code “stacks” and “resets” the index to create a clean list where every row is a unique cell_barcode + guide_identity pair.

[8]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

[12]:

#umis vs coverage
#so unique molecular identifiers mean that you have a molecule that is present
#coverage is the number of reads from each umi

protospacer_calls=pd.read_csv("../../cellranger_output/crispr_analysis/protospacer_calls_per_cell.csv")
protospacer_calls.head()

[12]:

	cell_barcode	num_features	feature_call	num_umis
0	AAACGGGTCTTAGAGC-1	2	Non-Targeting\|sgNegCtrl3b	3694\|2106
1	AAAGTAGCAGCTGGCT-1	2	Non-Targeting\|sgPKM_MPD	1335\|19
2	AACCATGCACCAGTTA-1	14	Non-Targeting\|non-targeting_02847\|sgDHX30_MPA\|...	27\|45\|11\|14\|71\|20\|28\|76\|36\|55\|17\|35\|6\|19
3	AACTCAGCAAGCCATT-1	2	Non-Targeting\|sgNegCtrl3b	1813\|550
4	AACTCAGGTATCACCA-1	2	Non-Targeting\|sgNegCtrl3b	1861\|518

[13]:

feature_call=protospacer_calls.set_index(['cell_barcode',"num_features"])['feature_call'].str.split('|', expand=True).stack().reset_index()
feature_call.reset_index(inplace=True)
feature_call=feature_call.drop("level_2", axis=1)


# form a dataframe of cellb arcode, feature call and num_umis
num_umi=protospacer_calls.set_index(['cell_barcode',"num_features"])['num_umis'].str.split('|', expand=True).stack().reset_index()
num_umi.reset_index(inplace=True)
num_umi=num_umi.drop("level_2", axis=1)
feature_call_stacked=num_umi.merge(feature_call,on=["cell_barcode", "index","num_features"])

feature_call_stacked.columns=['index', 'cell_barcode', 'num_features', 'UMI_count', 'guide_identity']
print(feature_call_stacked[feature_call_stacked["cell_barcode"]=="CGAATGTGTCTAGTCA-1"])

     index        cell_barcode  num_features UMI_count guide_identity
944    944  CGAATGTGTCTAGTCA-1             1      1315  Non-Targeting

[ ]:

read_info=pd.read_csv("../../cellranger_output/molinfo_data_crispr_only.csv",names=["cell","umi","gem_group","gene","reads","library"],skiprows=1)
gene_name=pd.read_csv("../../cellranger_output/molinfo_genes.csv")

read_info["cell_barcode"]=read_info["cell"]+"-1"
read_info=read_info.merge(gene_name,right_index=True, left_on="gene", how="inner")
read_info["guide_identity"]=read_info["0"].str.decode('utf-8')
read_info=read_info[["cell_barcode","guide_identity","reads"]]
read_umi_check=read_info.groupby(["cell_barcode","guide_identity"]).sum().reset_index()

read_umi=pd.read_csv("../../cellranger_output/crispr_analysis/read_umi_identity.csv", names=["cell_barcode","guide_identity","read_count"], skiprows=1)
read_umi.head()

	cell_barcode	guide_identity	read_count
6	AAACCCAAGAAACCCG-1	non-targeting_00373	1.0
7	AAACCCAAGAAACCCG-1	sgCALR_MPA	2.0
8	AAACCCAAGAAACCCG-1	sgCHD8_MPC	1.0
9	AAACCCAAGAAACCCG-1	sgSRSF5_APC	1.0
10	AAACCCAAGAAACCCG-1	sgGTF2F1_APA	1.0

[ ]:

read_umi_check=read_umi.groupby(["cell_barcode","guide_identity"]).sum().reset_index()
print(read_umi_check.head())

         cell_barcode       guide_identity  read_count
0  AAACCCAAGAAACCCG-1  non-targeting_00373         1.0
1  AAACCCAAGAAACCCG-1           sgCALR_MPA         2.0
2  AAACCCAAGAAACCCG-1           sgCHD8_MPC         1.0
3  AAACCCAAGAAACCCG-1         sgGTF2F1_APA         1.0
4  AAACCCAAGAAACCCG-1         sgNFE2L2_MPD         1.0

[ ]:

feature_call_stacked=read_umi_check.merge(feature_call_stacked,on=["cell_barcode","guide_identity"]).drop_duplicates()
feature_call_stacked['UMI_count']=feature_call_stacked['UMI_count'].astype('float')
feature_call_stacked['coverage'] = feature_call_stacked['read_count']/feature_call_stacked['UMI_count']
feature_call_stacked["coverage"].value_counts()

[ ]:

##### cell_barcode,guide_identity,read_count,UMI_count,coverage,gemgroup,good_coverage,number_of_cells
##output table
output_dir="../../cellranger_output/"
feature_call_stacked['good_coverage']=True
feature_call_stacked['number_of_cells']=1.0
feature_call_stacked["gemgroup"]=1

tst=feature_call_stacked[['cell_barcode','guide_identity','read_count','UMI_count','coverage','gemgroup','good_coverage','number_of_cells']]
tst.to_csv(output_dir+'cell_identities.csv',index=False)