Guide Calling for Dual Guides
Run through the 10x Cellranger pipeline and velocyto for single cell RNAseq quatification and using (2) guides quantifiction. all found in the cellranger files folder bash
Guide Calling for dual guide. Use repogle method to take molecule.h5 generated by cellranger and py to run through repogle version of guide calling or use cellranger_guidecalling.ipynb for Direct Capture Perturb-Seq dual guide. Formed guide-specific lists of cells.
Pseudobulk analysis. A. Seperation of guide-specific fastq files. bash B. Whippet pseudobulk for transcript specific analysis, post UMI deduplication. bash C. Transcript quality control. R D. Whippet result visualisation.
Normalisation of adata object and E-distance of KD
Check gene and neighboring gene expression
Create individual umaps per gene of interest
UMAPs
Rand Index score
Cell phase assignment model from FUCCI-matched single cell paper (GSE146773_)
Differential Expression analysis.
Find the shared P1 and P2 genes.
Check the shared P1 and P2 across different protospacers with the same A/B and C/D.
CNV Score & Numbat to quantify and Velocity quantification with loom file
ESR1-specific analysis from proliferation analysis to rt-qpcr
Spectra analysis and visualisation for pathway enrichment
This specific stage handles “Guide Calling” and filtering—essentially identifying which CRISPR guides are in which individual cell and ensuring only the “ideal” cells (those with the correct combination of guides) are used for downstream analysis. The process begins by reading the output from Cell Ranger, which is the standard tool for processing 10x Genomics data.
What is happening: The code loads protospacer_calls_per_cell.csv. This file contains raw data on which “features” (guides) were detected in each cell barcode and how many UMIs (Unique Molecular Identifiers) were associated with each.
Normalization: Since one cell barcode can have multiple guides, the code “stacks” and “resets” the index to create a clean list where every row is a unique cell_barcode + guide_identity pair.
[8]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
[12]:
#umis vs coverage
#so unique molecular identifiers mean that you have a molecule that is present
#coverage is the number of reads from each umi
protospacer_calls=pd.read_csv("../../cellranger_output/crispr_analysis/protospacer_calls_per_cell.csv")
protospacer_calls.head()
[12]:
| cell_barcode | num_features | feature_call | num_umis | |
|---|---|---|---|---|
| 0 | AAACGGGTCTTAGAGC-1 | 2 | Non-Targeting|sgNegCtrl3b | 3694|2106 |
| 1 | AAAGTAGCAGCTGGCT-1 | 2 | Non-Targeting|sgPKM_MPD | 1335|19 |
| 2 | AACCATGCACCAGTTA-1 | 14 | Non-Targeting|non-targeting_02847|sgDHX30_MPA|... | 27|45|11|14|71|20|28|76|36|55|17|35|6|19 |
| 3 | AACTCAGCAAGCCATT-1 | 2 | Non-Targeting|sgNegCtrl3b | 1813|550 |
| 4 | AACTCAGGTATCACCA-1 | 2 | Non-Targeting|sgNegCtrl3b | 1861|518 |
[13]:
feature_call=protospacer_calls.set_index(['cell_barcode',"num_features"])['feature_call'].str.split('|', expand=True).stack().reset_index()
feature_call.reset_index(inplace=True)
feature_call=feature_call.drop("level_2", axis=1)
# form a dataframe of cellb arcode, feature call and num_umis
num_umi=protospacer_calls.set_index(['cell_barcode',"num_features"])['num_umis'].str.split('|', expand=True).stack().reset_index()
num_umi.reset_index(inplace=True)
num_umi=num_umi.drop("level_2", axis=1)
feature_call_stacked=num_umi.merge(feature_call,on=["cell_barcode", "index","num_features"])
feature_call_stacked.columns=['index', 'cell_barcode', 'num_features', 'UMI_count', 'guide_identity']
print(feature_call_stacked[feature_call_stacked["cell_barcode"]=="CGAATGTGTCTAGTCA-1"])
index cell_barcode num_features UMI_count guide_identity
944 944 CGAATGTGTCTAGTCA-1 1 1315 Non-Targeting
[ ]:
read_info=pd.read_csv("../../cellranger_output/molinfo_data_crispr_only.csv",names=["cell","umi","gem_group","gene","reads","library"],skiprows=1)
gene_name=pd.read_csv("../../cellranger_output/molinfo_genes.csv")
read_info["cell_barcode"]=read_info["cell"]+"-1"
read_info=read_info.merge(gene_name,right_index=True, left_on="gene", how="inner")
read_info["guide_identity"]=read_info["0"].str.decode('utf-8')
read_info=read_info[["cell_barcode","guide_identity","reads"]]
read_umi_check=read_info.groupby(["cell_barcode","guide_identity"]).sum().reset_index()
read_umi=pd.read_csv("../../cellranger_output/crispr_analysis/read_umi_identity.csv", names=["cell_barcode","guide_identity","read_count"], skiprows=1)
read_umi.head()
| cell_barcode | guide_identity | read_count | |
|---|---|---|---|
| 6 | AAACCCAAGAAACCCG-1 | non-targeting_00373 | 1.0 |
| 7 | AAACCCAAGAAACCCG-1 | sgCALR_MPA | 2.0 |
| 8 | AAACCCAAGAAACCCG-1 | sgCHD8_MPC | 1.0 |
| 9 | AAACCCAAGAAACCCG-1 | sgSRSF5_APC | 1.0 |
| 10 | AAACCCAAGAAACCCG-1 | sgGTF2F1_APA | 1.0 |
[ ]:
read_umi_check=read_umi.groupby(["cell_barcode","guide_identity"]).sum().reset_index()
print(read_umi_check.head())
cell_barcode guide_identity read_count
0 AAACCCAAGAAACCCG-1 non-targeting_00373 1.0
1 AAACCCAAGAAACCCG-1 sgCALR_MPA 2.0
2 AAACCCAAGAAACCCG-1 sgCHD8_MPC 1.0
3 AAACCCAAGAAACCCG-1 sgGTF2F1_APA 1.0
4 AAACCCAAGAAACCCG-1 sgNFE2L2_MPD 1.0
[ ]:
feature_call_stacked=read_umi_check.merge(feature_call_stacked,on=["cell_barcode","guide_identity"]).drop_duplicates()
feature_call_stacked['UMI_count']=feature_call_stacked['UMI_count'].astype('float')
feature_call_stacked['coverage'] = feature_call_stacked['read_count']/feature_call_stacked['UMI_count']
feature_call_stacked["coverage"].value_counts()
[ ]:
##### cell_barcode,guide_identity,read_count,UMI_count,coverage,gemgroup,good_coverage,number_of_cells
##output table
output_dir="../../cellranger_output/"
feature_call_stacked['good_coverage']=True
feature_call_stacked['number_of_cells']=1.0
feature_call_stacked["gemgroup"]=1
tst=feature_call_stacked[['cell_barcode','guide_identity','read_count','UMI_count','coverage','gemgroup','good_coverage','number_of_cells']]
tst.to_csv(output_dir+'cell_identities.csv',index=False)