{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Guide Calling for Dual Guides \n",
    "\n",
    "1. Run through the 10x Cellranger pipeline and velocyto for single cell RNAseq quatification and using (2) guides quantifiction. all found in the cellranger files folder **bash**\n",
    "2.  **Guide Calling for dual guide. Use repogle method to take molecule.h5 generated by cellranger and py to run through repogle version of guide calling or use cellranger_guidecalling.ipynb for Direct Capture Perturb-Seq dual guide. Formed guide-specific lists of cells.**\n",
    "3. Pseudobulk analysis.\n",
    "    A. Seperation of guide-specific fastq files. **bash**\n",
    "    B. Whippet pseudobulk for transcript specific analysis, post UMI deduplication.  **bash**\n",
    "    C. Transcript quality control. **R**\n",
    "    D. Whippet result visualisation.\n",
    "\n",
    "4. Normalisation of adata object and E-distance of KD\n",
    "5. Check gene and neighboring gene expression\n",
    "6. Create individual umaps per gene of interest \n",
    "\n",
    "    A. UMAPs \n",
    "    \n",
    "    B. Rand Index score\n",
    "7. Cell phase assignment model from FUCCI-matched single cell paper (GSE146773_)\n",
    "8. Differential Expression analysis.\n",
    "\n",
    "    A. Find the shared P1 and P2 genes. \n",
    "\n",
    "    B. Check the shared P1 and P2 across different protospacers with the same A/B and C/D.\n",
    "    \n",
    "9.  CNV Score & Numbat to quantify and Velocity quantification with loom file\n",
    "10. ESR1-specific analysis from proliferation analysis to rt-qpcr\n",
    "11.  Spectra analysis and visualisation for pathway enrichment\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This specific stage handles \"Guide Calling\" and filtering—essentially identifying which CRISPR guides are in which individual cell and ensuring only the \"ideal\" cells (those with the correct combination of guides) are used for downstream analysis. The process begins by reading the output from Cell Ranger, which is the standard tool for processing 10x Genomics data.\n",
    "\n",
    "What is happening: The code loads protospacer_calls_per_cell.csv. This file contains raw data on which \"features\" (guides) were detected in each cell barcode and how many UMIs (Unique Molecular Identifiers) were associated with each.\n",
    "\n",
    "Normalization: Since one cell barcode can have multiple guides, the code \"stacks\" and \"resets\" the index to create a clean list where every row is a unique cell_barcode + guide_identity pair.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cell_barcode</th>\n",
       "      <th>num_features</th>\n",
       "      <th>feature_call</th>\n",
       "      <th>num_umis</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AAACGGGTCTTAGAGC-1</td>\n",
       "      <td>2</td>\n",
       "      <td>Non-Targeting|sgNegCtrl3b</td>\n",
       "      <td>3694|2106</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AAAGTAGCAGCTGGCT-1</td>\n",
       "      <td>2</td>\n",
       "      <td>Non-Targeting|sgPKM_MPD</td>\n",
       "      <td>1335|19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AACCATGCACCAGTTA-1</td>\n",
       "      <td>14</td>\n",
       "      <td>Non-Targeting|non-targeting_02847|sgDHX30_MPA|...</td>\n",
       "      <td>27|45|11|14|71|20|28|76|36|55|17|35|6|19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AACTCAGCAAGCCATT-1</td>\n",
       "      <td>2</td>\n",
       "      <td>Non-Targeting|sgNegCtrl3b</td>\n",
       "      <td>1813|550</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AACTCAGGTATCACCA-1</td>\n",
       "      <td>2</td>\n",
       "      <td>Non-Targeting|sgNegCtrl3b</td>\n",
       "      <td>1861|518</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         cell_barcode  num_features  \\\n",
       "0  AAACGGGTCTTAGAGC-1             2   \n",
       "1  AAAGTAGCAGCTGGCT-1             2   \n",
       "2  AACCATGCACCAGTTA-1            14   \n",
       "3  AACTCAGCAAGCCATT-1             2   \n",
       "4  AACTCAGGTATCACCA-1             2   \n",
       "\n",
       "                                        feature_call  \\\n",
       "0                          Non-Targeting|sgNegCtrl3b   \n",
       "1                            Non-Targeting|sgPKM_MPD   \n",
       "2  Non-Targeting|non-targeting_02847|sgDHX30_MPA|...   \n",
       "3                          Non-Targeting|sgNegCtrl3b   \n",
       "4                          Non-Targeting|sgNegCtrl3b   \n",
       "\n",
       "                                   num_umis  \n",
       "0                                 3694|2106  \n",
       "1                                   1335|19  \n",
       "2  27|45|11|14|71|20|28|76|36|55|17|35|6|19  \n",
       "3                                  1813|550  \n",
       "4                                  1861|518  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#umis vs coverage\n",
    "#so unique molecular identifiers mean that you have a molecule that is present\n",
    "#coverage is the number of reads from each umi \n",
    "\n",
    "protospacer_calls=pd.read_csv(\"../../cellranger_output/crispr_analysis/protospacer_calls_per_cell.csv\")\n",
    "protospacer_calls.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     index        cell_barcode  num_features UMI_count guide_identity\n",
      "944    944  CGAATGTGTCTAGTCA-1             1      1315  Non-Targeting\n"
     ]
    }
   ],
   "source": [
    "feature_call=protospacer_calls.set_index(['cell_barcode',\"num_features\"])['feature_call'].str.split('|', expand=True).stack().reset_index()\n",
    "feature_call.reset_index(inplace=True)\n",
    "feature_call=feature_call.drop(\"level_2\", axis=1)\n",
    "\n",
    "\n",
    "# form a dataframe of cellb arcode, feature call and num_umis\n",
    "num_umi=protospacer_calls.set_index(['cell_barcode',\"num_features\"])['num_umis'].str.split('|', expand=True).stack().reset_index()\n",
    "num_umi.reset_index(inplace=True)\n",
    "num_umi=num_umi.drop(\"level_2\", axis=1)\n",
    "feature_call_stacked=num_umi.merge(feature_call,on=[\"cell_barcode\", \"index\",\"num_features\"])\n",
    "\n",
    "feature_call_stacked.columns=['index', 'cell_barcode', 'num_features', 'UMI_count', 'guide_identity']\n",
    "print(feature_call_stacked[feature_call_stacked[\"cell_barcode\"]==\"CGAATGTGTCTAGTCA-1\"])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cell_barcode</th>\n",
       "      <th>guide_identity</th>\n",
       "      <th>read_count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>AAACCCAAGAAACCCG-1</td>\n",
       "      <td>non-targeting_00373</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>AAACCCAAGAAACCCG-1</td>\n",
       "      <td>sgCALR_MPA</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>AAACCCAAGAAACCCG-1</td>\n",
       "      <td>sgCHD8_MPC</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>AAACCCAAGAAACCCG-1</td>\n",
       "      <td>sgSRSF5_APC</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>AAACCCAAGAAACCCG-1</td>\n",
       "      <td>sgGTF2F1_APA</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          cell_barcode       guide_identity  read_count\n",
       "6   AAACCCAAGAAACCCG-1  non-targeting_00373         1.0\n",
       "7   AAACCCAAGAAACCCG-1           sgCALR_MPA         2.0\n",
       "8   AAACCCAAGAAACCCG-1           sgCHD8_MPC         1.0\n",
       "9   AAACCCAAGAAACCCG-1          sgSRSF5_APC         1.0\n",
       "10  AAACCCAAGAAACCCG-1         sgGTF2F1_APA         1.0"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "read_info=pd.read_csv(\"../../cellranger_output/molinfo_data_crispr_only.csv\",names=[\"cell\",\"umi\",\"gem_group\",\"gene\",\"reads\",\"library\"],skiprows=1)\n",
    "gene_name=pd.read_csv(\"../../cellranger_output/molinfo_genes.csv\")\n",
    "\n",
    "read_info[\"cell_barcode\"]=read_info[\"cell\"]+\"-1\"\n",
    "read_info=read_info.merge(gene_name,right_index=True, left_on=\"gene\", how=\"inner\")\n",
    "read_info[\"guide_identity\"]=read_info[\"0\"].str.decode('utf-8') \n",
    "read_info=read_info[[\"cell_barcode\",\"guide_identity\",\"reads\"]]\n",
    "read_umi_check=read_info.groupby([\"cell_barcode\",\"guide_identity\"]).sum().reset_index()\n",
    "\n",
    "read_umi=pd.read_csv(\"../../cellranger_output/crispr_analysis/read_umi_identity.csv\", names=[\"cell_barcode\",\"guide_identity\",\"read_count\"], skiprows=1)\n",
    "read_umi.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "         cell_barcode       guide_identity  read_count\n",
      "0  AAACCCAAGAAACCCG-1  non-targeting_00373         1.0\n",
      "1  AAACCCAAGAAACCCG-1           sgCALR_MPA         2.0\n",
      "2  AAACCCAAGAAACCCG-1           sgCHD8_MPC         1.0\n",
      "3  AAACCCAAGAAACCCG-1         sgGTF2F1_APA         1.0\n",
      "4  AAACCCAAGAAACCCG-1         sgNFE2L2_MPD         1.0\n"
     ]
    }
   ],
   "source": [
    "read_umi_check=read_umi.groupby([\"cell_barcode\",\"guide_identity\"]).sum().reset_index()\n",
    "print(read_umi_check.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_call_stacked=read_umi_check.merge(feature_call_stacked,on=[\"cell_barcode\",\"guide_identity\"]).drop_duplicates()\n",
    "feature_call_stacked['UMI_count']=feature_call_stacked['UMI_count'].astype('float')\n",
    "feature_call_stacked['coverage'] = feature_call_stacked['read_count']/feature_call_stacked['UMI_count']\n",
    "feature_call_stacked[\"coverage\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##### cell_barcode,guide_identity,read_count,UMI_count,coverage,gemgroup,good_coverage,number_of_cells\n",
    "##output table\n",
    "output_dir=\"../../cellranger_output/\"\n",
    "feature_call_stacked['good_coverage']=True\n",
    "feature_call_stacked['number_of_cells']=1.0\n",
    "feature_call_stacked[\"gemgroup\"]=1\n",
    "\n",
    "tst=feature_call_stacked[['cell_barcode','guide_identity','read_count','UMI_count','coverage','gemgroup','good_coverage','number_of_cells']]\n",
    "tst.to_csv(output_dir+'cell_identities.csv',index=False)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}