Identify Target Region for Dual Guides

Pipeline for identifying optimal genomic regions for dual guide targeting of alternative promoters.

integration_pacbiotss_proactiv_published_only

Integration of PacBio, TSS and Proactiv Data¶

We are identifying alternative promoter candidates to be used within the CRISPRi screen. After many weeks of trying every type of sequencing technology that could identify TSS/promoters for MCF-7 cell line.

Step 1) Identify the alternative TSS usage in 3 technologies- CAGE-seq, proActiv and PacBio.

Step 2) Filter the TSS based on proximity to each other, number and if the major promoter was upstream of the minor promoter.

Step 3) Prioritising the genes of interest.

Step 1) The three technologies used are as following (as well as a short description about how a TSS is called for each technology): a. PacBio Long-Read RNA-seq data From MCF-7 Cell-line. Hoen et al. PMID: 29598823 PacBio transcript sequences are actually a consensus sequence generated from a combination of full-length reads with isoform clustering algorithm (ICE) and then with partial reads. After benchmarking PacBio, we found it was as accurate as CAGE-seq at calling TSS but identified more TP TSS. This is because CAGE-seq is very conservative in calling TSS.

First, the TSS locus is defined as the first strand-specific nucleotide of each read. Then, counting the TSS called with 10 nucleotides of others to find the consensus locus and depth of coverage. Finally these TSS were overlapped with the refTSS database (PMID: 31075273). Note that PacBio has particularly low coverage relative to short read sequencing.

b. proActiv In-House RNA-seq from MCF-7 Cell Line proActiv is a purpose-built package that identifies alternative promoters and their activity. proActiv uses STAR junction files, and an annotation gtf file of the GENCODE v36 genome. The output- is the most annoying SummarisedExperiment dataframe.

Identified promoters from annotations. Created weighted splicing graphs splicing graphs that capture all the splice variants of one gene in one data structure. Calculate promoter activity estimates for uniquely identified promoters. We are not filtering out single exon promoters and promoters that uses an internal intron. Read counts are quantified for absolute and relative promoter activities using either split reads ration method or junction reads. They divided promoters into 3 different categories major, minor and inactive promoters. The highest activity for each gene across the sample cohort is the major promoters. Promoters with average activity less than 0.25 = inactive, and the remaining mean other promoters. The pipeline can be found on proActiv.R.

The mean of the absolute promoter activity and relative promoter activity were calculated across all three repeats of the MCF-7 cell line.

c. CAGE-seq - IDR of ENCODE MCF-7 Cell Line from the Carnicini Lab CAGE stands for the Cap analysis of gene expression. Understanding of CAGE-seq is that it measures RNA expression and maps TSS in promoters to a single-nucleotide resolution. However, it only works on total mature RNA and detection is biased toward TSS of long-lived transcripts. We are using CAGE-seq TSS found across two repeats using IDR. The average peak was found across both repeats. Then these consensus TSS were overlapped with refTSS database (PMID: 31075273).

Step 2) With all three sequencing technologies we have filtered by: a) Genes with between 2 and 20 TSSs more than 300 between them were retained as candidates.

The gene of interest were identified if found in two out of the three technologies.

Step 3) Now that we have identified these genes of interest, they are then given True/False if present in three lists.

First, a list of nuclear proteins. Specifically, 106 transcription factors were identified (PMID: 29425488) and 24 chromatin remodellers.

Secondly, a manually curated list of genes that have AP usage.

Finally, differentially expressed genes found to switch exon usage from non-cancerous stem cells to cancerous stem cells in HMLER and HCC38. The DEGs for the cancerous stem cells were actually identified by differential exon usage with proActiv.

These are all summarised in ascending order of number of hits pivot_simple_TF_CHROM_HMM_HCC.txt

In [1]:

import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from empiricaldist import Pmf
import pickle
from hmtl_scripts.alternative_promoter_func import func
from hmtl_scripts.alternative_promoter_func import downstream_func
from hmtl_scripts.alternative_promoter_func import prep

In [12]:

###Create the list of Manually Curated 
published_results=prep.ensembl_genename_search(df_location="./annotations/TF_CHROM_CURATED/Grazi_Published_Articles.csv", genecolumn="Gene",type="csv")
manually_curated=list(published_results["ensembl.gene"])

###add tp53
manually_curated.append("ENSG00000141510")
manually_curated.append("ENSG00000091831")
with open("./output_files/manually_curated.txt", "wb") as fp:
    pickle.dump(manually_curated, fp)

querying 1-39...done.
Finished.
39 input query terms found dup hits:
	[('ADAR', 10), ('PTHLH', 10), ('SHC1', 10), ('LEF1', 10), ('NR3C1', 10), ('IKBKG', 10), ('NAT1', 10)
Pass "returnall=True" to return complete lists of duplicate or missing query terms.

In [3]:

###Both of the reference files used are in bed format and cover both the EPD and refTSS reference promoters with 200bp. 
##The bed format is <chr:start:end:
##from the overlap paper Hien et al. 
LONGREAD=prep.longread_prep(filename="./annotations/LONGREAD/MCF7_2015.ANGEL_FINAL_nominus_noduplicate.curated.gff",cluster=False,reference="EPD_refTSS",distancefromref=100, annotation_bed="./annotations/EPD/Hs_EPDnew_006_hg38_tab.ensembl.gene.bed", second_reference_file="./annotations/refTSS/refTSS_v3.1_human_coordinate.hg38.8.sorted.bed")

LONGREAD.to_csv("./annotations/LONGREAD/MCF7_2015.ANGEL_FINAL_nominus_noduplicate_EPD.csv",index=None )

##PacBio due to the lack of read depth has not got a consistent way of calling Major/Minor promoters therefore ignore the Major/minor calling
%run ./hmtl_scripts/unique_close_AP.py --dataframe_input "./annotations/LONGREAD/MCF7_2015.ANGEL_FINAL_nominus_noduplicate_EPD.csv" --save_csvfilename "./output_files/PacBio_refTSS_candidatelist_filtered_2_distance.csv" --PacBio True --checkTP53 True --endcoord_TSS_column "TSSend" --makeBED True --save_bedfilename "./visualisation_files/PacBio_refTSS_candidatelist_filtered_2_distance.bed" --Manually_Curated "./output_files/manually_curated.txt" --distance_between_TSS 0 --uniqueid_TSScolumn "ref_TSSID" --uniqueid_genecolumn "Ensembl_ID"

translate bed file to TSS
convert hg19. to hg38
       chr1     29338     29339  transcript  - -.1  1 chr1.1     29346  \
0      chr1    484071    484072  transcript  +   -  1   chr1    778605   
1      chr1    727972    727973  transcript  -   -  1   chr1    778605   
2      chr1    728308    728309  transcript  -   -  1   chr1    778605   
3      chr1    728472    728473  transcript  -   -  1   chr1    778605   
4      chr1    827503    827504  transcript  -   -  1   chr1    827505   
...     ...       ...       ...         ... ..  .. ..    ...       ...   
74839  chrY  18149738  18149739  transcript  -   -  1   chrY  17880158   
74840  chrY  18992792  18992793  transcript  -   -  0   chrY  18992808   
74841  chrY  18992792  18992793  transcript  -   -  0   chrY  18992808   
74842  chrY  18992794  18992795  transcript  -   -  0   chrY  18992808   
74843  chrY  18992796  18992797  transcript  -   -  0   chrY  18992808   

          29366  hg_234684.1  ENSG00000227232 -.2     WASH7P       8  
0        778674    hg_9665.1              NaN   -        NaN  294534  
1        778674    hg_9665.1              NaN   -        NaN   50633  
2        778674    hg_9665.1              NaN   -        NaN   50297  
3        778674    hg_9665.1              NaN   -        NaN   50133  
4        827525    hg_9666.1  ENSG00000225880   -  LINC00115       2  
...         ...          ...              ...  ..        ...     ...  
74839  17880159  hg_244124.1  ENSG00000129873   -      CDY2B  269580  
74840  18992868       CD24_1  ENSG00000272398   -       CD24      16  
74841  18992868       CD24_1  ENSG00000272398   -       CD24      16  
74842  18992868       CD24_1  ENSG00000272398   -       CD24      14  
74843  18992868       CD24_1  ENSG00000272398   -       CD24      12  

[74844 rows x 15 columns]
find depth
       index  OverlappingTSS    chr  TSSstart   TSSend strand ref_TSSchr  \
22328  30111               1  chr17   7675237  7675238      -      chr17   
22329  30112               1  chr17   7687542  7687543      -      chr17   

       ref_TSSstart  ref_TSSend    ref_TSSID       Ensembl_ID Gene_symbol  \
22328       7675230     7675245  hg_164276.1  ENSG00000141510        TP53   
22329       7687476     7687536       TP53_1  ENSG00000141510        TP53   

       distance  
22328         0  
22329         7  
Element exists in Dataframe
If you read this line it means that you have provided all the parameters
Genes with between  2  and  1000  TSSs more than  0  between them were retained as candidates.
       index  OverlappingTSS    chr  TSSstart   TSSend strand ref_TSSchr  \
22328  30111               1  chr17   7675237  7675238      -      chr17   
22329  30112               1  chr17   7687542  7687543      -      chr17   

       ref_TSSstart  ref_TSSend    ref_TSSID       Ensembl_ID Gene_symbol  \
22328       7675230     7675245  hg_164276.1  ENSG00000141510        TP53   
22329       7687476     7687536       TP53_1  ENSG00000141510        TP53   

       distance  
22328         0  
22329         7  
Number of Genes After Filtering For Promoter Number: 6658
Number of Genes After Filtering For Distance between TSS and AltTSS: 4977
PacBio Data so not deep enough coverage to confidendentally call Major/Minor promoters
Number of Genes After Filtering  
 for Strand Specific Upstream Major Promoter: 4977
            Ensembl_ID   SD  index  OverlappingTSS    chr  TSSstart   TSSend  \
23357  ENSG00000141510  0.0  30111               1  chr17   7675237  7675238   
23358  ENSG00000141510  0.0  30112               1  chr17   7687542  7687543   

      strand ref_TSSchr  ref_TSSstart  ref_TSSend    ref_TSSID Gene_symbol  \
23357      -      chr17       7675230     7675245  hg_164276.1        TP53   
23358      -      chr17       7687476     7687536       TP53_1        TP53   

       distance  range MAJOR_MINOR     diff  
23357         0  12305       Major      0.0  
23358         7  12305       Major  12305.0

In [4]:

##ProActiv
PROACTIV = prep.pro_prep(filename="./annotations/PROACTIV/proactivMCF7.csv",distancebetweenref=100,reference="EPD_refTSS")
PROACTIV.to_csv("./annotations/PROACTIV/proactivMCF7_input.csv",index=None)

%run ./hmtl_scripts/unique_close_AP.py --dataframe_input "./annotations/PROACTIV/proactivMCF7_input.csv" --save_csvfilename "./annotations/PROACTIV/proactiv_refTSS_candidatelist_filtered_2_distance.csv"  --uniqueid_genecolumn "geneId_NoV" --TSSQuant_colummn "MCF7.mean" --uniqueid_TSScolumn "promoterId" --TSScoord_column "start" --upplim_promoter_number 10000 --makeBED True --save_bedfilename "./visualisation_files/proactiv_refTSS_candidatelist_filtered_2_distance.bed" --chromosome_col "seqnames" --Manually_Curated "./output_files/manually_curated.txt" --distance_between_TSS 0 --checkTP53 True


os.system("rm -f /output_files/proactiv_input.txt ./output_files/proactiv_bed.filteredts1_tsl2.sorted.bed ./output_files/proactiv_bed.sorted.bed ./output_files/proactiv_bed.bed")

(99380, 21)
closestBed -s -d -mdb all -b ./annotations/refTSS/refTSS_v3.1_human_coordinate.hg38.8.sorted.bed ./annotations/EPD/Hs_EPDnew_006_hg38_tab.ensembl.gene.bed -a ./annotations/PROACTIV/proactivMCF7.TSS.sorted.bed > ./annotations/PROACTIV/proactivMCF7EPD_refTSS.TSS.sorted.bed
      seqnames      start              geneId  promoterId strand    ref_TSSID  \
0         chr1     360057   ENSG00000236601.2       89982      +       hg_1.1   
1         chr1     827598  ENSG00000228794.10       82866      +      hg_16.1   
2         chr1     851348  ENSG00000228794.10       82868      +      hg_18.1   
3         chr1     923928  ENSG00000187634.12       64499      +      hg_24.1   
4         chr1     930312  ENSG00000187634.12       64502      +      hg_28.1   
...        ...        ...                 ...         ...    ...          ...   
37124     chrX  155612952  ENSG00000185973.12       63412      -      TMLHE_1   
37125     chrX  155881345  ENSG00000124333.16       24904      +      VAMP7_1   
37126     chrX  155881345  ENSG00000124333.16       24904      +  hg_198225.1   
37127     chrX  156021328  ENSG00000182484.15       60803      +  hg_243873.1   
37128     chrY   11215026   ENSG00000278212.2      116488      -  hg_201254.1   

      Gene_symbol  distance  
0             NaN    269134  
1       LINC01128        55  
2       LINC01128     19753  
3          SAMD11         0  
4          SAMD11      4556  
...           ...       ...  
37124       TMLHE         0  
37125       VAMP7         0  
37126       VAMP7         0  
37127         NaN      2717  
37128         NaN   1206542  

[37129 rows x 8 columns]
(37129, 8)
(18896, 8)
If you read this line it means that you have provided all the parameters
Genes with between  2  and  10000  TSSs more than  0  between them were retained as candidates.
     seqnames    start              geneId  promoterId strand    ref_TSSID  \
7591    chr17  7676594  ENSG00000141510.18       35747      -  hg_221839.1   
7592    chr17  7687538  ENSG00000141510.18       35750      -       TP53_1   

     Gene_symbol  distance  internalPromoter  promoterPosition  ...  \
7591        TP53        28              True                 4  ...   
7592        TP53         2             False                 1  ...   

      012SJ_normpromCount 000SJ_absPromAct  001SJ_absPromAct  \
7591            98.509902         7.270845          7.104026   
7592            84.210077         6.862031          6.866094   

      012SJ_absPromAct  000SJ_relProm  001SJ_relProm  012SJ_relProm  \
7591          6.636768         0.3397       0.340212       0.332686   
7592          6.412952         0.3206       0.328818       0.321467   

      absPromMean  relPromMean       geneId_NoV  
7591     7.003880     0.337533  ENSG00000141510  
7592     6.713692     0.323628  ENSG00000141510  

[2 rows x 27 columns]
Number of Genes After Filtering For Promoter Number: 2284
Number of Genes After Filtering For Distance between TSS and AltTSS: 2284
           geneId_NoV        SD seqnames    start              geneId  \
3590  ENSG00000141510  0.145094    chr17  7676594  ENSG00000141510.18   
3591  ENSG00000141510  0.145094    chr17  7687538  ENSG00000141510.18   

      promoterId strand    ref_TSSID Gene_symbol  distance  ...  \
3590       35747      -  hg_221839.1        TP53        28  ...   
3591       35750      -       TP53_1        TP53         2  ...   

      012SJ_absPromAct  000SJ_relProm  001SJ_relProm 012SJ_relProm  \
3590          6.636768         0.3397       0.340212      0.332686   
3591          6.412952         0.3206       0.328818      0.321467   

      absPromMean  relPromMean  range  MAJOR_MINOR     diff  Minor_Downstream  
3590     7.003880     0.337533  10944        Major      0.0             False  
3591     6.713692     0.323628  10944        Minor  10944.0             False  

[2 rows x 32 columns]

Out[4]:

CAGGEEEE For the MCF-7 combined peaks: https://www.encodeproject.org/files/ENCFF917XEM/@@download/ENCFF917XEM.bed.gz

Using Human Genes GRCh38.p13 hg38_genes using Ensembl web browser w/ attributes Chromosome/scaffold name Gene start (bp) Gene end (bp) Gene stable ID version Karyotype band Strand Source (gene) Gene type Transcription start site (TSS)

In [5]:

CAGE=prep.cage_prep("./annotations/CAGE/ENCFF917XEM.bed",distancefromref=100,reference="EPD_refTSS",annotation_bed="./annotations/EPD/Hs_EPDnew_006_hg38_tab.ensembl.bed")
CAGE.to_csv("./annotations/CAGE/ENCFF917XEM_EPD.csv")

%run ./hmtl_scripts/unique_close_AP.py --dataframe_input "./annotations/CAGE/ENCFF917XEM_EPD.csv" --save_csvfilename "./output_files/CAGE_candidatelist_filtered_2_distance.csv"  --TSScoord_column 'start' --TSSQuant_colummn 'avgpeak' --upplim_promoter_number 10000 --Manually_Curated "./output_files/manually_curated.txt" --makeBED True --endcoord_TSS_column "end" --save_bedfilename "./visualisation_files/CAGE_candidatelist_filtered_2_distance.bed" --distance_between_TSS 0  --uniqueid_TSScolumn "ref_TSSID" --uniqueid_genecolumn "ensembl_id"

os.system("rm -f ./output_files/CAGErefTSS.txt ./output_files/ENCFF917XEM_refTSS.bed")

./annotations/CAGE/ENCFF917XEM.sorted.bed ./annotations/CAGE/ENCFF917XEM.ref.sorted.bed
        chr      start        end strand  pVal  qVal   peak1   peak2  \
0      chr1     634031     634069      -  2.54  3.11   175.0   132.0   
1      chr1     778582     778688      -  1.94  2.61   107.0    60.0   
2      chr1     827496     827770      -  1.16  2.01    49.0    35.0   
3      chr1    1013470    1013549      +  5.00  5.00  3872.0  3043.0   
4      chr1    1013470    1013549      +  5.00  5.00  3872.0  3043.0   
...     ...        ...        ...    ...   ...   ...     ...     ...   
10938  chrX  154762851  154762946      +  2.90  3.42   310.0   177.0   
10939  chrX  154762851  154762946      +  2.90  3.42   310.0   177.0   
10940  chrX  154762851  154762946      +  2.90  3.42   310.0   177.0   
10941  chrX  155216444  155216541      +  0.84  1.76    75.0    26.0   
10942  chrX  155216444  155216541      +  0.84  1.76    75.0    26.0   

      ref_TSSchr  ref_TSSstart  ref_TSSend    ref_TSSID       ensembl_id  \
0           chr1        778605      778674    hg_9665.1              NaN   
1           chr1        778605      778674    hg_9665.1              NaN   
2           chr1        827505      827525    hg_9666.1  ENSG00000225880   
3           chr1       1013447     1013507      ISG15_1  ENSG00000187608   
4           chr1       1013486     1013505      hg_47.1  ENSG00000187608   
...          ...           ...         ...          ...              ...   
10938       chrX     154762813   154762893  hg_198210.1  ENSG00000130826   
10939       chrX     154762825   154762885       DKC1_1  ENSG00000130826   
10940       chrX     154762900   154762951  hg_198211.1  ENSG00000130826   
10941       chrX     155216410   155216470       VBP1_1  ENSG00000155959   
10942       chrX     155216439   155216484  hg_198223.1  ENSG00000155959   

      Gene_symbol   distance  
0             NaN     144537  
1             NaN          0  
2       LINC00115          0  
3         1013496    1013507  
4           ISG15          0  
...           ...        ...  
10938        DKC1          0  
10939   154762874  154762885  
10940        DKC1          0  
10941   155216459  155216470  
10942        VBP1          0  

[10943 rows x 15 columns]
If you read this line it means that you have provided all the parameters
Genes with between  2  and  10000  TSSs more than  0  between them were retained as candidates.
Empty DataFrame
Columns: [Unnamed: 0, chr, start, end, strand, pVal, qVal, peak1, peak2, ref_TSSchr, ref_TSSstart, ref_TSSend, ref_TSSID, ensembl_id, Gene_symbol, distance, avgpeak]
Index: []
Number of Genes After Filtering For Promoter Number: 1424
Number of Genes After Filtering For Distance between TSS and AltTSS: 347

Out[5]:

In [6]:

func.merge_three_df_venn(dataframe1="./output_files/CAGE_candidatelist_filtered_2_distance.csv",dataframe2="./output_files/proactiv_refTSS_candidatelist_filtered_2_distance.csv",dataframe3="./output_files/PacBio_refTSS_candidatelist_filtered_2_distance.csv",symbolcolumn3="Ensembl_ID",symbolcolumn2="geneId_NoV",symbolcolumn1="ensembl_id",output_folder="./merged_tables/")

        ensembl_id          SD  Unnamed: 0    chr     start       end strand  \
0  ENSG00000003056   28.284271        2036  chr12   8949393   8949426      -   
1  ENSG00000003056   28.284271        2037  chr12   8949583   8949640      -   
2  ENSG00000003056   28.284271        2038  chr12   8949583   8949640      -   
3  ENSG00000008083  156.691566        8749   chr6  15245446  15245711      +   
4  ENSG00000008083  156.691566        8751   chr6  15245446  15245711      +   

   pVal  qVal  peak1  ...  ref_TSSstart ref_TSSend    ref_TSSID  Gene_symbol  \
0 -0.00  0.53   17.0  ...       8949402    8949417  hg_129378.1         M6PR   
1  1.69  2.42  107.0  ...       8949597    8949615  hg_129379.1         M6PR   
2  1.69  2.42  107.0  ...       8949628    8949674  hg_129380.1         M6PR   
3  1.12  1.97   47.0  ...      15245446   15245492   hg_66039.1       JARID2   
4  1.12  1.97   47.0  ...      15245496   15245515   hg_66040.1       JARID2   

  distance avgpeak  range  MAJOR_MINOR   diff Minor_Downstream  
0        0    16.0    190        Minor    0.0             True  
1        0    76.0    190        Major  190.0             True  
2        0    76.0    190        Major    0.0             True  
3        0    47.0   3448        Minor    0.0             True  
4        0    47.0   3448        Minor    0.0             True  

[5 rows x 22 columns]

./merged_tables/ CAGE proactiv .csv
./merged_tables/ proactiv PacBio .csv
./merged_tables/ CAGE PacBio .csv
./merged_tables/ CAGE proactiv PacBio .csv

In [7]:

##Write pivot tables for all
###Proactiv

pro_pivot= downstream_func.pivot_each(dataframe1="./merged_tables/CAGEproactiv.csv", index_list1=['promoterId','Gene_symbol_y', 'seqnames','start_y', 'internalPromoter',"MAJOR_MINOR_y","geneId_NoV","strand_y"], values_list1=["absPromMean", "relPromMean"], dataframe2="./merged_tables/proactivPacBio.csv", index_list2=['promoterId','Gene_symbol_x','seqnames','start','internalPromoter',"MAJOR_MINOR_x","Ensembl_ID","strand_x"], values_list2=["absPromMean", "relPromMean"], name="proActiv", columnnames=['proActiv_promoterId', 'Gene_symbol', 'proActiv_seqnames', 'proActiv_start', 'proActiv_internalPromoter','proActiv_MAJOR_MINOR',"Ensembl_ID","strand",'proActiv_absPromMean', 'proActiv_relPromMean'], reorderedcolumns=['proActiv_promoterId', 'Gene_symbol', 'Ensembl_ID' , 'proActiv_seqnames', 'proActiv_start','proActiv_end', "strand",'proActiv_internalPromoter',  'proActiv_MAJOR_MINOR', 'proActiv_absPromMean', 'proActiv_relPromMean','proActiv'], noendcoord=True , save_txt="./candidate_AP_output/propivot_table.txt" )

 ###CAGE
cage_pivot= downstream_func.pivot_each(dataframe1="./merged_tables/CAGEproactiv.csv", index_list1=['ref_TSSID_x','Gene_symbol_x', 'chr',"start_x",'end',"MAJOR_MINOR_x","ensembl_id","strand_x"], values_list1=["avgpeak"], dataframe2="./merged_tables/CAGEPacBio.csv", index_list2=['ref_TSSID_y','Gene_symbol_y','chr_y','start','end',"MAJOR_MINOR_y","ensembl_id","strand_y"], values_list2=["avgpeak"], name="CAGE", columnnames=['CAGE_TSSID', 'Gene_symbol', 'CAGE_chr', 'CAGE_start','CAGE_end',  'CAGE_MAJOR_MINOR', 'Ensembl_ID',"strand",'CAGE_avgpeak'], reorderedcolumns=['CAGE_TSSID', 'Gene_symbol', 'Ensembl_ID','CAGE_chr', 'CAGE_start','CAGE_end',"strand", 'CAGE_MAJOR_MINOR','CAGE_avgpeak',"CAGE"], noendcoord=False, save_txt="./candidate_AP_output/cagepivot_table.txt" )


###Pacbio
pac_pivot= downstream_func.pivot_each(dataframe1="./merged_tables/proactivPacBio.csv", index_list1=['ref_TSSID_y','Gene_symbol_y', 'chr','TSSstart','TSSend',"MAJOR_MINOR_y","Ensembl_ID","strand_y"], values_list1=["OverlappingTSS"], dataframe2="./merged_tables/CAGEPacBio.csv", index_list2=['ref_TSSID_x','Gene_symbol_x','chr_x','TSSstart','TSSend',"MAJOR_MINOR_x","ensembl_id","strand_x"], values_list2=["OverlappingTSS"], name="PacBio", columnnames=['PacBio_TSSID', 'Gene_symbol', 'PacBio_chr', 'PacBio_start','PacBio_end', 'PacBio_MAJOR_MINOR', "Ensembl_ID","strand",'PacBio_OverlappingTSS'], reorderedcolumns=['PacBio_TSSID', 'Gene_symbol','Ensembl_ID', 'PacBio_chr', 'PacBio_start','PacBio_end', "strand", 'PacBio_MAJOR_MINOR','PacBio_OverlappingTSS','PacBio'], noendcoord=False, save_txt="./candidate_AP_output/pacpivot_table.txt" )




all_pivot=pro_pivot.merge(cage_pivot, on=["Gene_symbol","Ensembl_ID"], how="outer").merge(pac_pivot, on=["Gene_symbol","Ensembl_ID"], how="outer")

###This is the simple pivot table of genes ONLY
all_pivot_simple=all_pivot[["Ensembl_ID",'Gene_symbol' ,'proActiv', 'CAGE',  'PacBio']].drop_duplicates()
all_pivot_simple[all_pivot_simple.isna()] = False

pro_pivot.drop(columns= ["proActiv_relPromMean","proActiv","proActiv_internalPromoter"],axis=0,inplace=True)
pro_pivot["Source"]="proActiv"

cage_pivot.drop(columns= ["CAGE"],axis=0,inplace=True)
cage_pivot["Source"]="CAGE"

pac_pivot.drop(columns= ["PacBio"],axis=0,inplace=True)
pac_pivot["Source"]="PacBio"

##This is the pivot table that will be used further downstream as it has every single TSS called across platforms
compiled_pivot= pd.DataFrame( np.concatenate((pro_pivot.values,cage_pivot.values,pac_pivot.values) , axis=0 )).drop_duplicates()
compiled_pivot.columns=['TSS_ID','Gene_symbol','Ensembl_ID' , 'chr', 'TSSstart','TSSend' ,"strand",'MAJOR_MINOR' ,'Depth',"Source"]

###Make a file where the promoters are clustered together
compiled_pivot_bed=compiled_pivot[["chr","TSSstart","TSSstart","Source","TSS_ID","strand","Ensembl_ID"]]
compiled_pivot_bed.to_csv("./output_files/compiled_pivot.bed", header=None, index=None, sep="\t")

In [8]:

###START W CANONICAL ATG ANALYSIS WITH ROBS ATG START CODONS
# I'd figure out where the "canonical" ATG is. If promoters upstream of this ATG, it is a fair assumption that both will use that ATG. If minor one is downstream, it probably has a different N-terminal. That would be most basic approach (edited)

nterminus_change_merge=downstream_func.find_nterminuschange()

Empty DataFrame
Columns: [chr, TSS_start, TSS_end, NumberTSS, Gene_symbol, TSS_ID, strand, Ensembl_ID]
Index: []
    chr  TSS_start  TSS_end  NumberTSS      Gene_symbol  \
0  chr1    1308554  1308789          5  PacBio,proActiv   
1  chr1    6385812  6385827          2  PacBio,proActiv   
2  chr1    6393406  6393406          1           PacBio   
3  chr1    6393767  6393767          1         proActiv   
4  chr1    7961662  7961895          6  PacBio,proActiv   

                                          TSS_ID strand       Ensembl_ID  
0            PUSL1_1,53265,hg_81.1,hg_83.1,53266      +  ENSG00000169972  
1                                  10696,ACOT7_3      -  ENSG00000097021  
2                                     hg_10030.1      -  ENSG00000097021  
3                                          10697      -  ENSG00000097021  
4  21343,PARK7_1,hg_453.1,hg_453.1,PARK7_1,21344      +  ENSG00000116288  
(2858, 8)
Index(['chr', 'TSS_start', 'TSS_end', 'NumberTSS', 'Gene_symbol', 'TSS_ID',
       'strand', 'Ensembl_ID', 'Range_ClusteredTSS', 'start_SC', 'end_SC'],
      dtype='object')
The percentage of canonical ATG downstream of promoters: 57.6 %

In [9]:

####Create a dataframe for import into flashfry
###Get the dataframe that shows the entire list of 
##Change the bed file so it is just the TSS 
all_pivot_simple_nterm=all_pivot_simple.merge(nterminus_change_merge, left_on="Ensembl_ID", right_on="Ensembl_ID", right_index=True)
all_pivot_simple_nterm.to_csv("./candidate_AP_output/all_pivot_simple_nterm.txt",sep="\t",index=None)

In [26]:

compiled_pivot_flashfry=compiled_pivot[['chr', 'TSSstart', 'TSSend', "TSS_ID", 'Gene_symbol','strand','Ensembl_ID','MAJOR_MINOR','Source']]

compiled_pivot_flashfry=compiled_pivot_flashfry[compiled_pivot_flashfry["Ensembl_ID"].isin(all_pivot_simple["Ensembl_ID"])]

#ensure that only one copy of a TSS is inclued 
SINGLE=compiled_pivot["Gene_symbol"].value_counts().index[compiled_pivot["Gene_symbol"].value_counts().values<2]

compiled_pivot_flashfry=compiled_pivot_flashfry[~compiled_pivot_flashfry["Gene_symbol"].isin(SINGLE)]

##create unique ID 
compiled_pivot_flashfry["ID"]=compiled_pivot_flashfry["TSS_ID"].astype(str)+"_"+(compiled_pivot_flashfry.index).astype(str)

compiled_pivot_flashfry.to_csv("./candidate_AP_output/compiled_pivot_all.bed",index=None)

# all_pivot_simple_nterm.to_csv("/Users/helenking/Desktop/Weatheritt_Lab/flashfry-for-crispr-targets/pacbio_cage_proactiv_TSS_files/all_pivot_simple_nterm.txt",sep="\t",index=None)
# compiled_pivot_flashfry.to_csv("/Users/helenking/Desktop/Weatheritt_Lab/flashfry-for-crispr-targets/pacbio_cage_proactiv_TSS_files/compiled_pivot_all.bed", sep="\t",index=None)

compiled_pivot_flashfry=func.find_confidencevalue(ALLTSS_bed_loc="./candidate_AP_output/compiled_pivot_all.bed")
compiled_pivot_flashfry.to_csv("./candidate_AP_output/compiled_pivot_all.bed",index=None,sep="\t")

compiled_pivot_flashfry_vis=compiled_pivot_flashfry[['chr', 'TSSstart', 'TSSend', 'Source', 'Gene_symbol','strand']]
compiled_pivot_flashfry_vis.to_csv("/Users/helenking/Desktop/Weatheritt_Lab/flashfry-for-crispr-targets/visualisations/compiled_pivot_all_vis.bed",header=None, sep="\t",index=None)

In [14]:

compiled_pivot_flashfry.columns

Out[14]:

Index(['chr', 'TSSstart', 'TSSend', 'TSS_ID', 'Gene_symbol', 'strand',
       'Ensembl_ID', 'MAJOR_MINOR', 'Source', 'confidence_value', 'rank'],
      dtype='object')

In [13]:

os.system("bedtools intersect -loj -a ./candidate_AP_output/compiled_pivot_all.bed -b ./annotations/ENCFF268RXB.bed > ./candidate_AP_output/compiled_pivot_all.H3K4me3.bed")

Out[13]:

In [ ]:

Download Original Report: 1B Target Region Report (HTML)