Integration of PacBio, TSS and proActiv Data

Multi-omics integration pipeline for identifying alternative promoters using long-read sequencing, CAGE-seq, and RNA-seq data.

integration_pacbiotss_proactiv_published_only

Integration of PacBio, TSS and Proactiv Data¶

We are identifying alternative promoter candidates to be used within the CRISPRi screen. After many weeks of trying every type of sequencing technology that could identify TSS/promoters for MCF-7 cell line.

Step 1) Identify the alternative TSS usage in 3 technologies- CAGE-seq, proActiv and PacBio.

Step 2) Filter the TSS based on proximity to each other, number and if the major promoter was upstream of the minor promoter.

Step 3) Prioritising the genes of interest.

Step 1) The three technologies used are as following (as well as a short description about how a TSS is called for each technology): a. PacBio Long-Read RNA-seq data From MCF-7 Cell-line. Hoen et al. PMID: 29598823 PacBio transcript sequences are actually a consensus sequence generated from a combination of full-length reads with isoform clustering algorithm (ICE) and then with partial reads. After benchmarking PacBio, we found it was as accurate as CAGE-seq at calling TSS but identified more TP TSS. This is because CAGE-seq is very conservative in calling TSS.

First, the TSS locus is defined as the first strand-specific nucleotide of each read. Then, counting the TSS called with 10 nucleotides of others to find the consensus locus and depth of coverage. Finally these TSS were overlapped with the refTSS database (PMID: 31075273). Note that PacBio has particularly low coverage relative to short read sequencing.

b. proActiv In-House RNA-seq from MCF-7 Cell Line proActiv is a purpose-built package that identifies alternative promoters and their activity. proActiv uses STAR junction files, and an annotation gtf file of the GENCODE v36 genome. The output- is the most annoying SummarisedExperiment dataframe.

Identified promoters from annotations. Created weighted splicing graphs splicing graphs that capture all the splice variants of one gene in one data structure. Calculate promoter activity estimates for uniquely identified promoters. We are not filtering out single exon promoters and promoters that uses an internal intron. Read counts are quantified for absolute and relative promoter activities using either split reads ration method or junction reads. They divided promoters into 3 different categories major, minor and inactive promoters. The highest activity for each gene across the sample cohort is the major promoters. Promoters with average activity less than 0.25 = inactive, and the remaining mean other promoters. The pipeline can be found on proActiv.R.

The mean of the absolute promoter activity and relative promoter activity were calculated across all three repeats of the MCF-7 cell line.

c. CAGE-seq - IDR of ENCODE MCF-7 Cell Line from the Carnicini Lab CAGE stands for the Cap analysis of gene expression. Understanding of CAGE-seq is that it measures RNA expression and maps TSS in promoters to a single-nucleotide resolution. However, it only works on total mature RNA and detection is biased toward TSS of long-lived transcripts. We are using CAGE-seq TSS found across two repeats using IDR. The average peak was found across both repeats. Then these consensus TSS were overlapped with refTSS database (PMID: 31075273).

Step 2) With all three sequencing technologies we have filtered by: a) Genes with between 2 and 20 TSSs more than 300 between them were retained as candidates.

The gene of interest were identified if found in two out of the three technologies.

Step 3) Now that we have identified these genes of interest, they are then given True/False if present in three lists.

First, a list of nuclear proteins. Specifically, 106 transcription factors were identified (PMID: 29425488) and 24 chromatin remodellers.

Secondly, a manually curated list of genes that have AP usage.

Finally, differentially expressed genes found to switch exon usage from non-cancerous stem cells to cancerous stem cells in HMLER and HCC38. The DEGs for the cancerous stem cells were actually identified by differential exon usage with proActiv.

These are all summarised in ascending order of number of hits pivot_simple_TF_CHROM_HMM_HCC.txt

In [1]:

import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from empiricaldist import Pmf
import pickle
from hmtl_scripts.alternative_promoter_func import func
from hmtl_scripts.alternative_promoter_func import downstream_func
from hmtl_scripts.alternative_promoter_func import prep

In [2]:

###Create the list of Manually Curated 
published_results=func.ensembl_genename_search(df_location="./annotations/Grazi_Published_Articles.csv", genecolumn="Gene",type="csv")
manually_curated=list(published_results["ensembl.gene"])

###add tp53
manually_curated.append("ENSG00000141510")
with open("./output_files/manually_curated.txt", "wb") as fp:
    pickle.dump(manually_curated, fp)

querying 1-39...done.
Finished.
39 input query terms found dup hits:
	[('ADAR', 10), ('PTHLH', 10), ('SHC1', 10), ('LEF1', 10), ('NR3C1', 10), ('IKBKG', 10), ('NAT1', 10)
Pass "returnall=True" to return complete lists of duplicate or missing query terms.

In [3]:

##This is the integration of PacBio TSS and Proactiv TSS and CAGE TSS
refTSSGENCODE=func.ensembl_genename_search(df_location="./annotations/refTSS_v3.1_human_annotation.txt", genecolumn="Gene_symbol",type="table")

# ##refTSS human coordinates used for overlaplongread_TSS_selected.bed
refTSS=pd.read_table("./annotations/refTSS_v3.1_human_coordinate.hg38.bed",header=None)

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-20000...done.
querying 20001-21000...done.
querying 21001-22000...done.
querying 22001-23000...done.
querying 23001-23551...done.
Finished.
20256 input query terms found dup hits:
	[('nan', 2), ('MTND1P23', 2), ('MTND2P28', 2), ('MTCO1P12', 2), ('MTCO2P12', 2), ('MTATP8P1', 2), ('
977 input query terms found no hit:
	['LINC00982', 'LOC102725193', 'CENPS CENPS-CORT', 'RNU1-1 RNU1-2 RNU1-3 RNU1-4 RNVU1-18', 'LOC400743
Pass "returnall=True" to return complete lists of duplicate or missing query terms.

In [30]:

###First PacBio
##from the overlap paper Hien et al. 
LONGREAD=pd.read_table("./annotations/MCF7_2015.ANGEL_FINAL_nominus_noduplicate.curated.gff", delimiter="\t", header=None)

LONGREAD=prep.longread_prep(filename="./annotations/MCF7_2015.ANGEL_FINAL_nominus_noduplicate.curated.gff")

##Merge the GeneCode Annotated refTSSID with the refTSS overlapped promoters in PacBio
refTSSLONGREAD=LONGREAD.merge(refTSSGENCODE, right_on="#refTSSID", left_on="refTSSID")
refTSSLONGREAD.to_csv("./output_files/refTSSLONGREAD.txt")

##PacBio due to the lack of read depth has not got a consistent way of calling Major/Minor promoters therefore ignored
%run ./hmtl_scripts/unique_close_AP.py --dataframe_input "./output_files/refTSSLONGREAD.txt" --save_csvfilename "./output_files/PacBio_refTSS_candidatelist_filtered_2_distance.csv"  --upplim_promoter_number 10000 --PacBio True --checkTP53 True --endcoord_TSS_column "TSSend" --makeBED True --save_bedfilename "./visualisation_files/PacBio_refTSS_candidatelist_filtered_2_distance.bed" --Manually_Curated "./output_files/manually_curated.txt" --distance_between_TSS 0

os.system("rm -f ./unlifted.bed ./output_files/ANGEL_FINAL_nominus_noduplicate.curated.TSS.bed ./output_files/ANGEL_FINAL_nominus_noduplicate.curated.TSS_hg38.bed ./output_files/ANGEL_FINAL_nominus_noduplicate.curated.TSS_hg38_merged10.bed ./output_files/ANGEL_FINAL_nominus_noduplicate.curated.TSS_hg38_sorted.bed ./output_files/refTSSLONGREAD.txt ./output_files/longread_TSScurated_refTSS.bed")

If you read this line it means that you have provided all the parameters
Genes with between  2  and  10000  TSSs more than  0  between them were retained as candidates.
      Unnamed: 0    chr  TSSstart   TSSend  OverlappingTSS strand refTSSchr  \
6456        6456  chr17   7675237  7675238               1      -     chr17   

      refTSSstart  refTSSend     refTSSID  ... Distance GeneID  HGNC/MGI_ID  \
6456      7675230    7675245  hg_164276.1  ...    112.0   7157   HGNC:11998   

     UniProt_ID          Gene_name Gene_symbol Gene_synonyms Gene_source  \
6456     Q19RW5  tumor protein p53        TP53      LFS1 p53  HGNC:11998   

         ensembl.gene    _score  
6456  ENSG00000141510  88.91277  

[1 rows x 22 columns]
Number of Genes After Filtering For Promoter Number: 2098
25
Number of Genes After Filtering For Distance between TSS and AltTSS: 2095
PacBio Data so not deep enough coverage to confidendentally call Major/Minor promoters
Number of Genes After Filtering  
 for Strand Specific Upstream Major Promoter: 2095
         ensembl.gene   SD  Unnamed: 0    chr  TSSstart   TSSend  \
3421  ENSG00000141510  0.0        6456  chr17   7675237  7675238   

      OverlappingTSS strand refTSSchr  refTSSstart  ...  HGNC/MGI_ID  \
3421               1      -     chr17      7675230  ...   HGNC:11998   

     UniProt_ID          Gene_name Gene_symbol  Gene_synonyms Gene_source  \
3421     Q19RW5  tumor protein p53        TP53       LFS1 p53  HGNC:11998   

        _score range MAJOR_MINOR diff  
3421  88.91277     0       Major  0.0  

[1 rows x 26 columns]

Out[30]:

In [28]:

##ProActiv
proactiv_input = prep.pro_prep(filename="./output_files/proactivMCF7.csv")
proactiv_input.to_csv("./output_files/proactiv_input.txt")

%run ./hmtl_scripts/unique_close_AP.py --dataframe_input "./output_files/proactiv_input.txt" --save_csvfilename "./output_files/proactiv_refTSS_candidatelist_filtered_2_distance.csv"  --uniqueid_genecolumn "geneId_NoV" --TSSQuant_colummn "MCF7.mean" --uniqueid_TSScolumn "promoterId" --TSScoord_column "start" --Get_Ensembl_Gene_Names True --upplim_promoter_number 10000 --checkTP53 True --makeBED True --save_bedfilename "./visualisation_files/proactiv_refTSS_candidatelist_filtered_2_distance.bed" --chromosome_col "seqnames" --Manually_Curated "./output_files/manually_curated.txt" --distance_between_TSS 0

os.system("rm -f /output_files/proactiv_input.txt ./output_files/proactiv_bed.filteredts1_tsl2.sorted.bed ./output_files/proactiv_bed.sorted.bed ./output_files/proactiv_bed.bed")

(22783, 6)
(22783, 24)
If you read this line it means that you have provided all the parameters
Genes with between  2  and  10000  TSSs more than  0  between them were retained as candidates.
      Unnamed: 0 seqnames    start strand              geneId  promoterId  \
9167        9167    chr17  7675493      -  ENSG00000141510.18       35746   
9168        9168    chr17  7675493      -  ENSG00000141510.18       35746   
9169        9169    chr17  7675493      -  ENSG00000141510.18       35746   
9170        9170    chr17  7675493      -  ENSG00000141510.18       35746   
9171        9171    chr17  7675493      -  ENSG00000141510.18       35746   
9172        9172    chr17  7675493      -  ENSG00000141510.18       35746   
9173        9173    chr17  7676594      -  ENSG00000141510.18       35747   
9174        9174    chr17  7676594      -  ENSG00000141510.18       35747   
9176        9176    chr17  7687538      -  ENSG00000141510.18       35750   

      Unnamed: 0.1  internalPromoter  promoterPosition  MCF7.mean  ...  \
9167         35431              True                 5   7.027062  ...   
9168         35431              True                 5   7.027062  ...   
9169         35431              True                 5   7.027062  ...   
9170         35431              True                 5   7.027062  ...   
9171         35431              True                 5   7.027062  ...   
9172         35431              True                 5   7.027062  ...   
9173         35432              True                 4   7.003880  ...   
9174         35432              True                 4   7.003880  ...   
9176         35434             False                 1   6.713692  ...   

     012SJ_normpromCount  000SJ_absPromAct  001SJ_absPromAct  \
9167          118.370769          7.270845          6.911036   
9168          118.370769          7.270845          6.911036   
9169          118.370769          7.270845          6.911036   
9170          118.370769          7.270845          6.911036   
9171          118.370769          7.270845          6.911036   
9172          118.370769          7.270845          6.911036   
9173           98.509902          7.270845          7.104026   
9174           98.509902          7.270845          7.104026   
9176           84.210077          6.862031          6.866094   

      012SJ_absPromAct  000SJ_relProm  001SJ_relProm  012SJ_relProm  \
9167          6.899306         0.3397       0.330970       0.345847   
9168          6.899306         0.3397       0.330970       0.345847   
9169          6.899306         0.3397       0.330970       0.345847   
9170          6.899306         0.3397       0.330970       0.345847   
9171          6.899306         0.3397       0.330970       0.345847   
9172          6.899306         0.3397       0.330970       0.345847   
9173          6.636768         0.3397       0.340212       0.332686   
9174          6.636768         0.3397       0.340212       0.332686   
9176          6.412952         0.3206       0.328818       0.321467   

      absPromMean  relPromMean       geneId_NoV  
9167     7.027062     0.338839  ENSG00000141510  
9168     7.027062     0.338839  ENSG00000141510  
9169     7.027062     0.338839  ENSG00000141510  
9170     7.027062     0.338839  ENSG00000141510  
9171     7.027062     0.338839  ENSG00000141510  
9172     7.027062     0.338839  ENSG00000141510  
9173     7.003880     0.337533  ENSG00000141510  
9174     7.003880     0.337533  ENSG00000141510  
9176     6.713692     0.323628  ENSG00000141510  

[9 rows x 26 columns]
Number of Genes After Filtering For Promoter Number: 5637
34
Number of Genes After Filtering For Distance between TSS and AltTSS: 5637
tst
querying 1-1000.../Users/helenking/Desktop/Weatheritt_Lab/alternative-promoter-identification/hmtl_scripts/unique_close_AP.py:170: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  res["Minor_Downstream"][res[uniqueid_genecolumn].isin(select_genes_minor_downstream)] =True
done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-7412...done.
                   name symbol          Ensembl  \
3982  tumor protein p53   TP53  ENSG00000141510   
3983  tumor protein p53   TP53  ENSG00000141510   
3984  tumor protein p53   TP53  ENSG00000141510   
3985  tumor protein p53   TP53  ENSG00000141510   
3986  tumor protein p53   TP53  ENSG00000141510   
3987  tumor protein p53   TP53  ENSG00000141510   
3988  tumor protein p53   TP53  ENSG00000141510   
3989  tumor protein p53   TP53  ENSG00000141510   
3990  tumor protein p53   TP53  ENSG00000141510   

                                     ensembl.transcript       geneId_NoV  \
3982  ENST00000269305 ENST00000359597 ENST0000041346...  ENSG00000141510   
3983  ENST00000269305 ENST00000359597 ENST0000041346...  ENSG00000141510   
3984  ENST00000269305 ENST00000359597 ENST0000041346...  ENSG00000141510   
3985  ENST00000269305 ENST00000359597 ENST0000041346...  ENSG00000141510   
3986  ENST00000269305 ENST00000359597 ENST0000041346...  ENSG00000141510   
3987  ENST00000269305 ENST00000359597 ENST0000041346...  ENSG00000141510   
3988  ENST00000269305 ENST00000359597 ENST0000041346...  ENSG00000141510   
3989  ENST00000269305 ENST00000359597 ENST0000041346...  ENSG00000141510   
3990  ENST00000269305 ENST00000359597 ENST0000041346...  ENSG00000141510   

            SD  Unnamed: 0 seqnames    start strand  ... 000SJ_relProm  \
3982  0.097123        9167    chr17  7675493      -  ...        0.3397   
3983  0.097123        9168    chr17  7675493      -  ...        0.3397   
3984  0.097123        9169    chr17  7675493      -  ...        0.3397   
3985  0.097123        9170    chr17  7675493      -  ...        0.3397   
3986  0.097123        9171    chr17  7675493      -  ...        0.3397   
3987  0.097123        9172    chr17  7675493      -  ...        0.3397   
3988  0.097123        9173    chr17  7676594      -  ...        0.3397   
3989  0.097123        9174    chr17  7676594      -  ...        0.3397   
3990  0.097123        9176    chr17  7687538      -  ...        0.3206   

      001SJ_relProm  012SJ_relProm  absPromMean  relPromMean  range  \
3982       0.330970       0.345847     7.027062     0.338839  12045   
3983       0.330970       0.345847     7.027062     0.338839  12045   
3984       0.330970       0.345847     7.027062     0.338839  12045   
3985       0.330970       0.345847     7.027062     0.338839  12045   
3986       0.330970       0.345847     7.027062     0.338839  12045   
3987       0.330970       0.345847     7.027062     0.338839  12045   
3988       0.340212       0.332686     7.003880     0.337533  12045   
3989       0.340212       0.332686     7.003880     0.337533  12045   
3990       0.328818       0.321467     6.713692     0.323628  12045   

     MAJOR_MINOR     diff  Minor_Downstream        geneID_wo  
3982       Major      0.0              True  ENSG00000141510  
3983       Major      0.0              True  ENSG00000141510  
3984       Major      0.0              True  ENSG00000141510  
3985       Major      0.0              True  ENSG00000141510  
3986       Major      0.0              True  ENSG00000141510  
3987       Major      0.0              True  ENSG00000141510  
3988       Minor      0.0              True  ENSG00000141510  
3989       Minor   1101.0              True  ENSG00000141510  
3990       Minor  10944.0              True  ENSG00000141510  

[9 rows x 36 columns]

Out[28]:

CAGGEEEE For the MCF-7 combined peaks: https://www.encodeproject.org/files/ENCFF917XEM/@@download/ENCFF917XEM.bed.gz

Using Human Genes GRCh38.p13 hg38_genes using Ensembl web browser w/ attributes Chromosome/scaffold name Gene start (bp) Gene end (bp) Gene stable ID version Karyotype band Strand Source (gene) Gene type Transcription start site (TSS)

In [27]:

os.system("bedtools intersect -wa -wb -s -b ./annotations/refTSS_v3.1_human_coordinate.hg38.sorted.bed -a ./annotations/ENCFF917XEM.bed> ./output_files/ENCFF917XEM_refTSS.bed")

CAGE=prep.cage_prep(filename="./output_files/ENCFF917XEM_refTSS.bed")

CAGErefTSS=CAGE.merge(refTSSGENCODE, right_on="#refTSSID", left_on="refTSSID")

CAGErefTSS.to_csv("./output_files/CAGErefTSS.txt")

%run ./hmtl_scripts/unique_close_AP.py --dataframe_input "./output_files/CAGErefTSS.txt" --save_csvfilename "./output_files/CAGE_candidatelist_filtered_2_distance.csv"  --TSScoord_column 'start' --TSSQuant_colummn 'avgpeak' --upplim_promoter_number 10000 --Manually_Curated "./output_files/manually_curated.txt" --checkTP53 True --makeBED True --endcoord_TSS_column "end" --save_bedfilename "./visualisation_files/CAGE_candidatelist_filtered_2_distance.bed" --PacBio True --distance_between_TSS 0

os.system("rm -f ./output_files/CAGErefTSS.txt ./output_files/ENCFF917XEM_refTSS.bed")

If you read this line it means that you have provided all the parameters
Genes with between  2  and  10000  TSSs more than  0  between them were retained as candidates.
Empty DataFrame
Columns: [Unnamed: 0, chr, start, end, strand, pVal, qVal, peak1, peak2, refTSSchr, refTSSstart, refTSSend, refTSSID, avgpeak, #refTSSID, Transcript_name, Distance, GeneID, HGNC/MGI_ID, UniProt_ID, Gene_name, Gene_symbol, Gene_synonyms, Gene_source, ensembl.gene, _score]
Index: []

[0 rows x 26 columns]
Number of Genes After Filtering For Promoter Number: 1399
11
Number of Genes After Filtering For Distance between TSS and AltTSS: 311
PacBio Data so not deep enough coverage to confidendentally call Major/Minor promoters
Number of Genes After Filtering  
 for Strand Specific Upstream Major Promoter: 311
Empty DataFrame
Columns: [ensembl.gene, SD, Unnamed: 0, chr, start, end, strand, pVal, qVal, peak1, peak2, refTSSchr, refTSSstart, refTSSend, refTSSID, avgpeak, #refTSSID, Transcript_name, Distance, GeneID, HGNC/MGI_ID, UniProt_ID, Gene_name, Gene_symbol, Gene_synonyms, Gene_source, _score, range, MAJOR_MINOR, diff]
Index: []

[0 rows x 30 columns]

Out[27]:

In [24]:

func.merge_three_df_venn(dataframe1="./output_files/CAGE_candidatelist_filtered_2_distance.csv",dataframe2="./output_files/proactiv_refTSS_candidatelist_filtered_2_distance.csv",dataframe3="./output_files/PacBio_refTSS_candidatelist_filtered_2_distance.csv",symbolcolumn2="Ensembl",output_folder="./merged_tables/")

In [17]:

##Write pivot tables for all
###Proactiv

pro_pivot= downstream_func.pivot_each(dataframe1="./merged_tables/CAGEproactiv.csv", index_list1=['promoterId','symbol', 'seqnames','start_y', 'internalPromoter',"MAJOR_MINOR_y","Ensembl","strand_y"], values_list1=["absPromMean", "relPromMean"], dataframe2="./merged_tables/proactivPacBio.csv", index_list2=['promoterId','symbol','seqnames','start','internalPromoter',"MAJOR_MINOR_x","Ensembl","strand_x"], values_list2=["absPromMean", "relPromMean"], name="proActiv", columnnames=['proActiv_promoterId', 'Gene_symbol', 'proActiv_seqnames', 'proActiv_start', 'proActiv_internalPromoter','proActiv_MAJOR_MINOR',"Ensembl_ID","strand",'proActiv_absPromMean', 'proActiv_relPromMean'], reorderedcolumns=['proActiv_promoterId', 'Gene_symbol', 'Ensembl_ID' , 'proActiv_seqnames', 'proActiv_start','proActiv_end', "strand",'proActiv_internalPromoter',  'proActiv_MAJOR_MINOR', 'proActiv_absPromMean', 'proActiv_relPromMean','proActiv'], noendcoord=True , save_txt="./candidate_AP_output/propivot_table.txt" )


 ###CAGE
cage_pivot= downstream_func.pivot_each(dataframe1="./merged_tables/CAGEproactiv.csv", index_list1=['refTSSID','Gene_symbol', 'chr',"start_x",'end',"MAJOR_MINOR_x","ensembl.gene","strand_x"], values_list1=["avgpeak"], dataframe2="./merged_tables/CAGEPacBio.csv", index_list2=['refTSSID_y','Gene_symbol_y','chr_y','start','end',"MAJOR_MINOR_y","ensembl.gene","strand_y"], values_list2=["avgpeak"], name="CAGE", columnnames=['CAGE_refTSSID', 'Gene_symbol', 'CAGE_chr', 'CAGE_start','CAGE_end',  'CAGE_MAJOR_MINOR', 'Ensembl_ID',"strand",'CAGE_avgpeak'], reorderedcolumns=['CAGE_refTSSID', 'Gene_symbol', 'Ensembl_ID','CAGE_chr', 'CAGE_start','CAGE_end',"strand", 'CAGE_MAJOR_MINOR','CAGE_avgpeak',"CAGE"], noendcoord=False, save_txt="./candidate_AP_output/cagepivot_table.txt" )


###Pacbio
pac_pivot= downstream_func.pivot_each(dataframe1="./merged_tables/proactivPacBio.csv", index_list1=['refTSSID','Gene_symbol', 'chr','TSSstart','TSSend',"MAJOR_MINOR_y","ensembl.gene","strand_y"], values_list1=["OverlappingTSS"], dataframe2="./merged_tables/CAGEPacBio.csv", index_list2=['refTSSID_x','Gene_symbol_x','chr_x','TSSstart','TSSend',"MAJOR_MINOR_x","ensembl.gene","strand_x"], values_list2=["OverlappingTSS"], name="PacBio", columnnames=['PacBio_refTSSID', 'Gene_symbol', 'PacBio_chr', 'PacBio_start','PacBio_end', 'PacBio_MAJOR_MINOR', "Ensembl_ID","strand",'PacBio_OverlappingTSS'], reorderedcolumns=['PacBio_refTSSID', 'Gene_symbol','Ensembl_ID', 'PacBio_chr', 'PacBio_start','PacBio_end', "strand", 'PacBio_MAJOR_MINOR','PacBio_OverlappingTSS','PacBio'], noendcoord=False, save_txt="./candidate_AP_output/pacpivot_table.txt" )


all_pivot=pro_pivot.merge(cage_pivot, on=["Gene_symbol","Ensembl_ID"], how="outer").merge(pac_pivot, on=["Gene_symbol","Ensembl_ID"], how="outer")

###This is the simple pivot table of genes ONLY
all_pivot_simple=all_pivot[["Ensembl_ID",'Gene_symbol' ,'proActiv', 'CAGE',  'PacBio']].drop_duplicates()
all_pivot_simple[all_pivot_simple.isna()] = False

pro_pivot.drop(columns= ["proActiv_relPromMean","proActiv","proActiv_internalPromoter"],axis=0,inplace=True)
pro_pivot["Source"]="proActiv"

cage_pivot.drop(columns= ["CAGE"],axis=0,inplace=True)
cage_pivot["Source"]="CAGE"

pac_pivot.drop(columns= ["PacBio"],axis=0,inplace=True)
pac_pivot["Source"]="PacBio"

##This is the pivot table that will be used further downstream as it has every single TSS called across platforms
compiled_pivot= pd.DataFrame( np.concatenate((pro_pivot.values,cage_pivot.values,pac_pivot.values) , axis=0 ))  
compiled_pivot.columns=['TSS_ID','Gene_symbol','Ensembl_ID' , 'chr', 'TSSstart','TSSend' ,"strand",'MAJOR_MINOR' ,'Depth',"Source"]

###Make a file where the promoters are clustered together
compiled_pivot_bed=compiled_pivot[["chr","TSSstart","TSSstart","Gene_symbol","TSS_ID","strand","Ensembl_ID"]]
compiled_pivot_bed.to_csv("./output_files/compiled_pivot.bed", header=None, index=None, sep="\t")

In [18]:

###START W CANONICAL ATG ANALYSIS WITH ROBS ATG START CODONS
# I'd figure out where the "canonical" ATG is. If promoters upstream of this ATG, it is a fair assumption that both will use that ATG. If minor one is downstream, it probably has a different N-terminal. That would be most basic approach (edited)

nterminus_change_merge=downstream_func.find_nterminuschange()

Empty DataFrame
Columns: [chr, TSS_start, TSS_end, NumberTSS, Gene_symbol, TSS_ID, strand, Ensembl_ID]
Index: []
    chr  TSS_start  TSS_end  NumberTSS Gene_symbol  \
0  chr1    1312700  1312700          1      INTS11   
1  chr1    1315697  1315697          1      INTS11   
2  chr1    1324618  1324688          4      INTS11   
3  chr1    1754896  1754896          1        NADK   
4  chr1    1756327  1756327          1        NADK   

                                TSS_ID strand       Ensembl_ID  
0                                26185      -  ENSG00000127054  
1                                26186      -  ENSG00000127054  
2  hg_9754.1,hg_9755.1,hg_9755.1,26190      -  ENSG00000127054  
3                                 1044      -  ENSG00000008130  
4                                 1045      -  ENSG00000008130  
(2081, 8)
Index(['chr', 'TSS_start', 'TSS_end', 'NumberTSS', 'Gene_symbol', 'TSS_ID',
       'strand', 'Ensembl_ID', 'Range_ClusteredTSS', 'start_SC', 'end_SC'],
      dtype='object')
The percentage of canonical ATG downstream of promoters: 53.7 %
/Users/helenking/Desktop/Weatheritt_Lab/alternative-promoter-identification/hmtl_scripts/alternative_promoter_func/downstream_func.py:58: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  compiled_pivot_wstrand_ATG["Canonical_ATG_Downsteam"][((compiled_pivot_wstrand_ATG["strand"]=="+") & ( (compiled_pivot_wstrand_ATG["TSS_start"] -100 ) < compiled_pivot_wstrand_ATG["start_SC"] )) | ((compiled_pivot_wstrand_ATG["strand"]=="-") & ( (compiled_pivot_wstrand_ATG["TSS_start"]+100) > compiled_pivot_wstrand_ATG["start_SC"] ))] = "True"
/Users/helenking/Desktop/Weatheritt_Lab/alternative-promoter-identification/hmtl_scripts/alternative_promoter_func/downstream_func.py:89: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nterminus_change["Nterminus_Change"][nterminus_change["False"]==0] = "False"

In [32]:

####Create a dataframe for import into flashfry
###Get the dataframe that shows the entire list of 
##Change the bed file so it is just the TSS 
all_pivot_simple_nterm=all_pivot_simple.merge(nterminus_change_merge, left_on="Ensembl_ID", right_on="Ensembl_ID", right_index=True)

compiled_pivot_flashfry=compiled_pivot[['chr', 'TSSstart', 'TSSend', "TSS_ID", 'Gene_symbol','strand','Ensembl_ID','MAJOR_MINOR','Source']]
compiled_pivot_flashfry=compiled_pivot_flashfry[compiled_pivot_flashfry["Ensembl_ID"].isin(all_pivot_simple["Ensembl_ID"])]

compiled_pivot_flashfry.to_csv("/Users/helenking/Desktop/Weatheritt_Lab/flashfry-for-crispr-targets/pacbio_cage_proactiv_TSS_files/compiled_pivot_all.bed",header=None, sep="\t",index=None)
all_pivot_simple_nterm.to_csv("/Users/helenking/Desktop/Weatheritt_Lab/flashfry-for-crispr-targets/pacbio_cage_proactiv_TSS_files/all_pivot_simple_nterm.txt",sep="\t",index=None)

compiled_pivot_flashfry_vis=compiled_pivot_flashfry[['chr', 'TSSstart', 'TSSend', 'Source', 'Gene_symbol','strand']]
compiled_pivot_flashfry_vis.to_csv("/Users/helenking/Desktop/Weatheritt_Lab/flashfry-for-crispr-targets/pacbio_cage_proactiv_TSS_files/compiled_pivot_all_vis.bed",header=None, sep="\t",index=None)

Download Original Report: 1A Integration Report (HTML)