Integration of PacBio, TSS and proActiv Data
Multi-omics integration pipeline for identifying alternative promoters using long-read sequencing, CAGE-seq, and RNA-seq data.
Integration of PacBio, TSS and Proactiv Data¶
We are identifying alternative promoter candidates to be used within the CRISPRi screen. After many weeks of trying every type of sequencing technology that could identify TSS/promoters for MCF-7 cell line.
Step 1) Identify the alternative TSS usage in 3 technologies- CAGE-seq, proActiv and PacBio.
Step 2) Filter the TSS based on proximity to each other, number and if the major promoter was upstream of the minor promoter.
Step 3) Prioritising the genes of interest.
Step 1) The three technologies used are as following (as well as a short description about how a TSS is called for each technology): a. PacBio Long-Read RNA-seq data From MCF-7 Cell-line. Hoen et al. PMID: 29598823 PacBio transcript sequences are actually a consensus sequence generated from a combination of full-length reads with isoform clustering algorithm (ICE) and then with partial reads. After benchmarking PacBio, we found it was as accurate as CAGE-seq at calling TSS but identified more TP TSS. This is because CAGE-seq is very conservative in calling TSS.
First, the TSS locus is defined as the first strand-specific nucleotide of each read. Then, counting the TSS called with 10 nucleotides of others to find the consensus locus and depth of coverage. Finally these TSS were overlapped with the refTSS database (PMID: 31075273). Note that PacBio has particularly low coverage relative to short read sequencing.
b. proActiv In-House RNA-seq from MCF-7 Cell Line proActiv is a purpose-built package that identifies alternative promoters and their activity. proActiv uses STAR junction files, and an annotation gtf file of the GENCODE v36 genome. The output- is the most annoying SummarisedExperiment dataframe.
Identified promoters from annotations. Created weighted splicing graphs splicing graphs that capture all the splice variants of one gene in one data structure. Calculate promoter activity estimates for uniquely identified promoters. We are not filtering out single exon promoters and promoters that uses an internal intron. Read counts are quantified for absolute and relative promoter activities using either split reads ration method or junction reads. They divided promoters into 3 different categories major, minor and inactive promoters. The highest activity for each gene across the sample cohort is the major promoters. Promoters with average activity less than 0.25 = inactive, and the remaining mean other promoters. The pipeline can be found on proActiv.R.
The mean of the absolute promoter activity and relative promoter activity were calculated across all three repeats of the MCF-7 cell line.
c. CAGE-seq - IDR of ENCODE MCF-7 Cell Line from the Carnicini Lab CAGE stands for the Cap analysis of gene expression. Understanding of CAGE-seq is that it measures RNA expression and maps TSS in promoters to a single-nucleotide resolution. However, it only works on total mature RNA and detection is biased toward TSS of long-lived transcripts. We are using CAGE-seq TSS found across two repeats using IDR. The average peak was found across both repeats. Then these consensus TSS were overlapped with refTSS database (PMID: 31075273).
Step 2) With all three sequencing technologies we have filtered by: a) Genes with between 2 and 20 TSSs more than 300 between them were retained as candidates.
The gene of interest were identified if found in two out of the three technologies.

Step 3) Now that we have identified these genes of interest, they are then given True/False if present in three lists.
First, a list of nuclear proteins. Specifically, 106 transcription factors were identified (PMID: 29425488) and 24 chromatin remodellers.
Secondly, a manually curated list of genes that have AP usage.
Finally, differentially expressed genes found to switch exon usage from non-cancerous stem cells to cancerous stem cells in HMLER and HCC38. The DEGs for the cancerous stem cells were actually identified by differential exon usage with proActiv.
These are all summarised in ascending order of number of hits pivot_simple_TF_CHROM_HMM_HCC.txt
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from empiricaldist import Pmf
import pickle
from hmtl_scripts.alternative_promoter_func import func
from hmtl_scripts.alternative_promoter_func import downstream_func
from hmtl_scripts.alternative_promoter_func import prep
###Create the list of Manually Curated
published_results=func.ensembl_genename_search(df_location="./annotations/Grazi_Published_Articles.csv", genecolumn="Gene",type="csv")
manually_curated=list(published_results["ensembl.gene"])
###add tp53
manually_curated.append("ENSG00000141510")
with open("./output_files/manually_curated.txt", "wb") as fp:
pickle.dump(manually_curated, fp)
querying 1-39...done.
Finished.
39 input query terms found dup hits:
[('ADAR', 10), ('PTHLH', 10), ('SHC1', 10), ('LEF1', 10), ('NR3C1', 10), ('IKBKG', 10), ('NAT1', 10)
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
##This is the integration of PacBio TSS and Proactiv TSS and CAGE TSS
refTSSGENCODE=func.ensembl_genename_search(df_location="./annotations/refTSS_v3.1_human_annotation.txt", genecolumn="Gene_symbol",type="table")
# ##refTSS human coordinates used for overlaplongread_TSS_selected.bed
refTSS=pd.read_table("./annotations/refTSS_v3.1_human_coordinate.hg38.bed",header=None)
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-20000...done.
querying 20001-21000...done.
querying 21001-22000...done.
querying 22001-23000...done.
querying 23001-23551...done.
Finished.
20256 input query terms found dup hits:
[('nan', 2), ('MTND1P23', 2), ('MTND2P28', 2), ('MTCO1P12', 2), ('MTCO2P12', 2), ('MTATP8P1', 2), ('
977 input query terms found no hit:
['LINC00982', 'LOC102725193', 'CENPS CENPS-CORT', 'RNU1-1 RNU1-2 RNU1-3 RNU1-4 RNVU1-18', 'LOC400743
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
###First PacBio
##from the overlap paper Hien et al.
LONGREAD=pd.read_table("./annotations/MCF7_2015.ANGEL_FINAL_nominus_noduplicate.curated.gff", delimiter="\t", header=None)
LONGREAD=prep.longread_prep(filename="./annotations/MCF7_2015.ANGEL_FINAL_nominus_noduplicate.curated.gff")
##Merge the GeneCode Annotated refTSSID with the refTSS overlapped promoters in PacBio
refTSSLONGREAD=LONGREAD.merge(refTSSGENCODE, right_on="#refTSSID", left_on="refTSSID")
refTSSLONGREAD.to_csv("./output_files/refTSSLONGREAD.txt")
##PacBio due to the lack of read depth has not got a consistent way of calling Major/Minor promoters therefore ignored
%run ./hmtl_scripts/unique_close_AP.py --dataframe_input "./output_files/refTSSLONGREAD.txt" --save_csvfilename "./output_files/PacBio_refTSS_candidatelist_filtered_2_distance.csv" --upplim_promoter_number 10000 --PacBio True --checkTP53 True --endcoord_TSS_column "TSSend" --makeBED True --save_bedfilename "./visualisation_files/PacBio_refTSS_candidatelist_filtered_2_distance.bed" --Manually_Curated "./output_files/manually_curated.txt" --distance_between_TSS 0
os.system("rm -f ./unlifted.bed ./output_files/ANGEL_FINAL_nominus_noduplicate.curated.TSS.bed ./output_files/ANGEL_FINAL_nominus_noduplicate.curated.TSS_hg38.bed ./output_files/ANGEL_FINAL_nominus_noduplicate.curated.TSS_hg38_merged10.bed ./output_files/ANGEL_FINAL_nominus_noduplicate.curated.TSS_hg38_sorted.bed ./output_files/refTSSLONGREAD.txt ./output_files/longread_TSScurated_refTSS.bed")
If you read this line it means that you have provided all the parameters
Genes with between 2 and 10000 TSSs more than 0 between them were retained as candidates.
Unnamed: 0 chr TSSstart TSSend OverlappingTSS strand refTSSchr \
6456 6456 chr17 7675237 7675238 1 - chr17
refTSSstart refTSSend refTSSID ... Distance GeneID HGNC/MGI_ID \
6456 7675230 7675245 hg_164276.1 ... 112.0 7157 HGNC:11998
UniProt_ID Gene_name Gene_symbol Gene_synonyms Gene_source \
6456 Q19RW5 tumor protein p53 TP53 LFS1 p53 HGNC:11998
ensembl.gene _score
6456 ENSG00000141510 88.91277
[1 rows x 22 columns]
Number of Genes After Filtering For Promoter Number: 2098
25
Number of Genes After Filtering For Distance between TSS and AltTSS: 2095
PacBio Data so not deep enough coverage to confidendentally call Major/Minor promoters
Number of Genes After Filtering
for Strand Specific Upstream Major Promoter: 2095
ensembl.gene SD Unnamed: 0 chr TSSstart TSSend \
3421 ENSG00000141510 0.0 6456 chr17 7675237 7675238
OverlappingTSS strand refTSSchr refTSSstart ... HGNC/MGI_ID \
3421 1 - chr17 7675230 ... HGNC:11998
UniProt_ID Gene_name Gene_symbol Gene_synonyms Gene_source \
3421 Q19RW5 tumor protein p53 TP53 LFS1 p53 HGNC:11998
_score range MAJOR_MINOR diff
3421 88.91277 0 Major 0.0
[1 rows x 26 columns]
0
##ProActiv
proactiv_input = prep.pro_prep(filename="./output_files/proactivMCF7.csv")
proactiv_input.to_csv("./output_files/proactiv_input.txt")
%run ./hmtl_scripts/unique_close_AP.py --dataframe_input "./output_files/proactiv_input.txt" --save_csvfilename "./output_files/proactiv_refTSS_candidatelist_filtered_2_distance.csv" --uniqueid_genecolumn "geneId_NoV" --TSSQuant_colummn "MCF7.mean" --uniqueid_TSScolumn "promoterId" --TSScoord_column "start" --Get_Ensembl_Gene_Names True --upplim_promoter_number 10000 --checkTP53 True --makeBED True --save_bedfilename "./visualisation_files/proactiv_refTSS_candidatelist_filtered_2_distance.bed" --chromosome_col "seqnames" --Manually_Curated "./output_files/manually_curated.txt" --distance_between_TSS 0
os.system("rm -f /output_files/proactiv_input.txt ./output_files/proactiv_bed.filteredts1_tsl2.sorted.bed ./output_files/proactiv_bed.sorted.bed ./output_files/proactiv_bed.bed")
(22783, 6)
(22783, 24)
If you read this line it means that you have provided all the parameters
Genes with between 2 and 10000 TSSs more than 0 between them were retained as candidates.
Unnamed: 0 seqnames start strand geneId promoterId \
9167 9167 chr17 7675493 - ENSG00000141510.18 35746
9168 9168 chr17 7675493 - ENSG00000141510.18 35746
9169 9169 chr17 7675493 - ENSG00000141510.18 35746
9170 9170 chr17 7675493 - ENSG00000141510.18 35746
9171 9171 chr17 7675493 - ENSG00000141510.18 35746
9172 9172 chr17 7675493 - ENSG00000141510.18 35746
9173 9173 chr17 7676594 - ENSG00000141510.18 35747
9174 9174 chr17 7676594 - ENSG00000141510.18 35747
9176 9176 chr17 7687538 - ENSG00000141510.18 35750
Unnamed: 0.1 internalPromoter promoterPosition MCF7.mean ... \
9167 35431 True 5 7.027062 ...
9168 35431 True 5 7.027062 ...
9169 35431 True 5 7.027062 ...
9170 35431 True 5 7.027062 ...
9171 35431 True 5 7.027062 ...
9172 35431 True 5 7.027062 ...
9173 35432 True 4 7.003880 ...
9174 35432 True 4 7.003880 ...
9176 35434 False 1 6.713692 ...
012SJ_normpromCount 000SJ_absPromAct 001SJ_absPromAct \
9167 118.370769 7.270845 6.911036
9168 118.370769 7.270845 6.911036
9169 118.370769 7.270845 6.911036
9170 118.370769 7.270845 6.911036
9171 118.370769 7.270845 6.911036
9172 118.370769 7.270845 6.911036
9173 98.509902 7.270845 7.104026
9174 98.509902 7.270845 7.104026
9176 84.210077 6.862031 6.866094
012SJ_absPromAct 000SJ_relProm 001SJ_relProm 012SJ_relProm \
9167 6.899306 0.3397 0.330970 0.345847
9168 6.899306 0.3397 0.330970 0.345847
9169 6.899306 0.3397 0.330970 0.345847
9170 6.899306 0.3397 0.330970 0.345847
9171 6.899306 0.3397 0.330970 0.345847
9172 6.899306 0.3397 0.330970 0.345847
9173 6.636768 0.3397 0.340212 0.332686
9174 6.636768 0.3397 0.340212 0.332686
9176 6.412952 0.3206 0.328818 0.321467
absPromMean relPromMean geneId_NoV
9167 7.027062 0.338839 ENSG00000141510
9168 7.027062 0.338839 ENSG00000141510
9169 7.027062 0.338839 ENSG00000141510
9170 7.027062 0.338839 ENSG00000141510
9171 7.027062 0.338839 ENSG00000141510
9172 7.027062 0.338839 ENSG00000141510
9173 7.003880 0.337533 ENSG00000141510
9174 7.003880 0.337533 ENSG00000141510
9176 6.713692 0.323628 ENSG00000141510
[9 rows x 26 columns]
Number of Genes After Filtering For Promoter Number: 5637
34
Number of Genes After Filtering For Distance between TSS and AltTSS: 5637
tst
querying 1-1000.../Users/helenking/Desktop/Weatheritt_Lab/alternative-promoter-identification/hmtl_scripts/unique_close_AP.py:170: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
res["Minor_Downstream"][res[uniqueid_genecolumn].isin(select_genes_minor_downstream)] =True
done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-7412...done.
name symbol Ensembl \
3982 tumor protein p53 TP53 ENSG00000141510
3983 tumor protein p53 TP53 ENSG00000141510
3984 tumor protein p53 TP53 ENSG00000141510
3985 tumor protein p53 TP53 ENSG00000141510
3986 tumor protein p53 TP53 ENSG00000141510
3987 tumor protein p53 TP53 ENSG00000141510
3988 tumor protein p53 TP53 ENSG00000141510
3989 tumor protein p53 TP53 ENSG00000141510
3990 tumor protein p53 TP53 ENSG00000141510
ensembl.transcript geneId_NoV \
3982 ENST00000269305 ENST00000359597 ENST0000041346... ENSG00000141510
3983 ENST00000269305 ENST00000359597 ENST0000041346... ENSG00000141510
3984 ENST00000269305 ENST00000359597 ENST0000041346... ENSG00000141510
3985 ENST00000269305 ENST00000359597 ENST0000041346... ENSG00000141510
3986 ENST00000269305 ENST00000359597 ENST0000041346... ENSG00000141510
3987 ENST00000269305 ENST00000359597 ENST0000041346... ENSG00000141510
3988 ENST00000269305 ENST00000359597 ENST0000041346... ENSG00000141510
3989 ENST00000269305 ENST00000359597 ENST0000041346... ENSG00000141510
3990 ENST00000269305 ENST00000359597 ENST0000041346... ENSG00000141510
SD Unnamed: 0 seqnames start strand ... 000SJ_relProm \
3982 0.097123 9167 chr17 7675493 - ... 0.3397
3983 0.097123 9168 chr17 7675493 - ... 0.3397
3984 0.097123 9169 chr17 7675493 - ... 0.3397
3985 0.097123 9170 chr17 7675493 - ... 0.3397
3986 0.097123 9171 chr17 7675493 - ... 0.3397
3987 0.097123 9172 chr17 7675493 - ... 0.3397
3988 0.097123 9173 chr17 7676594 - ... 0.3397
3989 0.097123 9174 chr17 7676594 - ... 0.3397
3990 0.097123 9176 chr17 7687538 - ... 0.3206
001SJ_relProm 012SJ_relProm absPromMean relPromMean range \
3982 0.330970 0.345847 7.027062 0.338839 12045
3983 0.330970 0.345847 7.027062 0.338839 12045
3984 0.330970 0.345847 7.027062 0.338839 12045
3985 0.330970 0.345847 7.027062 0.338839 12045
3986 0.330970 0.345847 7.027062 0.338839 12045
3987 0.330970 0.345847 7.027062 0.338839 12045
3988 0.340212 0.332686 7.003880 0.337533 12045
3989 0.340212 0.332686 7.003880 0.337533 12045
3990 0.328818 0.321467 6.713692 0.323628 12045
MAJOR_MINOR diff Minor_Downstream geneID_wo
3982 Major 0.0 True ENSG00000141510
3983 Major 0.0 True ENSG00000141510
3984 Major 0.0 True ENSG00000141510
3985 Major 0.0 True ENSG00000141510
3986 Major 0.0 True ENSG00000141510
3987 Major 0.0 True ENSG00000141510
3988 Minor 0.0 True ENSG00000141510
3989 Minor 1101.0 True ENSG00000141510
3990 Minor 10944.0 True ENSG00000141510
[9 rows x 36 columns]
0
CAGGEEEE For the MCF-7 combined peaks: https://www.encodeproject.org/files/ENCFF917XEM/@@download/ENCFF917XEM.bed.gz
Using Human Genes GRCh38.p13 hg38_genes using Ensembl web browser w/ attributes Chromosome/scaffold name Gene start (bp) Gene end (bp) Gene stable ID version Karyotype band Strand Source (gene) Gene type Transcription start site (TSS)
os.system("bedtools intersect -wa -wb -s -b ./annotations/refTSS_v3.1_human_coordinate.hg38.sorted.bed -a ./annotations/ENCFF917XEM.bed> ./output_files/ENCFF917XEM_refTSS.bed")
CAGE=prep.cage_prep(filename="./output_files/ENCFF917XEM_refTSS.bed")
CAGErefTSS=CAGE.merge(refTSSGENCODE, right_on="#refTSSID", left_on="refTSSID")
CAGErefTSS.to_csv("./output_files/CAGErefTSS.txt")
%run ./hmtl_scripts/unique_close_AP.py --dataframe_input "./output_files/CAGErefTSS.txt" --save_csvfilename "./output_files/CAGE_candidatelist_filtered_2_distance.csv" --TSScoord_column 'start' --TSSQuant_colummn 'avgpeak' --upplim_promoter_number 10000 --Manually_Curated "./output_files/manually_curated.txt" --checkTP53 True --makeBED True --endcoord_TSS_column "end" --save_bedfilename "./visualisation_files/CAGE_candidatelist_filtered_2_distance.bed" --PacBio True --distance_between_TSS 0
os.system("rm -f ./output_files/CAGErefTSS.txt ./output_files/ENCFF917XEM_refTSS.bed")
If you read this line it means that you have provided all the parameters Genes with between 2 and 10000 TSSs more than 0 between them were retained as candidates. Empty DataFrame Columns: [Unnamed: 0, chr, start, end, strand, pVal, qVal, peak1, peak2, refTSSchr, refTSSstart, refTSSend, refTSSID, avgpeak, #refTSSID, Transcript_name, Distance, GeneID, HGNC/MGI_ID, UniProt_ID, Gene_name, Gene_symbol, Gene_synonyms, Gene_source, ensembl.gene, _score] Index: [] [0 rows x 26 columns] Number of Genes After Filtering For Promoter Number: 1399 11 Number of Genes After Filtering For Distance between TSS and AltTSS: 311 PacBio Data so not deep enough coverage to confidendentally call Major/Minor promoters Number of Genes After Filtering for Strand Specific Upstream Major Promoter: 311 Empty DataFrame Columns: [ensembl.gene, SD, Unnamed: 0, chr, start, end, strand, pVal, qVal, peak1, peak2, refTSSchr, refTSSstart, refTSSend, refTSSID, avgpeak, #refTSSID, Transcript_name, Distance, GeneID, HGNC/MGI_ID, UniProt_ID, Gene_name, Gene_symbol, Gene_synonyms, Gene_source, _score, range, MAJOR_MINOR, diff] Index: [] [0 rows x 30 columns]
0
func.merge_three_df_venn(dataframe1="./output_files/CAGE_candidatelist_filtered_2_distance.csv",dataframe2="./output_files/proactiv_refTSS_candidatelist_filtered_2_distance.csv",dataframe3="./output_files/PacBio_refTSS_candidatelist_filtered_2_distance.csv",symbolcolumn2="Ensembl",output_folder="./merged_tables/")
##Write pivot tables for all
###Proactiv
pro_pivot= downstream_func.pivot_each(dataframe1="./merged_tables/CAGEproactiv.csv", index_list1=['promoterId','symbol', 'seqnames','start_y', 'internalPromoter',"MAJOR_MINOR_y","Ensembl","strand_y"], values_list1=["absPromMean", "relPromMean"], dataframe2="./merged_tables/proactivPacBio.csv", index_list2=['promoterId','symbol','seqnames','start','internalPromoter',"MAJOR_MINOR_x","Ensembl","strand_x"], values_list2=["absPromMean", "relPromMean"], name="proActiv", columnnames=['proActiv_promoterId', 'Gene_symbol', 'proActiv_seqnames', 'proActiv_start', 'proActiv_internalPromoter','proActiv_MAJOR_MINOR',"Ensembl_ID","strand",'proActiv_absPromMean', 'proActiv_relPromMean'], reorderedcolumns=['proActiv_promoterId', 'Gene_symbol', 'Ensembl_ID' , 'proActiv_seqnames', 'proActiv_start','proActiv_end', "strand",'proActiv_internalPromoter', 'proActiv_MAJOR_MINOR', 'proActiv_absPromMean', 'proActiv_relPromMean','proActiv'], noendcoord=True , save_txt="./candidate_AP_output/propivot_table.txt" )
###CAGE
cage_pivot= downstream_func.pivot_each(dataframe1="./merged_tables/CAGEproactiv.csv", index_list1=['refTSSID','Gene_symbol', 'chr',"start_x",'end',"MAJOR_MINOR_x","ensembl.gene","strand_x"], values_list1=["avgpeak"], dataframe2="./merged_tables/CAGEPacBio.csv", index_list2=['refTSSID_y','Gene_symbol_y','chr_y','start','end',"MAJOR_MINOR_y","ensembl.gene","strand_y"], values_list2=["avgpeak"], name="CAGE", columnnames=['CAGE_refTSSID', 'Gene_symbol', 'CAGE_chr', 'CAGE_start','CAGE_end', 'CAGE_MAJOR_MINOR', 'Ensembl_ID',"strand",'CAGE_avgpeak'], reorderedcolumns=['CAGE_refTSSID', 'Gene_symbol', 'Ensembl_ID','CAGE_chr', 'CAGE_start','CAGE_end',"strand", 'CAGE_MAJOR_MINOR','CAGE_avgpeak',"CAGE"], noendcoord=False, save_txt="./candidate_AP_output/cagepivot_table.txt" )
###Pacbio
pac_pivot= downstream_func.pivot_each(dataframe1="./merged_tables/proactivPacBio.csv", index_list1=['refTSSID','Gene_symbol', 'chr','TSSstart','TSSend',"MAJOR_MINOR_y","ensembl.gene","strand_y"], values_list1=["OverlappingTSS"], dataframe2="./merged_tables/CAGEPacBio.csv", index_list2=['refTSSID_x','Gene_symbol_x','chr_x','TSSstart','TSSend',"MAJOR_MINOR_x","ensembl.gene","strand_x"], values_list2=["OverlappingTSS"], name="PacBio", columnnames=['PacBio_refTSSID', 'Gene_symbol', 'PacBio_chr', 'PacBio_start','PacBio_end', 'PacBio_MAJOR_MINOR', "Ensembl_ID","strand",'PacBio_OverlappingTSS'], reorderedcolumns=['PacBio_refTSSID', 'Gene_symbol','Ensembl_ID', 'PacBio_chr', 'PacBio_start','PacBio_end', "strand", 'PacBio_MAJOR_MINOR','PacBio_OverlappingTSS','PacBio'], noendcoord=False, save_txt="./candidate_AP_output/pacpivot_table.txt" )
all_pivot=pro_pivot.merge(cage_pivot, on=["Gene_symbol","Ensembl_ID"], how="outer").merge(pac_pivot, on=["Gene_symbol","Ensembl_ID"], how="outer")
###This is the simple pivot table of genes ONLY
all_pivot_simple=all_pivot[["Ensembl_ID",'Gene_symbol' ,'proActiv', 'CAGE', 'PacBio']].drop_duplicates()
all_pivot_simple[all_pivot_simple.isna()] = False
pro_pivot.drop(columns= ["proActiv_relPromMean","proActiv","proActiv_internalPromoter"],axis=0,inplace=True)
pro_pivot["Source"]="proActiv"
cage_pivot.drop(columns= ["CAGE"],axis=0,inplace=True)
cage_pivot["Source"]="CAGE"
pac_pivot.drop(columns= ["PacBio"],axis=0,inplace=True)
pac_pivot["Source"]="PacBio"
##This is the pivot table that will be used further downstream as it has every single TSS called across platforms
compiled_pivot= pd.DataFrame( np.concatenate((pro_pivot.values,cage_pivot.values,pac_pivot.values) , axis=0 ))
compiled_pivot.columns=['TSS_ID','Gene_symbol','Ensembl_ID' , 'chr', 'TSSstart','TSSend' ,"strand",'MAJOR_MINOR' ,'Depth',"Source"]
###Make a file where the promoters are clustered together
compiled_pivot_bed=compiled_pivot[["chr","TSSstart","TSSstart","Gene_symbol","TSS_ID","strand","Ensembl_ID"]]
compiled_pivot_bed.to_csv("./output_files/compiled_pivot.bed", header=None, index=None, sep="\t")
###START W CANONICAL ATG ANALYSIS WITH ROBS ATG START CODONS
# I'd figure out where the "canonical" ATG is. If promoters upstream of this ATG, it is a fair assumption that both will use that ATG. If minor one is downstream, it probably has a different N-terminal. That would be most basic approach (edited)
nterminus_change_merge=downstream_func.find_nterminuschange()
Empty DataFrame
Columns: [chr, TSS_start, TSS_end, NumberTSS, Gene_symbol, TSS_ID, strand, Ensembl_ID]
Index: []
chr TSS_start TSS_end NumberTSS Gene_symbol \
0 chr1 1312700 1312700 1 INTS11
1 chr1 1315697 1315697 1 INTS11
2 chr1 1324618 1324688 4 INTS11
3 chr1 1754896 1754896 1 NADK
4 chr1 1756327 1756327 1 NADK
TSS_ID strand Ensembl_ID
0 26185 - ENSG00000127054
1 26186 - ENSG00000127054
2 hg_9754.1,hg_9755.1,hg_9755.1,26190 - ENSG00000127054
3 1044 - ENSG00000008130
4 1045 - ENSG00000008130
(2081, 8)
Index(['chr', 'TSS_start', 'TSS_end', 'NumberTSS', 'Gene_symbol', 'TSS_ID',
'strand', 'Ensembl_ID', 'Range_ClusteredTSS', 'start_SC', 'end_SC'],
dtype='object')
The percentage of canonical ATG downstream of promoters: 53.7 %
/Users/helenking/Desktop/Weatheritt_Lab/alternative-promoter-identification/hmtl_scripts/alternative_promoter_func/downstream_func.py:58: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
compiled_pivot_wstrand_ATG["Canonical_ATG_Downsteam"][((compiled_pivot_wstrand_ATG["strand"]=="+") & ( (compiled_pivot_wstrand_ATG["TSS_start"] -100 ) < compiled_pivot_wstrand_ATG["start_SC"] )) | ((compiled_pivot_wstrand_ATG["strand"]=="-") & ( (compiled_pivot_wstrand_ATG["TSS_start"]+100) > compiled_pivot_wstrand_ATG["start_SC"] ))] = "True"
/Users/helenking/Desktop/Weatheritt_Lab/alternative-promoter-identification/hmtl_scripts/alternative_promoter_func/downstream_func.py:89: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
nterminus_change["Nterminus_Change"][nterminus_change["False"]==0] = "False"
####Create a dataframe for import into flashfry
###Get the dataframe that shows the entire list of
##Change the bed file so it is just the TSS
all_pivot_simple_nterm=all_pivot_simple.merge(nterminus_change_merge, left_on="Ensembl_ID", right_on="Ensembl_ID", right_index=True)
compiled_pivot_flashfry=compiled_pivot[['chr', 'TSSstart', 'TSSend', "TSS_ID", 'Gene_symbol','strand','Ensembl_ID','MAJOR_MINOR','Source']]
compiled_pivot_flashfry=compiled_pivot_flashfry[compiled_pivot_flashfry["Ensembl_ID"].isin(all_pivot_simple["Ensembl_ID"])]
compiled_pivot_flashfry.to_csv("/Users/helenking/Desktop/Weatheritt_Lab/flashfry-for-crispr-targets/pacbio_cage_proactiv_TSS_files/compiled_pivot_all.bed",header=None, sep="\t",index=None)
all_pivot_simple_nterm.to_csv("/Users/helenking/Desktop/Weatheritt_Lab/flashfry-for-crispr-targets/pacbio_cage_proactiv_TSS_files/all_pivot_simple_nterm.txt",sep="\t",index=None)
compiled_pivot_flashfry_vis=compiled_pivot_flashfry[['chr', 'TSSstart', 'TSSend', 'Source', 'Gene_symbol','strand']]
compiled_pivot_flashfry_vis.to_csv("/Users/helenking/Desktop/Weatheritt_Lab/flashfry-for-crispr-targets/pacbio_cage_proactiv_TSS_files/compiled_pivot_all_vis.bed",header=None, sep="\t",index=None)
Download Original Report: 1A Integration Report (HTML)