Identify Target Region for Dual Guides
Pipeline for identifying optimal genomic regions for dual guide targeting of alternative promoters.
Integration of PacBio, TSS and Proactiv Data¶
We are identifying alternative promoter candidates to be used within the CRISPRi screen. After many weeks of trying every type of sequencing technology that could identify TSS/promoters for MCF-7 cell line.
Step 1) Identify the alternative TSS usage in 3 technologies- CAGE-seq, proActiv and PacBio.
Step 2) Filter the TSS based on proximity to each other, number and if the major promoter was upstream of the minor promoter.
Step 3) Prioritising the genes of interest.
Step 1) The three technologies used are as following (as well as a short description about how a TSS is called for each technology): a. PacBio Long-Read RNA-seq data From MCF-7 Cell-line. Hoen et al. PMID: 29598823 PacBio transcript sequences are actually a consensus sequence generated from a combination of full-length reads with isoform clustering algorithm (ICE) and then with partial reads. After benchmarking PacBio, we found it was as accurate as CAGE-seq at calling TSS but identified more TP TSS. This is because CAGE-seq is very conservative in calling TSS.
First, the TSS locus is defined as the first strand-specific nucleotide of each read. Then, counting the TSS called with 10 nucleotides of others to find the consensus locus and depth of coverage. Finally these TSS were overlapped with the refTSS database (PMID: 31075273). Note that PacBio has particularly low coverage relative to short read sequencing.
b. proActiv In-House RNA-seq from MCF-7 Cell Line proActiv is a purpose-built package that identifies alternative promoters and their activity. proActiv uses STAR junction files, and an annotation gtf file of the GENCODE v36 genome. The output- is the most annoying SummarisedExperiment dataframe.
Identified promoters from annotations. Created weighted splicing graphs splicing graphs that capture all the splice variants of one gene in one data structure. Calculate promoter activity estimates for uniquely identified promoters. We are not filtering out single exon promoters and promoters that uses an internal intron. Read counts are quantified for absolute and relative promoter activities using either split reads ration method or junction reads. They divided promoters into 3 different categories major, minor and inactive promoters. The highest activity for each gene across the sample cohort is the major promoters. Promoters with average activity less than 0.25 = inactive, and the remaining mean other promoters. The pipeline can be found on proActiv.R.
The mean of the absolute promoter activity and relative promoter activity were calculated across all three repeats of the MCF-7 cell line.
c. CAGE-seq - IDR of ENCODE MCF-7 Cell Line from the Carnicini Lab CAGE stands for the Cap analysis of gene expression. Understanding of CAGE-seq is that it measures RNA expression and maps TSS in promoters to a single-nucleotide resolution. However, it only works on total mature RNA and detection is biased toward TSS of long-lived transcripts. We are using CAGE-seq TSS found across two repeats using IDR. The average peak was found across both repeats. Then these consensus TSS were overlapped with refTSS database (PMID: 31075273).
Step 2) With all three sequencing technologies we have filtered by: a) Genes with between 2 and 20 TSSs more than 300 between them were retained as candidates.
The gene of interest were identified if found in two out of the three technologies.

Step 3) Now that we have identified these genes of interest, they are then given True/False if present in three lists.
First, a list of nuclear proteins. Specifically, 106 transcription factors were identified (PMID: 29425488) and 24 chromatin remodellers.
Secondly, a manually curated list of genes that have AP usage.
Finally, differentially expressed genes found to switch exon usage from non-cancerous stem cells to cancerous stem cells in HMLER and HCC38. The DEGs for the cancerous stem cells were actually identified by differential exon usage with proActiv.
These are all summarised in ascending order of number of hits pivot_simple_TF_CHROM_HMM_HCC.txt
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from empiricaldist import Pmf
import pickle
from hmtl_scripts.alternative_promoter_func import func
from hmtl_scripts.alternative_promoter_func import downstream_func
from hmtl_scripts.alternative_promoter_func import prep
###Create the list of Manually Curated
published_results=prep.ensembl_genename_search(df_location="./annotations/TF_CHROM_CURATED/Grazi_Published_Articles.csv", genecolumn="Gene",type="csv")
manually_curated=list(published_results["ensembl.gene"])
###add tp53
manually_curated.append("ENSG00000141510")
manually_curated.append("ENSG00000091831")
with open("./output_files/manually_curated.txt", "wb") as fp:
pickle.dump(manually_curated, fp)
querying 1-39...done.
Finished.
39 input query terms found dup hits:
[('ADAR', 10), ('PTHLH', 10), ('SHC1', 10), ('LEF1', 10), ('NR3C1', 10), ('IKBKG', 10), ('NAT1', 10)
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
###Both of the reference files used are in bed format and cover both the EPD and refTSS reference promoters with 200bp.
##The bed format is <chr:start:end:
##from the overlap paper Hien et al.
LONGREAD=prep.longread_prep(filename="./annotations/LONGREAD/MCF7_2015.ANGEL_FINAL_nominus_noduplicate.curated.gff",cluster=False,reference="EPD_refTSS",distancefromref=100, annotation_bed="./annotations/EPD/Hs_EPDnew_006_hg38_tab.ensembl.gene.bed", second_reference_file="./annotations/refTSS/refTSS_v3.1_human_coordinate.hg38.8.sorted.bed")
LONGREAD.to_csv("./annotations/LONGREAD/MCF7_2015.ANGEL_FINAL_nominus_noduplicate_EPD.csv",index=None )
##PacBio due to the lack of read depth has not got a consistent way of calling Major/Minor promoters therefore ignore the Major/minor calling
%run ./hmtl_scripts/unique_close_AP.py --dataframe_input "./annotations/LONGREAD/MCF7_2015.ANGEL_FINAL_nominus_noduplicate_EPD.csv" --save_csvfilename "./output_files/PacBio_refTSS_candidatelist_filtered_2_distance.csv" --PacBio True --checkTP53 True --endcoord_TSS_column "TSSend" --makeBED True --save_bedfilename "./visualisation_files/PacBio_refTSS_candidatelist_filtered_2_distance.bed" --Manually_Curated "./output_files/manually_curated.txt" --distance_between_TSS 0 --uniqueid_TSScolumn "ref_TSSID" --uniqueid_genecolumn "Ensembl_ID"
translate bed file to TSS
convert hg19. to hg38
chr1 29338 29339 transcript - -.1 1 chr1.1 29346 \
0 chr1 484071 484072 transcript + - 1 chr1 778605
1 chr1 727972 727973 transcript - - 1 chr1 778605
2 chr1 728308 728309 transcript - - 1 chr1 778605
3 chr1 728472 728473 transcript - - 1 chr1 778605
4 chr1 827503 827504 transcript - - 1 chr1 827505
... ... ... ... ... .. .. .. ... ...
74839 chrY 18149738 18149739 transcript - - 1 chrY 17880158
74840 chrY 18992792 18992793 transcript - - 0 chrY 18992808
74841 chrY 18992792 18992793 transcript - - 0 chrY 18992808
74842 chrY 18992794 18992795 transcript - - 0 chrY 18992808
74843 chrY 18992796 18992797 transcript - - 0 chrY 18992808
29366 hg_234684.1 ENSG00000227232 -.2 WASH7P 8
0 778674 hg_9665.1 NaN - NaN 294534
1 778674 hg_9665.1 NaN - NaN 50633
2 778674 hg_9665.1 NaN - NaN 50297
3 778674 hg_9665.1 NaN - NaN 50133
4 827525 hg_9666.1 ENSG00000225880 - LINC00115 2
... ... ... ... .. ... ...
74839 17880159 hg_244124.1 ENSG00000129873 - CDY2B 269580
74840 18992868 CD24_1 ENSG00000272398 - CD24 16
74841 18992868 CD24_1 ENSG00000272398 - CD24 16
74842 18992868 CD24_1 ENSG00000272398 - CD24 14
74843 18992868 CD24_1 ENSG00000272398 - CD24 12
[74844 rows x 15 columns]
find depth
index OverlappingTSS chr TSSstart TSSend strand ref_TSSchr \
22328 30111 1 chr17 7675237 7675238 - chr17
22329 30112 1 chr17 7687542 7687543 - chr17
ref_TSSstart ref_TSSend ref_TSSID Ensembl_ID Gene_symbol \
22328 7675230 7675245 hg_164276.1 ENSG00000141510 TP53
22329 7687476 7687536 TP53_1 ENSG00000141510 TP53
distance
22328 0
22329 7
Element exists in Dataframe
If you read this line it means that you have provided all the parameters
Genes with between 2 and 1000 TSSs more than 0 between them were retained as candidates.
index OverlappingTSS chr TSSstart TSSend strand ref_TSSchr \
22328 30111 1 chr17 7675237 7675238 - chr17
22329 30112 1 chr17 7687542 7687543 - chr17
ref_TSSstart ref_TSSend ref_TSSID Ensembl_ID Gene_symbol \
22328 7675230 7675245 hg_164276.1 ENSG00000141510 TP53
22329 7687476 7687536 TP53_1 ENSG00000141510 TP53
distance
22328 0
22329 7
Number of Genes After Filtering For Promoter Number: 6658
Number of Genes After Filtering For Distance between TSS and AltTSS: 4977
PacBio Data so not deep enough coverage to confidendentally call Major/Minor promoters
Number of Genes After Filtering
for Strand Specific Upstream Major Promoter: 4977
Ensembl_ID SD index OverlappingTSS chr TSSstart TSSend \
23357 ENSG00000141510 0.0 30111 1 chr17 7675237 7675238
23358 ENSG00000141510 0.0 30112 1 chr17 7687542 7687543
strand ref_TSSchr ref_TSSstart ref_TSSend ref_TSSID Gene_symbol \
23357 - chr17 7675230 7675245 hg_164276.1 TP53
23358 - chr17 7687476 7687536 TP53_1 TP53
distance range MAJOR_MINOR diff
23357 0 12305 Major 0.0
23358 7 12305 Major 12305.0
##ProActiv
PROACTIV = prep.pro_prep(filename="./annotations/PROACTIV/proactivMCF7.csv",distancebetweenref=100,reference="EPD_refTSS")
PROACTIV.to_csv("./annotations/PROACTIV/proactivMCF7_input.csv",index=None)
%run ./hmtl_scripts/unique_close_AP.py --dataframe_input "./annotations/PROACTIV/proactivMCF7_input.csv" --save_csvfilename "./annotations/PROACTIV/proactiv_refTSS_candidatelist_filtered_2_distance.csv" --uniqueid_genecolumn "geneId_NoV" --TSSQuant_colummn "MCF7.mean" --uniqueid_TSScolumn "promoterId" --TSScoord_column "start" --upplim_promoter_number 10000 --makeBED True --save_bedfilename "./visualisation_files/proactiv_refTSS_candidatelist_filtered_2_distance.bed" --chromosome_col "seqnames" --Manually_Curated "./output_files/manually_curated.txt" --distance_between_TSS 0 --checkTP53 True
os.system("rm -f /output_files/proactiv_input.txt ./output_files/proactiv_bed.filteredts1_tsl2.sorted.bed ./output_files/proactiv_bed.sorted.bed ./output_files/proactiv_bed.bed")
(99380, 21)
closestBed -s -d -mdb all -b ./annotations/refTSS/refTSS_v3.1_human_coordinate.hg38.8.sorted.bed ./annotations/EPD/Hs_EPDnew_006_hg38_tab.ensembl.gene.bed -a ./annotations/PROACTIV/proactivMCF7.TSS.sorted.bed > ./annotations/PROACTIV/proactivMCF7EPD_refTSS.TSS.sorted.bed
seqnames start geneId promoterId strand ref_TSSID \
0 chr1 360057 ENSG00000236601.2 89982 + hg_1.1
1 chr1 827598 ENSG00000228794.10 82866 + hg_16.1
2 chr1 851348 ENSG00000228794.10 82868 + hg_18.1
3 chr1 923928 ENSG00000187634.12 64499 + hg_24.1
4 chr1 930312 ENSG00000187634.12 64502 + hg_28.1
... ... ... ... ... ... ...
37124 chrX 155612952 ENSG00000185973.12 63412 - TMLHE_1
37125 chrX 155881345 ENSG00000124333.16 24904 + VAMP7_1
37126 chrX 155881345 ENSG00000124333.16 24904 + hg_198225.1
37127 chrX 156021328 ENSG00000182484.15 60803 + hg_243873.1
37128 chrY 11215026 ENSG00000278212.2 116488 - hg_201254.1
Gene_symbol distance
0 NaN 269134
1 LINC01128 55
2 LINC01128 19753
3 SAMD11 0
4 SAMD11 4556
... ... ...
37124 TMLHE 0
37125 VAMP7 0
37126 VAMP7 0
37127 NaN 2717
37128 NaN 1206542
[37129 rows x 8 columns]
(37129, 8)
(18896, 8)
If you read this line it means that you have provided all the parameters
Genes with between 2 and 10000 TSSs more than 0 between them were retained as candidates.
seqnames start geneId promoterId strand ref_TSSID \
7591 chr17 7676594 ENSG00000141510.18 35747 - hg_221839.1
7592 chr17 7687538 ENSG00000141510.18 35750 - TP53_1
Gene_symbol distance internalPromoter promoterPosition ... \
7591 TP53 28 True 4 ...
7592 TP53 2 False 1 ...
012SJ_normpromCount 000SJ_absPromAct 001SJ_absPromAct \
7591 98.509902 7.270845 7.104026
7592 84.210077 6.862031 6.866094
012SJ_absPromAct 000SJ_relProm 001SJ_relProm 012SJ_relProm \
7591 6.636768 0.3397 0.340212 0.332686
7592 6.412952 0.3206 0.328818 0.321467
absPromMean relPromMean geneId_NoV
7591 7.003880 0.337533 ENSG00000141510
7592 6.713692 0.323628 ENSG00000141510
[2 rows x 27 columns]
Number of Genes After Filtering For Promoter Number: 2284
Number of Genes After Filtering For Distance between TSS and AltTSS: 2284
geneId_NoV SD seqnames start geneId \
3590 ENSG00000141510 0.145094 chr17 7676594 ENSG00000141510.18
3591 ENSG00000141510 0.145094 chr17 7687538 ENSG00000141510.18
promoterId strand ref_TSSID Gene_symbol distance ... \
3590 35747 - hg_221839.1 TP53 28 ...
3591 35750 - TP53_1 TP53 2 ...
012SJ_absPromAct 000SJ_relProm 001SJ_relProm 012SJ_relProm \
3590 6.636768 0.3397 0.340212 0.332686
3591 6.412952 0.3206 0.328818 0.321467
absPromMean relPromMean range MAJOR_MINOR diff Minor_Downstream
3590 7.003880 0.337533 10944 Major 0.0 False
3591 6.713692 0.323628 10944 Minor 10944.0 False
[2 rows x 32 columns]
0
CAGGEEEE For the MCF-7 combined peaks: https://www.encodeproject.org/files/ENCFF917XEM/@@download/ENCFF917XEM.bed.gz
Using Human Genes GRCh38.p13 hg38_genes using Ensembl web browser w/ attributes Chromosome/scaffold name Gene start (bp) Gene end (bp) Gene stable ID version Karyotype band Strand Source (gene) Gene type Transcription start site (TSS)
CAGE=prep.cage_prep("./annotations/CAGE/ENCFF917XEM.bed",distancefromref=100,reference="EPD_refTSS",annotation_bed="./annotations/EPD/Hs_EPDnew_006_hg38_tab.ensembl.bed")
CAGE.to_csv("./annotations/CAGE/ENCFF917XEM_EPD.csv")
%run ./hmtl_scripts/unique_close_AP.py --dataframe_input "./annotations/CAGE/ENCFF917XEM_EPD.csv" --save_csvfilename "./output_files/CAGE_candidatelist_filtered_2_distance.csv" --TSScoord_column 'start' --TSSQuant_colummn 'avgpeak' --upplim_promoter_number 10000 --Manually_Curated "./output_files/manually_curated.txt" --makeBED True --endcoord_TSS_column "end" --save_bedfilename "./visualisation_files/CAGE_candidatelist_filtered_2_distance.bed" --distance_between_TSS 0 --uniqueid_TSScolumn "ref_TSSID" --uniqueid_genecolumn "ensembl_id"
os.system("rm -f ./output_files/CAGErefTSS.txt ./output_files/ENCFF917XEM_refTSS.bed")
./annotations/CAGE/ENCFF917XEM.sorted.bed ./annotations/CAGE/ENCFF917XEM.ref.sorted.bed
chr start end strand pVal qVal peak1 peak2 \
0 chr1 634031 634069 - 2.54 3.11 175.0 132.0
1 chr1 778582 778688 - 1.94 2.61 107.0 60.0
2 chr1 827496 827770 - 1.16 2.01 49.0 35.0
3 chr1 1013470 1013549 + 5.00 5.00 3872.0 3043.0
4 chr1 1013470 1013549 + 5.00 5.00 3872.0 3043.0
... ... ... ... ... ... ... ... ...
10938 chrX 154762851 154762946 + 2.90 3.42 310.0 177.0
10939 chrX 154762851 154762946 + 2.90 3.42 310.0 177.0
10940 chrX 154762851 154762946 + 2.90 3.42 310.0 177.0
10941 chrX 155216444 155216541 + 0.84 1.76 75.0 26.0
10942 chrX 155216444 155216541 + 0.84 1.76 75.0 26.0
ref_TSSchr ref_TSSstart ref_TSSend ref_TSSID ensembl_id \
0 chr1 778605 778674 hg_9665.1 NaN
1 chr1 778605 778674 hg_9665.1 NaN
2 chr1 827505 827525 hg_9666.1 ENSG00000225880
3 chr1 1013447 1013507 ISG15_1 ENSG00000187608
4 chr1 1013486 1013505 hg_47.1 ENSG00000187608
... ... ... ... ... ...
10938 chrX 154762813 154762893 hg_198210.1 ENSG00000130826
10939 chrX 154762825 154762885 DKC1_1 ENSG00000130826
10940 chrX 154762900 154762951 hg_198211.1 ENSG00000130826
10941 chrX 155216410 155216470 VBP1_1 ENSG00000155959
10942 chrX 155216439 155216484 hg_198223.1 ENSG00000155959
Gene_symbol distance
0 NaN 144537
1 NaN 0
2 LINC00115 0
3 1013496 1013507
4 ISG15 0
... ... ...
10938 DKC1 0
10939 154762874 154762885
10940 DKC1 0
10941 155216459 155216470
10942 VBP1 0
[10943 rows x 15 columns]
If you read this line it means that you have provided all the parameters
Genes with between 2 and 10000 TSSs more than 0 between them were retained as candidates.
Empty DataFrame
Columns: [Unnamed: 0, chr, start, end, strand, pVal, qVal, peak1, peak2, ref_TSSchr, ref_TSSstart, ref_TSSend, ref_TSSID, ensembl_id, Gene_symbol, distance, avgpeak]
Index: []
Number of Genes After Filtering For Promoter Number: 1424
Number of Genes After Filtering For Distance between TSS and AltTSS: 347
0
func.merge_three_df_venn(dataframe1="./output_files/CAGE_candidatelist_filtered_2_distance.csv",dataframe2="./output_files/proactiv_refTSS_candidatelist_filtered_2_distance.csv",dataframe3="./output_files/PacBio_refTSS_candidatelist_filtered_2_distance.csv",symbolcolumn3="Ensembl_ID",symbolcolumn2="geneId_NoV",symbolcolumn1="ensembl_id",output_folder="./merged_tables/")
ensembl_id SD Unnamed: 0 chr start end strand \ 0 ENSG00000003056 28.284271 2036 chr12 8949393 8949426 - 1 ENSG00000003056 28.284271 2037 chr12 8949583 8949640 - 2 ENSG00000003056 28.284271 2038 chr12 8949583 8949640 - 3 ENSG00000008083 156.691566 8749 chr6 15245446 15245711 + 4 ENSG00000008083 156.691566 8751 chr6 15245446 15245711 + pVal qVal peak1 ... ref_TSSstart ref_TSSend ref_TSSID Gene_symbol \ 0 -0.00 0.53 17.0 ... 8949402 8949417 hg_129378.1 M6PR 1 1.69 2.42 107.0 ... 8949597 8949615 hg_129379.1 M6PR 2 1.69 2.42 107.0 ... 8949628 8949674 hg_129380.1 M6PR 3 1.12 1.97 47.0 ... 15245446 15245492 hg_66039.1 JARID2 4 1.12 1.97 47.0 ... 15245496 15245515 hg_66040.1 JARID2 distance avgpeak range MAJOR_MINOR diff Minor_Downstream 0 0 16.0 190 Minor 0.0 True 1 0 76.0 190 Major 190.0 True 2 0 76.0 190 Major 0.0 True 3 0 47.0 3448 Minor 0.0 True 4 0 47.0 3448 Minor 0.0 True [5 rows x 22 columns]
./merged_tables/ CAGE proactiv .csv ./merged_tables/ proactiv PacBio .csv ./merged_tables/ CAGE PacBio .csv ./merged_tables/ CAGE proactiv PacBio .csv
##Write pivot tables for all
###Proactiv
pro_pivot= downstream_func.pivot_each(dataframe1="./merged_tables/CAGEproactiv.csv", index_list1=['promoterId','Gene_symbol_y', 'seqnames','start_y', 'internalPromoter',"MAJOR_MINOR_y","geneId_NoV","strand_y"], values_list1=["absPromMean", "relPromMean"], dataframe2="./merged_tables/proactivPacBio.csv", index_list2=['promoterId','Gene_symbol_x','seqnames','start','internalPromoter',"MAJOR_MINOR_x","Ensembl_ID","strand_x"], values_list2=["absPromMean", "relPromMean"], name="proActiv", columnnames=['proActiv_promoterId', 'Gene_symbol', 'proActiv_seqnames', 'proActiv_start', 'proActiv_internalPromoter','proActiv_MAJOR_MINOR',"Ensembl_ID","strand",'proActiv_absPromMean', 'proActiv_relPromMean'], reorderedcolumns=['proActiv_promoterId', 'Gene_symbol', 'Ensembl_ID' , 'proActiv_seqnames', 'proActiv_start','proActiv_end', "strand",'proActiv_internalPromoter', 'proActiv_MAJOR_MINOR', 'proActiv_absPromMean', 'proActiv_relPromMean','proActiv'], noendcoord=True , save_txt="./candidate_AP_output/propivot_table.txt" )
###CAGE
cage_pivot= downstream_func.pivot_each(dataframe1="./merged_tables/CAGEproactiv.csv", index_list1=['ref_TSSID_x','Gene_symbol_x', 'chr',"start_x",'end',"MAJOR_MINOR_x","ensembl_id","strand_x"], values_list1=["avgpeak"], dataframe2="./merged_tables/CAGEPacBio.csv", index_list2=['ref_TSSID_y','Gene_symbol_y','chr_y','start','end',"MAJOR_MINOR_y","ensembl_id","strand_y"], values_list2=["avgpeak"], name="CAGE", columnnames=['CAGE_TSSID', 'Gene_symbol', 'CAGE_chr', 'CAGE_start','CAGE_end', 'CAGE_MAJOR_MINOR', 'Ensembl_ID',"strand",'CAGE_avgpeak'], reorderedcolumns=['CAGE_TSSID', 'Gene_symbol', 'Ensembl_ID','CAGE_chr', 'CAGE_start','CAGE_end',"strand", 'CAGE_MAJOR_MINOR','CAGE_avgpeak',"CAGE"], noendcoord=False, save_txt="./candidate_AP_output/cagepivot_table.txt" )
###Pacbio
pac_pivot= downstream_func.pivot_each(dataframe1="./merged_tables/proactivPacBio.csv", index_list1=['ref_TSSID_y','Gene_symbol_y', 'chr','TSSstart','TSSend',"MAJOR_MINOR_y","Ensembl_ID","strand_y"], values_list1=["OverlappingTSS"], dataframe2="./merged_tables/CAGEPacBio.csv", index_list2=['ref_TSSID_x','Gene_symbol_x','chr_x','TSSstart','TSSend',"MAJOR_MINOR_x","ensembl_id","strand_x"], values_list2=["OverlappingTSS"], name="PacBio", columnnames=['PacBio_TSSID', 'Gene_symbol', 'PacBio_chr', 'PacBio_start','PacBio_end', 'PacBio_MAJOR_MINOR', "Ensembl_ID","strand",'PacBio_OverlappingTSS'], reorderedcolumns=['PacBio_TSSID', 'Gene_symbol','Ensembl_ID', 'PacBio_chr', 'PacBio_start','PacBio_end', "strand", 'PacBio_MAJOR_MINOR','PacBio_OverlappingTSS','PacBio'], noendcoord=False, save_txt="./candidate_AP_output/pacpivot_table.txt" )
all_pivot=pro_pivot.merge(cage_pivot, on=["Gene_symbol","Ensembl_ID"], how="outer").merge(pac_pivot, on=["Gene_symbol","Ensembl_ID"], how="outer")
###This is the simple pivot table of genes ONLY
all_pivot_simple=all_pivot[["Ensembl_ID",'Gene_symbol' ,'proActiv', 'CAGE', 'PacBio']].drop_duplicates()
all_pivot_simple[all_pivot_simple.isna()] = False
pro_pivot.drop(columns= ["proActiv_relPromMean","proActiv","proActiv_internalPromoter"],axis=0,inplace=True)
pro_pivot["Source"]="proActiv"
cage_pivot.drop(columns= ["CAGE"],axis=0,inplace=True)
cage_pivot["Source"]="CAGE"
pac_pivot.drop(columns= ["PacBio"],axis=0,inplace=True)
pac_pivot["Source"]="PacBio"
##This is the pivot table that will be used further downstream as it has every single TSS called across platforms
compiled_pivot= pd.DataFrame( np.concatenate((pro_pivot.values,cage_pivot.values,pac_pivot.values) , axis=0 )).drop_duplicates()
compiled_pivot.columns=['TSS_ID','Gene_symbol','Ensembl_ID' , 'chr', 'TSSstart','TSSend' ,"strand",'MAJOR_MINOR' ,'Depth',"Source"]
###Make a file where the promoters are clustered together
compiled_pivot_bed=compiled_pivot[["chr","TSSstart","TSSstart","Source","TSS_ID","strand","Ensembl_ID"]]
compiled_pivot_bed.to_csv("./output_files/compiled_pivot.bed", header=None, index=None, sep="\t")
###START W CANONICAL ATG ANALYSIS WITH ROBS ATG START CODONS
# I'd figure out where the "canonical" ATG is. If promoters upstream of this ATG, it is a fair assumption that both will use that ATG. If minor one is downstream, it probably has a different N-terminal. That would be most basic approach (edited)
nterminus_change_merge=downstream_func.find_nterminuschange()
Empty DataFrame
Columns: [chr, TSS_start, TSS_end, NumberTSS, Gene_symbol, TSS_ID, strand, Ensembl_ID]
Index: []
chr TSS_start TSS_end NumberTSS Gene_symbol \
0 chr1 1308554 1308789 5 PacBio,proActiv
1 chr1 6385812 6385827 2 PacBio,proActiv
2 chr1 6393406 6393406 1 PacBio
3 chr1 6393767 6393767 1 proActiv
4 chr1 7961662 7961895 6 PacBio,proActiv
TSS_ID strand Ensembl_ID
0 PUSL1_1,53265,hg_81.1,hg_83.1,53266 + ENSG00000169972
1 10696,ACOT7_3 - ENSG00000097021
2 hg_10030.1 - ENSG00000097021
3 10697 - ENSG00000097021
4 21343,PARK7_1,hg_453.1,hg_453.1,PARK7_1,21344 + ENSG00000116288
(2858, 8)
Index(['chr', 'TSS_start', 'TSS_end', 'NumberTSS', 'Gene_symbol', 'TSS_ID',
'strand', 'Ensembl_ID', 'Range_ClusteredTSS', 'start_SC', 'end_SC'],
dtype='object')
The percentage of canonical ATG downstream of promoters: 57.6 %
####Create a dataframe for import into flashfry
###Get the dataframe that shows the entire list of
##Change the bed file so it is just the TSS
all_pivot_simple_nterm=all_pivot_simple.merge(nterminus_change_merge, left_on="Ensembl_ID", right_on="Ensembl_ID", right_index=True)
all_pivot_simple_nterm.to_csv("./candidate_AP_output/all_pivot_simple_nterm.txt",sep="\t",index=None)
compiled_pivot_flashfry=compiled_pivot[['chr', 'TSSstart', 'TSSend', "TSS_ID", 'Gene_symbol','strand','Ensembl_ID','MAJOR_MINOR','Source']]
compiled_pivot_flashfry=compiled_pivot_flashfry[compiled_pivot_flashfry["Ensembl_ID"].isin(all_pivot_simple["Ensembl_ID"])]
#ensure that only one copy of a TSS is inclued
SINGLE=compiled_pivot["Gene_symbol"].value_counts().index[compiled_pivot["Gene_symbol"].value_counts().values<2]
compiled_pivot_flashfry=compiled_pivot_flashfry[~compiled_pivot_flashfry["Gene_symbol"].isin(SINGLE)]
##create unique ID
compiled_pivot_flashfry["ID"]=compiled_pivot_flashfry["TSS_ID"].astype(str)+"_"+(compiled_pivot_flashfry.index).astype(str)
compiled_pivot_flashfry.to_csv("./candidate_AP_output/compiled_pivot_all.bed",index=None)
# all_pivot_simple_nterm.to_csv("/Users/helenking/Desktop/Weatheritt_Lab/flashfry-for-crispr-targets/pacbio_cage_proactiv_TSS_files/all_pivot_simple_nterm.txt",sep="\t",index=None)
# compiled_pivot_flashfry.to_csv("/Users/helenking/Desktop/Weatheritt_Lab/flashfry-for-crispr-targets/pacbio_cage_proactiv_TSS_files/compiled_pivot_all.bed", sep="\t",index=None)
compiled_pivot_flashfry=func.find_confidencevalue(ALLTSS_bed_loc="./candidate_AP_output/compiled_pivot_all.bed")
compiled_pivot_flashfry.to_csv("./candidate_AP_output/compiled_pivot_all.bed",index=None,sep="\t")
compiled_pivot_flashfry_vis=compiled_pivot_flashfry[['chr', 'TSSstart', 'TSSend', 'Source', 'Gene_symbol','strand']]
compiled_pivot_flashfry_vis.to_csv("/Users/helenking/Desktop/Weatheritt_Lab/flashfry-for-crispr-targets/visualisations/compiled_pivot_all_vis.bed",header=None, sep="\t",index=None)
compiled_pivot_flashfry.columns
Index(['chr', 'TSSstart', 'TSSend', 'TSS_ID', 'Gene_symbol', 'strand',
'Ensembl_ID', 'MAJOR_MINOR', 'Source', 'confidence_value', 'rank'],
dtype='object')
os.system("bedtools intersect -loj -a ./candidate_AP_output/compiled_pivot_all.bed -b ./annotations/ENCFF268RXB.bed > ./candidate_AP_output/compiled_pivot_all.H3K4me3.bed")
0
Download Original Report: 1B Target Region Report (HTML)