Integration of PacBio, TSS and Proactiv Data

We are identifying alternative promoter candidates to be used within the CRISPRi screen. After many weeks of trying every type of sequencing technology that could identify TSS/promoters for MCF-7 cell line.

Step 1) Identify the alternative TSS usage in 3 technologies- CAGE-seq, proActiv and PacBio.

Step 2) Filter the TSS based on proximity to each other, number and if the major promoter was upstream of the minor promoter.

Step 3) Prioritising the genes of interest.

Step 1) The three technologies used are as following (as well as a short description about how a TSS is called for each technology): a. PacBio Long-Read RNA-seq data From MCF-7 Cell-line. Hoen et al. PMID: 29598823 PacBio transcript sequences are actually a consensus sequence generated from a combination of full-length reads with isoform clustering algorithm (ICE) and then with partial reads. After benchmarking PacBio, we found it was as accurate as CAGE-seq at calling TSS but identified more TP TSS. This is because CAGE-seq is very conservative in calling TSS.

First, the TSS locus is defined as the first strand-specific nucleotide of each read. Then, counting the TSS called with 10 nucleotides of others to find the consensus locus and depth of coverage. Finally these TSS were overlapped with the refTSS database (PMID: 31075273). Note that PacBio has particularly low coverage relative to short read sequencing.

b. proActiv In-House RNA-seq from MCF-7 Cell Line proActiv is a purpose-built package that identifies alternative promoters and their activity. proActiv uses STAR junction files, and an annotation gtf file of the GENCODE v36 genome. The output- is the most annoying SummarisedExperiment dataframe.

Identified promoters from annotations. Created weighted splicing graphs splicing graphs that capture all the splice variants of one gene in one data structure. Calculate promoter activity estimates for uniquely identified promoters. We are not filtering out single exon promoters and promoters that uses an internal intron. Read counts are quantified for absolute and relative promoter activities using either split reads ration method or junction reads. They divided promoters into 3 different categories major, minor and inactive promoters. The highest activity for each gene across the sample cohort is the major promoters. Promoters with average activity less than 0.25 = inactive, and the remaining mean other promoters. The pipeline can be found on proActiv.R.

The mean of the absolute promoter activity and relative promoter activity were calculated across all three repeats of the MCF-7 cell line.

c. CAGE-seq - IDR of ENCODE MCF-7 Cell Line from the Carnicini Lab CAGE stands for the Cap analysis of gene expression. Understanding of CAGE-seq is that it measures RNA expression and maps TSS in promoters to a single-nucleotide resolution. However, it only works on total mature RNA and detection is biased toward TSS of long-lived transcripts. We are using CAGE-seq TSS found across two repeats using IDR. The average peak was found across both repeats. Then these consensus TSS were overlapped with refTSS database (PMID: 31075273).

Step 2) With all three sequencing technologies we have filtered by: a) Genes with between 2 and 20 TSSs more than 300 between them were retained as candidates.

The gene of interest were identified if found in two out of the three technologies.

Step 3) Now that we have identified these genes of interest, they are then given True/False if present in three lists.

First, a list of nuclear proteins. Specifically, 106 transcription factors were identified (PMID: 29425488) and 24 chromatin remodellers.

Secondly, a manually curated list of genes that have AP usage.

Finally, differentially expressed genes found to switch exon usage from non-cancerous stem cells to cancerous stem cells in HMLER and HCC38. The DEGs for the cancerous stem cells were actually identified by differential exon usage with proActiv.

These are all summarised in ascending order of number of hits pivot_simple_TF_CHROM_HMM_HCC.txt

CAGGEEEE For the MCF-7 combined peaks: https://www.encodeproject.org/files/ENCFF917XEM/@@download/ENCFF917XEM.bed.gz

Using Human Genes GRCh38.p13 hg38_genes using Ensembl web browser w/ attributes Chromosome/scaffold name Gene start (bp) Gene end (bp) Gene stable ID version Karyotype band Strand Source (gene) Gene type Transcription start site (TSS)