---
title: "ORFik Overview"
author: "Haakon Tjeldnes & Kornel Labun"
date: "`r BiocStyle::doc_date()`"
package: "`r pkg_ver('ORFik')`"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{ORFik Overview}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

Welcome to the `ORFik` package. 
`ORFik` is an R package for analysis of transcript and translation features through manipulation of sequence data,
RiboSeq, RNASeq, TCPseq and CAGE, focusing on 5' UTRs (leaders), it is generalized in the 
sense that any transcript region can be analysed. 
This vignette will walk you through our detailed package usage with examples.


`ORFik` currently supports:  

1. Finding Open Reading Frames (very fast) in the genome of interest or on the 
set of transcripts/sequences.  
2. Automatic estimations of RiboSeq footprint shift.  
3. Utilities for metaplots of RiboSeq coverage over gene START and STOP codons 
allowing to spot the shift.  
4. Shifting functions for the RiboSeq data.  
5. Finding new Transcription Start Sites with the use of CageSeq data.  
6. Various measurements of gene identity e.g. FLOSS, coverage, ORFscore, 
entropy that are recreated based on many scientific publications.  
7. Utility functions to extend GenomicRanges for faster grouping, splitting, 
tiling etc. 
8. Several standardized plots for coverage and metacoverage of NGS data, 
including smart grouping functions for easier prototyping.

# Finding Open Reading Frames 

In molecular genetics, an Open Reading Frame (ORF) is the part of a reading 
frame that has the ability to be translated. Although not every ORF has the potential to be translated or to be functional, to find novel genes we must first be able to identify potential ORFs.

To find all Open Reading Frames (ORFs) and possibly map them to genomic 
coordinates,`ORFik` gives you three main functions:

* `findORFs` - find ORFs in sequences of interest,
* `findMapORFs` - find ORFs and map them to their respective genomic coordinates 
* `findORFsFasta` - find ORFs in Fasta file or `BSGenome` (supports circular genomes!)

## Example of finding ORFs in on 5' UTR of hg19

Load libraries we need for examples
```{r eval = TRUE, echo = TRUE, message = FALSE}
library(ORFik)                        # This package
library(GenomicFeatures)              # For basic transcript operations
library(data.table)                   # For fast table operations
library(BSgenome.Hsapiens.UCSC.hg19)  # Human genome
```

After loading libraries, load human genome from `GenomicFeatures`. 
```{r eval = TRUE, echo = TRUE}
txdbFile <- system.file("extdata", "hg19_knownGene_sample.sqlite", 
                        package = "GenomicFeatures")
```

We load gtf file as txdb (transcript database).
We will then extract the 5' leaders to find all upstream open reading 
frames.
```{r eval = TRUE, echo = TRUE}
txdb <- loadTxdb(txdbFile)
fiveUTRs <- loadRegion(txdb, "leaders")
fiveUTRs[1]
```
As we can see we have extracted 5' UTRs for hg19 annotations. Now we can load
`BSgenome` version of human genome (hg19).

Either import fasta or BSgenome file to get sequences.
```{r eval = TRUE, echo = TRUE, message = FALSE}
# Extract sequences of fiveUTRs.
tx_seqs <- extractTranscriptSeqs(BSgenome.Hsapiens.UCSC.hg19::Hsapiens, 
                                 fiveUTRs) 
tx_seqs[1]
```

Find all ORFs on those transcripts and get their genomic coordinates.
```{r eval = TRUE, echo = TRUE, message = FALSE}
fiveUTR_ORFs <- findMapORFs(fiveUTRs, tx_seqs, groupByTx = FALSE)
fiveUTR_ORFs[1:2]
```

In the example above, you can see that fiveUTR_ORFs are grouped by ORF. 
That means each group in the GRangesList is 1 ORF, that can have multiple exons.
To get the transcript the ORF came from, do this:
```{r eval = TRUE, echo = TRUE, message = FALSE}
txNames(fiveUTR_ORFs[1:2]) # <- Which transcript
```
You see that both ORFs are from transcript "uc010ogz.1"

Meta-column names contains name 
of the transcript and identifier of the ORF separated by "_". When a ORF is 
separated into two exons you see it twice, as the first ORF with name 
"uc010ogz.1_1". The first ORF will always be the one most upstream for + 
strand, and the least upstream for - strand.
```{r eval = TRUE, echo = TRUE, message = FALSE}
names(fiveUTR_ORFs[1:2]) # <- Which ORF
```
## Saving ORFs to disc

We recommend two options for storing ORF ranges:

1. If you want to reuse only in R:
Save as R object
```{r eval = FALSE, echo = TRUE, message = FALSE}
saveRDS(fiveUTR_ORFs[1:2], "save/path/uorfs.rds")
```

2. If you want to use in IGV, UCSC genome browser etc:
Save as bed12 format, that is a bed format with 1 row per ORF, that contains
splicing information and even possible color coding for visualizing groups of ORFs:
```{r eval = FALSE, echo = TRUE, message = FALSE}
export.bed12(fiveUTR_ORFs[1:2], "save/path/uorfs.rds")
```

## Getting fasta sequences of ORFs
Now lets see how easy it is to get fasta sequences from the ranges.
```{r eval = TRUE, echo = TRUE, message = FALSE}
orf_seqs <- extractTranscriptSeqs(BSgenome.Hsapiens.UCSC.hg19::Hsapiens,
                                  fiveUTR_ORFs[1:3])
orf_seqs
```
You can see ORF 1 named (uc010ogz.1_1) has a CTG start codon, a TAG stop codon and 159/3 = 53 codons. 

To save as .fasta do
```{r eval = FALSE, echo = TRUE, message = FALSE}
writeXStringSet(orf_seqs, filepath = "uorfs.fasta")
```

We will now look on ORFik functions to get startcodons and stopcodon etc.
# New GRanges and GRangesList utilities for ORFs
`ORFik` contains functions that can be utilized to speed up your coding.
Check documentation for these functions: `sortPerGroup`, `unlistGrl`, 
`strandBool`, `tile1`.

## Grouping ORFs

There are 2 main ways of grouping ORFs.
- Group by ORF
- Group by transcript

To do this more easily, you can use the function `groupGRangesBy`.

1. Grouped by transcript:
We use the names() to group, 
```{r eval = TRUE, echo = TRUE}
unlisted_ranges <- unlistGrl(fiveUTR_ORFs)
test_ranges <- groupGRangesBy(unlisted_ranges) # <- defualt is tx grouping by names()
test_ranges[1]
```
All orfs within a transcript grouped together as one group, the names column seperates the orfs.

2. Grouped by ORF:
we use the orfs meta column called ($names) to group, it is made by ORFik.
```{r eval = TRUE, echo = TRUE}
unlisted_ranges <- unlistGrl(fiveUTR_ORFs)
test_ranges <- groupGRangesBy(unlisted_ranges, unlisted_ranges$names)
test_ranges[1:2]
```
Here you see each group is one ORF only.

## Filtering example

Lets say you found some ORFs, and you want to filter out some of them.
ORFik provides several functions for filtering. A problem with the
original GenomicRanges container is that filtering on GRanges objects
are much easier than on a GRangesList object, ORFik tries to fix this.

In this example we will filter out all ORFs as following:

1. First group GRangesList by ORFs
2. width < 60
3. number of exons < 2
4. strand is negative

1. Group by ORFs, if ORFs are grouped by transcripts it would make no sense.
lets use the fiveUTR_ORFs
```{r eval = TRUE, echo = TRUE}
  unlisted_ranges <- unlistGrl(fiveUTR_ORFs)
  ORFs <- groupGRangesBy(unlisted_ranges, unlisted_ranges$names)
  length(ORFs) # length means how many ORFs left in set
```
2. Remove widths < 60
```{r eval = TRUE, echo = TRUE}
  ORFs <- ORFs[widthPerGroup(ORFs) >= 60]
  length(ORFs)
```
3. Keep only ORFs with at least 2 exons
```{r eval = TRUE, echo = TRUE}
  ORFs <- ORFs[numExonsPerGroup(ORFs) > 1]
  length(ORFs)
```
4. Keep only positive ORFs
```{r eval = TRUE, echo = TRUE}
  
  ORFs <- ORFs[strandPerGroup(ORFs) == "+"]
  # all remaining ORFs where on positive strand, so no change
  length(ORFs)
```

## ORF interest regions

Specific part of the ORF are usually of interest, as start and stop codons.
Here we run an example to show what ORFik can do for you.

1. Find the start and stop sites as GRanges
```{r eval = TRUE, echo = TRUE}
startSites(fiveUTR_ORFs, asGR = TRUE, keep.names = TRUE, is.sorted = TRUE)
stopSites(fiveUTR_ORFs, asGR = TRUE, keep.names = TRUE, is.sorted = TRUE)
```
2. Lets find the start and stop codons.
This takes care of potential 1 base exons etc. 
```{r eval = TRUE, echo = TRUE}
starts <- startCodons(fiveUTR_ORFs, is.sorted = TRUE)
stops <- stopCodons(fiveUTR_ORFs, is.sorted = TRUE)
starts[1:2]
```
3. Lets get the bases of the start and stop codons from the fasta file
It's very important to check that ORFs are sorted here, 
```{r eval = TRUE, echo = TRUE}
txSeqsFromFa(starts, BSgenome.Hsapiens.UCSC.hg19::Hsapiens, is.sorted = TRUE)
"Stop codons"
txSeqsFromFa(stops, BSgenome.Hsapiens.UCSC.hg19::Hsapiens, is.sorted = TRUE)
```
Many more operations are also supported for manipulation of ORFs

# When to use which ORFfinding function

ORFik supports multiple ORF finding functions. Here we describe their specific
use cases:

Which function you will use depend on which organism the data is from, and
specific parameters, like circular or non circular genomes, will you use
a transcriptome etc. 

There are 4 standard ways of finding ORFs:

1. You have some fasta file of the genome only. (For prokaryotes/circular  genomes)
2. You have some fasta file of the genome and a spliced transcriptome annotation.  (For eukaryotes with splicing)
3. You have a fasta file of transcripts (eukaryotes or prokaryotes)
4. You have a vector of transcripts preloaded in R.

Let's start with the simplest case; a vector of preloaded transcripts. 

Lets say you have some transcripts and want to find all ORFs on them.
findORFs() will give you only 5' to 3' direction, so if you want both directions,
you can do (for strands in both direction):
```{r eval = TRUE, echo = TRUE}
  library(Biostrings)
  # strand with ORFs in both directions
  seqs <- DNAStringSet("ATGAAATGAAGTAAATCAAAACAT")
  ######################>######################< (< > is direction of ORF)
  
  # positive strands
  pos <- findORFs(seqs, startCodon = "ATG", minimumLength = 0)
  # negative strands
  neg <- findORFs(reverseComplement(seqs),
                  startCodon = "ATG", minimumLength = 0)
```
Merge into a GRanges object, since we want strand information
```{r eval = TRUE, echo = TRUE}
  pos <- GRanges(pos, strand = "+")
  neg <- GRanges(neg, strand = "-")
  # as GRanges
  res <- c(pos, neg)
  # or merge together and make GRangesList
  res <- split(res, seq.int(1, length(pos) + length(neg)))
  res[1:2]
```  

Remember that these results are in transcript coordinates, sometimes you need
to convert them to Genomic coordinates.

## Finding ORFs in spliced transcripts
If you have a genome and a spliced transcriptome annotation, you must use findMapORFs(). 
It takes care of the potential problem from the last example, 
that we really want our result in genomic coordinates in the end.

## Prokaryote/Circular Genomes and fasta transcriptomes.
Use findORFsFasta(is.circular = TRUE).
Note that findORFsFasta automaticly finds (-) strand too, because that is 
normally used for genomes.

## Filter on strand
If you have fasta transcriptomes, you dont want the (-) strand. Since 
all transcripts are in the direction in the fasta file.
If you get both (+/-) strand and only want (+) ORFs, do:
```{r eval = TRUE, echo = TRUE}
  res[strandBool(res)] # Keep only + stranded ORFs
```

See individual functions for more examples.

# CageSeq data for 5' UTR re-annotation

In the previous example we used the reference annotation of the 5' UTRs
from Hg19. 

Here we will use CageSeq data to set new Transcription Start Sites (TSS) and re-annotate 5' UTRs. This is useful to improve tissue specific transcripts. Since most eukaryotes usually have variance in TSS definitions. 

```{r eval = TRUE, echo = TRUE, warning = FALSE, message = FALSE} 
# path to example CageSeq data from hg19 heart sample
cageData <- system.file("extdata", "cage-seq-heart.bed.bgz", 
                        package = "ORFik")
# get new Transcription Start Sites using CageSeq dataset
newFiveUTRs <- reassignTSSbyCage(fiveUTRs, cageData)
newFiveUTRs
```

You will now see that most of the transcription start sites have changed. 
Depending on the species, regular annotations might be incomplete or not 
specific enough for your purposes. 

NOTE: IF you want to edit the whole txdb / gtf file, use reassignTxDbByCage.
And save this to get the new gtf with reannotated leaders by CAGE.

  
# RiboSeq footprints automatic shift detection and shifting

In RiboSeq data ribosomal footprints are restricted to their p-site positions 
and shifted with respect to the shifts visible over the start and stop 
codons. `ORFik` has multiple functions for processing of RiboSeq data. We will
go through an example processing of RiboSeq data below.

Example raw RiboSeq footprints (unshifted):
```{r eval = TRUE, echo = TRUE} 
bam_file <- system.file("extdata", "ribo-seq.bam", package = "ORFik")
footprints <- readBam(bam_file)
```

What footprint lengths are present in our data:
```{r eval = TRUE, echo = TRUE} 
table(readWidths(footprints))
```

Lets look at how the reads distribute around the CDS per read length:
For that we need to prepare the transcriptome annotation.
```{r eval = TRUE, echo = TRUE, warning = FALSE, message = FALSE} 
gtf_file <- system.file("extdata", "annotations.gtf", package = "ORFik")
txdb <- loadTxdb(gtf_file)
tx <- exonsBy(txdb, by = "tx", use.names = TRUE)
cds <- cdsBy(txdb, by = "tx", use.names = TRUE)
trailers <- threeUTRsByTranscript(txdb, use.names = TRUE)
cds[1]
```

Note in ORFik you can load all transcript annotation in one line (this is same as above):
```{r eval = FALSE, echo = TRUE, warning = FALSE, message = FALSE} 
loadRegions(gtf_file, parts = c("tx", "cds", "trailers"))
```
A note here is that "tx" are all transcripts, if you write "mrna" you will only 
get subset of tx that has a defined cds. 

Restrict footprints to their 5' starts (after shifting it will be a p-site).
```{r eval = TRUE, echo = TRUE} 
footprintsGR <- convertToOneBasedRanges(footprints, addSizeColumn = TRUE)
footprintsGR
```
The function convertToOneBasedRanges gives you a size column, that contains
read length information. You can also choose to use the score column 
for read information. But size has priority over score for deciding what 
column defines read lengths.

In the figure below we see why we need to p-shift, see that per length the 
start of the read are in different positions relative to the CDS start site.
The reads create a ladder going downwards, left to right. (see the blue steps)
```{r eval = TRUE, echo = TRUE} 
  hitMap <- windowPerReadLength(cds, tx,  footprintsGR, pShifted = FALSE)
  coverageHeatMap(hitMap, scoring = "transcriptNormalized")
```

If you only want to know how to run the function and no details, 
skip down to after the 2 coming bar plots.

For the sake of this example we will focus only on most abundant length of 29.
```{r eval = TRUE, echo = TRUE} 
footprints <- footprints[readWidths(footprints) == 29]
footprintsGR <- footprintsGR[readWidths(footprintsGR) == 29]
footprints
```

Filter the cds annotation to only those that have some minimum trailer and 
leader lengths, as well as cds. 
Then get start and stop codons with extra window of 30bp around 
them.
```{r eval = TRUE, echo = TRUE, warning = FALSE} 
txNames <- filterTranscripts(txdb) # <- get only transcripts that pass filter
tx <- tx[txNames]; cds <- cds[txNames]; trailers <- trailers[txNames];
windowsStart <- startRegion(cds, tx, TRUE, upstream = 30, 29)
windowsStop <- startRegion(trailers, tx, TRUE, upstream = 30, 29)
windowsStart
```

Calculate meta-coverage over start and stop windowed regions.
```{r eval = TRUE, echo = TRUE, warning = FALSE} 
hitMapStart <- metaWindow(footprintsGR, windowsStart, withFrames = TRUE)
hitMapStop <- metaWindow(footprintsGR, windowsStop, withFrames = TRUE)
```

Plot start/stop windows for length 29.
```{r eval = TRUE, echo = TRUE, warning = FALSE} 
  pSitePlot(hitMapStart)
```

```{r eval = TRUE, echo = TRUE, warning = FALSE} 
  pSitePlot(hitMapStop, region = "stop")
```
From these shifts ORFik uses a fourier transform to detect signal change needed 
to scale all read lengths of Ribo-seq to the start of the meta-cds. 

We can also use automatic detection of RiboSeq shifts using the code below. As
we can see reasonable conclusion from the plots would be to shift length 29 by
12, it is in agreement with the automatic detection of the offsets.
```{r eval = TRUE, echo = TRUE, warning = FALSE} 
shifts <- detectRibosomeShifts(footprints, txdb, stop = TRUE)
shifts
```

Fortunately `ORFik` has a function that can be used to shift footprints using 
desired shifts. See documentation for more details.
```{r eval = TRUE, echo = TRUE, warning = FALSE} 
shiftedFootprints <- shiftFootprints(footprints, shifts)
shiftedFootprints
```

# Gene identity functions for ORFs or genes

`ORFik` contains functions of gene identity that can be used to predict 
which ORFs are potentially coding and functional.

There are 2 main categories:

* Sequence features (kozak, gc-content, etc.)
* Read features (reads as: Ribo-seq, RNA-seq, TCP-seq, shape-seq etc)

Some important read features are:

- FLOSS `floss`   
- coverage `coverage`  
- ORFscore `orfScore`  
- entropy `entropy`  
- translational effiency `translationalEff`  
- inside outside score `insideOutsideScore`  
- distance between orfs and cds' `distToCds`  
- other  

All of the features are implemented based on scientific article published in
peer reviewed journal. `ORFik` supports seamless calculation of all available 
features. See example below.

Find ORFs:
```{r eval = TRUE, echo = TRUE, warning = FALSE, message = FALSE}
fiveUTRs <- fiveUTRs[1:10]
faFile <- BSgenome.Hsapiens.UCSC.hg19::Hsapiens
tx_seqs <- extractTranscriptSeqs(faFile, fiveUTRs)

fiveUTR_ORFs <- findMapORFs(fiveUTRs, tx_seqs, groupByTx = FALSE)
```

Make some toy ribo seq and rna seq data:
```{r eval = TRUE, echo = TRUE, warning = FALSE, message = FALSE}
starts <- unlist(ORFik:::firstExonPerGroup(fiveUTR_ORFs), use.names = FALSE)
RFP <- promoters(starts, upstream = 0, downstream = 1)
score(RFP) <- rep(29, length(RFP)) # the original read widths
# set RNA-seq seq to duplicate transcripts
RNA <- unlist(exonsBy(txdb, by = "tx", use.names = TRUE), use.names = TRUE)
```

Find features of sequence and library data
```{r eval = TRUE, echo = TRUE, warning = FALSE, message = FALSE}
# transcript database
txdb <- loadTxdb(txdbFile)
dt <- computeFeatures(fiveUTR_ORFs[1:4], RFP, RNA, txdb, faFile, 
                      sequenceFeatures = TRUE)
dt
```
You will now get a data.table with one column per score, the columns are named after
the different scores, you can now go further with prediction, or making plots.

# Calculating Kozak sequence score for ORFs

Instead of getting all features, we can also extract single features.

To understand how strong the binding affinitity of an ORF promoter region might be, we can use kozak sequence score. The kozak functions supports
several species. In the first example we use human kozak sequence,
then we make a self defined kozak sequence.

In this example we will find kozak score of cds' 
```{r eval = TRUE, echo = TRUE}
cds <- cdsBy(txdb, by = "tx", use.names = TRUE)[1:10]
tx <- exonsBy(txdb, by = "tx", use.names = TRUE)[names(cds)]
faFile <- BSgenome.Hsapiens.UCSC.hg19::Hsapiens

kozakSequenceScore(cds, tx, faFile, species = "human")


```
A few species are pre supported, if not, make your own input pfm.

Here is an example where the human pfm is sent in again, even though it is already supported:
```{r eval = TRUE, echo = TRUE}
pfm <- t(matrix(as.integer(c(29,26,28,26,22,35,62,39,28,24,27,17,
                             21,26,24,16,28,32,5,23,35,12,42,21,
                             25,24,22,33,22,19,28,17,27,47,16,34,
                             25,24,26,25,28,14,5,21,10,17,15,28)),
                ncol = 4))

kozakSequenceScore(cds, tx, faFile, species = pfm)
```

As an example of the many plots you can make with ORFik, let's make a scoring of Ribo-seq by kozak sequence.
```{r eval = TRUE, echo = TRUE}
seqs <- startRegionString(cds, tx, faFile, upstream = 5, downstream = 5)
rate <- fpkm(cds, RFP)
ORFik:::kozakHeatmap(seqs, rate)

```
It will be a black boundary box around the strongest nucleotide per position 
(what base at what position gives highest ribo-seq fpkm for the cds).
See at the start codon (position +1 to +3) you have A, T, G.
As known from the literature many C's before start codon and a G after 
the start codon. In a real example most of the nucleotides will be used for all positions. 

# Using ORFik in your package or scripts

The focus of ORFik for development is to be a Swiss army knife for 
transcriptomics. If you need functions for splicing, getting windows of exons
per transcript, periodic windows of exons, speicific parts of exons etc, 
ORFik can help you with this. 

Let's do an example where ORFik shines.
Objective: We have three transcripts, we also have a library of Ribo-seq.
This library was treated with cyclohexamide, so we know Ribo-seq reads can
stack up close to the stop codon of the CDS. Lets say we only want to keep 
transcripts, where the cds stop region (defined as last 9 bases of cds), has
maximum 33% of the reads. To only keep transcripts with a good spread of reads 
over the CDS. How would you make this filter ?

```{r eval = TRUE, echo = TRUE}
  # First make some toy example
  cds <- GRanges("chr1", IRanges(c(1, 10, 20, 30, 40, 50, 60, 70, 80),
                                 c(5, 15, 25, 35, 45, 55, 65, 75, 85)),
                 "+")
  names(cds) <- c(rep("tx1", 3), rep("tx2", 3), rep("tx3", 3))
  cds <- groupGRangesBy(cds)
  ribo <- GRanges("chr1", c(1, rep.int(23, 4), 30, 34, 34, 43, 60, 64, 71, 74),
                  "+")
  # We could do a simplification and use the ORFik entropy function
  entropy(cds, ribo) # <- spread of reads
```
We see that ORF 1, has a low (bad) entropy, but we do not know where the reads 
are stacked up.
So lets make a new filter by using more ORFiks utility functions:
```{r eval = TRUE, echo = TRUE}
tile <- tile1(cds, FALSE, FALSE) # tile them to 1 based positions
tails <- tails(tile, 9) # get 9 last bases per cds
stopOverlap <- countOverlaps(tails, ribo)
allOverlap <- countOverlaps(cds, ribo)
fractions <- (stopOverlap + 1) / (allOverlap + 1) # pseudocount 1
cdsToRemove <- fractions > 1 / 2 # filter with pseudocounts (1+1)/(3+1) 
cdsToRemove
```
We now easily made a stop codon filter for our coding sequences.


# Coverage plots made easy with ORFik

In investigation of ORFs or other interest regions, ORFik can help you make
some coverage plots from reads of Ribo-seq, RNA-seq, CAGE-seq, TCP-seq etc.

Lets make 3 plots of Ribo-seq focused on CDS regions. 

Load data as shown before and pshift the Ribo-seq:
```{r eval = TRUE, echo = TRUE, warning = FALSE, message = FALSE}
# Get the annotation
txdb <- loadTxdb(gtf_file)
# Ribo-seq
bam_file <- system.file("extdata", "ribo-seq.bam", package = "ORFik")
reads <- readGAlignments(bam_file)
shiftedReads <- shiftFootprints(reads, detectRibosomeShifts(reads, txdb))  
```

Make meta windows of leaders, cds' and trailers
```{r eval = TRUE, echo = TRUE, message = FALSE}
# Lets take all valid transcripts, with size restrictions:
# leader > 100 bases, cds > 100 bases, trailer > 100 bases
txNames <- filterTranscripts(txdb, 100, 100, 100) # valid transcripts
loadRegions(txdb, parts = c("leaders", "cds", "trailers", "tx"), 
            names.keep = txNames)

# Create meta coverage per part of transcript
leaderCov <- metaWindow(shiftedReads, leaders, scoring = NULL, 
                        feature = "leaders")

cdsCov <- metaWindow(shiftedReads, cds, scoring = NULL, 
                     feature = "cds")

trailerCov <- metaWindow(shiftedReads, trailers, scoring = NULL, 
                         feature = "trailers")
```

Bind together and plot:
```{r eval = TRUE, echo = TRUE, message = FALSE}
dt <- rbindlist(list(leaderCov, cdsCov, trailerCov))
dt[, `:=` (fraction = "Ribo-seq")] # Set info column
# zscore gives shape, a good starting plot
windowCoveragePlot(dt, scoring = "zscore", title = "Ribo-seq metaplot") 
```
Z-score is good at showing overall shape. You see from the windows each region;
leader, cds and trailer is scaled to 100. NOTE: we can use the function windowPerTranscript 
to do all of this in one call. 


Lets use a median scoring to find median counts per meta window per positions. 
```{r eval = TRUE, echo = TRUE}
windowCoveragePlot(dt, scoring = "median", title = "Ribo-seq metaplot") 
```

We see a big spike close to start of CDS, called the TIS.
The median counts by transcript is close to 50 here.
Lets look at the TIS region using the pshifting plot, seperated into 
the 3 frames.
```{r eval = TRUE, echo = TRUE}
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  # size 100 window: 50 upstream, 49 downstream of TIS
  windowsStart <- startRegion(cds, tx, TRUE, upstream = 50, 49)
  hitMapStart <- metaWindow(shiftedReads, windowsStart, withFrames = TRUE)
  pSitePlot(hitMapStart, length = "meta coverage")
}
```

Since these reads are p-shifted, the maximum 
number of reads are on the 0 position. We also see a clear pattern in the 
Ribo-seq.

To see how the different read lengths distribute over the region, we make 
a heatmap. Where the colors represent the zscore of counts per position. 

```{r eval = TRUE, echo = TRUE, message = FALSE, fig.height=8}
hitMap <- windowPerReadLength(cds, tx,  shiftedReads)
coverageHeatMap(hitMap, addFracPlot = TRUE)
```
In the heatmap you can see that read length 30 has the strongest peak on the 
TIS, while read length 28 has some reads in the leaders (the minus positions).
  
## Multiple data sets in one plot

Often you have multiple data sets you want to compare (like ribo-seq).

ORFik has an extensive syntax for automatic grouping of data sets in plots.

The protocol is:
1. Load all data sets
2. Create a merged coverage data.table
3. Pass it into the plot you want.

Here is an easy example:

```{r eval = TRUE, echo = TRUE, message=FALSE}
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  # Load more files like above (Here I make sampled data from earlier Ribo-seq)
  dt2 <- copy(dt)
  dt2[, `:=` (fraction = "Ribo-seq2")]
  dt2$score <- dt2$score + sample(seq(-40, 40), nrow(dt2), replace = TRUE)
  
  dtl <- rbindlist(list(dt, dt2))
  windowCoveragePlot(dtl, scoring = "median", title = "Ribo-seq metaplots") 
}
```
You see that the fraction column is what seperates the rows. You can have unlimited datasets joined in this way.

# Conclusion
Our hope is that by using ORFik, we can simplify your analysis when 
you focus on ORFs / transcript features and especially in combination with
sequence libraries like RNA-seq and Ribo-seq.  

Happy coding!