Output format

Poly(A) clusters

Output directory of LAPA (lapa --output_dir your_output_dir) looks like:

your_output_dir/
├── polyA_clusters.bed
├── raw_polyA_clusters.bed
├── counts
│   ├── all_polyA_counts_neg.bw
│   ├── all_polyA_counts_pos.bw
│   ├── {sample}_polyA_counts_neg.bw
│   ├── {sample}_polyA_counts_pos.bw
├── coverage
│   ├── all_polyA_coverage_neg.bw
│   ├── all_polyA_coverage_pos.bw
│   ├── {sample}_polyA_coverage_neg.bw
│   ├── {sample}_polyA_coverage_pos.bw
├── ratio
│   ├── all_polyA_ratio_neg.bw
│   ├── all_polyA_ratio_pos.bw
│   ├── {sample}_polyA_ratio_neg.bw
│   ├── {sample}_polyA_ratio_pos.bw
├── dataset
│   └── {dataset}.bed
├── raw_sample
│   └── {sample}.bed
├── sample
│   └── {sample}.bed
├── logs
│   ├── final_stats.log
└── ├── progress.log
└── warnings.log

polyA_clusters.bed: is the main output of LAPA and contains poly(A) clusters with replication rate. Those set of clusters are high confidence set of poly(A) clusters. Poly(A) cluster .bed file have following format:

Chromosome	Start	End	polyA_site	count	Feature	gene_id	tpm	gene_count	usage	fracA	signal	annotated_site
chr17	2681887	2681912	2681907	249	three_prime_utr	ENSG00000007168.14	7978.72	780	0.319231	4	2681885@AATAAA	2685608
chr17	2684498	2684510	2684505	58	three_prime_utr	ENSG00000007168.14	1858.5	780	0.074359	4	2684477@AATAAA	2685608
chr17	2685607	2685616	2685614	473	three_prime_utr	ENSG00000007168.14	15156.4	780	0.60641	3	2685562@ATTAAA	2685608
chr17	3661532	3661541	3661536	110	three_prime_utr	ENSG00000040531.16	3524.74	110	1	1	3661514@ATTAAA	3663103
chr17	2059842	2059845	2059843	145	three_prime_utr	ENSG00000070366.14	4646.24	145	1	2	2059861@AATAAA	2059842

…

Poly(A) clusters can be read by following code as dataframe:

from lapa import read_polyA_cluster

df = read_polyA_cluster('your_output_dir/polyA_clusters.bed')

raw_polyA_clusters.bed: contains the all the poly(A) clusters detected by LAPA but not filtered for replication.

counts: is directory containing read end bigwig files. Each bigwig file contains number of reads ends per position indicating possosible poly(A) sites. This directory contains one bigwig file for each strand and not filtered so representing row data. There is pair of bigwig files per sample. The file starting all prefix contains counts from all the samples where counts are aggreated into one bigwig file.

coverage: is directory containing bigwig files for coverage. Each file indicates for coverage of non-zero values in counts file. So the file format is sparse, contains values only for positions where at least 1 read is ending, and remaining positions are zero despite coverage can be non-zero. Sparse file format used to limit file and computational efficiency. The file starting all prefix contains counts from all the samples where counts are aggreated into one bigwig file.

coverage: is directory containing ratio of counts to coverage ($ count / coverage $). This ratio indicates percentage reads are ending at a position given coverage. If the ratio close to one, the site is definitive poly(A) site give there is high coverage. If ratio is close to 0 then reads ending could be at the position by chance. Based on the default parameters LAPA (cluster_ratio_cutoff) only initialize cluster if ratio > 5% at a position. The file starting all prefix contains counts from all the samples where counts are aggreated into one bigwig file.

dataset: is directory containing poly(A) cluster .bed files per dataset. Those files are filtered for replication rate using samples in for the dataset, then replicated clusters from all the samples aggregate into .bed file for the cluster.

raw_sample: is directory containing poly(A) cluster .bed files per sample where files are not filtered for replication.

sample: is directory containing poly(A) cluster .bed files per sample where files are filtered for replication.

logs: is directory containing logs of LAPA. final_stats.log contains statistics about poly(A) clusters after program finished. progress.log provide inside about the progress of program run. warnings.log file contains possible warning encounter during the run time if there is any.

TSS clusters

Output directory of TSS LAPA (lapa_tss –output_dir your_output_dir) looks like:

your_output_dir/
├── tss_clusters.bed
├── raw_tss_clusters.bed
├── counts
│   ├── all_tss_counts_neg.bw
│   ├── all_tss_counts_pos.bw
│   ├── {sample}_tss_counts_neg.bw
│   ├── {sample}_tss_counts_pos.bw
├── coverage
│   ├── all_polyA_coverage_neg.bw
│   ├── all_polyA_coverage_pos.bw
│   ├── {sample}_tss_coverage_neg.bw
│   ├── {sample}_tss_coverage_pos.bw
├── ratio
│   ├── all_polyA_ratio_neg.bw
│   ├── all_polyA_ratio_pos.bw
│   ├── {sample}_tss_ratio_neg.bw
│   ├── {sample}_tss_ratio_pos.bw
├── dataset
│   └── {dataset}.bed
├── raw_sample
│   └── {sample}.bed
├── sample
│   └── {sample}.bed
├── logs
│   ├── final_stats.log
└── ├── progress.log
    └── warnings.log

where tss_clusters.bed is the main output of LAPA and contains TSS clusters with replication rate. Those set of clusters are high confidence set of TSS clusters. TSS cluster .bed file have following format:

Chromosome	Start	End	tss_site	count	Feature	gene_id	tpm	gene_count	usage	annotated_site
chr17	38870035	38870063	38870060	201	five_prime_utr	ENSG00000002834.18	2220.6	252	0.797619	38869858
chr17	38890455	38890456	38890456	24	exon	ENSG00000002834.18	265.15	252	0.0952381	-1
chr17	38918997	38918998	38918998	27	exon	ENSG00000002834.18	298.29	252	0.107143	-1
chr17	48107566	48107599	48107576	13	five_prime_utr	ENSG00000002919.15	143.62	23	0.565217	48107548
chr17	48107764	48107785	48107785	10	five_prime_utr	ENSG00000002919.15	110.48	23	0.434783	48107548

…

Tss clusters can be read by following code as dataframe:

from lapa import read_tss_cluster

df = read_polyA_cluster('your_output_dir/tss_clusters.bed')

This cluster .bed file and all other cluster .bed files have following columns:

Chromosome: Chromosome of the tss cluster.
Start: Start position of tss cluster.
End: End position of tss cluster.
tss_site: Exact tss site (peak) of the cluster.
count: number of reads supporting to cluster (ending in cluster).
Strand: Strand of the tss cluster.
Feature: Genomics feature overlapping with the cluster (obtained from GTF file).
gene_id: The gene containing the tss clusters.
tpm: TPM of the cluster calculated by $count / sum(count) * 1,000,000$.
gene_count: Total number reads in all the clusters of this gene calculated by $sum(count_i)$ where $i in gene$.
usage: Percentage use of specific tss clusters of the gene calculated by $count / gene_count$
annotated_site: End position of the 5’ UTR based on the GTF if tss cluster located in 5’ UTR.

raw_tss_clusters.bed contains the all the TSS clusters detected by LAPA but not filtered for replication.

For the details of the other files see [the documentation of Poly(A) clusters](#poly(A)-clusters). File structure and content of the files are same with TSS clusters.