lapa

Subpackages

Submodules

Package Contents

Functions

lapa(alignment: str, fasta: str, annotation: str, chrom_sizes: str, output_dir: str, method='end', min_tail_len=10, min_percent_a=0.9, mapq=10, cluster_extent_cutoff=3, cluster_window=25, cluster_ratio_cutoff=0.05, min_replication_rate=0.95, replication_rolling_size=1000, replication_num_sample=2, replication_min_count=1, non_replicates_read_threhold=10)

LAPA high level api for polyA cluster calling.

lapa_tss(alignment: str, fasta: str, annotation: str, chrom_sizes: str, output_dir: str, method='start', mapq=10, cluster_extent_cutoff=3, cluster_window=25, cluster_ratio_cutoff=0.05, min_replication_rate=0.95, replication_rolling_size=1000, replication_num_sample=2, replication_min_count=1, non_replicates_read_threhold=10)

LAPA TSS high level api for polyA cluster calling.

link_tss_to_tes(alignment, lapa_dir, lapa_tss_dir, distance=50, mapq=10, min_read_length=100, dataset='all')

Link transcript site sites to transcript end sites using

read_polyA_cluster(path: str)

Read poly(A) cluster file generated by LAPA.

read_tss_cluster(path: str)

Read tss cluster file generated by LAPA.

lapa.lapa(alignment: str, fasta: str, annotation: str, chrom_sizes: str, output_dir: str, method='end', min_tail_len=10, min_percent_a=0.9, mapq=10, cluster_extent_cutoff=3, cluster_window=25, cluster_ratio_cutoff=0.05, min_replication_rate=0.95, replication_rolling_size=1000, replication_num_sample=2, replication_min_count=1, non_replicates_read_threhold=10)

LAPA high level api for polyA cluster calling.

Parameters
  • alignment – Single or multiple bam file paths are separated with a comma.Alternatively, CSV file with columns of sample, dataset, path where the sample columns contains the name of the sample, the dataset is the group of samples replicates of each other, and path is the path of bam file.

  • fasta – Genome reference (GENCODE or ENSEMBL fasta)

  • annotation – Standart genome annotation (GENCODE or ENSEMBL gtf). GENCODE gtf file do not contains annotation for five_prime_utr and three_prime_utr so need to be corrected with gencode_utr_fix

  • chrom_sizes – Chrom sizes files (can be generated with

  • output_dir – See lapa.readthedocs.io/en/latest/output.html) for the details of the directory structure and file format.

  • method – Counting method either end or tail where tails counting only counts reads with poly(A)-tail with certain length defined by –min_tail_len parameter. end counting still detects tails if exists but uses end location of all the reads in counting regardless of tail length.

  • min_tail_len – Minimum tail length for tail counting strategy. This parameter will be ignored in end counting setting.

  • min_percent_a – Minimum percentage of A bp in soft-trimmed segment to consider the segment as tails. This parameter will be ignored for end counting.

  • mapq – Minimum read quality to required for tes calling

  • cluster_extent_cutoff – Minimum number of reads to initialized cluster and terminated cluster will be terminated if read numbers below this cutoff for certain number of base pairs.

  • cluster_ratio_cutoff – Percentage of coverage change for initialize cluster. At least x% of reads covering the bp need to ended in the position to initilized the cluster. This filter implies <x% of the reads given position could stop by chance so filtered as noise.

  • cluster_window – Patience threshold to wait for termination cluster. If reads counts below the threshold for x bp then cluster will be terminated otherwise cluster will be extended. if number of reads subceed `the cluster_extent_cutoff.

  • min_replication_rate – Minimum replication rate to include cluster in replicated clusters. 0.95 is recommended cutoff for experimental replication and 75% for biological replication.

  • replication_rolling_size – Replication rolling size to calculate replication rate.

  • replication_num_sample – Number of samples which region need to be observed for replication

  • replication_min_count – Minimum count needed to recognize region as expressed

  • non_replicates_read_threhold – Minimum read count need for the samples without replication. If there is not replicate samples for the sample, this default cutoff will be applied.

lapa.lapa_tss(alignment: str, fasta: str, annotation: str, chrom_sizes: str, output_dir: str, method='start', mapq=10, cluster_extent_cutoff=3, cluster_window=25, cluster_ratio_cutoff=0.05, min_replication_rate=0.95, replication_rolling_size=1000, replication_num_sample=2, replication_min_count=1, non_replicates_read_threhold=10)

LAPA TSS high level api for polyA cluster calling.

Parameters
  • alignment – Single or multiple bam file paths are separated with a comma.Alternatively, CSV file with columns of sample, dataset, path where the sample columns contains the name of the sample, the dataset is the group of samples replicates of each other, and path is the path of bam file.

  • fasta – Genome reference (GENCODE or ENSEMBL fasta)

  • annotation – Standart genome annotation (GENCODE or ENSEMBL gtf). GENCODE gtf file do not contains annotation for five_prime_utr and three_prime_utr so need to be corrected with gencode_utr_fix

  • chrom_sizes – Chrom sizes files (can be generated with

  • method – Counting method

  • output_dir – See lapa.readthedocs.io/en/latest/output.html) for the details of the directory structure and file format.

  • min_tail_len – Minimum tail length for tail counting strategy. This parameter will be ignored in end counting setting.

  • min_percent_a – Minimum percentage of A bp in soft-trimmed segment to consider the segment as tails. This parameter will be ignored for end counting.

  • mapq – Minimum read quality to required for tes calling

  • cluster_extent_cutoff – Minimum number of reads to initialized cluster and terminated cluster will be terminated if read numbers below this cutoff for certain number of base pairs.

  • cluster_ratio_cutoff – Percentage of coverage change for initialize cluster. At least x% of reads covering the bp need to ended in the position to initilized the cluster. This filter implies <x% of the reads given position could stop by chance so filtered as noise.

  • cluster_window – Patience threshold to wait for termination cluster. If reads counts below the threshold for x bp then cluster will be terminated otherwise cluster will be extended. if number of reads subceed `the cluster_extent_cutoff.

  • min_replication_rate – Minimum replication rate to include cluster in replicated clusters. 0.95 is recommended cutoff for experimental replication and 75% for biological replication.

  • replication_rolling_size – Replication rolling size to calculate replication rate.

  • replication_num_sample – Number of samples which region need to be observed for replication

  • replication_min_count – Minimum count needed to recognize region as expressed

  • non_replicates_read_threhold – Minimum read count need for the samples without replication. If there is not replicate samples for the sample, this default cutoff will be applied.

Link transcript site sites to transcript end sites using long-read from the alignment file.

Parameters
  • alignment (str) – Path to bam file or TALON read_annot file.

  • lapa_dir (str) – Path to lapa output directory generated with lapa command

  • lapa_tss_dir (str) – Path to lapa tss directory with lapa_tss command

lapa.read_polyA_cluster(path: str)

Read poly(A) cluster file generated by LAPA.

Parameters

path – Path to LAPA poly(A) cluster bed file.

lapa.read_tss_cluster(path: str)

Read tss cluster file generated by LAPA.

Parameters

path – Path to LAPA TSS cluster bed file.