lapa.count

Module Contents

Classes

BaseCounter

Base class to count features from alignment file.

ThreePrimeCounter

Counts 3' ends of reads (transcript end sites) per position

FivePrimeCounter

Counts 5' ends of reads (transcript end sites) per position

PolyaTailCounter

Counts 3' end of reads with polyA-tail (transcript end sites) per position

BaseMultiCounter

Base class to counts reads from multiple aligment files.

TesMultiCounter

Counts transcript end sites from multiple aligment files.

TssMultiCounter

Counts transcript start sites from multiple aligment files.

Functions

_tqdm_counting(iterable)

Adaptor for tqdm to integrate to logging

save_count_bw(df, output_dir, chrom_sizes, prefix)

Saves counts as bigwig file for each strand generated with

save_tss_count_bw(df, chrom_sizes, output_dir, prefix)

Saves counts of transcript end sites (TSS) as bigwig file

save_tes_count_bw(df, chrom_sizes, output_dir, prefix)

Saves counts of transcript end sites (TSS) as bigwig file

lapa.count._tqdm_counting(iterable)

Adaptor for tqdm to integrate to logging

class lapa.count.BaseCounter(bam_file, mapq=10, progress=True)

Base class to count features from alignment file.

Parameters
  • bam_file – Path to bam file or pysam.AlignmentFile object.

  • mapq (int, optional) – minimum reads quality required to use in counting.

  • progress (bool, optional) – Show progress in counting.

bam

alignment file.

Type

pysam.AlignmentFile

Examples

Count mid location of the read.

>>> class MidCounter(BaseCounter):
>>>     def count_read(self, read):
>>>         return (read.reference_start + read.reference_end) / 2
>>> counter = MidCounter(bam_file)
>>> counter.to_bigwig(chrom_sizes, output_dir, 'mid')
>>> os.listdir(output_dir)
['mid_pos.bw', 'mid_neg.bw']
>>> counter.to_df()
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...
property bam(self)

Alignment file used in counting.

iter_reads(self, chrom=None, strand=None)
filter_read(self, read)

Filter reads from counting if not true

count(self)

Counts reads per positions defined by count_Read_function

Returns

Returns dictionary of

chromosome, position and strand ad index and counts as values.

Return type

Dict[(chrom, pos, strand), int]

abstract count_read(self, read: pysam.AlignedSegment)
to_gr(self)
Counts as dataframe with columns of

[‘Chromosome’, ‘Start’, ‘End’, ‘Strand’, ‘count’]

to_df(self)
static _to_bigwig(gr, chrom_sizes, output_dir, prefix)
to_bigwig(self, chrom_sizes, output_dir, prefix='lapa_counts')

Saves counts as bigwig file for each strand

Parameters
  • chrom_sizes (str) – Chrom sizes files (can be generated with) from fasta with faidx fasta -i chromsizes > chrom_sizes

  • output_dir – Output directory to save bigwig files

  • prefix (str) – File prefix to used in bigwig the files

lapa.count.save_count_bw(df, output_dir, chrom_sizes, prefix)
Saves counts as bigwig file for each strand generated with

instance of BaseCounter object

Parameters
  • chrom_sizes (str) – Chrom sizes files (can be generated with) from fasta with faidx fasta -i chromsizes > chrom_sizes

  • output_dir – Output directory to save bigwig files

  • prefix (str) – File prefix to used in bigwig the files

lapa.count.save_tss_count_bw(df, chrom_sizes, output_dir, prefix)
Saves counts of transcript end sites (TSS) as bigwig file

for each strand generated with TesCounters

Parameters
  • chrom_sizes (str) – Chrom sizes files (can be generated with) from fasta with faidx fasta -i chromsizes > chrom_sizes

  • output_dir – Output directory to save bigwig files

  • prefix (str) – File prefix to used in bigwig the files

lapa.count.save_tes_count_bw(df, chrom_sizes, output_dir, prefix)
Saves counts of transcript end sites (TSS) as bigwig file

for each strand generated with TesCounters

Parameters
  • chrom_sizes (str) – Chrom sizes files (can be generated with) from fasta with faidx fasta -i chromsizes > chrom_sizes

  • output_dir – Output directory to save bigwig files

  • prefix (str) – File prefix to used in bigwig the files

class lapa.count.ThreePrimeCounter(bam_file, mapq=10, progress=True)

Bases: BaseCounter

Counts 3’ ends of reads (transcript end sites) per position

from alignment file.

Parameters
  • bam_file – Path to bam file or pysam.AlignmentFile object.

  • mapq (int, optional) – minimum reads quality required to use in counting.

  • progress (bool, optional) – Show progress in counting.

bam

alignment file.

Type

pysam.AlignmentFile

Examples

Count 3’ ends of the read per position.

>>> counter = ThreePrimeCounter(bam_file)
>>> counter.to_bigwig(chrom_sizes, output_dir, 'mid')
>>> os.listdir(output_dir)
['lapa_count_pos.bw', 'lapa_count_neg.bw']
>>> counter.to_df()
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...
count_read(self, read: pysam.AlignedSegment)

Returns 3’ end of the read

static _calculate_tail_seq(tail_seq, tail_base)

Calculate tail seq

class lapa.count.FivePrimeCounter(bam_file, mapq=10, progress=True)

Bases: BaseCounter

Counts 5’ ends of reads (transcript end sites) per position

from alignment file.

Parameters
  • bam_file – Path to bam file or pysam.AlignmentFile object.

  • mapq (int, optional) – minimum reads quality required to use in counting.

  • progress (bool, optional) – Show progress in counting.

bam

alignment file.

Type

pysam.AlignmentFile

Examples

Count 5’ ends of the read per position.

>>> counter = StartCounter(bam_file)
>>> counter.to_bigwig(chrom_sizes, output_dir, 'mid')
>>> os.listdir(output_dir)
['lapa_count_pos.bw', 'lapa_count_neg.bw']
>>> counter.to_df()
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...
count_read(self, read: pysam.AlignedSegment)

Returns 5’ end of the read

class lapa.count.PolyaTailCounter(bam_file, mapq=10, progress=True, min_tail_len=10, min_percent_a=0.9, count_aligned=False)

Bases: ThreePrimeCounter

Counts 3’ end of reads with polyA-tail (transcript end sites) per position

from alignment file.

Parameters
  • bam_file – Path to bam file or pysam.AlignmentFile object.

  • mapq (int, optional) – minimum reads quality required to use in counting.

  • progress (bool, optional) – Show progress in counting.

  • min_tail_len

bam

alignment file.

Type

pysam.AlignmentFile

Examples

Count 3’ ends of the read per position.

>>> counter = PolyaTailCounter(bam_file)
>>> counter.to_bigwig(chrom_sizes, output_dir, 'mid')
>>> os.listdir(output_dir)
['lapa_count_pos.bw', 'lapa_count_neg.bw']
>>> counter.to_df()
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...
static detect_polyA_tail(read: pysam.AlignedSegment, count_aligned=False)

Detect polyA tails from a read

Parameters
  • read – aligned reads

  • count_aligned – Count aligned base pairs (likely internal priming) as well in tail length.

Returns

Tuple of polyA_site, length of tail, percent of A base in tails.

_read_is_tailed(self, tail_len, percent_a)
iter_tailed_reads(self)

Iterates polyA reads and polyA_site based on polyA filters.

save_tailed_reads(self, output_bam)

Save tailed reads as bam files

Parameters

output_bam – Path to bam file or pysam.AlignmentFile object.

tail_len_dist(self)

Returns tail length distribution of reads based on the filters.

plot_tail_len_dist(self)

Plots pdf and cdf of tail length distribution

filter_read(self, read)

Filter tailed reads and quality

class lapa.count.BaseMultiCounter(df_alignment: pandas.DataFrame, method: str, mapq=10, is_read_annot=False)

Base class to counts reads from multiple aligment files.

Parameters
  • df_alignment – DataFrame with columns of [‘sample’, ‘dataset’, ‘path’] where sample is the sample name, dataset is name of the group (replicates) of sample belong, path is the path to bam file.

  • method – Counting method implemented by child class.

  • mapq – minimum mapping quality

  • is_read_annot – Talon reads annotate file can be provided to df_alignment argument in that case this argument need to True.

abstract build_counter(self, bam)
abstract _count_read_annot(self)
static _to_bigwig(df_all, tes, chrom_sizes, output_dir, prefix='polyA')
to_df(self)

Export counts as dataframe.

Returns

Counst as tuple the first element is dataframe of all the counts and second element dictonary where first element is the name of sample and second element dataframe of counts.

Return type

(pd.DataFrame, Dict[str, pd.DataFrame])

class lapa.count.TesMultiCounter(alignment, method='end', mapq=10, min_tail_len=10, min_percent_a=0.9, is_read_annot=False)

Bases: BaseMultiCounter

Counts transcript end sites from multiple aligment files.

Parameters
  • df_alignment – DataFrame with columns of [‘sample’, ‘dataset’, ‘path’] where sample is the sample name, dataset is name of the group (replicates) of sample belong, path is the path to bam file.

  • method – either end or tail see PolyaTailCounter and ThreePrimeCounter for countering behavior.

  • mapq – minimum mapping quality

  • is_read_annot – Talon reads annotate file can be provided to df_alignment argument in that case this argument need to True.

Examples

Counts transcript end files for two samples with two replicates

>>> df_alignment = pd.DataFrame({
>>>     'sample': ['s1', 's2', 's3', 's4'],
>>>     'dataset': ['d1', 'd2', 'd3', 'd4'],
>>>     'path': ['s1.bam', 's2.bam', 's3.bam', 's4.bam']
>>> })
>>> counter = TesMultiCounter(df_alignment)
>>> counter.to_bigwig(chrom_sizes, output_dir) # export counts as bw
>>> df_all, samples = counter.to_df() # or export as df
>>> df_all
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...
>>> samples['s1']
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...
build_counter(self, bam)
_count_read_annot(self)
class lapa.count.TssMultiCounter(alignment, method='start', mapq=10, is_read_annot=False)

Bases: BaseMultiCounter

Counts transcript start sites from multiple aligment files.

Parameters
  • df_alignment – DataFrame with columns of [‘sample’, ‘dataset’, ‘path’] where sample is the sample name, dataset is name of the group (replicates) of sample belong, path is the path to bam file.

  • method – either end or tail see FiveTailCounter

  • mapq – minimum mapping quality

  • is_read_annot – Talon reads annotate file can be provided to df_alignment argument in that case this argument need to True.

Examples

Counts transcript end files for two samples with two replicates

>>> df_alignment = pd.DataFrame({
>>>     'sample': ['s1', 's2', 's3', 's4'],
>>>     'dataset': ['d1', 'd2', 'd3', 'd4'],
>>>     'path': ['s1.bam', 's2.bam', 's3.bam', 's4.bam']
>>> })
>>> counter = TssMultiCounter(df_alignment)
>>> counter.to_bigwig(chrom_sizes, output_dir) # export counts as bw
>>> df_all, samples = counter.to_df() # or export as df
>>> df_all
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...
>>> samples['s1']
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...
build_counter(self, bam)
_count_read_annot(self)