`lapa.count`

Module Contents

Classes

`BaseCounter`	Base class to count features from alignment file.
`ThreePrimeCounter`	Counts 3' ends of reads (transcript end sites) per position
`FivePrimeCounter`	Counts 5' ends of reads (transcript end sites) per position
`PolyaTailCounter`	Counts 3' end of reads with polyA-tail (transcript end sites) per position
`BaseMultiCounter`	Base class to counts reads from multiple aligment files.
`TesMultiCounter`	Counts transcript end sites from multiple aligment files.
`TssMultiCounter`	Counts transcript start sites from multiple aligment files.

Functions

`_tqdm_counting`(iterable)	Adaptor for tqdm to integrate to logging
`save_count_bw`(df, output_dir, chrom_sizes, prefix)	Saves counts as bigwig file for each strand generated with
`save_tss_count_bw`(df, chrom_sizes, output_dir, prefix)	Saves counts of transcript end sites (TSS) as bigwig file
`save_tes_count_bw`(df, chrom_sizes, output_dir, prefix)	Saves counts of transcript end sites (TSS) as bigwig file

lapa.count._tqdm_counting(iterable): Adaptor for tqdm to integrate to logging

class lapa.count.BaseCounter(bam_file, mapq=10, progress=True)

Base class to count features from alignment file.

Parameters

bam_file – Path to bam file or pysam.AlignmentFile object.
mapq (int, optional) – minimum reads quality required to use in counting.
progress (bool, optional) – Show progress in counting.

bam

alignment file.

Type: pysam.AlignmentFile

Examples

Count mid location of the read.

>>> class MidCounter(BaseCounter):
>>>     def count_read(self, read):
>>>         return (read.reference_start + read.reference_end) / 2
>>> counter = MidCounter(bam_file)
>>> counter.to_bigwig(chrom_sizes, output_dir, 'mid')
>>> os.listdir(output_dir)
['mid_pos.bw', 'mid_neg.bw']
>>> counter.to_df()
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...

property bam(self): Alignment file used in counting.

iter_reads(self, chrom=None, strand=None)

filter_read(self, read): Filter reads from counting if not true

count(self)

Counts reads per positions defined by count_Read_function

Returns

Returns dictionary of: chromosome, position and strand ad index and counts as values.

Return type

Dict[(chrom, pos, strand), int]

abstract count_read(self, read: pysam.AlignedSegment)

to_gr(self)

Counts as dataframe with columns of: [‘Chromosome’, ‘Start’, ‘End’, ‘Strand’, ‘count’]

to_df(self)

static _to_bigwig(gr, chrom_sizes, output_dir, prefix)

to_bigwig(self, chrom_sizes, output_dir, prefix='lapa_counts')

Saves counts as bigwig file for each strand

Parameters

chrom_sizes (str) – Chrom sizes files (can be generated with) from fasta with faidx fasta -i chromsizes > chrom_sizes
output_dir – Output directory to save bigwig files
prefix (str) – File prefix to used in bigwig the files

lapa.count.save_count_bw(df, output_dir, chrom_sizes, prefix)

Saves counts as bigwig file for each strand generated with: instance of BaseCounter object

Parameters

chrom_sizes (str) – Chrom sizes files (can be generated with) from fasta with faidx fasta -i chromsizes > chrom_sizes
output_dir – Output directory to save bigwig files
prefix (str) – File prefix to used in bigwig the files

lapa.count.save_tss_count_bw(df, chrom_sizes, output_dir, prefix)

Saves counts of transcript end sites (TSS) as bigwig file: for each strand generated with TesCounters

Parameters

chrom_sizes (str) – Chrom sizes files (can be generated with) from fasta with faidx fasta -i chromsizes > chrom_sizes
output_dir – Output directory to save bigwig files
prefix (str) – File prefix to used in bigwig the files

lapa.count.save_tes_count_bw(df, chrom_sizes, output_dir, prefix)

Saves counts of transcript end sites (TSS) as bigwig file: for each strand generated with TesCounters

Parameters

chrom_sizes (str) – Chrom sizes files (can be generated with) from fasta with faidx fasta -i chromsizes > chrom_sizes
output_dir – Output directory to save bigwig files
prefix (str) – File prefix to used in bigwig the files

class lapa.count.ThreePrimeCounter(bam_file, mapq=10, progress=True)

Bases: BaseCounter

Counts 3’ ends of reads (transcript end sites) per position: from alignment file.

Parameters

bam_file – Path to bam file or pysam.AlignmentFile object.
mapq (int, optional) – minimum reads quality required to use in counting.
progress (bool, optional) – Show progress in counting.

bam

alignment file.

Type: pysam.AlignmentFile

Examples

Count 3’ ends of the read per position.

>>> counter = ThreePrimeCounter(bam_file)
>>> counter.to_bigwig(chrom_sizes, output_dir, 'mid')
>>> os.listdir(output_dir)
['lapa_count_pos.bw', 'lapa_count_neg.bw']
>>> counter.to_df()
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...

count_read(self, read: pysam.AlignedSegment): Returns 3’ end of the read

static _calculate_tail_seq(tail_seq, tail_base): Calculate tail seq

class lapa.count.FivePrimeCounter(bam_file, mapq=10, progress=True)

Bases: BaseCounter

Counts 5’ ends of reads (transcript end sites) per position: from alignment file.

Parameters

bam_file – Path to bam file or pysam.AlignmentFile object.
mapq (int, optional) – minimum reads quality required to use in counting.
progress (bool, optional) – Show progress in counting.

bam

alignment file.

Type: pysam.AlignmentFile

Examples

Count 5’ ends of the read per position.

>>> counter = StartCounter(bam_file)
>>> counter.to_bigwig(chrom_sizes, output_dir, 'mid')
>>> os.listdir(output_dir)
['lapa_count_pos.bw', 'lapa_count_neg.bw']
>>> counter.to_df()
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...

count_read(self, read: pysam.AlignedSegment): Returns 5’ end of the read

class lapa.count.PolyaTailCounter(bam_file, mapq=10, progress=True, min_tail_len=10, min_percent_a=0.9, count_aligned=False)

Bases: ThreePrimeCounter

Counts 3’ end of reads with polyA-tail (transcript end sites) per position: from alignment file.

Parameters

bam_file – Path to bam file or pysam.AlignmentFile object.
mapq (int, optional) – minimum reads quality required to use in counting.
progress (bool, optional) – Show progress in counting.
min_tail_len –

bam

alignment file.

Type: pysam.AlignmentFile

Examples

Count 3’ ends of the read per position.

>>> counter = PolyaTailCounter(bam_file)
>>> counter.to_bigwig(chrom_sizes, output_dir, 'mid')
>>> os.listdir(output_dir)
['lapa_count_pos.bw', 'lapa_count_neg.bw']
>>> counter.to_df()
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...

static detect_polyA_tail(read: pysam.AlignedSegment, count_aligned=False)

Detect polyA tails from a read

Parameters

read – aligned reads
count_aligned – Count aligned base pairs (likely internal priming) as well in tail length.

Returns

Tuple of polyA_site, length of tail, percent of A base in tails.

_read_is_tailed(self, tail_len, percent_a)

iter_tailed_reads(self): Iterates polyA reads and polyA_site based on polyA filters.

save_tailed_reads(self, output_bam)

Save tailed reads as bam files

Parameters: output_bam – Path to bam file or pysam.AlignmentFile object.

tail_len_dist(self): Returns tail length distribution of reads based on the filters.

plot_tail_len_dist(self): Plots pdf and cdf of tail length distribution

filter_read(self, read): Filter tailed reads and quality

class lapa.count.BaseMultiCounter(df_alignment: pandas.DataFrame, method: str, mapq=10, is_read_annot=False)

Base class to counts reads from multiple aligment files.

Parameters

df_alignment – DataFrame with columns of [‘sample’, ‘dataset’, ‘path’] where sample is the sample name, dataset is name of the group (replicates) of sample belong, path is the path to bam file.
method – Counting method implemented by child class.
mapq – minimum mapping quality
is_read_annot – Talon reads annotate file can be provided to df_alignment argument in that case this argument need to True.

abstract build_counter(self, bam)

abstract _count_read_annot(self)

static _to_bigwig(df_all, tes, chrom_sizes, output_dir, prefix='polyA')

to_df(self)

Export counts as dataframe.

Returns: Counst as tuple the first element is dataframe of all the counts and second element dictonary where first element is the name of sample and second element dataframe of counts.
Return type: (pd.DataFrame, Dict[str, pd.DataFrame])

class lapa.count.TesMultiCounter(alignment, method='end', mapq=10, min_tail_len=10, min_percent_a=0.9, is_read_annot=False)

Bases: BaseMultiCounter

Counts transcript end sites from multiple aligment files.

Parameters

df_alignment – DataFrame with columns of [‘sample’, ‘dataset’, ‘path’] where sample is the sample name, dataset is name of the group (replicates) of sample belong, path is the path to bam file.
method – either end or tail see PolyaTailCounter and ThreePrimeCounter for countering behavior.
mapq – minimum mapping quality
is_read_annot – Talon reads annotate file can be provided to df_alignment argument in that case this argument need to True.

Examples

Counts transcript end files for two samples with two replicates

>>> df_alignment = pd.DataFrame({
>>>     'sample': ['s1', 's2', 's3', 's4'],
>>>     'dataset': ['d1', 'd2', 'd3', 'd4'],
>>>     'path': ['s1.bam', 's2.bam', 's3.bam', 's4.bam']
>>> })
>>> counter = TesMultiCounter(df_alignment)
>>> counter.to_bigwig(chrom_sizes, output_dir) # export counts as bw
>>> df_all, samples = counter.to_df() # or export as df
>>> df_all
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...
>>> samples['s1']
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...

build_counter(self, bam)

_count_read_annot(self)

class lapa.count.TssMultiCounter(alignment, method='start', mapq=10, is_read_annot=False)

Bases: BaseMultiCounter

Counts transcript start sites from multiple aligment files.

Parameters

df_alignment – DataFrame with columns of [‘sample’, ‘dataset’, ‘path’] where sample is the sample name, dataset is name of the group (replicates) of sample belong, path is the path to bam file.
method – either end or tail see FiveTailCounter
mapq – minimum mapping quality
is_read_annot – Talon reads annotate file can be provided to df_alignment argument in that case this argument need to True.

Examples

Counts transcript end files for two samples with two replicates

>>> df_alignment = pd.DataFrame({
>>>     'sample': ['s1', 's2', 's3', 's4'],
>>>     'dataset': ['d1', 'd2', 'd3', 'd4'],
>>>     'path': ['s1.bam', 's2.bam', 's3.bam', 's4.bam']
>>> })
>>> counter = TssMultiCounter(df_alignment)
>>> counter.to_bigwig(chrom_sizes, output_dir) # export counts as bw
>>> df_all, samples = counter.to_df() # or export as df
>>> df_all
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...
>>> samples['s1']
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...

build_counter(self, bam)

_count_read_annot(self)

lapa.count

Module Contents

Classes

Functions

`lapa.count`