`lapa.cluster`

Module Contents

Classes

`Cluster`	Cluster class representing cluster in genome in a chromosome
`PolyACluster`	PolyA cluster used in poly(A)-site clustering, performs peak calling to
`TssCluster`	TSS cluster used in tss site clustering, performs peak calling to
`Clustering`	Clustering algorith to obtains regions cluster together based
`PolyAClustering`	Clustering algorith to obtains polyA clusters from read end counts.
`TssClustering`	Clustering algorith to obtains tss clusters from read start counts.

Functions

_tqdm_clustering(iterable)

Adaptor for tqdm to integrate to logging

lapa.cluster._tqdm_clustering(iterable): Adaptor for tqdm to integrate to logging

class lapa.cluster.Cluster(Chromosome: str, Start: int, End: int, Strand: str, counts=None, fields=None)

Cluster class representing cluster in genome in a chromosome: with start, end location and strand. Mainly, each position between start and end location has counts as value indicating number of reads ending at the location. Based on the counts peak calling is performed. Counts are stored sparse format but converted to dense format for peak calling.

Parameters

Chromosome – Chromosome of the cluster
Start – Chromosome of the cluster
End – Chromosome of the cluster
Strand – Chromosome of the cluster
counts (List[Tuple[int, int]], optional) – Sparse representation of counts in the cluster as list of tuple where tuple is (position, count).
fields (Dict[int, List]], optional) – Fields are dictonary of list where information about each read in the cluster can be stored. Not used in clustering algorith by default.

Examples

Peak calling based on the counts.

>>> cluster = Cluster('chr1', 10, 11, '+')
>>> cluster.extend((12, 5))
>>> len(cluster)
3
>>> cluster.peak()
12
>>> cluster.extend((15, 5))
>>> cluster.extend((16, 3))
>>> cluster.extend((18, 1))
>>> len(cluster)
8
>>> cluster.peak()
15

extend(self, end, count)

Extend cluster with end position and count.

Parameters

end – New end location to extend cluster.
count – Number of reads supporting position.

property total_count(self): Total counts in the cluster.

__len__(self): Length of the cluster End - Start.

static _count_arr(counts)

peak(self, window=5, std=1)

Detech peaks position where value are maximum in the cluster.: Counts are smoothed with moving average before performing peak calling.

Parameters

window – window size for smoothing.
std – Standard deviation of gaussian kernel used for smoothing.

to_dict(self, fasta: kipoiseq.extractors.FastaStringExtractor)

Convert cluster into dictonary annotate regulatory elements: in the cluster using fasta file.

Parameters: fasta – FastaStringExtractor object of kipoiseq.

__str__(self): Return str(self).

__repr__(self): Return repr(self).

class lapa.cluster.PolyACluster(Chromosome: str, Start: int, End: int, Strand: str, counts=None, fields=None)

Bases: Cluster

PolyA cluster used in poly(A)-site clustering, performs peak calling to: obtain exact poly(A)-site, and extract sequence elements in the vicinity of cluster.

Examples

Peak calling for poly(A)-site detection.

>>> cluster = PolyACluster('chr1', 100, 101, '+')
>>> cluster.extend((112, 5))
>>> cluster.extend((115, 5))
>>> cluster.extend((116, 3))
>>> cluster.extend((118, 1))
>>> len(cluster)
8
>>> cluster.polyA_site()
115
>>> cluster.polyA_signal_sequence('hg38.fa', polyA_site=115)
118, 'AATAAA'
>>> cluster.fraction_A('hg38.fa', polyA_site=115)
3

polyA_site(self, window=5, std=1)

Detects poly(A)-site with peak calling.: Counts are smoothed with moving average before performing peak calling.

Parameters

window – window size for smoothing.
std – Standard deviation of gaussian kernel used for smoothing.

polyA_signal_sequence(self, fasta: kipoiseq.extractors.FastaStringExtractor, polyA_site: int)

Poly(A) signal sequence in the vicinity of poly(A) site.

Parameters

fasta – Fasta to extract sequences.
polyA_site – Poly(A) site based on the peak calling.

fraction_A(self, fasta: kipoiseq.extractors.FastaStringExtractor, polyA_site)

Fraction of A following the polyA site.

Parameters

fasta – Fasta to extract sequences.
polyA_site – Poly(A) site based on the peak calling.

to_dict(self, fasta)

Convert cluster into dictonary annotate regulatory elements: (polyA_signal and fracA) in the cluster using fasta file.

Parameters: fasta – FastaStringExtractor object of kipoiseq.

class lapa.cluster.TssCluster(Chromosome: str, Start: int, End: int, Strand: str, counts=None, fields=None)

Bases: Cluster

TSS cluster used in tss site clustering, performs peak calling to: obtain exact tss-site, and extract sequence elements in the vicinity of cluster.

Examples

Peak calling for tss site detection.

>>> cluster = PolyACluster('chr1', 100, 101, '+')
>>> cluster.extend((112, 5))
>>> cluster.extend((115, 5))
>>> cluster.extend((116, 3))
>>> cluster.extend((118, 1))
>>> len(cluster)
8
>>> cluster.peak()
115

to_dict(self, fasta)

Convert cluster into dictonary.

Parameters: fasta – FastaStringExtractor object of kipoiseq.

class lapa.cluster.Clustering(fasta, extent_cutoff=3, ratio_cutoff=0.05, window=25, groupby=None, fields=None, progress=True)

Clustering algorith to obtains regions cluster together based: on the read end counts.

Parameters

fasta – path to fasta file which used to extract regulatory elements in the vicinity of genome.
extent_cutoff – Extent cluster if number of read end counts above this cutoff.
ratio_cutoff – Ratio of read end counts to coverage of the region.
window – Patiance window cluster will be terminated on if read numbers below this cutoff for this window size of bps.
groupby – Groupby reads in the same region and sceen read number default of Chromosome and Strand.
fields – Fields to extract from the counts and store in cluster object.
progress – Show progress bar for clustering

Cluster

cluster(self, df_tes)

Parameters

df_tes –

.

to_df(self, df_tes)

Perform clustering based on read end counts.

Parameters: df_tes – Counts per genomics position obtain with counting classes in pandas.DataFrame with Chromosome, Start, End, Strand, count, coverage columns.

class lapa.cluster.PolyAClustering(fasta, extent_cutoff=3, ratio_cutoff=0.05, window=25, groupby=None, fields=None, progress=True)

Bases: Clustering

Clustering algorith to obtains polyA clusters from read end counts.

Examples

Cluster poly(A)-sites from bam file.

>>> clustering = PolyAClustering('hg38.fasta')
>>> counter = ThreePrimeCounter(bam_file)
>>> df_counts = counter.to_df()
>>> df_counts.head()
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...
>>> df_clusters = clustering.to_df(df_counts)
>>> df_clusters
+--------------+-----------+-----------+--------------+-----------+--------------+-----------+---------------+
| Chromosome   |     Start |       End |   polyA_site |     count | Strand       |     fracA | signal        |
| (category)   |   (int32) |   (int32) |      (int64) |   (int64) | (category)   |   (int64) | (object)      |
|--------------+-----------+-----------+--------------+-----------+--------------+-----------+---------------|
| chr17        |    100099 |    100100 |       100100 |        10 | +            |         6 | 100098@GATAAA |
| chr17        |    100199 |    100200 |       100200 |         7 | -            |         2 | None@None     |
| chrM         |      1100 |      1101 |         1101 |        11 | +            |        -1 | None@None     |
...

Cluster

class lapa.cluster.TssClustering(fasta, extent_cutoff=3, ratio_cutoff=0.05, window=25, groupby=None, fields=None, progress=True)

Bases: Clustering

Clustering algorith to obtains tss clusters from read start counts.

Examples

Cluster tss sites from bam file.

>>> clustering = PolyAClustering('hg38.fasta')
>>> counter = FivePrimeCounter(bam_file)
>>> df_counts = counter.to_df()
>>> df_counts.head()
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...
>>> df_clusters = clustering.to_df(df_counts)
>>> df_clusters
+--------------+-----------+-----------+--------------+-----------+--------------+
| Chromosome   |     Start |       End |     tss_site |     count | Strand       |
| (category)   |   (int32) |   (int32) |      (int64) |   (int64) | (category)   |
|--------------+-----------+-----------+--------------+-----------+--------------|
| chr17        |    100099 |    100100 |       100100 |        10 | +            |
| chr17        |    100199 |    100200 |       100200 |         7 | -            |
| chrM         |      1100 |      1101 |         1101 |        11 | +            |
...

Cluster

lapa.cluster

Module Contents

Classes

Functions

`lapa.cluster`