lapa.cluster

Module Contents

Classes

Cluster

Cluster class representing cluster in genome in a chromosome

PolyACluster

PolyA cluster used in poly(A)-site clustering, performs peak calling to

TssCluster

TSS cluster used in tss site clustering, performs peak calling to

Clustering

Clustering algorith to obtains regions cluster together based

PolyAClustering

Clustering algorith to obtains polyA clusters from read end counts.

TssClustering

Clustering algorith to obtains tss clusters from read start counts.

Functions

_tqdm_clustering(iterable)

Adaptor for tqdm to integrate to logging

lapa.cluster._tqdm_clustering(iterable)

Adaptor for tqdm to integrate to logging

class lapa.cluster.Cluster(Chromosome: str, Start: int, End: int, Strand: str, counts=None, fields=None)
Cluster class representing cluster in genome in a chromosome

with start, end location and strand. Mainly, each position between start and end location has counts as value indicating number of reads ending at the location. Based on the counts peak calling is performed. Counts are stored sparse format but converted to dense format for peak calling.

Parameters
  • Chromosome – Chromosome of the cluster

  • Start – Chromosome of the cluster

  • End – Chromosome of the cluster

  • Strand – Chromosome of the cluster

  • counts (List[Tuple[int, int]], optional) – Sparse representation of counts in the cluster as list of tuple where tuple is (position, count).

  • fields (Dict[int, List]], optional) – Fields are dictonary of list where information about each read in the cluster can be stored. Not used in clustering algorith by default.

Examples

Peak calling based on the counts.

>>> cluster = Cluster('chr1', 10, 11, '+')
>>> cluster.extend((12, 5))
>>> len(cluster)
3
>>> cluster.peak()
12
>>> cluster.extend((15, 5))
>>> cluster.extend((16, 3))
>>> cluster.extend((18, 1))
>>> len(cluster)
8
>>> cluster.peak()
15
extend(self, end, count)

Extend cluster with end position and count.

Parameters
  • end – New end location to extend cluster.

  • count – Number of reads supporting position.

property total_count(self)

Total counts in the cluster.

__len__(self)

Length of the cluster End - Start.

static _count_arr(counts)
peak(self, window=5, std=1)
Detech peaks position where value are maximum in the cluster.

Counts are smoothed with moving average before performing peak calling.

Parameters
  • window – window size for smoothing.

  • std – Standard deviation of gaussian kernel used for smoothing.

to_dict(self, fasta: kipoiseq.extractors.FastaStringExtractor)
Convert cluster into dictonary annotate regulatory elements

in the cluster using fasta file.

Parameters

fasta – FastaStringExtractor object of kipoiseq.

__str__(self)

Return str(self).

__repr__(self)

Return repr(self).

class lapa.cluster.PolyACluster(Chromosome: str, Start: int, End: int, Strand: str, counts=None, fields=None)

Bases: Cluster

PolyA cluster used in poly(A)-site clustering, performs peak calling to

obtain exact poly(A)-site, and extract sequence elements in the vicinity of cluster.

Examples

Peak calling for poly(A)-site detection.

>>> cluster = PolyACluster('chr1', 100, 101, '+')
>>> cluster.extend((112, 5))
>>> cluster.extend((115, 5))
>>> cluster.extend((116, 3))
>>> cluster.extend((118, 1))
>>> len(cluster)
8
>>> cluster.polyA_site()
115
>>> cluster.polyA_signal_sequence('hg38.fa', polyA_site=115)
118, 'AATAAA'
>>> cluster.fraction_A('hg38.fa', polyA_site=115)
3
polyA_site(self, window=5, std=1)
Detects poly(A)-site with peak calling.

Counts are smoothed with moving average before performing peak calling.

Parameters
  • window – window size for smoothing.

  • std – Standard deviation of gaussian kernel used for smoothing.

polyA_signal_sequence(self, fasta: kipoiseq.extractors.FastaStringExtractor, polyA_site: int)

Poly(A) signal sequence in the vicinity of poly(A) site.

Parameters
  • fasta – Fasta to extract sequences.

  • polyA_site – Poly(A) site based on the peak calling.

fraction_A(self, fasta: kipoiseq.extractors.FastaStringExtractor, polyA_site)

Fraction of A following the polyA site.

Parameters
  • fasta – Fasta to extract sequences.

  • polyA_site – Poly(A) site based on the peak calling.

to_dict(self, fasta)
Convert cluster into dictonary annotate regulatory elements

(polyA_signal and fracA) in the cluster using fasta file.

Parameters

fasta – FastaStringExtractor object of kipoiseq.

class lapa.cluster.TssCluster(Chromosome: str, Start: int, End: int, Strand: str, counts=None, fields=None)

Bases: Cluster

TSS cluster used in tss site clustering, performs peak calling to

obtain exact tss-site, and extract sequence elements in the vicinity of cluster.

Examples

Peak calling for tss site detection.

>>> cluster = PolyACluster('chr1', 100, 101, '+')
>>> cluster.extend((112, 5))
>>> cluster.extend((115, 5))
>>> cluster.extend((116, 3))
>>> cluster.extend((118, 1))
>>> len(cluster)
8
>>> cluster.peak()
115
to_dict(self, fasta)

Convert cluster into dictonary.

Parameters

fasta – FastaStringExtractor object of kipoiseq.

class lapa.cluster.Clustering(fasta, extent_cutoff=3, ratio_cutoff=0.05, window=25, groupby=None, fields=None, progress=True)
Clustering algorith to obtains regions cluster together based

on the read end counts.

Parameters
  • fasta – path to fasta file which used to extract regulatory elements in the vicinity of genome.

  • extent_cutoff – Extent cluster if number of read end counts above this cutoff.

  • ratio_cutoff – Ratio of read end counts to coverage of the region.

  • window – Patiance window cluster will be terminated on if read numbers below this cutoff for this window size of bps.

  • groupby – Groupby reads in the same region and sceen read number default of Chromosome and Strand.

  • fields – Fields to extract from the counts and store in cluster object.

  • progress – Show progress bar for clustering

Cluster
cluster(self, df_tes)
Parameters

df_tes

.

to_df(self, df_tes)

Perform clustering based on read end counts.

Parameters

df_tes – Counts per genomics position obtain with counting classes in pandas.DataFrame with Chromosome, Start, End, Strand, count, coverage columns.

class lapa.cluster.PolyAClustering(fasta, extent_cutoff=3, ratio_cutoff=0.05, window=25, groupby=None, fields=None, progress=True)

Bases: Clustering

Clustering algorith to obtains polyA clusters from read end counts.

Examples

Cluster poly(A)-sites from bam file.

>>> clustering = PolyAClustering('hg38.fasta')
>>> counter = ThreePrimeCounter(bam_file)
>>> df_counts = counter.to_df()
>>> df_counts.head()
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...
>>> df_clusters = clustering.to_df(df_counts)
>>> df_clusters
+--------------+-----------+-----------+--------------+-----------+--------------+-----------+---------------+
| Chromosome   |     Start |       End |   polyA_site |     count | Strand       |     fracA | signal        |
| (category)   |   (int32) |   (int32) |      (int64) |   (int64) | (category)   |   (int64) | (object)      |
|--------------+-----------+-----------+--------------+-----------+--------------+-----------+---------------|
| chr17        |    100099 |    100100 |       100100 |        10 | +            |         6 | 100098@GATAAA |
| chr17        |    100199 |    100200 |       100200 |         7 | -            |         2 | None@None     |
| chrM         |      1100 |      1101 |         1101 |        11 | +            |        -1 | None@None     |
...
Cluster
class lapa.cluster.TssClustering(fasta, extent_cutoff=3, ratio_cutoff=0.05, window=25, groupby=None, fields=None, progress=True)

Bases: Clustering

Clustering algorith to obtains tss clusters from read start counts.

Examples

Cluster tss sites from bam file.

>>> clustering = PolyAClustering('hg38.fasta')
>>> counter = FivePrimeCounter(bam_file)
>>> df_counts = counter.to_df()
>>> df_counts.head()
+--------------+-----------+-----------+--------------+-----------+------------+
| Chromosome   | Start     | End       | Strand       | count     | coverage   |
| (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int64)    |
|--------------+-----------+-----------+--------------+-----------+------------|
| chr1         | 887771    | 887772    | +            | 5         | 5          |
| chr1         | 994684    | 994685    | -            | 8         | 10         |
...
>>> df_clusters = clustering.to_df(df_counts)
>>> df_clusters
+--------------+-----------+-----------+--------------+-----------+--------------+
| Chromosome   |     Start |       End |     tss_site |     count | Strand       |
| (category)   |   (int32) |   (int32) |      (int64) |   (int64) | (category)   |
|--------------+-----------+-----------+--------------+-----------+--------------|
| chr17        |    100099 |    100100 |       100100 |        10 | +            |
| chr17        |    100199 |    100200 |       100200 |         7 | -            |
| chrM         |      1100 |      1101 |         1101 |        11 | +            |
...
Cluster