lapa.cluster
Module Contents
Classes
Cluster class representing cluster in genome in a chromosome |
|
PolyA cluster used in poly(A)-site clustering, performs peak calling to |
|
TSS cluster used in tss site clustering, performs peak calling to |
|
Clustering algorith to obtains regions cluster together based |
|
Clustering algorith to obtains polyA clusters from read end counts. |
|
Clustering algorith to obtains tss clusters from read start counts. |
Functions
|
Adaptor for tqdm to integrate to logging |
- lapa.cluster._tqdm_clustering(iterable)
Adaptor for tqdm to integrate to logging
- class lapa.cluster.Cluster(Chromosome: str, Start: int, End: int, Strand: str, counts=None, fields=None)
- Cluster class representing cluster in genome in a chromosome
with start, end location and strand. Mainly, each position between start and end location has counts as value indicating number of reads ending at the location. Based on the counts peak calling is performed. Counts are stored sparse format but converted to dense format for peak calling.
- Parameters
Chromosome – Chromosome of the cluster
Start – Chromosome of the cluster
End – Chromosome of the cluster
Strand – Chromosome of the cluster
counts (
List[Tuple[int, int]]
, optional) – Sparse representation of counts in the cluster as list of tuple where tuple is (position, count).fields (
Dict[int, List]]
, optional) – Fields are dictonary of list where information about each read in the cluster can be stored. Not used in clustering algorith by default.
Examples
Peak calling based on the counts.
>>> cluster = Cluster('chr1', 10, 11, '+') >>> cluster.extend((12, 5)) >>> len(cluster) 3 >>> cluster.peak() 12 >>> cluster.extend((15, 5)) >>> cluster.extend((16, 3)) >>> cluster.extend((18, 1)) >>> len(cluster) 8 >>> cluster.peak() 15
- extend(self, end, count)
Extend cluster with end position and count.
- Parameters
end – New end location to extend cluster.
count – Number of reads supporting position.
- property total_count(self)
Total counts in the cluster.
- __len__(self)
Length of the cluster End - Start.
- static _count_arr(counts)
- peak(self, window=5, std=1)
- Detech peaks position where value are maximum in the cluster.
Counts are smoothed with moving average before performing peak calling.
- Parameters
window – window size for smoothing.
std – Standard deviation of gaussian kernel used for smoothing.
- to_dict(self, fasta: kipoiseq.extractors.FastaStringExtractor)
- Convert cluster into dictonary annotate regulatory elements
in the cluster using fasta file.
- Parameters
fasta – FastaStringExtractor object of kipoiseq.
- __str__(self)
Return str(self).
- __repr__(self)
Return repr(self).
- class lapa.cluster.PolyACluster(Chromosome: str, Start: int, End: int, Strand: str, counts=None, fields=None)
Bases:
Cluster
- PolyA cluster used in poly(A)-site clustering, performs peak calling to
obtain exact poly(A)-site, and extract sequence elements in the vicinity of cluster.
Examples
Peak calling for poly(A)-site detection.
>>> cluster = PolyACluster('chr1', 100, 101, '+') >>> cluster.extend((112, 5)) >>> cluster.extend((115, 5)) >>> cluster.extend((116, 3)) >>> cluster.extend((118, 1)) >>> len(cluster) 8 >>> cluster.polyA_site() 115 >>> cluster.polyA_signal_sequence('hg38.fa', polyA_site=115) 118, 'AATAAA' >>> cluster.fraction_A('hg38.fa', polyA_site=115) 3
- polyA_site(self, window=5, std=1)
- Detects poly(A)-site with peak calling.
Counts are smoothed with moving average before performing peak calling.
- Parameters
window – window size for smoothing.
std – Standard deviation of gaussian kernel used for smoothing.
- polyA_signal_sequence(self, fasta: kipoiseq.extractors.FastaStringExtractor, polyA_site: int)
Poly(A) signal sequence in the vicinity of poly(A) site.
- Parameters
fasta – Fasta to extract sequences.
polyA_site – Poly(A) site based on the peak calling.
- fraction_A(self, fasta: kipoiseq.extractors.FastaStringExtractor, polyA_site)
Fraction of A following the polyA site.
- Parameters
fasta – Fasta to extract sequences.
polyA_site – Poly(A) site based on the peak calling.
- to_dict(self, fasta)
- Convert cluster into dictonary annotate regulatory elements
(polyA_signal and fracA) in the cluster using fasta file.
- Parameters
fasta – FastaStringExtractor object of kipoiseq.
- class lapa.cluster.TssCluster(Chromosome: str, Start: int, End: int, Strand: str, counts=None, fields=None)
Bases:
Cluster
- TSS cluster used in tss site clustering, performs peak calling to
obtain exact tss-site, and extract sequence elements in the vicinity of cluster.
Examples
Peak calling for tss site detection.
>>> cluster = PolyACluster('chr1', 100, 101, '+') >>> cluster.extend((112, 5)) >>> cluster.extend((115, 5)) >>> cluster.extend((116, 3)) >>> cluster.extend((118, 1)) >>> len(cluster) 8 >>> cluster.peak() 115
- to_dict(self, fasta)
Convert cluster into dictonary.
- Parameters
fasta – FastaStringExtractor object of kipoiseq.
- class lapa.cluster.Clustering(fasta, extent_cutoff=3, ratio_cutoff=0.05, window=25, groupby=None, fields=None, progress=True)
- Clustering algorith to obtains regions cluster together based
on the read end counts.
- Parameters
fasta – path to fasta file which used to extract regulatory elements in the vicinity of genome.
extent_cutoff – Extent cluster if number of read end counts above this cutoff.
ratio_cutoff – Ratio of read end counts to coverage of the region.
window – Patiance window cluster will be terminated on if read numbers below this cutoff for this window size of bps.
groupby – Groupby reads in the same region and sceen read number default of Chromosome and Strand.
fields – Fields to extract from the counts and store in cluster object.
progress – Show progress bar for clustering
- Cluster
- cluster(self, df_tes)
- Parameters
df_tes –
.
- to_df(self, df_tes)
Perform clustering based on read end counts.
- Parameters
df_tes – Counts per genomics position obtain with counting classes in pandas.DataFrame with Chromosome, Start, End, Strand, count, coverage columns.
- class lapa.cluster.PolyAClustering(fasta, extent_cutoff=3, ratio_cutoff=0.05, window=25, groupby=None, fields=None, progress=True)
Bases:
Clustering
Clustering algorith to obtains polyA clusters from read end counts.
Examples
Cluster poly(A)-sites from bam file.
>>> clustering = PolyAClustering('hg38.fasta') >>> counter = ThreePrimeCounter(bam_file) >>> df_counts = counter.to_df() >>> df_counts.head() +--------------+-----------+-----------+--------------+-----------+------------+ | Chromosome | Start | End | Strand | count | coverage | | (category) | (int32) | (int32) | (category) | (int64) | (int64) | |--------------+-----------+-----------+--------------+-----------+------------| | chr1 | 887771 | 887772 | + | 5 | 5 | | chr1 | 994684 | 994685 | - | 8 | 10 | ... >>> df_clusters = clustering.to_df(df_counts) >>> df_clusters +--------------+-----------+-----------+--------------+-----------+--------------+-----------+---------------+ | Chromosome | Start | End | polyA_site | count | Strand | fracA | signal | | (category) | (int32) | (int32) | (int64) | (int64) | (category) | (int64) | (object) | |--------------+-----------+-----------+--------------+-----------+--------------+-----------+---------------| | chr17 | 100099 | 100100 | 100100 | 10 | + | 6 | 100098@GATAAA | | chr17 | 100199 | 100200 | 100200 | 7 | - | 2 | None@None | | chrM | 1100 | 1101 | 1101 | 11 | + | -1 | None@None | ...
- Cluster
- class lapa.cluster.TssClustering(fasta, extent_cutoff=3, ratio_cutoff=0.05, window=25, groupby=None, fields=None, progress=True)
Bases:
Clustering
Clustering algorith to obtains tss clusters from read start counts.
Examples
Cluster tss sites from bam file.
>>> clustering = PolyAClustering('hg38.fasta') >>> counter = FivePrimeCounter(bam_file) >>> df_counts = counter.to_df() >>> df_counts.head() +--------------+-----------+-----------+--------------+-----------+------------+ | Chromosome | Start | End | Strand | count | coverage | | (category) | (int32) | (int32) | (category) | (int64) | (int64) | |--------------+-----------+-----------+--------------+-----------+------------| | chr1 | 887771 | 887772 | + | 5 | 5 | | chr1 | 994684 | 994685 | - | 8 | 10 | ... >>> df_clusters = clustering.to_df(df_counts) >>> df_clusters +--------------+-----------+-----------+--------------+-----------+--------------+ | Chromosome | Start | End | tss_site | count | Strand | | (category) | (int32) | (int32) | (int64) | (int64) | (category) | |--------------+-----------+-----------+--------------+-----------+--------------| | chr17 | 100099 | 100100 | 100100 | 10 | + | | chr17 | 100199 | 100200 | 100200 | 7 | - | | chrM | 1100 | 1101 | 1101 | 11 | + | ...
- Cluster