pydrobert.speech.compute

Compute features from speech signals

class pydrobert.speech.compute.FrameComputer[source]

Bases: AliasedFactory

Construct features from a signal from fixed-length segments

A signal is treated as a (possibly overlapping) time series of frames. Each frame is transformed into a fixed-length vector of coefficients.

Features can be computed one at a time, for example:

>>> chunk_size = 2 ** 10
>>> while len(signal):Z
>>>     segment = signal[:chunk_size]
>>>     feats = computer.compute_chunk(segment)
>>>     # do something with feats
>>>     signal = signal[chunk_size:]
>>> feats = computer.finalize()

Or all at once (which can be much faster, depending on how the computer is optimized):

>>> feats = computer.compute_full(signal)

The k-th frame can be roughly localized to the signal offset to about signal[k * computer.frame_shift]. The signal’s exact region of influence is dictated by the frame_style property.

abstract compute_chunk(chunk)[source]

Compute some coefficients, given a chunk of audio

Parameters:: chunk (ndarray) – A 1D float array of the signal. Should be contiguous and non-overlapping with any previously processed segments in the audio stream
Returns:: chunk (numpy.ndarray) – A 2D float array of shape (num_frames, num_coeffs). num_frames is nonnegative (possibly 0). Contains some number of feature vectors, ordered in time over axis 0.

compute_full(signal)[source]

Compute a full signal’s worth of feature coefficients

Parameters:: signal (ndarray) – A 1D float array of the entire signal
Returns:: spec (numpy.ndarray) – A 2D float array of shape (num_frames, num_coeffs). num_frames is nonnegative (possibly 0). Contains some number of feature vectors, ordered in time over axis 0.
Raises:: ValueError – If already begin computing frames (started=True), and finalize() has not been called

abstract finalize()[source]

Conclude processing a stream of audio, processing any stored buffer

Returns:: chunk (numpy.ndarray) – A 2D float array of shape (num_frames, num_coeffs). num_frames is either 1 or 0.

abstract property frame_length

Number of samples which dictate a feature vector

Type:: int

property frame_length_ms

Number of milliseconds of audio which dictate a feature vector

Type:: float

abstract property frame_shift

Number of samples absorbed between successive frame computations

Type:: int

property frame_shift_ms

Number of milliseconds between succecssive frame computations

Type:: float

abstract property frame_style

Dictates how the signal is split into frames

If 'causal', the k-th frame is computed over the indices signal[k * frame_shift:k * frame_shift + frame_length] (at most). If 'centered', the k-th frame is computed over the indices signal[k * frame_shift - (frame_length + 1) // 2 + 1:k * frame_shift + frame_length // 2 + 1]. Any range beyond the bounds of the signal is generated in an implementation-specific way.

Type:: str

abstract property num_coeffs

Number of coefficients returned per frame

Type:: int

abstract property sampling_rate

Number of samples in a second of a target recording

Type:: float

abstract property started

Whether computations for a signal have started

Becomes True after the first call to compute_chunk(). Becomes False after call to finalize()

Type:: bool

class pydrobert.speech.compute.LinearFilterBankFrameComputer(bank, include_energy=False)[source]

Bases: FrameComputer

Frame computers whose features are derived from linear filter banks

Computers based on linear filter banks have a predictable number of coefficients and organization. Like the banks, the features with lower indices correspond to filters with lower bandwidths. num_coeffs will be simply bank.num_filts + int(include_energy).

Parameters:

bank (Union[LinearFilterBank, Mapping, str]) – Each filter in the bank corresponds to a coefficient in a frame vector. Can be a LinearFilterBank or something compatible with pydrobert.speech.alias.alias_factory_subclass_from_arg()
include_energy (bool) – Whether to include a coefficient based on the energy of the signal within the frame. If True, the energy coefficient will be inserted at index 0.

property bank

The LinearFilterBank from which features are derived

Type:: LinearFilterBank

property includes_energy

Whether the first coefficient is an energy coefficient

Type:: bool

pydrobert.speech.compute.SIFrameComputer: alias of ShortIntegrationFrameComputer

pydrobert.speech.compute.STFTFrameComputer: alias of ShortTimeFourierTransformFrameComputer

class pydrobert.speech.compute.ShortIntegrationFrameComputer(bank, frame_shift_ms=10, frame_style=None, include_energy=False, pad_to_nearest_power_of_two=True, window_function=None, use_power=False, use_log=True)[source]

Bases: LinearFilterBankFrameComputer

Compute features by integrating over the filter modulus

Each filter in the bank is convolved with the signal. A pointwise nonlinearity pushes the frequency band towards zero. Most of the energy of the signal can be captured in a short time integration. Though best suited to processing whole utterances at once, short integration is compatable with the frame analogy if the frame is assumed to be the cone of influence of the maximum-length filter.

For computational purposes, each filter’s impulse response is clamped to zero outside the support of the largest filter in the bank, making it a finite impulse response filter. This effectively decreases the frequency resolution of the filters which aren’t already FIR. For better frequency resolution at the cost of computational time, increase pydrobert.speech.config.EFFECTIVE_SUPPORT_THRESHOLD.

Parameters:

bank (Union[LinearFilterBank, Mapping, str]) – Each filter in the bank corresponds to a coefficient in a frame vector. Can be a LinearFilterBank or something compatible with pydrobert.speech.alias.alias_factory_subclass_from_arg()
frame_shift_ms (float) – The offset between successive frames, in milliseconds. Also the length of the integration
frame_style (Optional[Literal['causal', 'centered']]) – Defaults to 'centered' if bank.is_zero_phase, 'causal' otherwise. If 'centered' each filter of the bank is translated so that its support lies in the center of the frame
include_energy (bool) –
pad_to_nearest_power_of_two (bool) – Pad the DFTs used in computation to a power of two for efficient computation
window_function (pydrobert.speech.filters.WindowFunction, dict, or str) – The window used to weigh integration. Can be a WindowFunction or something compatible with pydrobert.speech.alias_factory_subclass_from_arg(). Defaults to pydrobert.speech.filters.GammaWindow when frame_style is 'causal', otherwise pydrobert.speech.filters.HannWindow.
use_power (bool) – Whether the pointwise linearity is the signal’s power or magnitude
use_log (bool) – Whether to take the log of the integration

aliases = {'si'}

class pydrobert.speech.compute.ShortTimeFourierTransformFrameComputer(bank, frame_length_ms=None, frame_shift_ms=10, frame_style=None, include_energy=False, pad_to_nearest_power_of_two=True, window_function=None, use_log=True, use_power=False, kaldi_shift=False)[source]

Bases: LinearFilterBankFrameComputer

Compute features of a signal by integrating STFTs

Computations are per frame and as follows:

The current frame is multiplied with some window (rectangular, Hamming, Hanning, etc)
A DFT is performed on the result
For each filter in the provided input bank:
1. Multiply the result of 2. with the frequency response of the filter
2. Sum either the pointwise square or absolute value of elements in the buffer from 3a.
3. Optionally take the log of the sum

Warning

This behaviour differs from that of [povey2011] or [young] in three ways. First, the sum (3b) comes after the filtering (3a), which changes the result in the squared case. Second, the sum is over the full power spectrum, rather than just between 0 and the Nyquist. This doubles the value at the end of 3c. if a real filter is used. Third, frame boundaries are calculated diffferently.

Parameters:

bank (Union[LinearFilterBank, Mapping, str]) – Each filter in the bank corresponds to a coefficient in a frame vector. Can be a LinearFilterBank or something compatible with pydrobert.speech.alias.alias_factory_subclass_from_arg()
frame_length_ms (Optional[float]) – The length of a frame, in milliseconds. Defaults to the length of the largest filter in the bank
frame_shift_ms (float, optional) – The offset between successive frames, in milliseconds
frame_style (Optional[Literal['causal', 'centered']]) – Defaults to 'centered' if bank.is_zero_phase, 'causal' otherwise.
include_energy (bool) –
pad_to_nearest_power_of_two (bool) – Whether the DFT should be a padded to a power of two for computational efficiency
window_function (Union[WindowFunction, Mapping, str, None]) – The window used in step 1. Can be a WindowFunction or something compatible with pydrobert.speech.alias_factory_subclass_from_arg(). Defaults to pydrobert.speech.filters.GammaWindow when frame_style is 'causal', otherwise pydrobert.speech.filters.HannWindow.
use_log (bool) – Whether to take the log of the sum from 3b.
use_power (bool) – Whether to sum the power spectrum or the magnitude spectrum
kaldi_shift (bool) – Dictates how to center frames when frame_style is 'centered'. If True, the k-th frame will be computed using the signal between signal[ k * frame_shift - frame_length // 2 + frame_shift // 2:k * frame_shift + (frame_length + 1) // 2 + frame_shift // 2]. These are the frame bounds for Kaldi [povey2011]. Otherwise, the k-th frame is signal[ k * frame_shift - (frame_length + 1) // 2 + 1: k * frame_shift + frame_length // 2 + 1].

aliases = {'stft'}

pydrobert.speech.compute.frame_by_frame_calculation(computer, signal, chunk_size=1024)[source]

Compute feature representation of entire signal iteratively

This function constructs a feature matrix of a signal through successive calls to computer.compute_chunk. Its return value should be identical to that of calling computer.compute_full(signal), but is possibly much slower. computer.compute_full should be favoured.

Parameters:

computer (FrameComputer) –
signal (ndarray) – A 1D float array of the entire signal
chunk_size (int) – The length of the signal buffer to process at a given time

Returns:

spec (numpy.ndarray) – A 2D float array of shape (num_frames, num_coeffs). num_frames is nonnegative (possibly 0). Contains some number of feature vectors, ordered in time over axis 0.

Raises:

ValueError – If already begin computing frames (computer.started == True)