Overview
This section provides an overview of how pydrobert-speech is organized so that you can get your feature representation just right.
The input to pydrobert-speech are (acoustic) signals. The output are
features. We call the operator that transforms the signal to some feature
representation a computer. Operators that act on the signal and produce a
signal, like random dithering, are preprocessors, and operators that act on
features and produce features, like unit normalization, are postprocessors. The
latter two operators exist in the submodules pydrobert.speech.pre
and
pydrobert.speech.post
and follow the usage pattern
>>> y = Op().apply(x)
A computer, which can be found in pydrobert.speech.compute
, will require
more complicated initialization. The standard feature representation, which is
a 2D time-log-frequency matrix of energy, derives from
pydrobert.speech.compute.LinearFilterBankFrameComputer
. It calculates
coefficients over uniform time slices (frames) using a bank of filters.
Children of pydrobert.speech.compute.LinearFilterBankFrameComputer
all
have similar representations and all use linear banks of filters, but can be
computed in different ways. The classic method of computation is the
pydrobert.speech.compute.ShortTimeFourierTransformFrameComputer
.
Banks of filters are derived from
pydrobert.speech.filters.LinearFilterBank
. Children of the parent
class, such as pydrobert.speech.filters.ComplexGammatoneFilterBank
,
will decide on the shape of the filters.
pydrobert.speech.compute.LinearFilterBankFrameComputer
instances
compute coefficients at uniform intervals in time. However, the distribution
over frequencies is decided by the distribution of filter frequency responses
from the filter bank, which, in turn, depends on a scaling function. Scaling
functions can be found in pydrobert.speech.scales
such as
pydrobert.speech.scales.MelScaling
. Scaling functions transform the
frequency domain into some other real domain. In that domain, filter
frequency bandwidths are distributed uniformly which, when translated back to
the frequency domain, could be quite non-uniform.
In sum, you build a computer by first choosing a scale from
pydrobert.speech.scales
. You then pass that as an argument to a filter
bank that you’ve chosen from pydrobert.speech.filters
. Finally, you past
that as an argument to your computer of choice. For example:
>>> from pydrobert.speech import *
>>> scale = scales.MelScaling()
>>> bank = filters.ComplexGammatoneFilterBank(scale)
>>> computer = compute.ShortTimeFourierTransformFrameComputer(bank)
>>> # preprocess the signal
>>> feats = computer.compute_full(signal)
>>> # postprocess the signal
This is a bit different from the syntax described in the README. There, we use aliases. Aliases are a simple mechanism for unpacking hierarchies of parameters, such as the hierarchy between these computers, filter banks, and scales. We can streamline the above initialization as
>>> computer = compute.ShortTimeFourierTransformFrameComputer(
... {"name": "tonebank", "scaling_function": "mel"})
or even
>>> computer = compute.FrameComputer.from_alias("stft",
... {"name": "tonebank", "scaling_function": "mel"})
The dictionaries are merely keyword argument dictionaries with the special key
"name"
or "alias"
referring to an alias of the subclass you wish
to initialize (unless you just pass a string, at which point it’s considered
the alias with no arguments). Aliases are listed in each subclass’ alias
class member. Besides for brevity, aliases provide a principled way of storing
hierarchies on disk via JSON. Thus, it’s possible to access most of
pydrobert-speech’s flexibility from the provided command-line hooks.
Finally, there are some visualization functions in the
pydrobert.speech.vis
module (requires matplotlib
), some
extensions to pydrobert-kaldi data iterators in
pydrobert.speech.corpus
.