pySLGR Package¶

Modules¶

`pyslgr.GMMModel`
`pyslgr.LLFeatures`
`pyslgr.LLSignal`
`pyslgr.MFCCFeatures`
`pyslgr.FeatPipe`
`pyslgr.iVector`
`pyslgr.GSV`
`pyslgr.sad`

pyslgr.GMMModel module¶

class pyslgr.GMMModel.GMMModel¶

Bases: object

icov()¶: Return the GMM diagonal of the inverse covariance matrix as a numpy array

is_loaded()¶

load()¶

Load a GMM model.

load(model_file_name)

Parameters:

model_file_name: File name of model to load

mean()¶: Return the GMM mean vector as a numpy array

num_fea()¶

num_mix()¶

score_models()¶

Load a GMM model.

load(LLFeatures f, list models, int topM, bool use_shortfall, float sf_delta)

Parameters:

f: Feature input
models: List of GMM models to process
topM: Number of Gaussians to score per frame (default: 5)
use_shortfall: Use shortfall method when evaluating Gaussian mixture model (default: True)
sf_delta: Shortfall delta for pruning (default: 10.0)

suff_stats()¶

class pyslgr.GMMModel.GMMSAD¶

Bases: object

Class to process a signal and produce SAD marks using a GMM

GMMSAD (feat_config, gmm_models, label_keep, label_window, min_gap_dur=0.5, min_seg_dur=0.2) or GMMSAD (config) where config is a JSON string or dictionary.

Parameters:

feat_config: Configuration string from a JSON object – see MFCCFeatures.process for more details
model_dir: Base directory where models are stored
gmm_models: Dictionary of models for GMMSAD scoring. Typical keys are ‘speech’, ‘music’, ‘nonspeech’.
label_keep: The key of the model to keep (default: ‘speech’)
label_window: Window length for smoothing frame scores
min_gap_dur: Minimum gap between segments – segments are combined if the gap is smaller than this time
min_seg_dur: Minimum segment duration – ignore segments below this duration

process()¶

Process LLSignal and return a list of tuples. Each tuple is a start time (seconds) and duration of the detected class.

gmmsad (signal): signal LLSignal f LLFeatures object (not used, but available for uniformity)

pyslgr.LLFeatures module¶

class pyslgr.LLFeatures.LLFeatures¶

Bases: object

accel()¶

Calculate acceleration values from delta values.

accel(accel_spread)

Parameters:

accel_spread: Acceleration at time t is calculated using the frames t-k,...,t,...,t+k where k is ‘accel_spread’

apply_sad()¶

Given that speech activity detection has been calculated or loaded, remove frames corresponding to non-speech.

apply_sad()

delta()¶

Calculate delta values from base features.

delta(delta_spread)

Parameters:

delta_spread: Delta at time t is calculated using the frames t-k,...,t,...,t+k where k is ‘delta_spread’

delta2point()¶

Calculate delta values from base features using only 2 values.

delta2point(delta_spread)

Parameters:

delta_spread: Delta at time t is calculated using the frames t-k and t+k where k is ‘delta_spread’

feat_norm()¶

Normalize each feature individually to zero mean and unit variance across all frames.

feat_norm()

load_raw()¶

Load a file of raw floats into a feature store. Assumes all features are base features.

load_raw (filename, num_feat)

Parameters:

filename: Name of file to load
num_feat: Number of base features (dimension of feature vector)

load_sad_labels()¶

Load a file of SAD labels (0/1). Assumes whitespace separation between labels.

load_sad_labels(filename, sloppy)

Parameters:

filename: Name of file to load
sloppy: Complain if the number of labels doesn’t match the number of feature vectors (default: True)

num_base_feat()¶

Return the number of base features in each vector. Base features do not include post-processing such as delta, sdc, or acceleration.

num_base_feat() -> int

num_outfeat()¶

Return the number of output features in each vector. Output features are set with the set_outfeat() method.

num_outfeat() -> int

num_total_feat()¶

num_vec()¶

Return the number of feature vectors.

num_vec() -> int

rasta()¶

Apply RASTA to all features.

rasta()

sad_labels()¶: Return the current SAD labels (0/1) per frame.

save_raw()¶

Save features as raw floats to ‘filename’. The features saved are set by ‘set_outfeat’.

save_raw(filename)

Parameters:

filename: Name of file to save features in

save_sad_labels()¶

Save SAD labels (0/1) to file.

save_sad_labels(filename)

Parameters:

filename: Name of output file

sdc()¶

Shifted-delta features – typically used for language recognition. Note: Uses available delta features which must be calculated before invoking ‘sdc()’.

sdc (sdc_p, sdc_k):

Parameters:

sdc_p: p value – shift between delta blocks (typical value, 3 )
sdc_k: k value – number of delta blocks to stack (typical value, 7)

set_outfeat()¶

Set the features to output for typical operations. The value can be changed at any time. Order in the parameter string determines the stacking order. By default the ‘outfeat’ is set to ‘all’ when the feature object is created.

set_outfeat(outfeat)

Parameters:

outfeat: Set the output features. If outfeat==’all’, then all base features and calculated features are returned (except energy. Otherwise, ‘outfeat’ is examined a character at a time and features are stacked in that order. ‘f’ base features, ‘d’ delta-features, ‘a’ acceleration features, ‘e’ energy, ‘s’ sdc features. E.g., set_outfeat(‘fd’) would set the output to base features in indices 0, ..., num_base_feat-1 and delta features in num_base_feat+1, ..., -1.

xtalk()¶

xtalk energy based speech-activity detection.

xtalk (abs_min_energy, thresh, med_len=1)

Parameters:

abs_min_energy: Below this threshold is non-speech. Typical values, -10 or 0.
thresh: Above this threshold triggers speech activity (the algorithm is adaptive).
med_len: int (default 1): Median filter to smooth activity. Large values imply less abrupt changes in speech activity.

pyslgr.LLSignal module¶

class pyslgr.LLSignal.LLSignal¶

Bases: object

Class to contain and process 1-dimensional signals–typically speech or audio.

LLSignal() – empty signal with zero samples.

get_f0()¶

Find the fundamental frequency f0 (“pitch”) from the signal using the Entropic algorithm.

get_f0 (min_f0, max_f0, window_dur, frame_step) -> np.array(dtype=float)

Parameters:

min_f0: Minimum allowed fundamental frequency in Hz (e.g., 100)
max_f0: Maximum allowed fundamental frequency in Hz (e.g., 650)
window_duration: Window duration in seconds (e.g., 0.010 – 10 milliseconds)
frame_step: Increment of window position (e.g., 0.002 – 2 milliseconds)

length()¶

Length of the signal in samples.

x.length() -> int

load_pcm_wav()¶

Load a single-channel pcm-encoded Microsoft wav file format.

load_pcm_wav(filename, sum_channels=True)

Parameters:

filename: Path of file to load
sum_channels: Default True – sum channels if multiple present. Otherwise an error will be thrown for multiple channels.

load_raw_short()¶

Load a single-channel pcm-encoded file of short ints with no header.

load_pcm_wav(filename, sampling_frequency)

Parameters:

filename: Path to file to load
sampling_frequency: Sampling frequency of the file in Hz (e.g., 8000 for 8 kHz)

load_sph()¶

Load a NIST sphere file – channel is 0, 1. Use 0 for single channel.

load_sph(filename, channel_num)

Parameters:

filename: Path to file to load
channel_num: Number of channel, 0 or 1, to load

normalize()¶

Normalize the amplitude of the waveform to 16-bits.

normalize()

preemphasis()¶

Perform pre-emphasis on the waveform; i.e., filter with 1/(1-alpha*z^(-1))

preemphasis (alpha)

Parameters:

alpha: Pre-emphasis coefficient

remove_mean()¶

Remove the mean of the signal.

remove_mean()

resample_16k()¶

Resample the signal to an 16 kHz sampling rate. Note: The resample_init() method must be called before calling this method. Also, if the sample rate is below 16 kHz, no operation will be performed.

resample_16k()

resample_8k()¶

Resample the signal to an 8 kHz sampling rate. Note: The resample_init() method must be called before calling this method.

resample_8k()

sampling_frequency()¶

Return the sampling frequency of the currently loaded signal.

sampling_frequency()

save_pcm_wav()¶

Save the current signal in pcm-encoded Microsoft wav file format.

save_pcm_wav(filename, scale=False)

Parameters:

filename: Path of file to save
scale: Scale the output to full 16-bit range when saving

save_raw_short()¶

Save the current signal as short ints with no-header.

save_raw_short(filename, clip, scale=False)

Parameters:

filename: Path of file to save
clip: Clip the output if it is greater than the largest 16-bit value
scale: Scale the output to full 16-bit range when saving

pyslgr.MFCCFeatures module¶

class pyslgr.MFCCFeatures.MFCCFeatures¶

Bases: pyslgr.LLFeatures.LLFeatures

config  String or dictionary containing configuration parameters for MFCCs.
Parameters in the config are:

alpha Warping factor for bilinear method (no warping: 1.0)

dither 0/1 - Add low level noise to the signal (typical: 1)

fb_low Lowest filter bank frequency in Hz (typical: 300)

fb_hi Highest filter bank frequency in Hz (typical: 3140)

fb_only 0/1 - Instead of producing cepstral coefficients produce the ‘raw’ filter bank outputs instead

keep_c0 0/1 - Keep the c0 cepstral coefficient; c0 represents frame energy (typical: 0)

linear true/false - linear or mel-warped scale for filter banks (typical: false)

num_cep int - number of cepstral coefficients (c1-c??) to output (typical: 7-19)

tgt_num_filt int - number of filters across the entire bandwidth; only applied for linear=true

win_inc_ms int - window increment in milliseconds (typical: 10)

win_len_ms int - window length in milliseconds (typical: 20-30)

static config_dict_to_str()¶: Converts a feature configuration dictionary to a json string for processing.

duration()¶

Return the duration that the MFCC data spans in seconds.

duration() -> float seconds

static get_lid_config()¶: Returns default language id configuration as a dictionary. User can modify the entries in the dictionary or use this configuration as is. User must call static method config_dict_to_str(config) before processing a signal with such configuration.

static get_sid_config()¶: Returns default speaker id configuration as a dictionary. User can modify the entries in the dictionary or use this configuration as is. User must call static method config_dict_to_str(config) before processing a signal with such configuration.

get_win_inc_ms()¶

Return the window increment in milliseconds.

get_win_inc_ms()

load_sad_marks()¶

Load SAD marks from a file or list

load_sad_marks(src)

Parameters:

src: Name of input file

or src : list of tuples

Tuples with start, duration in seconds: [(0.0,1.0),(2.0,1.5)]

process()¶

Process the signal to return mel-frequency cepstral coefficient (MFCC) features.

process(signal) -> features

Parameters:: signal Input signal – instance of LLSignal class

save_sad_marks()¶

Save SAD marks to a file.

save_sad_marks(filename)

Parameters:

filename: Name of output file

pyslgr.FeatPipe module¶

class pyslgr.FeatPipe.FeatPipe(config, featClass, sadClass)¶

Bases: object

Implementation of a full fatures extraction pipeline.


config : a dictionary of config parameters with two main keys ‘pipe_config’, ‘sad_config’.
config[‘pipe_config’] has keys:

accel_spread int

delta_spread int

delta2point True/False

do_accel True/False

do_delta True/False

do_rasta True/False

do_feat_norm True/False

do_sdc True/False

outfeat string to pass to set_outfeat

feat_config dictionary to pass directly into LLFeatures object

sdc_params a tuple to pass to sdc – typically (3,7)

config[‘sad_config’] is passed directly to the sadClass

featClass : an LLFeatures compatible class
sadClass  : a class with constructor sadClass(config) and method sadClass.process(LLSignal x, LLFeatures f)

process(x)¶

Extract features: x Input signal

Returns a feature object

pyslgr.iVector module¶

class pyslgr.iVector.iVector(config)¶

Bases: object

iVector extractor

config : a dictionary of config parameters

tv_matrix filename for total variability matrix – raw floats

ubm_model UBM model file

process(f)¶

f : LLFeatures object, input features

returns an ivector (factors with no scaling or transformation)

pyslgr.GSV module¶

class pyslgr.GSV.GSV¶

Bases: object

GSV(config)

config  dictionary or JSON string with config parameters

process()¶

Process input features ‘f’ and produce a GSV expansion

f LLFeatures input

pyslgr.sad module¶

class pyslgr.sad.XtalkSAD(config)¶

Bases: object

Perform energy-based speech activity detection using Xtalk

config : a dictionary of config parameters to pass to xtalk

‘abs_min_energy’, ‘thresh’, ‘med_len’ (optional)

process(x, f)¶

pyslgr.Scores module¶

class pyslgr.Scores.Scores(f_scores, s, u_score)¶: Expects frame scores and scores (f_scores and s respectively) as Python lists. Argument u_score represents an ubm score as a float type.

alpha	Warping factor for bilinear method (no warping: 1.0)
dither	0/1 - Add low level noise to the signal (typical: 1)
fb_low	Lowest filter bank frequency in Hz (typical: 300)
fb_hi	Highest filter bank frequency in Hz (typical: 3140)
fb_only	0/1 - Instead of producing cepstral coefficients produce the ‘raw’ filter bank outputs instead
keep_c0	0/1 - Keep the c0 cepstral coefficient; c0 represents frame energy (typical: 0)
linear	true/false - linear or mel-warped scale for filter banks (typical: false)
num_cep	int - number of cepstral coefficients (c1-c??) to output (typical: 7-19)
tgt_num_filt	int - number of filters across the entire bandwidth; only applied for linear=true
win_inc_ms	int - window increment in milliseconds (typical: 10)
win_len_ms	int - window length in milliseconds (typical: 20-30)

accel_spread	int
delta_spread	int
delta2point	True/False
do_accel	True/False
do_delta	True/False
do_rasta	True/False
do_feat_norm	True/False
do_sdc	True/False
outfeat	string to pass to set_outfeat
feat_config	dictionary to pass directly into LLFeatures object
sdc_params	a tuple to pass to sdc – typically (3,7)

tv_matrix	filename for total variability matrix – raw floats
ubm_model	UBM model file