Audio Time-Scale Modification with Temporal Compressing Networks

Abstract

We propose a novel approach for time-scale modification of audio signals. Unlike traditional methods that rely on the framing technique or the short-time Fourier transform to preserve the frequency during temporal stretching, our neural network model encodes the raw audio into a high-level latent representation, dubbed Neuralgram, where each vector represents 1024 audio sample points. Due to a sufficient compression ratio, we are able to apply arbitrary spatial interpolation of the Neuralgram to perform temporal stretching. Finally, a learned neural decoder synthesizes the time-scaled audio samples based on the stretched Neuralgram representation. Both the encoder and decoder are trained with latent regression losses and adversarial losses in order to obtain high-fidelity audio samples. Despite its simplicity, our method has comparable performance compared to the existing baselines and opens a new possibility in research into modern time-scale modification. Audio samples can be found on our website.

Gradio app

Samples

There are 5 audio clips and 5 different stretchers, including 3 compression ratios, for each dataset. Please go to the datasets' homepages to get more original clips.

Traditional TSM algorithms for the purpose of comparison.

PV-TSM, implemented by librosa
WSOLA, implemented by SoX

Sample Compression ratio / Method More rates

{{ r }}x

Audio Time-Scale Modification with Temporal Compressing Networks

Abstract

Gradio app

Samples

{{ d }}