High frequency magnitude spectrogram reconstruction for music mixtures using convolutional autoencoders

This webpage contains sound examples to accompany the following paper accepted for the DAFx-2018 conference:

M. Miron and M. E. P. Davies. High frequency magnitude spectrogram reconstruction for music mixtures using convolutional autoencoders.

Abstract

We present a new approach for audio bandwidth extension for music signals using convolutional neural networks (CNNs). Inspired by the concept of inpainting from the field of image processing, we seek to reconstruct the high-frequency region (i.e., above a cutoff frequency) of a time-frequency representation given the observation of a band-limited version. We then invert this reconstructed time-frequency representation using the phase information from the band-limited input to provide an enhanced musical output. We contrast the performance of two musically adapted CNN architectures which are trained separately using the STFT and the invertible CQT. Through our evaluation, we demonstrate that the CQT, with its logarithmic frequency spacing, provides better reconstruction performance as measured by the signal to distortion ratio.

Sound Examples

To give an impression about the signal reconstruction quality, we provide the following sound examples. Note, all sound examples below are either taken from the MedleyDB test set, and were thus not used in training, or are short excerpts of well known songs that were neither in the training or test sets.
Please note, all sound examples are stereo, uncompressed .wav files, so they might take a while to load on slow internet connections.

All sounds examples are less than 20s in duration and their inclusion is considered fair use for nonprofit educational purposes. However, at the request of the copyright holders the sound examples will be taken down.

Comparison of CNN architectures, cutoff frequencies and input representations

Music: ACDC - Back in Black

Filtered version (3500 Hz) Filtered version (7500 Hz)
CQT bottleneck (3500 Hz) CQT bottleneck (7500 Hz)
CQT stride2 (3500 Hz) CQT stride2 (7500 Hz)
STFT bottleneck (3500 Hz) STFT bottleneck (7500 Hz)
STFT stride2 (3500 Hz) STFT stride2 (7500 Hz)
Original

Music: Tracy Chapman - Fast Car

Filtered version (3500 Hz) Filtered version (7500 Hz)
CQT bottleneck (3500 Hz) CQT bottleneck (7500 Hz)
CQT stride2 (3500 Hz) CQT stride2 (7500 Hz)
STFT bottleneck (3500 Hz) STFT bottleneck (7500 Hz)
STFT stride2 (3500 Hz) STFT stride2 (7500 Hz)
Original

Music: Steely Dan - Aja

Filtered version (3500 Hz) Filtered version (7500 Hz)
CQT bottleneck (3500 Hz) CQT bottleneck (7500 Hz)
CQT stride2 (3500 Hz) CQT stride2 (7500 Hz)
STFT bottleneck (3500 Hz) STFT bottleneck (7500 Hz)
STFT stride2 (3500 Hz) STFT stride2 (7500 Hz)
Original

Music: 2Pac feat. Dr. Dre - California Love

Filtered version (3500 Hz) Filtered version (7500 Hz)
CQT bottleneck (3500 Hz) CQT bottleneck (7500 Hz)
CQT stride2 (3500 Hz) CQT stride2 (7500 Hz)
STFT bottleneck (3500 Hz) STFT bottleneck (7500 Hz)
STFT stride2 (3500 Hz) STFT stride2 (7500 Hz)
Original

Phase Reconstruction

These examples provide an informal comparison of the effect of using the phase from the low-pass filtered version for making the enhanced signal, compared to the random phase above the cutoff frequency. For completeness, we also include the original version of the input signal, and the low-pass filtered version (without any enhancement). For the informal comparison of phase reconstruction, we only provide examples from the two best performing approaches: CQT 3500Hz stride-2, and CQT 7500 Hz bottleneck.

CQT: 3500Hz cutoff, CNN: stride-2, Music: Grants - Punch Drunk

Filtered version Enhanced (LPF phase) Enh. (rand phase above cutoff) Original

CQT: 3500Hz cutoff, CNN: stride-2, Music: Music Delta - Chinese Chao Zhou

Filtered version Enhanced (LPF phase) Enh. (rand phase above cutoff) Original

CQT: 7500Hz cutoff, CNN: bottleneck, Music: Auctioneer - Our Future Faces

Filtered version Enhanced (LPF phase) Enh. (rand phase above cutoff) Original

CQT: 7500Hz, cutoff CNN: bottleneck, Music: Helado Negro - Mitad Del Mundo

Filtered version Enhanced (LPF phase) Enh. (rand phase above cutoff) Original


Funding Acknowledgments


M. E. P. Davies is supported by Portuguese National Funds through the FCT -- Foundation for Science and Technology, I.P., under the project IF/01566/2015.
The TITANX used for this research was donated by the NVIDIA Corporation.

"TEC4Growth - Pervasive Intelligence, Enhancers and Proofs of Concept with Industrial Impact/NORTE-01-0145-FEDER-00020" is financed by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF).