SignalTrain Demo Page

Supplemental Materials accompanying the paper
"SignalTrain: Profiling Audio Compressors with Deep Neural Networks"
by Scott H. Hawley, Benjamin Colburn, Stylianos I. Mimilakis
Paper: arXiv:1905.11928 (PDF here)


In this work we present a data-driven approach for predicting the behavior of (i.e. profiling) a given non-linear audio signal processing effect (henceforth ``audio effect"). Our objective is to learn a mapping function that maps the unprocessed audio to the processed by the audio effect to be profiled, using time-domain samples. To that aim, we employ a deep auto-encoder model that is conditioned on both time-domain samples and the control parameters of the target audio effect. As a test-case study, we focus on the offline profiling of two dynamic range compression audio effects, one software-based and the other analog. Compressors were chosen because they are a widely used and important set of effects and because their parameterized nonlinear time-dependent nature makes them a challenging problem a system aiming to profile ``general'' audio effects. Results from our experimental procedure show that the primary functional and auditory characteristics of the compressors can be captured, however there is still sufficient audible noise to merit further investigation before such methods are applied to real-world audio processing workflows.
There are different ways to try to model musical audio processing effects units (e.g., rack-mounted analog gear, guitar stomp pedals, etc.) in software. One way is to model the physical components or processes at work inside the unit, and another way is to try to reproduce key aspects of the sound independent of how they might physically be created. For this paper we take the latter approach. We’ve been trying to take advantage of recent advances with neural networks to create a general system that that can "learn" to emulate -- i.e., that can "profile" -- a variety of audio effects, not by having “pre-made” effects modules (e.g., reverb, distortion, delay) for which we estimate parameters (or “knob settings”), but rather a “generic” system with as few assumptions as possible, that learns what the effect is purely on the basis of what it does, i.e., by treating the effect in a "black box" manner and looking at how the unit’s output relates to the input and to the settings of the controls (or “knobs”) used for the output. For effects such as distortion or delay this can be a useful test problem, and early experiments by us and others in 2016 with simple recurrent networks seemed promising, but what these early systems couldn’t learn -- and what audio engineers challenged us with -- were (dynamic range) compressors, because these are important for the audio production industry. Compressors proved to be simply impossible for our earlier systems to learn: compressors are both time-dependent and non-linear, and that combination makes them difficult. Plus, we're trying to create a system that can learn not only a single control setting, but the whole range of what the “knobs” on the effect unit “do.” So we focussed on compressors until we could get them right, which is what this paper is about. Later we can share some analyses about applying our method to learning other/more effects, but an in-depth report on compressors merited attention first. In the end, even though we’re following the true compressor output pretty well, we’re still getting too much noise in our neural network’s output to make this a “production” utility, and it runs too slowly to be used in real time (on current hardware). Still, our work presented here constitutes progress in the right direction.

(Give the demo a minute to load)

Interactive Slider Demo:

The neural network deployed for this demo was trained on input and output sizes of 4096 samples.

Note: Best to "tap" where you want the slider to go, rather than dragging it while pressed.

We also have a Jupyter Notebook version where you can use your own audio, to be released with the code (see below).

Audio Samples:

All audio samples are either written & recorded by one of the authors or colleagues (used with permission, all rights reserved), or have a Creative Commons license.

These are from the Testing dataset, which the network never "saw."
Effect: Comp-4C (Software Effect)
Details: The audio samples were created using a network trained for input size of 16384 and output size of 8,192, at 44.1kHz.
InputEffect & SettingsTarget Output (Real Effect) Predicted Output (Inference)Difference (Target-Predicted)
"Windy Places" intro
Comp-4C: Thresh=-30dB, Ratio=3, Attack=0.01s, Release=0.03s
"Windy Places" chorus
Comp-4C: Thresh=-30dB, Ratio=3, Attack=0.01s, Release=0.03s
"Windy Places" bridge
Comp-4C: Thresh=-30dB, Ratio=2.5, Attack=0.002s, Release=0.03s
Same as previous Same as previous Same as previous No L1 Reg. Freq Scaling

Spectrum Comparison
No L1 Reg. Freq Scaling

"Leadfoot" intro
Comp-4C: Thresh=-20dB, Ratio=5, Attack=0.01s, Release=0.04s
"Leadfoot" chorus
Comp-4C: Thresh=-20dB, Ratio=5, Attack=0.01s, Release=0.04s

Effect: LA-2A (Analog Effect)
Details: The audio samples were created using a network trained for input size of 163,840 and output size of 8,192 at 44.1kHz.
InputEffect & SettingsTarget Output (Real Effect)Predicted Output (Inference)Difference (Target-Predicted)
Loops 1 (piano, distorted guitar)
Peak Red.=65
Loops 2 (clean guitar, vibes)
Peak Red.=85
"How Good You've Been" outro
Peak Red.=80
+12dB make-up gain
+12dB make-up gain
+12dB make-up gain

Effect: Comp-4C-Large (Same as Comp-4C but larger parameter ranges)
Details: The audio samples were created using a network trained for input size of 131,072 (2.9s) and output size of 8,192, at 44.1kHz.
InputEffect & SettingsTarget Output (Real Effect) Predicted OutputDifference (Target-Predicted)
Speech ("5 to 20")
Speech ("Pure Systems...")


Full code will be released on GitHub pending peer review, and pending "cleanup" for full release.
We have an interactive slider-based Jupyter Notebook like the above but that that allows one to upload new audio and listen to.
In the meantime, here are some the software effects that were trained against:

Model Graph:

1. Schematic version (from the paper):

This model combines both time-domain and spectral-like representations of the signal.
The settings of the effects unit's controls (or “knobs”) get grafted into the middle of the hourglass-shapes (“autoencoders”).
Each long-dashed black line is somewhat like a “50-50 wet-dry mix,” and the short-dashed red line is like an adaptive gain control.
See the paper for details.

2. Detailed PyTorch GraphViz output: SVG Image here

(c) 2019 SignalTrain authors
____    `.
'oo OOOO oo\_