Can SAM Audio Be Used for Stem Separation, Like Demucs?
TL;DR
No.
A more nuanced summary of my experience running SAM Audio locally, so far
Getting SAM Audio to work on my Mac laptop was laborious, and the performance is not amazing. Some of the limitations arise because I run the model on a macOS laptop. With more firepower, I could probably wring more out of it. That said, macOS laptops are very common in audio processing, so this is not an unrealistic setup. For the record, I use a machine with an M1 Max chip, which has an integrated GPU, and 64 GB of RAM.
Until now I have been using another package, also from Meta, called Demucs. It is no longer supported, so I forked the version one of the original developers had already forked, and I am using that. Demucs is designed to separate a sound file into four tracks: bass, drums, vocals, and other. There is also an experimental branch with additional stems such as guitar and piano, but the results are inconsistent. Demucs mostly does the job, and on my laptop it never takes more than 30 minutes for a track, and it Just Works™. The quality is not great, but it is enough for my needs.
1. SAM Audio cannot handle whole tracks; 10-second segments are recommended
With my setup, you cannot pass a whole track to
How long should these segments be? One Meta developer suggested 10 seconds, since those are the clip lengths the models were trained on. I use values between 9 and 18 seconds for the sam-audio-large model. So a typical 3 to 4 minute track ends up split into roughly 20 segments. Important:
Splitting into segments is not the end of the world, but it causes two main issues. First, there are sometimes boundary artifacts between segments. Second, the overall tone changes between segments. Both are significant enough to make the model unsuitable for professional use as a stem separator, in my opinion, though still good enough for hobbyists like myself.
An example of boundary artifacts is this bass fragment from the Stevie Wonder track I linked below. It’s the only clearly audible artifact in a 4 mins track (others are more subtle), but it’s quite noticeable.
One common solution to boundary artifacts is the use of Hann windows, which basically means having a one or two second overlap across adjacent segments and then crossfading between them. I tried applying this, but it did not consistently remove the artifacts. In fact, sometimes it introduced artifacts of its own. This may improve with more experimentation, but for now I decided against using overlaps and windowing, as the artifacts are not always present and depend on what the instrument is doing.
Tone changes across segments occur because each segment has its own internal mix of instruments. The results of extracting the bass in, say, the intro of a song, which might consist only of kick drum and bass, versus the middle of the song, which has the whole band and maybe a string section, sound slightly different.
In this bass fragment from the D’Angelo piece below, which used 13.22s windows, there is hardly any bass in the first window, so
Demucs handles all of this under the hood, so you do not notice any of the issues I mentioned. I intend to investigate their source code further to see how they manage segmentation, and whether there are tricks worth borrowing.
2. SAM Audio is not actually a stem separator, like Demucs is
As mentioned earlier, Demucs is designed to do one thing: separate a sound file into four tracks, and nothing else.
There is no explicit concept of an “other” stem in
Getting creative with prompts does not help. Prompts like “other”, “everything except bass or drums”, or “bass and drums” do not work. This is not just a limitation of my local setup. The same behaviour appears when running their demo.
To obtain an “other” track, you have to chain residuals. For example, you extract the bass first and get a bass track and a residual. Then you extract the drums from that residual and get a drum track and another residual. Then you extract vocals from that second residual and get a vocal track and a final residual. That last residual is your “other.” This works in principle, but the quality degrades with each pass. At that point, it is often faster and cleaner to use
Another approach I briefly toyed with was subtracting the extracted instruments from the original mix, to avoid the quality loss associated with multiple passes. This turned out to be problematic because the processed segments do not line up perfectly with the original audio. The lengths differ slightly, possibly due to padding, resampling, or other internal processing. Sample rates also play a part - output generated by
3. Text prompts are an illusion; SAM Audio only knows bass, drums, and vocals
One of the things that initially appealed to me about
In reality, the model appears to be trained on the same dataset as Demucs, and it only really understands bass, drums, and vocals. One of the developers mentioned that it can distinguish between male and female voices, but I have not tested that. Never mind surreal prompts. Even straightforward requests like “keyboard” or “guitar” result in silence or a nearly blank track.
This is all rather disappointing, though the models may improve over time.
4. It is time-consuming
Once you factor in segmentation, running each prompt three times for the main instruments, plus additional runs to obtain a usable residual, splitting a single track into stems takes hours.
For a track that is 4 minutes and 31 seconds long, the total processing time was 5 hours, 36 minutes, and 3 seconds with sam-audio-small, and 5 hours, 34 minutes, and 50 seconds with sam-audio-large. Yes, it was slightly faster with the larger model, but I do not think this difference is meaningful. It likely depends on what else the computer was doing at the time.
For comparison, Demucs processed the same track in about 20 minutes.
Where SAM Audio shines
The raw audio quality from
It also handles slap bass quite well, including popped notes that Demucs often misses. That said, there are cases where Demucs does better on drums. I have encountered examples where
So where does that leave us?
I am a bass player, and I use stem separation to extract bass and drums, combine them into a bass-plus-drums track for transcription, and also create bassless tracks to play along to. For this use case, a combination of
Beyond that, though,
Some examples
Propaganda · Salami Rose Joe Louis
Here
Drums
Demucs: muffled hi-hat, but has all the parts
Sam Audio: muffled hi-hat, missing snares
Bass
Demucs: muffled
Sam Audio: decent, can hear articulation
Other
Demucs: loud and clear
Sam Audio: includes drums, and not clear
Stevie Wonder - I wish
SAM Audio does a much better job than Demucs, but again, the residual chain (i.e. extracting residual from residual from residual) is just not good enough quality. I also compared sam-audio-large and sam-audio-small
Drums
Sam Audio Large: muffled hi-hat, more presence than demucs
Bass
Demucs: muffled
Sam Audio Small: decent
Sam Audio Large: decent, can hear articulation
Other
Demucs: decent
Sam Audio: not fit for purpose
D’Angelo - Left and Right, Live
This is a low quality recording with a lot of audience noises.
Drums
Demucs: missing cymbals
Sam Audio: decent
Bass
Demucs: decent
Sam Audio: decent, a bit less muffled
Sam Audio: the start of the song, showing how segmentation windows change the sound
Other
Demucs: loud and clear
Sam Audio: mangled up
Got thoughts on this post? Join the conversation on Mastodon!