SWave: Improving Vocoder Efficiency by Straightening the Waveform Generation Path

Code Link.

Abstract: Diffusion model-based vocoders have exhibited outstanding performance in the realm of speech synthesis. However, owing to the curved nature of generation path, they necessitate traversing numerous steps to guarantee the speech quality, hindering their applicability in real-world scenarios. In this paper, we propose SWave, a novel vocoder based on rectified flow which improves the efficiency of speech synthesis by Straightening the Waveform generation path. Specifically, we employ rectification to transform the noise distribution into the data distribution with a probability flow that is as straight as possible. Subsequently, we use distillation and fine-tuning to further enhance the generation efficiency and quality, respectively. Experiments on the LJSpeech dataset demonstrate that compared with other methods such as InferGrad and WaveGrad, SWave enhances the generation efficiency. With a straightforward sampling schedule, SWave generates comparable speech to WaveGrad with significantly fewer steps (2 steps vs 25 steps).

SWave Workflow Diagram

Samples from SWave and WaveGrad:

We randomly sample three speeches: LJ010-0060, LJ008-0065 and LJ045-0158. The generations of SWave and WaveGrad are exhibted below.
Text The Cato Street conspiracy would have been simply ridiculous but for the recklessness of the desperadoes who planned it. They are both drawn at once by a windlass, and the unhappy culprits remain suspended. He spent quite a bit of time putting away diapers and played with the children on the street.
Reference
SW (Linear 2)
SW (Linear 5)
SW (Linear 10)
WG (Grid Search 2)
WG (Grid Search 6)
WG (Fibonacci 25)

Samples from Models at Different Stages

We synthesize speeches using models at different stages: post-rectification (F-step VFE), post-distillation (N-step VFE), and post-fine-tuning (N-step SWave). F is set to 1000 and N is set to 2 or 10.
Text The Cato Street conspiracy would have been simply ridiculous but for the recklessness of the desperadoes who planned it. They are both drawn at once by a windlass, and the unhappy culprits remain suspended. He spent quite a bit of time putting away diapers and played with the children on the street.
Reference
1000-step VFE (Linear 2)
2-step VFE (Linear 2)
2-step SWave (Linear 2)
1000-step VFE (Linear 10)
10-step VFE (Linear 10)
10-step SWave (Linear 10)

One-step Generation from 1-step SWave

Text The Cato Street conspiracy would have been simply ridiculous but for the recklessness of the desperadoes who planned it. They are both drawn at once by a windlass, and the unhappy culprits remain suspended. He spent quite a bit of time putting away diapers and played with the children on the street.
1-step SWave (Linear 1)