Introducing Resemble Enhance: Open Source Speech Super Resolution AI Model

Dec 14, 2023

Open-Source AI-Powered Speech Enhancement 

In digital audio technology, the necessity for crystal clear sound quality is paramount, however achieving pristine sound quality has remained a consistent challenge. Background noise, distortions, and bandwidth limitations can significantly hinder clarity, comprehension and user experience. Today, we introduce  Resemble Enhance, an AI-powered model designed to transform noisy audio into clear and impactful speech. Enhance improves the overall quality of speech with two modules: a sophisticated denoiser and a state-of-the-art enhancer. If you鈥檇 like to try out Enhance right now, please click the link below. To learn more about use cases and the technology behind the model continue reading below! 

Try Enhance Now

The Catalyst for Resemble Enhance

Current speech enhancement techniques are pivotal for a variety of applications, yet they often fall short when faced with the intricate challenges of modern sound environments. These existing methods can be limited, particularly when extracting clarity from the cacophony of background noise, or when restoring historical recordings. The need for more sophisticated enhancement technologies is evident across a spectrum of industries. Podcast producers depend on high-quality audio to connect with their audience through crystal-clear narratives, the entertainment industry depends heavily on immaculate audio tracks to create immersive experiences, and possibly the most challenging audio restoration works to breathe new life into archived sounds.

Dual Module Speech Enhancement

Resemble Enhance is equipped to address these diverse use cases with unprecedented precision and ease. To solve for these speech imperfections, we’ve employed the power of advanced generative models for speech enhancement. Not only does Enhance cleanse audio of noise but also enriches its overall perceptual quality. Enhance consists of two modules: a denoiser, which separates speech from noisy audio, and an enhancer, which further boosts the perceptual audio quality by restoring audio distortions and extending the audio bandwidth. The two models are trained on high-quality 44.1kHz speech data that guarantees the enhancement of your speech with high quality. Whether, reviving archived audio or extracting high quality speech from background music the models compliment each other to enhance speech. The video below showcases the model enhancing a conversation between two people on a busy street. 

 

Listen to the original audio at the start and compare it to the enhanced audio near the end.

Speech Enhancement: Denoiser

At the heart of Resemble Enhance lies a sophisticated denoiser. Imagine this module as a filter that meticulously separates speech from unwanted background noise. The denoiser employed in this context utilizes a UNet model that accepts a noise-infused complex spectrogram as its input. The model then predicts the magnitude mask and phase rotation, effectively isolating the speech from the original audio. This methodology aligns with the one delineated in the  AudioSep paper [1].

Resemble Enhance: Denoised Spectrogram
Resemble Enhance: Denoiser
Resemble Enhance: Denoised Spectrogram
Resemble Enhance Denoiser
Learn More About Enhance

Speech Enhancement: Enhancer

The enhancer is a latent conditional flow matching (CFM) model. It consists of an Implicit Rank-Minimizing Autoencoder (IRMAE) and a CFM model that predicts the latents.

Stage 1

This first stage involves an autoencoder that compresses the clean mel spectrogram $M_\text{clean}$ into a compact latent representation $Z_\text{clean}$, which is then decoded and vocoded back into a waveform. The model consists of an encoder, decoder and vocoder. Both the encoder and decoder is based on residual conv1d blocks, and the vocoder is a UnivNet that incorporates with the AMP Block from BigVGAN.

We train this module in an end-to-end manner with the GAN-based vocoder losses, including the multi-resolution STFT losses and discriminator losses, together with a mel reconstruction loss.

Implicit Rank-Minimizing Autoencoder (IRMAE)

Stage 2

After completing the training of the first stage, we freeze the IRMAE and only train the latent CFM model.

The CFM model is conditioned on a blended Mel $M_\text{blend} = \alpha M_\text{denoised} + (1- \alpha ) M_\text{noisy}$ , derived from the noisy mel-spectrogram $M_\text{noisy}$ and a denoised mel-spectrogram $M_\text{denoised}$.

Here, $\alpha$ is the parameter that adjusts the strength of the denoiser. During training, we set $\alpha$ to follow a uniform distribution $\mathcal{U}(0, 1)$. During inference, the value of $\alpha$ can be controlled by the user.

We load the pre-trained denoiser and train it jointly with the latent CFM model to predict the latent representation of the clean speech.

Conditional Flow Matching Diagram

The CFM model used in our work is based on a non-causal WaveNet model. To train the CFM model, we employ the I-CFM training objective [3]. It enables to train a model that transforms an initial point $Z_0$, which is drawn from a predefined probability distribution $p(Z_0)$, to a point that resembles one from the target distribution, i.e. the distribution of the clean mel latents, denoted by $Z_1 \sim q(Z_1)$.

The initial distribution, $p(Z_0)$, is a blend of the noisy mel latents and a noise drawn from standard Gaussian distribution. We start by taking the latent representation of a noisy mel spectrogram $Z_\text{noisy}$ and a random gaussian noise $\epsilon \sim \mathcal{N}(0, 1)$, then select a blending parameter $\lambda \sim \mathcal{U}(0, 1)$. $Z_0$ is then computed as: $Z_0 = \lambda Z_\text{noisy} + (1 – \lambda)\epsilon$. Blending noisy mel latents with Gaussian noise permits the initiation of inference using noisy mel latents, while the addition of Gaussian noise ensures that the prior space is adequately supported. Below are the before and after-Resemble Enhance look at three spectrograms featuring speech with different background audio ranging from traffic sounds to background music.

Spectrogram Before and After Enhance

The Future of Enhance

Moving forward with the development of Resemble Enhance, our commitment lies in deploying our sophisticated AI models to elevate even the most antiquated audio鈥攖hink recordings from over 75 years ago鈥攖o unparalleled clarity. Although Enhance already demonstrates remarkable robustness and adaptability, efforts to accelerate processing times are ongoing, and we are dedicated to expanding the user’s command over nuanced speech elements such as accentuation and rhythmic patterns. Keep an eye on this space for the latest advancements as we persist in pushing the boundaries of audio technology innovation.

References

[1] https://arxiv.org/abs/2308.05037
[2] https://openreview.net/forum?id=PqvMRDCJT9t
[3] https://arxiv.org/pdf/2302.00482

Try Enhance Now!

玻璃钢生产厂家商场玻璃钢美陈花盆山西公园玻璃钢雕塑哪家便宜上海玻璃钢彩绘骆驼雕塑赣州玻璃钢动物卡通雕塑商场设计叫美陈吗福建玻璃钢人像雕塑公司信阳市玻璃钢雕塑kaws玻璃钢雕塑杭州商场装饰美陈河北开业商场美陈怎么样中山玻璃钢景观雕塑厂家工业玻璃钢雕塑摆件研发公司深圳动物雕塑玻璃钢广佛地区玻璃钢雕塑厂家湖北佛像玻璃钢雕塑订做价格山东玻璃钢雕塑施工成都玻璃钢雕塑货源推荐山东抽象玻璃钢雕塑批发青浦区拉丝玻璃钢雕塑在线咨询玻璃钢雕塑哪家好性价比高玻璃钢小熊雕塑图片东莞商业广场玻璃钢卡通雕塑价格湖北多彩玻璃钢雕塑批发四川仿铜玻璃钢雕塑定制江苏大型商场美陈厂家直销乐山玻璃钢仿铜雕塑定制六盘水玻璃钢雕塑设计制作广东大型商场美陈市场价通道商场美陈研发长治玻璃钢仿铜雕塑厂家香港通过《维护国家安全条例》两大学生合买彩票中奖一人不认账让美丽中国“从细节出发”19岁小伙救下5人后溺亡 多方发声单亲妈妈陷入热恋 14岁儿子报警汪小菲曝离婚始末遭遇山火的松茸之乡雅江山火三名扑火人员牺牲系谣言何赛飞追着代拍打萧美琴窜访捷克 外交部回应卫健委通报少年有偿捐血浆16次猝死手机成瘾是影响睡眠质量重要因素高校汽车撞人致3死16伤 司机系学生315晚会后胖东来又人满为患了小米汽车超级工厂正式揭幕中国拥有亿元资产的家庭达13.3万户周杰伦一审败诉网易男孩8年未见母亲被告知被遗忘许家印被限制高消费饲养员用铁锨驱打大熊猫被辞退男子被猫抓伤后确诊“猫抓病”特朗普无法缴纳4.54亿美元罚金倪萍分享减重40斤方法联合利华开始重组张家界的山上“长”满了韩国人?张立群任西安交通大学校长杨倩无缘巴黎奥运“重生之我在北大当嫡校长”黑马情侣提车了专访95后高颜值猪保姆考生莫言也上北大硕士复试名单了网友洛杉矶偶遇贾玲专家建议不必谈骨泥色变沉迷短剧的人就像掉进了杀猪盘奥巴马现身唐宁街 黑色着装引猜测七年后宇文玥被薅头发捞上岸事业单位女子向同事水杯投不明物质凯特王妃现身!外出购物视频曝光河南驻马店通报西平中学跳楼事件王树国卸任西安交大校长 师生送别恒大被罚41.75亿到底怎么缴男子被流浪猫绊倒 投喂者赔24万房客欠租失踪 房东直发愁西双版纳热带植物园回应蜉蝣大爆发钱人豪晒法院裁定实锤抄袭外国人感慨凌晨的中国很安全胖东来员工每周单休无小长假白宫:哈马斯三号人物被杀测试车高速逃费 小米:已补缴老人退休金被冒领16年 金额超20万

玻璃钢生产厂家 XML地图 TXT地图 虚拟主机 SEO 网站制作 网站优化