Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

Jun Wang*, Xijuan Zeng*, Chunyu Qiang*, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Jiahui Zhao Nan Li, Zihan Li, Yuzhe Liang, Xiaopeng Wang, Haorui Zheng, Ming Wen, Kang Yin, Yiran Wang Nan Li, Feng Deng, Liang Dong, Chen Zhang, Di Zhang, Kun Gai
Kuaishou Technology

Abstract

Abstract

We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model. Taking an input video and an optional text prompt, the model synthesizes high-fidelity audio that is semantically aligned and temporally synchronized with the video content, encompassing elements such as sound effects and background music. Significantly, Kling-Foley can produce audio sequences of arbitrary duration, dynamically adapting to the length of the input video.

Method

Architecture

The core of Kling-Foley is a multimodal-controlled flowmatching model. Text, video, and temporally extracted video frames serve as conditional inputs. The multimodal features are then fused via a Multimodal Joint Conditioning module, which feeds into the MMDit Block for processing. This module predicts VAE latents, which a pretrained mel decoder subsequently reconstructs into a monaural mel-spectrogram. The monaural spectrogram is then converted to stereo spectrogram via a Mono2Stereo module. Finally, the stereo spectrogram is passed through a vocoder to generate the output waveform.

Kling-Foley Demo Gallery

Animal Sound Effect

The sound of a tiger roaring.

A dinosaur running forward, the sound of its mighty roar.

The sound of a Dragon-Transformer clashing in battle.

The sound of a puppy eating ice cream.

Object Sound Effect

The sound of submachine gun firing.

The sound of a chattering submachine gun against unearthly howls.

The sound of punching a car.

The sound of an electronic pulse from a sound wave detector.

Tool Sound Effect

The sound of a blacksmith’s hammer on the anvil.

The sound of a lumberjack working a chainsaw through a tree trunk.

The sound of a construction worker wielding a sledgehammer to demolish a wall.

Natural Sound Effect

The sound of a mountain collapsing.

The sound of a volcanic eruption.

The sound of frantic footsteps pounding through debris as a collapsing building roars overhead.

Traffic Sound Effect

The sound of a racing car accelerating.

The sound of a motorcycle growling through a barrage of gunfire.

Music Sound Effect

The sound of singing to music.

The sound of cheering to dance music.

The sound of playing the violin.

Voice Sound Effect

Footsteps, sound of people talking.

The sound of a man’s howling, hysterical laughter.

The soundless joy of a baby's beaming smile.