We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model. Taking an input video and an optional text prompt, the model synthesizes high-fidelity audio that is semantically aligned and temporally synchronized with the video content, encompassing elements such as sound effects and background music. Significantly, Kling-Foley can produce audio sequences of arbitrary duration, dynamically adapting to the length of the input video.
The core of Kling-Foley is a multimodal-controlled flowmatching model. Text, video, and temporally extracted video frames serve as conditional inputs. The multimodal features are then fused via a Multimodal Joint Conditioning module, which feeds into the MMDit Block for processing. This module predicts VAE latents, which a pretrained mel decoder subsequently reconstructs into a monaural mel-spectrogram. The monaural spectrogram is then converted to stereo spectrogram via a Mono2Stereo module. Finally, the stereo spectrogram is passed through a vocoder to generate the output waveform.
The sound of a tiger roaring.
A dinosaur running forward, the sound of its mighty roar.
The sound of a Dragon-Transformer clashing in battle.
The sound of a puppy eating ice cream.
The sound of submachine gun firing.
The sound of a chattering submachine gun against unearthly howls.
The sound of punching a car.
The sound of an electronic pulse from a sound wave detector.
The sound of a blacksmith’s hammer on the anvil.
The sound of a lumberjack working a chainsaw through a tree trunk.
The sound of a construction worker wielding a sledgehammer to demolish a wall.
The sound of a mountain collapsing.
The sound of a volcanic eruption.
The sound of frantic footsteps pounding through debris as a collapsing building roars overhead.
The sound of a racing car accelerating.
The sound of a motorcycle growling through a barrage of gunfire.
The sound of singing to music.
The sound of cheering to dance music.
The sound of playing the violin.
Footsteps, sound of people talking.
The sound of a man’s howling, hysterical laughter.
The soundless joy of a baby's beaming smile.