Kling AI Kling 3.0 Kling 2.6 Pro Kling Prompt Guide Kling Blog

Prompt Guide

Kling 2.6 Pro Prompt Guide

Learn how to write effective prompts for native audio-visual creation with Kling 2.6 Pro.

1Welcome to Experience KLING VIDEO 2.6: Let's "See the Sound, Hear the Visual"

Previously, KLING's video models could only generate "silent visuals". Creators had to manually find voiceovers, add sound effects, and adjust the pace—an overly complex process that made it hard to achieve true immersion.

Now, the all-new VIDEO 2.6 Model is available: it generates visuals, natural voiceovers, matching sound effects, and ambient atmosphere in a single pass, bridging the worlds of "sound" and "visuals". Whether inputting text or uploading an image, you can instantly create a dynamic video that's complete, with sound, rhythm, and immersion—no more tedious editing. Compared to previous "silent" models, VIDEO 2.6 offers a comprehensive upgrade:

No more "silent films": Visuals, voice, and sound effects are generated together, with seamless integration of camera rhythm and emotional tone, transforming content from "viewable" to "immersive".
Full control over sound: Choose who speaks, what they say, and the emotion behind it. Generate ambient and special effects sounds freely, adjusting the pace and atmosphere to fit various creative needs.
Effortless creation for beginners: No complex operations required—just input text or images, and the system will automatically handle sound and visual details. Ideal for content creators and small studios to quickly produce professional videos.

2KLING's First "Native Audio" Model is Now Live!

With the VIDEO 2.6 Model, we are introducing the "Native Audio" feature for the first time: a single generation that simultaneously produces video visuals and complete audio, including voiceovers, sound effects, and ambient sounds. This feature achieves seamless coordination in rhythm, emotion, and narrative expression, delivering a true "see what you hear" audio-visual experience.

This upgrade focuses on:

Audio-Visual Coordination: Voice rhythm, ambient sounds, and visual actions are closely aligned, eliminating the disconnect between "visuals and separate audio."
Audio Quality: Supports various sound types such as voice, sound effects, and ambient sounds, with cleaner sound quality and richer layers, closely mimicking real mixing effects.
Semantic Understanding: Strong semantic comprehension of text descriptions, spoken language, and complex storylines in different contexts, ensuring more accurate interpretation of creator intentions and delivering content that better meets needs.

For the creation process, KLING 2.6 provides two efficient creation paths centered around the core need of "fast audio-video content generation from text/images".

3What Can the VIDEO 2.6 Model Do?

VIDEO 2.6 supports various sound types, including speech, dialogue, narration, singing, rap, ambient sound effects, and mixed sound effects. Below, we outline the model's capabilities to help you quickly understand its creative potential.

3.1Solo Monologue

Capability: The character speaks directly to the camera with natural emotion and synchronised lip movements.

Applicable Scenarios: Product showcases, lifestyle vlogs, news broadcasts, public speaking.

Product Showcase

Display products and highlight key selling points. Clear speech, natural tone, and a match to the product's atmosphere are key.

In a beauty live-streaming room, warm yellow lighting illuminates the table, with lipstick samples displayed on either side. [Caucasian beauty influencer] raises a matte dusty rose lipstick. [Caucasian beauty influencer, sweet and fresh voice] says: "Perfect for yellow undertones! Brightens the complexion without drying, and the finish looks beautifully soft all day." Background: Soft beauty BGM playing.

Lifestyle Vlog

Showcasing easy, natural moments from daily life.

On the beach, the waves crash against the shore. [Young Caucasian male] wearing a backward baseball cap, holding a camera and taking a selfie, with a smile at the corner of his mouth. [Young Caucasian male, sunny voice] says: "The weather is amazing today! All my worries feel totally gone. I've been needing a day like this—sun, breeze, just the sound of the waves." The camera is in vlog close-up style.

News Reporting

Emphasizes professionalism, formality, and stable tone.

Visual: In front of an outdoor shopping mall, a crowd gathers, cheering. Dialog: [African-American male reporter] stands next to the crowd, holding a microphone, his body slightly turned. [African-American male reporter, steady voice] says: "Now we can see the atmosphere here is absolutely electric. Let's go check it out together! There's so much happening all at once." Background: Cheerful crowd noises and event BGM, with occasional close-ups of the event.

Public Speaking

Shows strong, persuasive delivery.

The main venue of an international tech summit, with delegates from various countries filling the seats. [Indian entrepreneur] stands at the center of the stage. [Indian entrepreneur] gazes steadily at the audience, his hands naturally hanging by his sides. [Indian entrepreneur, loud voice] says: "A decade ago, the world saw India through call centers." After a brief pause, he extends his hands upward. [Indian entrepreneur, passionate voice] says: "Now, Indian innovation is reshaping the world with tech!" The camera slowly zooms in on the Indian entrepreneur's face, and as he finishes his speech, he joins his hands together in a prayer gesture. The audience bursts into applause.

3.2Narration

Capability: Off-screen voice narrating, explaining, or commenting on visuals.

Applicable Scenarios: Product explanations, event commentary, documentaries, storytelling.

Product Explanation

Static visuals + professional narration, ideal for e-commerce videos.

Visual: In a tidy living room, a white robotic vacuum sits in the center, with no clutter around it.

Dialog: [Narrator, soft female voice] accompanied by the gentle sound of vacuuming: "Are you still troubled by dust in hard-to-reach corners? This robotic vacuum features edge-to-edge cleaning, leaving no gaps behind—making your life easier and effortless!" The camera closely follows the vacuum's path as it cleans.

Event Commentary

Requires dynamic pacing and event atmosphere.

Visual: At the World Cup final, the lights are dazzling, and the stands are roaring with excitement.

Dialog: (No characters, just narration) [Narrator, excited male voice] as the ball hits the net: "The game is over!" Background: Fans erupt in cheers, and the camera captures the moment the ball enters the net from the goalkeeper's perspective.

3.3Multi-Character Dialogue

Capability: Interaction between multiple characters with natural tone switching.

Applicable Scenarios: Interviews, scripted performances, casual dialogue, comedy skits.

Interview Show

Visual: A modern industrial-style recording studio with brick walls covered in soundproof panels, equipment neatly arranged.

Dialog: [Caucasian male host] sits in front of the microphone, slightly leaning forward. [Caucasian male host, steady voice] says: "Today we're excited to have Dr. Sarah Miller from Stanford AI Lab. Sarah, your research on neural networks is groundbreaking." During this, [African-American female guest] remains silent. Immediately, [African-American female guest] raises her chin slightly, holding the microphone. [African-American female guest, gentle voice] says: "Thank you for having me." During this, [Caucasian male host] remains silent.

Scripted Performance (Short Play)

Visual: A dimly lit casino VIP room with a green-felt poker table at the center, surrounded by swirling smoke. Wall lamps cast warm, silhouetted glows.

Dialog: [Man in suit, elbows on the table leaning forward, deep male voice]: "Three rounds to decide. Win, and all the chips are yours. Lose, and tell me the real reason you're getting close to him." [Woman with curly hair, fingers gently tracing the edge of the table, red lips curling into a faint smile, cool and glamorous female voice]: "I don't care about the chips."

3.4Music Performance

Singing

Visual: A sunlit garden path, with daisies in full bloom and butterflies fluttering gently.

Dialog: [Asian woman] walks slowly with loose braids, her floral dress brushing against the daisies. [Asian woman, gentle voice] sings: "In this tranquil morning, I've found my way. With dreams in my heart, there's light in my days." The Asian woman reaches out to brush past the flowers, startling a white butterfly into flight.

Rap

Visual: Brooklyn, New York – in front of a graffiti-covered wall, the street vibe is intense, with breakdancers freestyling nearby.

Dialog: An African-American rapper wearing a gold chain and an oversized hoodie, grooving to the beat while facing the camera. [African-American rapper, energetic male voice] Rapping over a drum beat: "Yeah, from the bottom to the top, I'm shining bright like a star. Brooklyn streets raised me tough, fought through the dark. Gold chain swingin', flow hits hard, grindin' daily, never bored. Now I'm livin' in the light, this is my life, raw and hardcore!" Background: Layered with deep bass and turntable scratches. Camera cuts rapidly between close-ups of his facial expressions, hand gestures, and the breakdancers.

Group

Visual: In a bright rehearsal room, sunlight streams through the window, and a standing microphone is placed in the center of the room.

Dialog: [Campus band female lead singer] stands in front of the microphone with her eyes closed, while the other members stand around her. [Campus band female lead singer, full voice] leads: "I will try to fix you, with all my heart and soul..." The background is an a cappella harmony, and the camera slowly circles around the band members.

Instrumental Performance

Visual: In a traditional study room, a scroll hangs on the wall, and a guqin rests on the desk, bathed in soft light.

Dialog: [Scholar] sits calmly at the desk, gently plucking the strings of the guqin with his fingertips, his expression serene. Background: The sound of the scroll turning and the melody of the guqin. The camera focuses on the scholar's fingers as he plucks the strings.

3.5Creative Scene

Visual Effects

Visual: In a cozy living room, the firewood is burning in the fireplace, and the sofa is placed next to a coffee table.

Dialog: [Male protagonist] enters the living room and speaks. [Male protagonist, gentle voice]: "Babe, taking a break from work?" During this, [Female protagonist] remains silent and smiles, nodding. Immediately, the male protagonist walks over to the sofa, gently sets down his cup, and reaches out to ruffle the female protagonist's hair. The camera focuses on their interaction.

Life Scene Atmosphere

Visual: [Ginger cat] lies on the windowsill.

Dialog: The [ginger cat] breathes slowly, with background sounds of distant birds and rustling leaves. The camera focuses on the light spots shifting with the cat's breath.

ASMR

Visual: In the library's restoration room at night, a warm desk lamp illuminates ancient books, and the restorer wears white gloves.

Dialog: Bringing the brush closer to the microphone. [Book restorer, whispering voice]: "These pages have been asleep for two hundred years. Today, we wake them gently." Background: The soft rustling of book pages, with the camera focusing on the cleaning motion.

Creative Ads / Material

Visual: In a product display scene, with a simple, bright background, a [raisin] is placed in the center.

Dialog: [Raisin] twists and is hydrated, transforming into a plump green grape. [Off-screen voice, crisp female voice]: "Don't want to end up shriveled like I was? Hydrating face cream quenches your skin's thirst and turns back time."

Supported Audio Types

Voice Narration

Character voice narration

Dialogue

Multi-person voice dialogue

Singing/Rap

Characters singing or rapping with lyrics

Ambient Sound Effects

Background sounds like wind, ocean waves, street noise, traffic

Object/Action sound effects

Sounds like glass breaking, footsteps, knife slicing, machine rumble

Mixed Sound Effects

A combination of voice, background sounds, and sound effects for an immersive audio-visual experience.

4How to Write Effective Prompts

When using the VIDEO 2.6 model, simply write "[the scene you want to see] + [the action that happens] + [the sound you want to hear]" to generate high-quality audio-visual output.

Prompt format formula:

Prompt = Scene (Scene Description) + Element (Subject Description) + Movement (Movement Description) + Audio (Dialogue / Singing / Sound Effects / Pure Music) + Other (Style / Emotion / Camera)

Audio components:

Dialogue: "Sentence" + Emotion + Speech Speed + Tone + Character Label. Single character: e.g. [Man speaking] "Sentence" + Deep + Fast. Multiple characters: use clear labels, e.g. [Character A, angrily] says "Sentence", [Character B, calmly] replies "Sentence".
Singing: "Lyrics" + Singing Style + Accompaniment + Emotion. Styles: Pop, Opera, Country. Emotion/techniques: High-pitched, Vibrato, Gentle singing.
Rap: "Sentence (Rhyming)" + Rhythm Style + Emotion. Rhythm: Intense Boom Bap, Trap Style Beat, Fast Flow. Content should show rhyme and meter.
Sound Effects: Sound Source (Action/Object) + State + Professional SFX. E.g. [Object: Wooden Door] suddenly [Action: Slams] + [Sound Effect: Bang]. Material/state: Glass Breaking, Metal Impact, Screeching Brakes.
Ambient Sound: Scene + Sound Elements + Spatial Reverb. Elements: Rain, Insects, Crowd Murmurs, Traffic. Spatial: Echo in Open Hall, Small Room Acoustics.
Pure Music: Instrument Type + Music Genre + Emotion. E.g. Piano Performance + Jazz + Melancholy. Genres: Classical, Rock, Electronic.

Tip: Use quotation marks " " to clarify sound content when writing prompts.

4.1Key Tutorial — Multi-Character Dialogue Prompt Examples and Guidelines

Guidelines	Core Principles	Prompt Guidelines and Examples	Incorrect Example (Prone to Model Failure)
P1. Structured Naming	Character labels must be unique and consistent.	Use `[Character A: Black-suited Agent]` and `[Character B: Female Assistant]`. Avoid pronouns or synonyms.	[Agent] says... Then, he says...
P2. Visual Anchoring	Bind dialogue to the character's unique action.	First describe the action, then the dialogue: The black-suited agent slams his hand on the table. [Black-suited Agent, angrily shouting]: "Where is the truth?"	[Black-suited Agent]: "Where is the truth?" (The model won't know who slammed the table)
P3. Audio Details	Assign unique tone and emotion labels per character.	[Black-suited Agent, raspy, deep voice]: "Don't move." [Female Assistant, clear, fearful voice]: "I'm scared."	[Man] says... [Woman] says... (Voice characteristics too vague, can confuse the model)
P4. Temporal Control	Use clear linking words to control dialogue sequence and rhythm.	...[Black-suited Agent]: "Why?" Immediately, [Female Assistant]: "Because it's time." (Optional: insert "this is when the speaker switches" between the two.)	[Black-suited Agent]: "Why?" [Female Assistant]: "Because it's time." (The model may generate continuous speech from one character)

4.2Common Audio Trigger Words

Reference tables for trigger words by audio type. Use these in prompts to control speech, singing, rap, SFX, and ambient sound.

Speech

Category	Trigger Words	Examples
Core Speech	Speaking / Talking	A woman sitting at a desk, calmly speaking into a microphone.
	Asking / Querying	A curious boy in the garden asking his father a question.
	Telling / Narrating	An old man by the fireplace, slowly telling a story.
	Explaining	A tour guide pointing at a map, clearly explaining the route.
Volume/Clarity	Whispering	Two friends leaning in close in a crowded room, whispering a secret.
	Softly Speaking	A student in the quiet library softly speaking on the phone.
	Clearly Speaking / Crisp Voice	A radio announcer with a clear voice speaking the news.
Emotion/Tone	Excitedly Speaking	The award winner holding a trophy, excitedly speaking their acceptance speech.
	Complaining	A customer at the counter complaining about poor service.
	Sighing	A tired worker by a window, letting out a heavy sighing sound.
	Gently Speaking	A mother rocking a baby, gently speaking a lullaby.
Vocal Quality	Hoarse Voice	A patient waking up, requesting help with a hoarse voice.
Vocal Quality	Deep Voice	A middle-aged man telling a scary story in a deep voice.

Pace/Rhythm, Performance & Dialogue

Type	Trigger / Description
Pace/Rhythm	Fast Talking / Rapid Speech — A fast-talking salesperson rapidly describing product features.
	Slow Talking — An old professor slow talking while elaborating on a complex theory.
Performance	Reciting / Reading Aloud — A poet on stage, reciting a dramatic poem.
	Monologue — An actor alone on stage, performing a sad monologue.
	Narration / Voiceover — A film scene with deep narration in the background.
Dialogue — Interaction	Answering / Responding — The interviewee answering the question immediately.
	Arguing / Quarrelling — A couple in the kitchen, arguing loudly.
	Shouting / Yelling — A father at the door shouting at his children playing outside.
	Discussing — A group of students around a table, discussing a difficult problem.
Vocal Action	Crying / Sobbing — A little girl on the ground crying after falling down.
	Screaming — A woman seeing a mouse, letting out a sharp screaming sound.
	Laughing / Chuckling — Three people sharing a joke and laughing loudly.

Singing & Rap

Category	Trigger / Example
Singing — Core Form	A Capella — A singer performing a cappella on an empty stage.
	Humming — A chef in the kitchen happily humming.
	Loud Singing — A rock musician on a mountaintop singing loudly.
Singing — Technique/Style	Bel Canto / Opera — A soprano in formal dress performing an opera piece.
	Pop Vocals — A young artist in the studio recording a pop song.
	Vibrato — A singer adding beautiful vibrato on a high note.
	Falsetto — A male singer hitting a very high note in falsetto.
	Harmony / Layered Vocals — A quartet performing a harmony section.
Rap — Terminology	Rapping / Hip-Hop — A street artist rapping under neon lights.
	Flow / Rhyme — A rapper performing with smooth flow and tight rhymes.
	Fast Rap / Rapid Delivery — A high-speed, rapid-fire rap verse in a song.
	Strong Rhythm / Heavy Beat — A hip-hop track with strong rhythm and heavy beat.

Sound Effects (SFX)

Category	Type	Example
Daily Actions	Tapping / Knocking	A carpenter tapping a nail with a hammer.
	Footsteps	Slow, heavy footsteps in an empty hallway.
	Chewing / Munching	A person chewing on crunchy chips.
Material Impact	Glass Shattering	A rock hitting a window, followed by glass shattering.
	Metal Clanging	Two large iron blocks clanging in a factory.
	Friction / Rubbing	Two pieces of rough fabric rubbing together.
Natural Elements	Thunder	Lightning flash, followed by low thunder rumble.
	Fire Crackling	A campfire crackling and burning brightly.
	Bubbling / Gurgling	Hot soup on the stove bubbling as it heats up.
Mechanical Noise	Alarm / Siren	A police car at night, its siren wailing.
	Braking	A car emergency stop with screeching brakes.
	Gears Whirring	An old clock's subtle gears whirring.
Musical Instruments	Piano Music	A pianist playing classical piano in a concert hall.
Musical Instruments	Guitar Plucking	A street artist gently plucking a guitar string.

Ambient Soundscapes

Category	Type	Example
Urban	Traffic Noise / Car Flow	Continuous traffic at a busy intersection.
	Crowd Murmur	Background crowd murmur in a museum.
	Subway Noise	Subway noise as a train arrives and departs.
	Construction Noise	Distant, persistent construction noise in the city.
Nature	Ocean Waves	Soothing ocean waves hitting the beach in the morning.
	Bird Chirping	Various bird chirping in a morning forest.
	Wind Sound (Nature)	Wind blowing across an open field.
	Rainforest	Hot, humid rainforest with bird calls and dripping water.
Indoor Space	Library Silence	Deep library silence with occasional book drop.
	Café Background Music	Casual café background music with quiet chatter.
	Air Conditioner Hum	Steady, low AC hum in a quiet office.
	Fireplace Burning	Warm, comforting fireplace sound in a winter cabin.

5Kling VIDEO 2.6: Voice Control for Image to Audio & Video

5.1 Feature Overview

When creating content across multiple videos or characters, you may face inconsistent voices or a lack of personalisation. Kling VIDEO 2.6 introduces a new "Voice Control" feature to address this: provide visual input + voice prompt + target voice, and high-quality audio-visual content can be generated in seconds with an easy, worry-free workflow.

Stable, High-Fidelity Voice Output: Voice remains consistent throughout the video, accurately preserving the target timbre. Ideal for IP characters, brand personas, and long-term voice consistency for recurring roles.
Flexible Style Adaptation: A single voice can be applied seamlessly to multiple scenarios—e.g. narration, dialogue, or speech—with automatic adaptation of tone, rhythm, and expression to match the context.
Natural Cross-Language Performance: No extra setup needed. A voice trained in one language can naturally deliver in another (e.g. Chinese ↔ English). Pronunciation is fluent and expression consistent; bidirectional Chinese–English is currently supported.
Prompt-Based Voice Binding: With a simple prompt such as [Character@VoiceName], the model automatically binds the voice to that character, making it easy to create multi-character dialogue with distinct voices.

7FAQ

Q: What languages does the current model support for voice output?

The model currently supports voice output in Chinese and English. If other languages are input, they will be automatically translated into English, and the corresponding audio will be generated in English—without affecting the overall user experience. Support for additional languages is being expanded; stay tuned!

Q: Can I generate audio only, without video?

Yes! Go to the platform's "[Sound Effect Generation]" module. There you can: (1) input text to generate standalone audio, or (2) upload a video to extract sound effects. These options let you create pure audio content without generating a video.

Q: How can I improve generation results?

Optimise your prompt: Keep descriptions clear and specific (scene, sound type, style, etc.). Avoid overloading the prompt with too many complex instructions; describe each element separately.
Enhance image-text alignment: If using reference images, ensure the image content matches the text (e.g. for "outdoor camping", avoid indoor photos to prevent conflicting information).
Set parameters accurately: Adjust video duration, resolution, and other settings to match your needs; avoid relying on defaults if they don't meet your expectations.
Simplify the creation scene: Focus on one core theme per creation. Avoid stacking too many elements (e.g. multiple ambient sounds + complex speech) so the model can produce more stable, ideal content.