![]() |
VOOZH | about |
Grok Imagine generated 1.245 billion videos in the 30 days leading up to its 1.0 release in early February 2026. That release bumped the model to 720p, 10-second clips, and dramatically better native audio. With OpenAI’s Sora 2 winding down by September 24, 2026, Grok Imagine now sits next to Google Veo 3.1 and Kling 3.0 as the dominant video APIs going forward, and it is the cheapest of the three.
Curious how Grok’s chatbot stacks up against ChatGPT for everyday tasks? See our full Grok vs ChatGPT comparison.
What separates a usable Grok Imagine clip from a wasted credit is almost always the prompt. This guide is the working prompt manual for Grok imagine video generation: the formula that produces clean output, 20 ready-to-paste examples across the use cases that actually matter, the anti-patterns that wreck generations, the xAI API specs and per-clip cost math, and what to do when a clip comes back blurry or refused.
- Grok Imagine generates 1–15 second clips at 480p or 720p with native audio (music, SFX, dialogue, lip-sync) on the
grok-imagine-videoAPI.- The xAI API charges $0.05 per second, roughly $4.20 per minute. A 6-second clip costs $0.30, a 15-second clip $0.75, native audio included.
- The prompt formula that works in 90% of cases: [Subject] + [Action] + [Environment] + [Style] + [Camera and lighting].
- One subject, one action, one camera move per prompt. Multiple competing instructions split the model’s attention and produce visual mush.
- Three failure modes account for almost every bad clip: prompt overload, content-policy refusals, and the hard 15-second length cap.
Here is what the model actually outputs as of May 2026.
| Spec | Consumer UI | xAI API |
|---|---|---|
| Duration | 6s (free, Lite), 10s (SuperGrok and above) | 1–15 seconds, edit mode capped at 8.7s |
| Resolution | 720p (default on paid tiers), 480p (Lite) | 480p (default) or 720p |
| Aspect ratios | 1:1, 16:9, 9:16, 4:3 | 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3 |
| Frame rate | 24 fps | 24 fps |
| Audio | Native, automatic | Native, automatic |
| Variations per prompt | 4 | Configurable |
| Speed | Roughly 17 seconds per clip | Up to a few minutes for 15s 720p |
xAI’s documentation lists the 1–15 second range and the two resolution tiers explicitly. Some third-party blogs claim 1080p output, but that is not on the official xAI spec sheet at the time of writing. Stick to 720p as the real ceiling.
Most underwhelming Grok Imagine clips trace back to vague prompts, not the model itself. The structure that consistently produces clean output is [Subject] + [Action] + [Environment] + [Style] + [Camera and lighting]. Each slot does specific work, and missing one of them is usually why a clip looks generic.
Subject is the one entity the camera is on, a person, a vehicle, an object, or a creature. Resist the urge to describe two subjects in the same prompt; if you need two, put one in the foreground and one in the environment.
Action is the verb. What is the subject doing? Walking, running, pouring coffee, looking up, turning toward camera. The action drives motion across all 24 frames per second; weak verbs produce static-looking output.
Environment is where the action takes place. A desert canyon, a cyberpunk café, a snow-covered ridge, a kitchen at sunrise. The environment grounds the lighting and the color palette and tells the model which atmosphere to render.
Style is the visual register. Cinematic, photoreal, anime, claymation, watercolor, food-commercial. Style words tell the model which slice of its training data to lean on; without one, you get a generic-looking clip.
Camera and lighting is the cinematography. Wide shot, close-up, slow push-in, tracking shot, drone pull-back, paired with a lighting cue like “golden hour”, “neon-lit”, or “soft morning light”. This is the difference between a flat clip and one that feels intentional.
A working example that uses every slot: “A heavily modified orange off-road buggy races toward camera at high speed through a desert canyon, kicking up a huge dust trail, cinematic wide shot, golden hour, photoreal.” That sentence names the subject (modified buggy), the action (racing toward camera), the environment (desert canyon), the style (cinematic, photoreal), and the camera and lighting (wide shot, golden hour). The model has zero ambiguity about what to render or how to light it, which is why it lands the clip in roughly 17 seconds.
The formula bends to fit different genres. Here are 20 ready-to-paste prompts grouped by what you are actually trying to make. Each one fills every slot in the formula and ships with an aspect ratio, so you can drop them straight into Grok Imagine and tweak from there.
“A surfer carves a turquoise wave at sunset, board cutting a clean spray line, drone tracking shot, low angle, cinematic, 16:9.”
“A motocross rider launches off a dirt jump, mid-air rotation, dust kicking up beneath the bike, shot from below, fisheye lens, photoreal, 16:9.”
“A parkour runner leaps between rooftops at golden hour, hands grabbing the ledge, wide cinematic shot tracking laterally, 16:9.”
“A horse gallops across a misty field at dawn, hooves throwing wet earth, slow-motion, cinematic, 16:9.”
“A weathered sailor grips a ship’s wheel at twilight, salt spray clinging to his beard, waves crashing against jagged cliffs, documentary feel, natural lighting, 16:9.”
“A detective steps into a rain-soaked alley, neon reflections in puddles, slow push-in, noir style, narrow depth of field, 16:9.”
“An astronaut walks across a red Martian plain, helmet reflecting the dawn sun, wide tracking shot, photoreal, 16:9.”
“Two strangers exchange a glance across a crowded train platform, soft warm light, cinematic medium shot, 16:9.”
“Slow push-in on a steaming cup of coffee on a marble counter, warm morning light from a kitchen window, shallow depth of field, food-commercial style, 1:1.”
“A pair of sleek wireless headphones rotates slowly on a glossy black surface, cool studio lighting, product-commercial style, 1:1.”
“A perfume bottle catches a beam of golden light, droplets sliding down the glass, macro shot, luxury commercial style, 9:16.”
“A new running shoe unboxing, hands lifting the lid, soft top-down lighting, social-commerce style, 9:16.”
“A golden retriever wearing aviator sunglasses cruises a convertible down Pacific Coast Highway, tongue out, summer vibe, cinematic but playful, 9:16.”
“A robot barista pours latte art into a glass cup, steam curling upward, neon-lit cyberpunk café in the background, anime style, medium close-up, 9:16.”
“A penguin in a tiny tuxedo tap-dances on an ice floe, snow drifting, comedic tone, hand-drawn animation style, 1:1.”
“A grandmother knits at lightning speed, wool flying, cluttered kitchen background, exaggerated motion, light comedy tone, 16:9.”
“A young swordswoman stands on a clifftop facing a storm, hair whipping in the wind, lightning behind her, Studio Ghibli-inspired anime style, wide shot, 16:9.”
“A lone wolf walks through a forest of glowing mushrooms at night, ethereal mood, painterly style, slow side-tracking shot, 16:9.”
“A floating city in the clouds, airships drifting between towers, golden hour, anime-cinematic style, wide establishing shot, 16:9.”
“A robot child waters a single flower on a barren planet, dust storms in the distance, melancholic mood, watercolor style, 1:1.”
When you start from an existing image, the prompt should describe motion only. The visuals are already locked in the source still, and asking the model to add new elements that are not in the source image is the fastest way to get a janky output. Keep image-to-video prompts to one or two sentences and lean on action verbs.
Three patterns that consistently work:
“Camera slowly pushes in toward the subject’s face, hair lifts gently in a breeze.”
“Subject blinks twice, then smiles softly; warm light shifts from left to right across the scene.”
“Background rain falls steadily, neon signs flicker, no camera movement.”
The shorter the prompt, the better. If the source image is a portrait, ask for a single facial action and a small camera move. If it is a wide environment shot, ask for atmospheric motion (wind, rain, dust, falling leaves) and let the camera stay still.
Three habits sabotage Grok Imagine prompts more than anything else, and they account for the bulk of “why did this come back blurry” complaints.
Stacking multiple subjects or actions in one prompt. “A wizard casts a fireball at a dragon while archers rain arrows from a castle wall in a thunderstorm” sounds cinematic, but the model splits its attention and renders none of it well. Cut to one subject, one action, one camera move. Chain extra moments by generating multiple clips and editing, or by using Extend From Frame on a partner platform like PixVerse.
Treating the prompt as a list of style references. “Wes Anderson meets Blade Runner with a hint of Studio Ghibli” produces averaged output that looks like none of those references. The model collapses competing styles into a generic mid-tier render. Pick one style register and commit; if you need to combine influences, do it in editing or by stacking multiple clips with different styles.
Prompts longer than three sentences. Grok Imagine’s instruction following degrades after the third sentence; it starts ignoring earlier clauses and over-weighting the last instruction. Two tight sentences typically beats a five-sentence shot list. If you genuinely need shot-list precision, switch to Custom mode and use the explicit camera, motion, and lighting parameters rather than packing them into prose.
The fastest path is the consumer interface:
There is no separate “save audio” step because the audio track is baked into the output. If you need to swap the music or add narration later, do that in your editor, not in Grok.
For developers, the flow is a POST to https://api.x.ai/v1/videos/generations with the grok-imagine-video model, then a GET on https://api.x.ai/v1/videos/{request_id} to poll for completion. Latency runs from tens of seconds to a couple of minutes for the longest 720p jobs. The full xAI video generation docs spell out every parameter.
Pricing is dead simple. The grok-imagine-video model is billed at $0.05 per second of generated video, native audio included. That works out to about $4.20 per minute. For context, Google Veo 3.1 Standard sits at about $24 per minute with audio, Kling 3.0 Standard runs around $5.04 per minute (Pro is $6.72), and OpenAI Sora 2 Pro was around $18 per minute at 720p before its API entered shutdown ahead of full retirement on September 24, 2026. That puts Grok Imagine under the price floor of every other live video API, which is what xAI flagged in its launch announcement.
Three worked examples for budgeting: a 6-second Instagram Reel hook costs about $0.30; a 15-second TikTok or YouTube Short costs about $0.75; a 60-second ad cut from four 15-second variations runs about $3.00 in raw API generation costs, before any editing or selection work. That makes burst experimentation cheap, generating 20 prompt variants for an idea costs around $6 if each is 6 seconds, which is hard to match anywhere else.
Three issues account for most underwhelming Grok Imagine outputs.
Blurry or unstable motion. Almost always the prompt-overload problem from earlier. If the same idea keeps coming back blurry, strip the prompt down to one subject, one action, one camera move, and one style cue, then generate again with a different seed.
Hard refusals or watered-down clips. If Grok returns a softened version of your prompt or refuses outright, the prompt has crossed a content boundary. Real-person likenesses (especially celebrities and politicians), minors in any context, and graphic violence are the most common triggers, and switching to Spicy Mode does not bypass them. Rewrite around archetypes and fictional characters instead of named people.
Hard length cap. The 15-second API ceiling and 10-second consumer ceiling are firm. If you need a 30-second clip, generate two 15-second segments with the same seed and stitch them in your editor, or use Extend mode on partner platforms like PixVerse to append a second segment. Grok itself will not exceed the cap on a single request.
Grok Imagine handles video, and not much else. It does not handle long-form writing, code, deep research, or document editing the way ChatGPT, Claude, Gemini, or DeepSeek do. Most creators end up paying for two or three subscriptions to cover the gaps. Fello AI at $9.99/month consolidates that stack. It is a Mac-native AI app that bundles ChatGPT, Claude, Gemini, Grok, DeepSeek, and Perplexity behind a single price, so you can draft a prompt with one model, refine it with another, and run it through Grok Imagine without juggling tabs and accounts.
For the broader Grok product picture, see our Grok 4.3 review and our Grok desktop client guide for Mac.
Grok Imagine is the most cost-effective serious AI video generator on the market, and the only major one with native audio at $4.20 per minute. The model is good. What separates a usable clip from a wasted credit is the prompt, and the formula above plus the 20 examples should get you 80% of the way to consistent output. For social clips, ad concepts, mood boards, and any short-form video work where speed and price beat absolute photorealism, it is the default tool to reach for. If you are pushing past 15 seconds or need cinema-grade 4K detail, Kling 3.0 or Veo 3.1 is the better pick.
The simplest place to start is grok.com/imagine on the free tier. For the rest of our coverage, browse the Grok Imagine tag archive or the full Grok tag. And to see where xAI goes next, read everything we know about Grok 5.
Stay ahead with expert AI insights trusted by top tech professionals!
Join thousands of AI fans & professionals benefiting from exclusive tips and insights from industry leaders.