StrategiesJune 5, 2026· 6 min read

The SaaS demo ad: a format most founders get wrong

Most SaaS demo ads are 60 seconds of UI tour over elevator music. The structure that converts: face hooks, screens convince, face closes. Script included.

saas video adproduct demo videoad scriptsvideo marketing

There is a default SaaS demo ad, and you have seen it a thousand times: a screen recording glides through the dashboard for 60 seconds, royalty-free music plinks underneath, feature labels fade in and out, and the logo lands at the end like a punchline nobody set up.

It feels safe to make. It shows the product, it offends no one, and it dies in the feed almost instantly, because the first frame is a UI nobody recognizes and the first sound is music nobody asked for.

The fix is not better motion graphics. It is a different structure: face, screen, face. Hook with a human, convince with the product, close with the human. Here is the whole format, with a script you can lift.

Why the UI tour fails

A product demo video built as a tour assumes the viewer already cares. They are watching your settings page because they want to learn your settings page. That is true on your docs site. It is false everywhere ads run.

On TikTok, Reels, and Shorts, the viewer is mid-scroll. The decision to stay happens in the first second or two, off the first frame and the first spoken words. A dashboard screenshot gives them neither a person nor a reason. Music gives them less than silence, because it confirms this is an ad.

The tour fails at the other end too. Sixty seconds of features is a list, and lists don't convert; one resolved pain does. Most of your UI is table stakes to the viewer anyway. They have seen a settings page before.

Faces hook, screens convince

Hooking and convincing are two different jobs, and the standard demo ad assigns both to the screen recording. That is the core mistake.

A face talking into a camera reads as "a person is about to tell me something." A screen recording reads as "a company is about to sell me something." You get one second to pick which of those signals goes first.

But faces can't close the deal for software. A founder saying "it is really fast" is a claim. A screen where the thing visibly happens is evidence. The viewer trusts what they watched occur, not what they were told occurred.

So the format splits the work. The face buys attention with a pain the viewer recognizes. The screen spends that attention on proof. Then the face returns to say what to do next, because people take instructions from people, not from interfaces.

The structure: face, screen, face

0-3 seconds: a face states the pain in 12 words or fewer

Selfie framing, eyes on lens, talking on the first frame. No intro, no name, no "hey guys." The pain line names the viewer's situation, not your product:

  • "You spent four hours on a report nobody read."
  • "Your onboarding emails go out three days late. Every time."
  • "Invoicing clients still eats your whole Friday afternoon."

Twelve words is a hard ceiling, not a guideline. At normal speaking pace that is about three seconds, which is exactly the budget before the swipe.

3-20 seconds: two or three screen moments, narrated with deixis

Not a tour. Moments. A screen moment is one click and one visible change, narrated by the same voice that hooked them. The narration uses deixis, which is pointing language that anchors the viewer to what is on screen right now:

  • "Watch what happens when I click this."
  • "That number on the left? That used to be 40 minutes of manual work."
  • "See it fill in? I typed nothing."

Deixis is what makes a static screen recording feel live. "Our platform automatically generates reports" is brochure copy. "Watch, it is writing the report right now" is a person showing you a thing, and the viewer leans in to verify it.

Pick your moments by one rule: the change must be visible without explanation. A before-state, a click, an after-state, on one screen. If a moment needs context to land, cut it. Two great moments beat four decent ones.

Final 3-5 seconds: the face returns with a 10-word CTA

Back to the selfie. One sentence, ten words or fewer, one action:

  • "Try it on your own data. Takes two minutes."
  • "Link below. First report is on us."

The return to the face matters more than the words. The screen showed proof; the face closes the loop and makes the ask personal. Ads that end on a logo card end on the weakest frame they have.

A script you can steal

Total runtime: 25-35 seconds. Longer is not better; it is just more places to lose them.

  1. Hook (face, 0-3s): "[Pain in 12 words or fewer.]"
  2. Pivot (voice over the first screen, 3-5s): "So I will just show you."
  3. Moment 1 (screen, 5-12s): "Watch what happens when I [click / paste / drop] this." Click. Visible change.
  4. Moment 2 (screen, 12-19s): "And this part here, [the before-state]. Now look." Change.
  5. Moment 3, optional (screen, 19-25s): the result artifact. The export, the sent email, the finished thing.
  6. CTA (face, last 3-5s): "[Action in 10 words or fewer.]"

Write the hook last. You will know which pain to lead with after you have picked your screen moments, because the hook is just the before-state of moment one, said out loud.

The production bar is lower than you think

This format needs a phone, a screen recorder, and a quiet room. The face segments are selfie shots; polish actively hurts them, because polish reads as ad. The screen segments need a clean test account and a cursor that moves with intent.

Where founders actually stall is volume. One SaaS video ad is a guess; you find the hook that works by running five or six variants against each other, and at $300-800 per edited short-form ad from a freelancer or agency, variant five rarely gets made. UGC creators run $60-150+ per video and still need your screen recordings and your script. The structure above is precisely the part they cannot do for you. Only you know which two moments in your product are worth pointing at.

So the honest workflow: pick the moments first, write three hooks against them, ship all three, keep the one that holds attention past second three, then iterate the screen section. The format is stable; the hook is the variable.

This face-screen-face structure is the SaaS-demo template inside Aitachyon: avatar hook, product screenshot cut-ins, avatar CTA. Paste your URL and a rendered ad with captions comes back in about two minutes, at roughly $1.32 a video on the Pro plan. Cheap enough that variant five always gets made.

Related articles