paned — video, decoded into context your LLM can read

The gap

Dropping a raw transcript into a prompt throws away most of the video.

A speech transcript captures the words and nothing else — no setting, no gestures, no on-screen text, no idea of what actually happened on camera. Models reason on what you give them. Give them the whole scene.

Plain transcript

Words only — no visual context whatsoever
On-screen text, slides and captions vanish
No actions, gestures, or scene changes
Speakers blurred together, timing approximate
Silent and visual-only moments simply disappear

paned multimodal description

Every modality — visuals, speech, text, audio, action
On-screen text and slides read and transcribed
Actions and scene transitions described per pane
Speakers labelled, frame-accurate timestamps
Structured so an LLM can cite and reason across it

The output

One video, decomposed into panes.

Each pane is a time-bounded slice of the video with every modality aligned — exactly the shape a language model reasons over best. Export as markdown for prompts or JSON for pipelines.

interview_clip.mp4 → paned.md 04:12 runtime38 panesmarkdown

00:00–00:04

PANE 01

visualWide shot, modern kitchen, daylight. A woman (30s, apron) stands at a marble island holding a chef's knife.

actionShe slices a red onion into even half-moons, left to right.

on-screen"Step 1 — Mise en place" (lower third)

speechSpeaker A (female): "First, we get all our prep done."

audio[rhythmic knife on cutting board]

00:04–00:09

PANE 02

visualCut to medium close-up. Hands sweep chopped onion into a steel bowl; steam rises from a pan behind.

cameraShot change — handheld, slight push-in.

speechSpeaker A: "The pan should already be hot — listen for it."

audio[oil sizzle, rising]

00:09–00:15

PANE 03

visualInsert graphic over black: an ingredient list animates in, three items at a time.

on-screen"1 red onion · 2 tbsp olive oil · 1 tsp cumin · pinch of salt"

audio[soft synth pad, no speech]

What it captures

Five modalities, aligned on one timeline.

Visuals & scene

Setting, subjects, composition and shot type — described per pane so the model knows what it's looking at.

Speech & speakers

Accurate transcription with speaker labels and frame-accurate timing — every line anchored to where it happens.

On-screen text

Captions, slides, lower thirds and UI read straight off the frame via OCR — nothing on screen is lost.

Action & audio

What people and objects do, plus non-speech sound — sizzles, applause, music cues — captured as events.

How it works

From file to prompt-ready in three moves.

Point it at a video

Upload a file or pass a URL. Any length, any format — paned segments it into coherent panes automatically.

It decodes every pane

Vision, speech, OCR and audio models run per pane and are merged onto a single aligned timeline.

Drop it in your prompt

Get back paned.md or paned.json — paste into a prompt or stream it into your pipeline.

Where teams use it

Anywhere an LLM needs to understand footage.

›Video RAG & semantic search

›Meeting & interview analysis

›Content moderation

›Course & lecture summarization

›Highlight & clip generation

›Ad & brand-safety review

›Sports & broadcast tagging

›Accessibility & audio description

›Training-data captioning

Early access

Give your model eyes.

paned is opening up to a first wave of teams building with video. Leave your email and we'll be in touch.

No spam — just an invite when your spot opens.