paned turns any video into a rich, frame-by-frame multimodal description — what's on screen, what's said, what's written, and what happens — structured as clean text built to drop straight into your prompts.
A speech transcript captures the words and nothing else — no setting, no gestures, no on-screen text, no idea of what actually happened on camera. Models reason on what you give them. Give them the whole scene.
Each pane is a time-bounded slice of the video with every modality aligned — exactly the shape a language model reasons over best. Export as markdown for prompts or JSON for pipelines.
Setting, subjects, composition and shot type — described per pane so the model knows what it's looking at.
Accurate transcription with speaker labels and frame-accurate timing — every line anchored to where it happens.
Captions, slides, lower thirds and UI read straight off the frame via OCR — nothing on screen is lost.
What people and objects do, plus non-speech sound — sizzles, applause, music cues — captured as events.
Upload a file or pass a URL. Any length, any format — paned segments it into coherent panes automatically.
Vision, speech, OCR and audio models run per pane and are merged onto a single aligned timeline.
Get back paned.md or paned.json — paste into a prompt or stream it into your pipeline.
paned is opening up to a first wave of teams building with video. Leave your email and we'll be in touch.
No spam — just an invite when your spot opens.