Documentation

Everything you need to know about staik VOICE.

Getting started

  1. Click "Try for free" on the homepage, or log in with your own API key from api.staik.se.
  2. Drop an audio file or pick one from your device. Supports mp3, wav, m4a, webm and ogg up to 100 MB / 30 minutes.
  3. Choose language (auto, Swedish or English) and whether speaker diarization should be on.
  4. Click "Transcribe" — Whisper-large-v3 runs on a Swedish GPU and pyannote separates speakers. The result appears on screen.
  5. Copy, download as .txt, .srt, .vtt or JSON, or share directly.

Supported formats

Audio formats

  • MP3
  • WAV
  • M4A / AAC
  • WebM (Opus / Vorbis)
  • OGG

Use cases

  • Meeting notes
  • Interviews and podcasts
  • Lectures and seminars
  • Voice memos
  • Internal calls
  • Research interviews

API reference

staik VOICE uses the staik API (api.staik.se/v1/audio/transcriptions) for transcription. It's OpenAI Whisper-compatible — just swap base_url.

POSThttps://api.staik.se/v1/audio/transcriptions

Authentication

Bearer token in the Authorization header. Get a key at api.staik.se.

Model

whisper-large-v3

curl

curl -X POST https://api.staik.se/v1/audio/transcriptions \
  -H "Authorization: Bearer sk-..." \
  -F file=@meeting.mp3 \
  -F model=whisper-large-v3 \
  -F response_format=verbose_json \
  -F diarize=true \
  -F language=sv

Python (openai SDK)

from openai import OpenAI

client = OpenAI(
    api_key="sk-...",
    base_url="https://api.staik.se/v1",
)

with open("meeting.mp3", "rb") as f:
    transcript = client.audio.transcriptions.create(
        file=f,
        model="whisper-large-v3",
        response_format="verbose_json",
        extra_body={"diarize": True},
    )

for seg in transcript.segments:
    speaker = seg.get("speaker", "?")
    print(f"[{speaker} {seg['start']:.1f}s] {seg['text']}")

Response

The response follows OpenAI's verbose_json format with extra fields for speaker per segment and word-level timestamps.

Limitations

  • MVP: max 100 MB file size and 30 minutes of audio in sync mode.
  • Longer files (>30 min) are handled via async mode in the next stage.
  • Pricing model: 1 minute of audio = 1,000 tokens.
  • Speaker diarization needs at least two distinct voices to work well.
  • Whisper-large-v3 supports 99 languages; Swedish and English are primarily tested.
  • Quality depends on recording — avoid heavy background noise and overlapping speech.

Plans and pricing

The demo account is free for short clips. With your own key you get more tokens and can transcribe longer files. See all plans at api.staik.se

Tips for best results

  • Record in a quiet environment — minimize background noise.
  • Place the microphone so all participants are heard equally.
  • Use an external microphone for longer meetings — better quality means better transcription.
  • 16 kHz mono is enough — higher sample rates don't add quality.
  • For diarization: avoid people talking over each other.