Back to articles index
Privacy deep-dive

Three classes of privacy risk in audio processing — and how browser-only tools address them

Conversations captured in recordings, voice as biometric data, and context leaked through ID3 tags or background sounds — audio tools carry distinct risk layers. Here is how NoSend Tools handles each class without sending data outside your browser, with concrete references to Whisper, the Web Audio API, and lamejs.

Risk class 1 — the conversation itself and the density of transcripts

The most raw layer of risk in an audio file is the content of what was said. Meeting minutes, medical consultations, legal advice, family arguments, conversations with children — once any of these recordings passes through a transcription step, it becomes searchable text that can be quoted out of context later. Audio alone resists casual review because replaying takes time, but a Whisper-class model collapses minutes of recording into seconds of text. Uploading the file to a cloud transcription service hands both the audio and the resulting transcript to the operator’s infrastructure; the terms of service may state that data is used only to deliver the service, but the actual retention windows and access boundaries are no longer in the user’s control.

voice-transcribe and audio-transcribe run OpenAI’s Whisper model (tiny / base / small, in quantised form) inside your browser using transformers.js with ONNX Runtime Web. The weights download once from a CDN — between roughly 40 MB and 240 MB depending on which size you pick — and from then on the tool works fully offline. The waveform is decoded by the Web Audio API (decodeAudioData on an AudioContext), held in memory as a Float32Array, and passed directly to the ONNX session. Neither the original audio nor the transcript text is transmitted. A server-side equivalent must write the binary to disk at least once on the upload path, retain it long enough to feed the model, and then log the produced transcript through the application stack — each of those steps is an independent surface where the content can be exposed.

Risk class 2 — voice as biometric data

The voice is biometric in the same way a face is. Under the EU GDPR, voiceprints fall into the special category of personal data, where explicit consent is required before any third-party processing. Recent voice-cloning models (Tortoise TTS, the ElevenLabs class of speaker-cloning systems) can reconstruct a speaker’s voice from samples on the order of tens of seconds, which means that once your recording reaches an external operator, the raw material for synthesising arbitrary statements in your voice is in their possession. Routinely uploading recordings to a cloud editor distributes that material to anyone the operator chooses to share data with, intentionally or otherwise.

voice-rec captures the microphone directly through getUserMedia, pulls PCM samples out of an AudioWorkletNode, and encodes them as MP3 using @breezystack/lamejs (the npm package, not a CDN-hosted lamejs build) or writes the same PCM into a RIFF container as WAV. Both the library and the audio bytes stay inside the tab. audio-pitch-shift and audio-tempo-shift run a WebAssembly port of the SoundTouch algorithm to shift pitch independently of tempo (or vice versa) without distorting the speaker’s formant structure, enabling anonymisation of recordings where the speaker should remain unidentifiable. In either case the design choice is the same: no step in the pipeline requires handing the voice to an external service.

Risk class 3 — metadata, ambient sound, and aggregation

Audio files carry their own equivalent of EXIF. The MP3 ID3v2 specification permits storing artist, composer, comment, recording date, recording device, and even GPS coordinates; DAWs and recorder apps often write these fields automatically. AAC containers (MP4) hold equivalent metadata atoms inherited from iTunes, and WAV files use the INFO chunk for the same purpose. Sharing a recording as-is therefore hands the recipient a bundle of fields the original speaker probably never inspected — the device model, the timestamp, the editor’s name.

audio-meta-strip identifies the ID3v2 header in MP3, the udta box inside the moov atom in MP4, and the INFO chunk in WAV by byte offset and removes only the metadata regions without re-encoding the audio stream. Quality is preserved exactly. Aggregation is the second half of this risk: a 10-second clip might reveal only train noise, but three clips merged together can expose a transit line, an arrival station, and the voice of the person being met. audio-cut isolates spans, audio-merge concatenates them, and audio-spectrum visualises the result — each step operates on an AudioBuffer through the Web Audio API, so the assembled output is the only artifact that touches the filesystem. Not one fragment needs to leave the browser to produce the merged piece.

Verify zero transmission with DevTools

All three risk classes share the same root question: where does processing actually run. The audio category in NoSend Tools is implemented on standard browser APIs and open-source libraries — Web Audio (AudioContext, AudioWorkletNode, OfflineAudioContext), @breezystack/lamejs, transformers.js with ONNX Runtime Web, and SoundTouch.js — with no outbound fetch to any external endpoint as part of the processing path.

To confirm this yourself, open any audio tool, enable Preserve log in the DevTools Network tab, and run a full session: record, transcribe, edit. The only requests that should appear are the initial loads of HTML, JS, WASM, and (on first use) the Whisper weights. The source code is published at otomomik/nosend-tools on GitHub if you want to audit at the implementation level. Conversation content, voiceprint, metadata — every layer of information your audio carries stays inside the browser tab.