Library notes Published: 2026-06-07

How transformers.js runs ML models inside the browser

HuggingFace's transformers.js uses ONNX Runtime Web to run models like Whisper and RMBG-1.4 entirely in the browser. Model weights are downloaded and cached locally; your audio and images never leave your machine.

transformers.js and ONNX Runtime Web

transformers.js, developed by HuggingFace, is a browser port of the Python Transformers library. Models trained in PyTorch or JAX are converted to the ONNX (Open Neural Network Exchange) interchange format and executed on ONNX Runtime Web (ortw) — a JavaScript engine for ONNX inference. The ONNX layer means a single inference path works across many model architectures without depending on the original training framework.

Two inference backends are available in the browser. When WebGPU is supported, the runtime dispatches GPU shaders directly for massively parallel matrix operations. When it is not, execution falls back to WASM SIMD — the SIMD extension instructions inside WebAssembly. NoSend Tools' Whisper worker probes WebGPU availability at startup by calling `navigator.gpu.requestAdapter()`. If an adapter is returned, WebGPU is used; if the call fails or the adapter is null, WASM is selected automatically. Even if WebGPU initialises but fails during shader compilation, the worker catches the error and transparently retries with WASM, so users never see the difference.

The two models NoSend Tools loads

The voice and audio transcription tools (voice-transcribe and audio-transcribe) use ONNX-converted builds of OpenAI's Whisper, published under the `onnx-community` namespace on HuggingFace Hub. Four sizes are available: tiny, base, small, and turbo (whisper-large-v3-turbo). Sizes in practice: whisper-tiny is roughly 95 MB on WebGPU and 40 MB on WASM; whisper-small is roughly 299 MB on WebGPU and 249 MB on WASM. The turbo encoder's fp32 weights exceed 2.5 GB unquantised, so the worker uses q4-quantised files to keep the download within browser reach. Model files are fetched from HuggingFace's CDN on first use, then stored in CacheStorage under the key `pwt-whisper-cache-v1`. On subsequent visits no network request is made before inference begins.

The background-removal tool (image-bg-remove) uses RMBG-1.4, published by briaai. The image is resized to 1024x1024, passed through the encoder, and the resulting single-channel mask tensor is scaled back to the original resolution to separate foreground and background. Model weights are cached under `pwt-rmbg-cache-v1`. Both tools run with `env.allowLocalModels = false` and `env.useBrowserCache = true`, so the model source and caching behaviour are stated explicitly in the worker source code.

The real difference from a cloud speech API

Cloud speech-to-text services like the OpenAI Whisper API work by sending your audio file to a server. Processing happens there, text comes back. Along that path, your audio may pass through CDN caches, load-balancers, and access logs, none of which you control. Even when the operator honours a "not used for training" policy, there is no way to verify that after the fact from the user's side.

In-browser Whisper works the other way. The ONNX weight files are downloaded from a CDN and cached locally — but those downloads carry no audio. Inference is handled inside a Web Worker by ONNX Runtime Web. The Float32Array representing your audio waveform never leaves the browser's memory. Open DevTools, start the Network tab, load the model, then run a transcription: you will see zero requests carrying audio content. The first-time wait is longer than a cloud round-trip (a few seconds for tiny, around ten seconds for small), but from the second visit onward the model loads from cache and the gap largely disappears.

Downloading a model is not the same as sending your data

When you open a transformers.js-based tool for the first time, model files begin downloading. This is structurally identical to ffmpeg.wasm loading on an audio tool or kuromoji loading on a Japanese text tool: it is the browser fetching a library or set of weights, not transmitting your input to a server. The HuggingFace Hub CDN delivers the model; it never receives the audio or image you are about to process.

The distinction matters for privacy analysis. Conflating "CDN request to fetch a model" with "data sent to a server" leads to the wrong conclusion about what happens after the cache warms up. NoSend Tools sets `env.useBrowserCache = true` explicitly so that after the first download, the CDN is not contacted again. You can verify this in DevTools under Application > Cache Storage: entries for `pwt-whisper-cache-v1` and `pwt-rmbg-cache-v1` accumulate there, and on repeat visits the Network tab shows no model-fetch requests before inference starts.