I stopped calling GPT-4 for the same classification task 10,000 times

DEV Community · February 24, 2026

I kept running into the same pattern building internal tools: calling an LLM API thousands of times with the same prompt template, just swapping in different text. Classify this contract clause Route

I kept running into the same pattern building internal tools: calling an LLM API thousands of times with the same prompt template, just swapping in different text. Classify this contract clause Route this support ticket Categorize this log line Same task. Different input. Over and over. Two problems kept coming up: Data sensitivity. For teams handling contracts, patient records, or internal logs, sending that data to a third-party API isn't always an option. Cost. At scale, you're paying per-token for what is essentially structured pattern matching. So I built an open-source CLI that trains a small local text classifier from labeled examples. You give it ~50 input/output pairs, it trains a ~230KB model on your machine, and you run inference locally. No network calls, ever. npm install -g expressible

expressible distill init clause-detector cd clause-detector

Add some labeled examples: expressible distill add --file ./labeled-clauses.json

The JSON is just an array of { "input": "...", "output": "..." } objects — 50 or so pairs. Train: expressible distill train

Training Complete Samples 87 Validation accuracy 93.2% Time elapsed 2.8s

✓ Model saved to model/

Run: expressible distill run "Either party may terminate this Agreement at any time for any reason by providing 90 days written notice"

{ "output": "termination-for-convenience", "confidence": 0.94 }

That ran locally. No API call. The text never left the machine. Distill uses all-MiniLM-L6-v2, a sentence embedding model that runs locally. It converts text into 384-dimensional vectors that capture semantic meaning. A small two-layer neural network trained on your labeled examples learns to map those vectors to your categories. The embedding model downloads once (~80MB) and is cached. Everything after that is fully offline. No Python. No GPU. No Docker. Just Node.js 18+. This was important for me to document honestly. Works well (80–95% accuracy): Topic and domain classificati

Read original at DEV Community →