Published 2026-05-08

Jankmarking: Janky Benchmarking

Local LLM performance frontier: Ollama with M5 Pro
my jankmark frontier using ollama

Are janky benchmarks useful?

I put together some evals for local benchmarking. I'm mostly interested in approximately measuring performance versus speed because I don't like waiting while my macbook imitiates a jet turbine in noise and temperature and want to get good enough answers locally. For anything bigger the cloud is still the move. The problem is my benchmarks are --- to use a human em-dash and quote the kids these days "hella jank".

my janky benchmark questions:

These aren't very good. They're half AI generated. The summarization one is a single sentence and is actually backwards asking for a summary longer than the input sentence... (it used to be longer but I think it got lost at some point when I copy and pasted it. Never made it into version control... )

And yet, these are directionally correct. The better models cluster together scorewise.

For serious work, make sure your bench marks aren't jankmarks.