Ranking Clippies

16 Dec, 2025

I am fascinated by how fast the capability of LLMs is advancing, and how tremendously useful they can be. Naturally, I don’t want to be stuck with one subscription (I am not subscribing to multiple services, my pockets are not bottomless). I would rather have access to the state-of-the-art mainstream models. And so I like having a ranking of the model performance. I end up using the top 3 models or so. This is how I rank clippies (that’s how I call LLMs in homage of this little guy).

Today we have a couple of services that rank clippies based on different criteria such as coding capability, how well they follow instructions, “reasoning”, language, data analysis and so forth. The most famous ones as far as I know are LMArena, Scale and LiveBench. Those are the ones I use.

LiveBench is the weakest one because it reflects more user preferences—think of it like a popularity contest. The more rigorous one seems to be Scale. Each ranking has its own biases, so in order to minimize those I average them out and use whatever three models come out on top.

First I go to the following pages in no particular order:

LMArena | leaderboard
Scale’s humanity’s last exam and tutor bench
LiveBench

I take note of the ranking for the models I am interested in and fill out some hash tables (cf. below). In the past I was using the Borda method to aggregate individual rankings into an average ranking. Now, I just use the following prompt template that gives similar results plus I don’t need to open a Jupyter notebook every time:

## CONTEXT

Here are the rankings of several LLM models in the form of python hash tables, according to different websites:

LiveBench={"sonnet”:7, “opus”:2, “haiku”:16,
"gpt”:4, "gpt5chat”:47, "gpt-pro”:5, “gpt-instant”:33,
"gemini”:3, "gemini-flash”:28,
"grok":18, “grok-fast”:19,
"kimi":12,
"llama-maverick":0}

LMArena={"sonnet”:8, “opus”:3, “haiku”:49,
"gpt5chat":22, “gpt”:6,
"gemini":1, "gemini-flash”:46,
"grok":2, “grok-fast”:19,
"kimi":20, 
"llama-maverick":0}

humanity_last_exam={"sonnet":14, “opus”:3, “haiku”:0,
"gpt-pro”:2, “gpt”:3, “gpt-instant”:21,
"gemini":1, "gemini-flash":14,
"llama-maverick":22}

tutor_bench={"sonnet":10, “opus”:5, “haiku”:0,
"gpt-pro”:0, “gpt”:1, “gpt-instant”:14,
"gemini":3, "gemini-flash":0,
"llama-maverick":19}


## YOUR TASK

What is the final ranking of the models, taking an average of the rankings? 

Note that:
- Models that have only one ranking should be penalized for the incompleteness of information.
- Models with a "0" score are not present in the indicated hash table. Likewise if the corresponding entry is not in the indicated hash table.

As of December 16 2025, the results and average scores are:

gemini (2.00)
opus (3.25)
gpt (3.50)
gpt-pro (3.50)
sonnet (9.75)
grok (10.00)
kimi (16.00)
grok-fast (19.00)
llama-maverick (20.50)
gpt-instant (22.67)
gemini-flash (29.33)
haiku (32.50)
gpt5chat (34.50)

In the above ranking, gemini means gemini-3-pro, opus is opus-4.5, gpt is gpt-5.2 and grok is grok-4.1. The top ranking models all have max-reasoning enabled.

#ai #reference