Wordle Race at Grokathon London: A Solo Journey to 4th Place
2026-01-19 · 7 minute read#ai-benchmarking #determinism #grokathon #xai #llm-evaluation #systems #hackathon
4th place at Grokathon London
Project: Wordle-AI-Benchmark
Team size: 1 (just me)
The Night Before
I almost didn’t go. My team had other plans. One by one, they backed out. “It’s too far.” “Something came up.” “Maybe next time.” By the night before Grokathon London, I was alone with a half-formed idea and a choice: stay home or show up solo.
I showed up.
What I Found There
Grokathon London wasn’t just another hackathon. Walking into that room changed everything. The energy was different. People weren’t there to win swag or pad resumes. They were there because they genuinely cared about building systems that matter.
I met engineers from xAI who spoke about latency the way musicians talk about timing. I watched teams debug infrastructure at 2 AM with the same focus I’d only seen in production war rooms. Someone was building a deterministic testing framework for RAG systems. Another person was optimizing inference pipelines for edge deployment.
This wasn’t a hackathon. This was heaven for people who think about systems.
And I was doing it alone.
The Idea That Wouldn’t Let Go
The question that drove me was simple but kept getting ignored everywhere I looked: when you give multiple AI models the exact same problem under identical conditions, which one actually works?
Not which one scores higher on some abstract benchmark. Not which one has the best marketing. Which one solves the problem correctly, quickly, and without waste?
I’d been frustrated with AI evaluation for months. Every benchmark felt like a performance review where everyone got to take the test in different rooms with different questions. How is that useful? How does that help anyone make real decisions?
So I built Wordle Race—a deterministic AI benchmarking platform where models compete head-to-head on the same Wordle puzzle, measured with brutal precision.
Building Alone, Learning from Everyone
Here’s what no one tells you about going solo at a hackathon: you become everyone’s teammate for five minutes at a time.
At 11 PM, someone helped me debug a race condition in my streaming pipeline. At 3 AM, I explained my scoring methodology to a team working on LLM observability, and their questions made my system better. At 6 AM, an engineer from xAI looked at my latency measurements and said, “This is what production evaluation should look like.”
I had no team, but I wasn’t alone.
The conversations I had that weekend reinforced something I’d been thinking for a while: AI evaluation should look more like production engineering and less like academic exams. The xAI CTO said something that stuck with me: “Benchmarks measure what’s easy to measure. Production cares about what matters.”
That became my north star.
What Wordle Race Actually Does
The concept is deliberately simple:
Every AI model gets:
- The same Wordle word
- The same starting state
- The same feedback after each guess
- The same six-guess limit
No advantages. No retries. No excuses.
The platform measures:
- Correctness: Did you solve it?
- Speed: Time to first token, end-to-end latency
- Efficiency: Guess count, token usage, estimated cost
Then it ranks models with brutal honesty:
- Solved vs not solved
- Total time to solve
- Number of guesses
If a model fails, it gets a closeness score based on how many letters it got right. But failure is still failure.
The Technical Heart
I built this on principles I care about: determinism, observability, and parallelism.
The Wordle Engine
Lives in lib/wordle-engine.ts. Enforces the rules. Computes feedback. Maintains isolated state. It’s strictly deterministic—same word and guesses always produce the same outcome.
The AI Runner
Built with Vercel AI SDK v5. All models start simultaneously. Token-level streaming is captured. Timestamps are precise. Results stream to clients via Server-Sent Events because I needed predictable ordering and low overhead.
The Scoring System
No hand-waving. No “it depends.” Just hard numbers:
- Correct letters (green): weight 3
- Present letters (yellow): weight 1
Your score reflects what you earned.
Why This Mattered at Grokathon
Modern AI benchmarks have three fatal flaws:
- Non-identical conditions: Models get different prompts, different retries, different advantages
- Latency blindness: Speed is ignored despite being critical in real systems
- Score compression: Single numbers hide variance, failure modes, and waste
In production, none of this flies. Developers need to know: Does it work? How fast? What does it cost?
Grokathon wasn’t about flashy demos. It was about systems that operate under real constraints. My project treated AI as infrastructure:
- Latency is a first-class metric
- Determinism is non-negotiable
- Evaluation mirrors real usage patterns
When I presented to the judges, I didn’t show them a dashboard. I showed them a race where every millisecond mattered and every token was counted.
They got it.
4th Place
When they announced the winners, I’ll be honest—I wasn’t expecting to place at all. I was the solo dev in a room full of teams. I’d spent half the weekend debugging streaming edge cases while everyone else seemed to have it figured out.
4th place out of every team there.
Not because I had the fanciest UI. Not because I used the newest framework. Because I built something that solved a real problem the right way.
The xAI team pulled me aside afterward. We talked about production evaluation, about measuring what matters, about building systems that don’t lie to you. One engineer said, “We need more tools like this.”
That meant more to me than the placement.
What I Learned
Going solo taught me things a team never could:
Latency variance matters more than averages. You can’t average away a bad user experience.
Streaming changes perceived intelligence. When results appear token-by-token, people trust the system more, even if total time is the same.
Determinism is harder than accuracy. It’s easy to get lucky once. It’s hard to be reliable every time.
People want tools that respect them. No one wants to be lied to by benchmarks. They want truth, even when it’s uncomfortable.
The best teams aren’t always the biggest. Sometimes it’s just you, a good idea, and people willing to help.
The Tech Stack
Built with tools I trust:
- Framework: Next.js 16
- Language: TypeScript
- AI: Vercel AI SDK v5 with AI Gateway
- Streaming: Server-Sent Events
- Styling: Tailwind CSS v4
- UI: shadcn/ui
- Deployment: Vercel Edge Runtime
Nothing fancy. Just solid engineering.
What’s Next
This is just the beginning. I’m planning:
- Batch evaluation across large word sets
- Historical leaderboards to track model evolution
- Variance and stability analysis over time
- Cost-versus-accuracy dashboards for real decision-making
- Domain-specific word lists for specialized testing
But more than features, I want to prove a point: AI evaluation can be honest, practical, and useful.
To Everyone at Grokathon London
Thank you.
To the xAI team who took time to discuss deterministic evaluation with a solo dev at 2 AM—you validated ideas I’d been carrying alone for months.
To the teams who shared debugging tips, asked hard questions, and made me think deeper—you made this project better.
To the organizers who created a space where systems thinking mattered more than polish—you built something special.
And to my team that never showed up—thank you for teaching me I didn’t need you. Sometimes the people who let you down give you the greatest gift: the chance to prove you were enough all along.
Final Thoughts
Wordle Race isn’t just a benchmarking tool. It’s proof that you can build something meaningful when you care more about truth than optics.
When you strip away the marketing, the hype, and the vanity metrics, you’re left with a simple question: Does it work?
This project answers that question honestly. No hand-waving. No asterisks. No excuses.
The code is open-source. The methodology is reproducible. The results speak for themselves.
If you care about building AI systems that matter, about measuring what’s real instead of what’s easy, about treating evaluation like engineering instead of theater—this is for you.
4th place at Grokathon London.
Solo.
Worth it.
Author: Rajnikant Dhar Dwivedi
Event: Grokathon London
License: MIT