My "30-minute rule" for LLM coding agents
This one is simple: LLM agents should save me 30 minutes over the next-best alternative while they still need well-specified problems and need my focus every few minutes.
TL;DR: To recoup my lost flow state, LLM agents should save me — at minimum — 30 minutes of work per task over what Cursor provides.
A number of coding agents have been released since the beginning of the year. If you’re curious you can check out some of the announcements here:
Jules is Google’s effort.
Codex is OpenAI’s effort.
Copilot agent mode is GitHub/Microsoft’s effort.
I tried out Jules this weekend out of curiosity. I gave it two changes:
Update a dependency in my Discord chatbot that my friends and I use.
Deprecate voting functionality in the chatbot, since I noticed that Discord finally added polls last year.
Adding the dependency went fine. I could have done it faster, but Jules also reran the tests for me. My only intervention was testing it against my test account. A promising start!
The second task went poorly. Part of this was my own inexperience with these agentic workflows; for example, I told it to “remove” the feature and it just commented everything out. After it was done, it asserted that it ran the tests. While reading the code I suspected it hadn’t actually run the tests. I asked it to run them again. It gave me a passive-aggressive remark but tried to run them. Sure enough, the tests were failing and it spun its wheels trying to fix the tests.
After 15 minutes, I started worrying that Jules wouldn’t finish during my kid’s naptime. So I fired up Visual Studio Code and started racing Jules. I didn’t have any LLM augmentation, but it was mostly deletion so I didn’t need it. 20 minutes later, I was testing my finished change. Jules was still spinning its wheels.
I don’t want to overextrapolate from this example. Jules is in beta, I’m not used to these longer-lived agentic workflows, and I also intentionally underspecified the task to see how it would handle it.
At the same time, I’m disappointed that it failed as hard as it did. It has access to the full commit history of the project. It can actually go and see how the feature was added, and what the project looked like before it was added. If the task was underspecified, then it could have told me that! But instead it just didn’t make much progress beyond its initial salvo. But I want it to behave like a human that is capable of reasoning and communicating its needs.
In that 15 minutes, I had a lot of time1 to reflect on the workflow. It required a lot of interaction. It would work for 2 or 6 or 10 minutes and then show me its progress. I needed to keep checking the tab to see whether it was done. By doing this, I learned that Jules threatens to auto-approve its own plans if you don’t respond quickly enough.
How am I supposed to get work done during that time? What engineer is most productive in the 2 and 6 and 10 minutes between their interruptions?
This leads me to my “30-minute rule” for LLM agents: in order to recoup my lost focus, I need to save at least 30 minutes over what Cursor does for me with prompting.
Why 30 minutes? Because I lose time writing a task with more exacting precision than I would for a human, then I lose time handling the questions corrections and reviews, and then I spend 10-20 minutes getting into a flow state on another task, then I go back and review what it does. Let’s give the LLM some leeway since you can run multiple. Let’s say that I break even on the agent when it can save me 30 minutes over what I did before.
Let’s pretend that Jules completed the task within the 35 minutes instead of failing. In comparison, it only took me 20 minutes — and I was alt-tabbing back to Jules constantly. I lost 15 minutes because I used the agent. In reality, I would need to run 4 Jules instances together on 4 different problems, and they all solved the problem instantly and correctly.
This also made me reflect on how the Jules LLM experience stacked up against human engineers. I like working with other engineers, and I’m happy to specify a task to the level that an engineer needs and handle their questions and review their code. Why does it feel worse with the LLM? Because my coworkers are effective engineers and they save me time. When I needed a benchmarking suite last year and gave it to a senior engineer? Man you should have seen what he wrote. It was great and saved me a ton of time. I basically told him, “we need to benchmark this thing under load. Here’s what I’ve done so far. Ideally it would have similar characteristics as our production load.” And he knocked it out of the park. Jules would have never made progress on that task.
When you’re giving a task to an engineer, you need to adjust the task to their leveling. A senior engineer might be able to convert a well-scoped business objective into an engineering project, but a junior engineer might never finish that project. So you’d break it down for them.
What kind of scope and specification does each engineering level require?
Here’s how I imagine each classic engineering level should be able to interact with scope and specification. This is an oversimplification, but it suffices here.
Intern: They can handle a well-specified problem with minimal scope. You can scale the number of hints they get, depending on their experience.
Junior engineer: They can handle a well-specified problem with small scope. They may also need some hints and pointers to get started.
Senior engineer: They can handle a scoped problem that is underspecified.
Staff engineer: Give them an unscoped and underspecified problem.
Beyond staff: You’ve evolved beyond this. You identify the problems. You prioritize the problems. You create the problems. You are the problem.
There’s also the contractor dimension. Here are a few archetypes that I’ve run into over the years, out of contractors that can complete assignments:
Brute-force contractor: They somehow follow the contract to the letter. What about the spirit? Only if you have a spirit clause in the contract. Is there a logical contradiction? The task is impossible? They’ll somehow torture the wording to produce a deliverable. Your codebase will be worse than it was before they started.
Long-term relationship contractor: This is the contractor that tries to understand the business needs and guides each contract back to sanity. These are the types of contractors that eventually become full-time employees2.
I’m trying to save your business contractor: This is the true domain expert. They attend the conference every year. They explain how your project fits into the greater ecosystem, how you’re failing, and necessary changes that you must make. This is unpopular because not everyone wants to hear what they have to say, and not everyone is talented enough to do it. But these people are worth their weight in gold.
I would be tempted to rate my interactions with Jules as at the level of an Intern, but that’s an insult to interns. First of all, you kinda want your intern bothering you every 2 or 6 or 10 minutes if they legitimately have a question. Investing in your interns pays dividends, since (a) they become more capable engineers, and (b) it is one of the best recruiting strategies imaginable. So when your intern asks you a question every 6 or 10 minutes at first, then you are simply doing your job as a host by responding to them. Your focus should be on them. They will grow and the questions will get further and further apart and you will need to direct them less and less.
On the other hand, the LLM doesn’t grow from this3. My experience with Jules was more on the lines of the “brute-force contractor,” where it was just doing everything that it could to get the project completed to the letter of what I wrote. And in this situation, it is a huge loss to have to answer questions every 6 or 10 minutes. The LLM isn’t growing from the conversation, my project isn’t moving forward.
About 15 minutes.
If they survive for years without the company terminating their contract for the most petty reason imaginable.
At least, not enough.