> Claude Code navigates a codebase the way a software engineer would: it traverses the file system, reads files, uses grep to find exactly what it needs, and follows references across the codebase. It operates locally on the developerâs machine and doesnât require a codebase index to be built, maintained, or uploaded to a server....
> Agentic search avoids those failure modes. There's no embedding pipeline or centralized index to maintain as thousands of engineers commit new code. Each developer's instance works from the live codebase.
The frame of "the way a software engineer would" and the conclusion seem at odds. I'd love to be schooled otherwise?
I use autocomplete/LSPs all the time and they're useful. That's an index? Why wouldn't Claude be able to use one? Also a "software engineer" remembers the codebase - that's definitely a RAG. I have a lot of muscle memory to find the file I need through an auto-completed CMD+P.
It doesn't need to particularly be real-time across thousands of engineers -- just the branch I'm on.
It's rare that I'd be navigating a codebase from first-principles traversal. It would usually be a new codebase and in those cases it's definitely not what I'd call an optimal experience.
It works exactly the way I'd work. I have learned to navigate large codebases before LSPs existed. I used vim for many years and would grep to find the relevant files. When I first tried Claude Code last year, I was like WTF, it's going exactly what I'd be doing.
> Claude Code is running in production across multi-million-line monorepos, decades-old legacy systems, distributed architectures spanning dozens of repositories (âŚ)
So it is optimized for the general case, using robust tooling that works everywhere, especially when large & messy.
That being said, your remark is right and for well organised smaller repoâs thereâs better tooing it can and should use. But I think it does, at least Codex does is my case so I guess Claude does it to. For example Codex use âgo docâ first before doing greps.
But the general use case is not the most efficient for a greenfield to-be fully managed by an agentic system code-base. It is built to be good around the scaffold(programming like humans) and not the actual problem space.
Anthropic's target should be a codebase designed for agentic comprehension from the first commit. Here the codebase adapts to the agent. You can enforce conventions, structured metadata, semantic indexing, explicit dependency graphs. Whatever makes the agent's job trivial rather than heroic.
The large majority of coding is maintenance work, not greenfield development. Even if you are doing greenfield development, it won't be long before it is maintenance.
Even if there is first principles traversal of some parts of the codebase, there are other bits that definitely not change, and where exploring every time is a massive waste of tokens. My arguments with claude often have to do with making it explore a lot less, because I know better, and faster, than its slow, expensive navigation of things that basically never change. And it just goes into the same kind of rabbit holes every time.
I still think the best process with Claude Code is: 1) ask it to gather context that you know is relevant 2) only then ask it to do whatever you want it to do. If you do it the other way around, it will over research, over think and generally make more of a mess.
That's the question, innit? Dumped into a codebase and given a ticket, what's the fastest way to get your bearings and do the ticket? It's gonna depend on the codebase and the ticket, but it would be an interesting contest to see what tools people have. Some form of grep, sped up using an index, is going to get a skilled operator pretty far, but more complex tools for more complex tickets, eg fix something subtle, like a bug that only manifests on Tuesdays in 2% of requests from Poland, I imagine more advanced tools would help the programmer figure it out faster.
Simple - It even eats up to 35% five hour usage limit in first prompt even on small projects and then there's 5 minutes time out for you to respond quickly or caches would go bust and you'll pay another 12% to 15% on the next prompt.
The article listed explains how to avoid this. If you naively turn it loose on a big code base, yes, youâll burn a lot of tokens while it tries to find stuff.
This is such a shame, finding where stuff is in a large codebase is my number 1 use for LLM. I hate it that it relies on grep so much, I can do grep better and faster myself.
If I set a regular expression as watcher on a filesytem to notify me if any file changes and I write that in go and assuming regular expression isn't buggy nor its implementation - and then I write rules in a file (as regex) then there's snowball in hell of a chance that it would misnotify or miscategorize anything.
Are LLMs that super reliable in their output already with all the guardrails around?
Don't think so. Hence it is snake oil just like dozens of harnesses.
It might behave differently than specified and a human is required to validate every output carefully or else.
> Are LLMs that super reliable in their output already with all the guardrails around?
Well, what is your definition of "super reliable in the output", and is it a quantifiable/measurable target or just a feeling?
Is it "more than humans", "more than senior developers", "almost perfect", "perfect"?
> It might behave differently than specified and a human is required to validate every output carefully or else.
Sure, just like meatbag developers. All the security flaws AI finds today were introduced years/decades ago by humans and haven't been found (that we know) by humans in ages.
printf("I'll count up to %d", MAX_COUNT);
for(int i=1; < MAX_COUNT; i++)
printf("I'm now counting %d", i);
```
And of the following prompt:
```
You'll count to 10,000. At the start say "I'll count up to 10,000" and then for each number say "I'm now counting <number>" and do not say anything else. Do not miss numbers in between.
```
Which one is going to produce 100% correct results out of a 10,000 run of each?
Now don't give me "these are different tools". We all know. I'm talking about reliability and predictability.
Just an anecdote: I was designing a project for LLMs onboarding and orchestration. Claude chose to read only the first 40 lines of each file. Later, in another session, looking for causes of low quality result, Claude detected the fault and changed the code to perform an AST analysis, so now the analyzer takes documentation lines and functions signature (input/output) as input.
Claude's initial approach was really poor.
One has to wonder how many times Claude code has to be modified/reviewed for improvement, or whether it is possible at all to make good code with it.
Edited: Generalization:
Claude can fix a localized, identifiable poor decision (e.g., "only reading first 40 lines") because the fault is discrete and traceable to one piece of code.
But real software quality problems often arise from many small, individually reasonable decisions that collectively produce bad outcomes. No single one is obviously "the fault." In that scenario, a tool that generates low-quality building blocks piecemeal may never converge on good code, because each piece seems fine in isolation.
I think it's taught to look at source code through a peephole for the sake of context preservation, but I feel like this could be a good use-case for some sub-logic or even a full sub-agent. Like, here sub-agent, you skim that file and tell me a summary, and highlight any areas related to X and Y so that I can look at them in my main context. You can also periodically observe the main work stream and interrupt me if you realise that something in the file you're thinking about is relevant to what I'm working on or might change the direction of what I'm doing.
> I think it's taught to look at source code through a peephole for the sake of context preservation
Yes, to a (real) fault. Less than one in fifty times it ignores an instruction or piece of data in a file has it seen the instruction or data before ignoring it. The other times, it's done this sampling nonsense.
Results are night and day using the 1M token models and reading the full files.
I think what you suggest is like a local second order approximation, that can help. But, I think that the real problem is a global one, is about architectural taste, how the many local pieces interact and their friction. Currently that demands human expertize.
Why can't Claude Code generate effective harness for us by inspecting the code base?
I tried defining CLAUDE.md (or AGENTS.md), skills, plugins, but I'm not getting the effectiveness others claim to be. LSP plugin for example, CC doesn't to use LSP's symbol renaming and edits file one by one slowly, or it does not invoke the skill when I explicitly ask to remember to invoke when prompt contains a specific clue.
Am I using it wrong? Is there a robust example I can copy the harness?
Harnesses do fix it IMO - itâs why Claude code and Codex had a massive jump in alleged productivity on release and then seems to have flatlined. But a custom harness _would_ allow you to do things like âon every message, run lint validation and testsâ. That in and of itself would be wildly useful.
Honestly - I think it's because it goes against the "vibe" part of the tooling - why do you care what the code looks like as long as when you run it it does what you want it to do?
I stopped using `/init` and having CLAUDE|AGENTS.md files that explained the codebase. The only thing I kept was how it should explore the codebase and use `git log` when researching, which is probably redundant too. I can't figure it out either.
The codebase I work on is roughly 100k LOC so idk if it is considered large. Personally it's the largest repo I have worked on.
What seems to work in some cases are hooks with scripts that feed into the context window (I've had to strip out some of the unnecessary linter messaging to limit context). Linters and/or other language specific checkers that can be installed via OS package repository and called via script. Also, the model + skill context together could make a difference. Skills that "worked" on 4.6 may not work as well on 4.7, which seems to require more explicit direction, but is more reliable by comparison to 4.6. Updating skills might help too. Test and run before/after to check. CC also injects unnecessary tool calls into context, so you may need to suppress tasks if you're a beads fan for example.
- runs the test what is failing | grep "x|failing" | tail 10
- runs the test again to get the why it's failing message | tail 10
- runs the test again because tail 10 cut off the message
every fucking time.
I have a skill for it to not do that = save output for whatever test you run into file, read from file using whatever commands you want. Ignores the skill. It's maddening. It's as if, puts on tinfoil hat, it's designed to waste your tokens, while eventually accomplishing its task
> That also includes codebases running on languages that teams don't always associate with AI coding tools, such as C, C++, C#, Java, PHP.
What a strange comment for them to make. Why wouldn't I expect CC to work well with those languages? What languages would I associated it with? Python and Javascript?
How very interesting. In an industry, where things shift around in months if not weeks, thereâs been not only enough time for clear patterns to emerge but also these patterns have proven successful on large codebases. Whatâs the success criteria? Didnât delete production database? Team velocity has increased? Codebase TTL has increased? Operations guys are happier?
I still say if this happens to you with AI tooling, that's both a failure on you and your org for giving a developer prod credentials that could nuke production resources. I don't think I've worked in a place that gave me this level of blind access.
I have only worked in startups and I have been an early engineer in both of them. I would always get high privileges within a short time where I would have the access to create and delete resources. I don't think it's that uncommon.
Sounds like they're still giving the model the keys to the kingdom, which is my point, stop giving the model the avenue to do catastrophic mistakes, it makes no sense.
If youâre message is in response to me, which I think it is, I deliberately donât give access to credentials and env variables. Iâve worked to create restrictions and seen AI models use very interesting methods to bypass them.
Even now my prompt says the AI must verify the path of the files it intends to edit, and get permission before editing one file at a time and only after permission. I stop it from ignoring those rules once a day at least.
We kinda need to architect things with the assumption that all token-output from an LLM can be unpredictably sneaky and malicious.
Alas, humans suck at constant vigilance, we're built to avoid it whenever possible, so a "reverse centaur" future of "do what the AI says but only if you see it's good" is going to suck.
I built my own IDE to replace vscode / cursor so I could design the harness and ensure that the model tool access was secure and limited. But the rest of the industry is YOLO
The problem with agents is they regularly sidestep the guardrails and do what they want with a script anyway. The number of times Iâve seen Claude try to escape the folder itâs working in, and then for it to write a python script that does exactly what I told it itâs not allowed do supports that.
If you use SSO and have an AWS config that Claude is allowed to see to get the correct role in the first place, it will just pick the role and plough on anyway.
And this is why it is the height of irresponsibility to run LLMs on your system. We know they are unreliable and just make things up; it's extremely foolish to go "yeah I'm going to let that run commands".
It's not _really_ any different to running an undocumented third party binary. Is it the height of irresponsibility to run Windows, or VSCode, or Spotify?
I think the model we've got now is wrong, and the harnesses should be OS-level sandboxed, and the agents should be running in harness managed sandboxes.
The first step I do when I do any meaningful side project is to set up rds with snapshots. So any startup that doesnt do this one basic step already deserves to fail in my opinion.
Then next I've used AI agents like crazy, we even have linked mcp servers that let it query on the dev database. Haven't seen it try deleting everything a single time. I haven't seen any agent try to do anything destructive. Ever. Perhaps its just reflecting an outrageously bad engineer and nothing else.
Exactly. So is that level of obvious hygiene where the bar is or is it somewhere else. What ticks me off is the audacity of blanket claims without an attempt to even remotely state why itâs said this is a list of successful patterns and what does success mean. Weâre just supposed to eat it up, because, you know, Claude.
I wonder if Anthropic tested their claims on a pro, 5x, 20x subscriptions. When you have infinite amount of free tokens it sure makes sense, you just throw tokens at the problem. But not in a limited usage scenarios it doesn't fly far..
the fishing: 1) install the official `skill-creator`; 2) use that with the above link to create `claude-md-improver`; 3) improve the skill by tasking claude with researching the topic of `progressive-disclosure`, in the official docs; 4) point the new skill at you CLAUDE.md file and accept the changes
What Iâm curious about is how well LLMs do when they create something from scratch, because so far my experience was with letting it fix issues or add features to existing codebase where I already shaped the general architecture and put in a lot of guardrails. But what if the architecture is unclear and there is nothing letting agent know if change breaks something or not?
My only experience with tiny codebase where it did a lot of scaffolding was poor - it did what I asked for, not what I needed. If i did more of the thinking myself I would realise itâs a code that works but doesnât solve the problem Iâm after.
Disagree, but also what do you classify as local storage? Does the repo âsizeâ include all projects or just one? What about multiple branches? How much capacity is local storage?
A stock Unreal Engine project is several hundred gigs, consists of multiple solutions, multiple languages, and I would classify as large personally.
Without some kind of indexing itâs very awkward to work with and very slow. To work with LLMs and Unreal projects we create a local index, that index file alone is 46GB.
Without distributed compilers and caches it can take multiple hours to compile the main solution per platform (usually PC, Linux, Xbox, PlayStation, Switch, and sometimes mobile).
So the codebase easily fits on local storage so long as you donât count assets (those are several TB) and extra so for source assets (10s of TB), and thatâs per stream per large project.
Anyways, point is I disagree and think Unreal Engine is an example of large codebase that fits locally.
How did they even manage to generate a terabyte sized repo, that's crazy. Do they have something written up on how it's structured and why they'd even go that route?
Probably, but you want to version control assets too.
People usually mention git-lfs at this point, but that is always annoying to use in practice. There is also shallow-clones and sparse-checkouts, but these only mitigate the problem as there is no way around cloning at least one revision completely with git.
I don't have any LSP's hooked up to CC yet (going to fix that today), or particularly sophisticated CLAUDE.md files.
So, if I've read this post correctly, that means that CC is navigating my codebase today by sending lots of it up to a model, and building an understanding. Is that correct? Did I misunderstand it?
I kinda suspected there was more local inference going on somehow -- partly because the iteration times are fairly fast.
I think that's correct. Which is kinda funny, I remember 10y ago that I was heavily relying on IntelliJ features to understand new codebases (jump to definition, find all usages of a function, navigate from SQL to the table in database tab etc.).
It turns out, that for a machine, find and grep is all that's required.
Agents use find+grep because it's available everywhere and without any configuration, but they would still be more efficient with LSP. Once LSPs will be more easily configurable for agents, they will use them.
A human could get by with just find and grep too. And in both cases, find and grep will be slower and less precise than an IDE's code navigation features.
Wondering if enterprises have a modified version of CC that doesnt have to optimize to stop bleeding on fixed cost subscription plans.
The article really does not align with the current sentiment. Everyone with a choice has mostly moved on to codex (ofc in this world all it takes is a model update/harness update to turn things around).
CC is great at a lot of things, but repeatedly misses out reading on crucial parts of the code base, hallucinates on the work that was done and a bunch of other issues.
The influencer economy trades on hype, on frenzy, and ultimately, eyeballs. The more the better.
They want you feel like youâre missing out. They want you to switch. Being boring is far more productive. Pin your versions. Stick to stable releases and avoid the nightlies.
Significant noise created from 4.6 to 4.7 Opus transition has caused some to interpret this as signal. Excluding certain genuine and real bugs, the noise about perceived quality falling dramatically was noise. Influencers doing influencing turned it into âsignalâ. The reality was that if you had strong planning and spec driven development it ranged from manageable to non-existent.
The vast majority of the people I know and work with have not switched off CC or their Max sub.
I have a choice and have not moved to codex (100/mo personal + my employer pays for a subscription). I try codex here and there and it seems to go off the rails every time. I have had some good experiences with codex, but generally trying to get something big accomplished it doesn't work out.
But I may not have paid enough to get the full real experience with codex
I use codex at home 20 bucks a month the limits are very high relative to the price, maybe the gravy train ends soon for these and then it's probably to open router chinese models.
At work it's CC or sometime codex, personally don't see much difference at all and most normies will notice none. The cultists have their opinions.
What bleeding? Anthropic wants as much of that "bleeding" as possible. The interaction data gathered from genuine human CC subscription usage of their models goes directly into their RL training, it's invaluable and they are more than happy to lose money on the inference to get it. That data is what xAI was recently willing to pay $10b to cursor to get.
They want you to use Claude Code. They hate other UI surfaces like OpenCode etc purely because they lose control over that data, so they're subsidizing the inference without getting what they actually want, the data (they still get some of it of course, but it's much less ergonomic for them. Those tools often abstract away the subagent calls, for example). OpenCode can collect that data themselves, so by allowing subscription there, Anthropic sees itself as subsidizing another org getting that data. Hard no.
And tools like OpenClaw are useless because they're mechanical and don't represent actual users interacting with the service - again, subsidizing but not getting the reward.
It's all very simple once you understand their motivations.
You must be using a different CC. Or what theyâre writing here is correct, and itâs all due to the CLAUDE.md file that I only occassionally yell at claude.
Hmm please share more. I have had the max CC sub since it came out. Religiously follow all of Boris/Cats advice but still struggle with it. Meanwhile a really badly written AGENTS.md will still get the work done.
I find that most âtechniquesâ are basically user hallucinations. Simple plan-write-refactor loops and trivial CLAUDE/AGENTS.md, generated by the harness itself, work nicely. Maaaaaaaaaybe write a skill or two, but usually itâs better to just write a script.
I think it's a good rule of thumb that if you find yourself saying everyone prefers this model or that model you're in a bubble. I've made this mistake before, I used to go around saying everyone knew Claude was the only model for serious professional use, but I was wrong.
I always assume that people making those comments on HN are trying to convince others to switch to their model. Surely no one actually believes their friend circle is a representative sample of the hundreds of millions of people that use these LLMs?
Btw the guy in charge of that stuff for Anthropic is the same guy who said GPT 2 was too dangerous to release, Jack Clark. LMAO. That model could barely string a sentence together.
Interesting that MCP was mentioned over CLI. For production or controlled environments, I would not make MCP the deployment path. I would let MCP help generate or choose commands, but have the actual deployment go through CLI scripts, Git commits, and CI/CD approval.
Iâm super interested to know what the back and forth between models and tools really looks like in practice.
Are there any much more detailed walkthroughs of how it works and how it decides the tools to use and the grep to use etc and what the conversations actually look like?
In the UI you see just enough to know itâs doing something but you donât really see the jumps itâs making offscreen.
There's a bunch of stuff I include, depending on the project. Some general ones are commenting style and coding standards. In theory it should be able to do it without that by looking at the repo style, but I haven't found that to be the case (especially with overly verbose/repetitive comments).
A specific example in another project is the testing/verification procedure. It's a wasm/WebGPU and the test harness is fairly complex. There are scripts to handle it, but by default Claude will churn for a while to figure it out and sometimes just give up. It definitely saves a lot of tokens/speeds things up.
The tokens it uses up clarifying can be saved, and it's often good to write out intentions. For instance, you may be mid-process on cleaning up some architectural pattern, and giving it guidance about where to find docs to follow, etc, are very project-specific.
>I mean: If there was something you could add to the prompt to consistently increase performance why isn't it in the system prompt already?
I think about this a lot. So far I think we are mostly just being gaslit. That we can influence the AI to be better with a few encouraging words and role playing, actually seems absurd. Maybe there is some element of randomness introduced there or something. All these extra MD files don't seem to do nearly as much for results as people believe they do.
Lots of concepts. Release the harness that made it possible to port Bun to Rust in 9 days. That's what everyone really wants. Then everyone can go "do that but for this other goal".
There you have a verifier though. As in you have test cases (which are written in JS and thus do not need to be translated). The moment you have a verifier signal LLMs become extremely reliable. Now of course they can reward hack your test cases but in a large codebase with many tests it becomes the only small thing you have to worry about.
This is really a zero information blog post. I want to know how they use the LSP to improve their understanding of the code base. Would be great if it was open source for us to review.
A post like this should be providing people with some reassurance about Claude's ability to understand code at a large scale. It's mostly fluff.
Edit: so I did some googling to dig around for thoughts on LSP performance and integration. the author of bun has a tweet about saying that they are a big drag on performance for no real gain and virtually all of the replies agree. Anyone else have any experience/thoughts?
This is already the case for many startups. In fact, the figure might be closer to 100%. The work shifts to requirements analysis, high-level specifications, and final review instead (after AI code review).
"AI will take over almost all the work of software engineers (SWEs) end - to - end in just 6 - 12 months!"
What you describe is >50% of the job of SWEs, even when they write all code by hand.
Are you saying that "for many start-ups", this isn't done by SWE's but by some other career type or are you implying that it's just the code written (and first review) is replaced by AI?
I have watched Darioâs interview at WEF referred to in the article and I am quite certain Dario didnât say that. He talked about AI automating most coding already or soon, not software engineering as a whole.
He did say a few months later in an interview in India that AI will eventually take over most of SWE tasks.
â-
My statement on startups is largely about automating coding by SWEs. My startup also uses AI to automate part of technical specifications and code review but I am not sure how widespread that is.
Yeah I'm working on one of those now that a 3rd-party vendor cranked out for us. I spent all day ripping out an endpoint that did 98% of what another endpoint did and should never have existed. I also ripped out 80 lines of code that looked like this:
const sqlStatement = (!params.mostRecentOnly)
? {giant SQL statement}
: {identical giant SQL statement + 'LIMIT 1' at the end}
AI never met a problem that can't be solved with more code. Need some data in a slightly different structure? Don't try to modify an existing endpoint, just build a new one! Need to access a field that's buried in a JSON object in the database? Just create a new column, but don't bother removing the field from the JSON object. The more sources of truth, the merrier! When it comes time to update, just write more code to update the field everywhere it lives!
Factor out the extra sources of truth you say? Good luck scanning the most verbose front-end you've ever seen to make sure nothing is looking at the source you want to remove. In the beginning of big projects, you have to be absolutely ruthless about keeping complexity down so it doesn't get out of control later. AI is terrible at keeping complexity down.
My goal is to halve the lines of code from what the vendor turned over to us. One baby step at a time.
That is a skill issue though. I have rules for my agents to write compositional, reusable, modular, small files and to avoid any sort of boilerplate etc. Being config driven, single source of truth, having other agents review that rules are followed, etc. Any API or UI or any sort of entry points very light, just proxying to the modular logic basically, so this logic could be reused by any entrypoint easily.
UI components always presentational only logic abstracted modularly, etc...
How do you make it so that the model doesn't forget to follow those rules and skills? How do you make it actually understand the architecture and constraints? You can't, current models don't work that way to make it happen.
Can you share your rules and some of the example PRs that it auto generates and reviews?
The number of times Iâve seen Claude say âthis test was failing already so is ignoredâ when it _wasnt_ despite me telling it to never do that makes me doubt.
I mean, since Opus 4.6 came out, that rings more and more true. You still have to babysit the output, do some planning and be proactive about ways to do things better⌠but 80-90% isnât out of the question if youâre in the domains that are well represented in the training data, e.g. if youâre writing a lot of CRUD functionality as a web dev.
Companies will definitely expect devs to ship more with the same headcount, oftentimes either wonât hire juniors to train them up or will straight up do layoffs, sometimes the AI just being a convenient scapegoat. We kind of canât ignore that either, sure a lot of those companies will be shooting themselves in the foot, but livelihoods will be impacted a bunch.
> Claude Code navigates a codebase the way a software engineer would: it traverses the file system, reads files, uses grep to find exactly what it needs, and follows references across the codebase. It operates locally on the developerâs machine and doesnât require a codebase index to be built, maintained, or uploaded to a server....
> Agentic search avoids those failure modes. There's no embedding pipeline or centralized index to maintain as thousands of engineers commit new code. Each developer's instance works from the live codebase.
The frame of "the way a software engineer would" and the conclusion seem at odds. I'd love to be schooled otherwise?
I use autocomplete/LSPs all the time and they're useful. That's an index? Why wouldn't Claude be able to use one? Also a "software engineer" remembers the codebase - that's definitely a RAG. I have a lot of muscle memory to find the file I need through an auto-completed CMD+P.
It doesn't need to particularly be real-time across thousands of engineers -- just the branch I'm on.
It's rare that I'd be navigating a codebase from first-principles traversal. It would usually be a new codebase and in those cases it's definitely not what I'd call an optimal experience.
It works exactly the way I'd work. I have learned to navigate large codebases before LSPs existed. I used vim for many years and would grep to find the relevant files. When I first tried Claude Code last year, I was like WTF, it's going exactly what I'd be doing.
The answer is in the introduction:
> Claude Code is running in production across multi-million-line monorepos, decades-old legacy systems, distributed architectures spanning dozens of repositories (âŚ)
So it is optimized for the general case, using robust tooling that works everywhere, especially when large & messy.
That being said, your remark is right and for well organised smaller repoâs thereâs better tooing it can and should use. But I think it does, at least Codex does is my case so I guess Claude does it to. For example Codex use âgo docâ first before doing greps.
But the general use case is not the most efficient for a greenfield to-be fully managed by an agentic system code-base. It is built to be good around the scaffold(programming like humans) and not the actual problem space.
Anthropic's target should be a codebase designed for agentic comprehension from the first commit. Here the codebase adapts to the agent. You can enforce conventions, structured metadata, semantic indexing, explicit dependency graphs. Whatever makes the agent's job trivial rather than heroic.
The large majority of coding is maintenance work, not greenfield development. Even if you are doing greenfield development, it won't be long before it is maintenance.
> So it is optimized for the general case, using robust tooling that works everywhere
Where "robust tooling" is "grep with various regexes while completely missing the big picture even in small codebases"
Even if there is first principles traversal of some parts of the codebase, there are other bits that definitely not change, and where exploring every time is a massive waste of tokens. My arguments with claude often have to do with making it explore a lot less, because I know better, and faster, than its slow, expensive navigation of things that basically never change. And it just goes into the same kind of rabbit holes every time.
I still think the best process with Claude Code is: 1) ask it to gather context that you know is relevant 2) only then ask it to do whatever you want it to do. If you do it the other way around, it will over research, over think and generally make more of a mess.
The article does have an entire paragraph about LSPs and how Claude can use them.
That's the question, innit? Dumped into a codebase and given a ticket, what's the fastest way to get your bearings and do the ticket? It's gonna depend on the codebase and the ticket, but it would be an interesting contest to see what tools people have. Some form of grep, sped up using an index, is going to get a skilled operator pretty far, but more complex tools for more complex tickets, eg fix something subtle, like a bug that only manifests on Tuesdays in 2% of requests from Poland, I imagine more advanced tools would help the programmer figure it out faster.
> How claud code works in large codebases?
Simple - It even eats up to 35% five hour usage limit in first prompt even on small projects and then there's 5 minutes time out for you to respond quickly or caches would go bust and you'll pay another 12% to 15% on the next prompt.
The article listed explains how to avoid this. If you naively turn it loose on a big code base, yes, youâll burn a lot of tokens while it tries to find stuff.
This is such a shame, finding where stuff is in a large codebase is my number 1 use for LLM. I hate it that it relies on grep so much, I can do grep better and faster myself.
If I set a regular expression as watcher on a filesytem to notify me if any file changes and I write that in go and assuming regular expression isn't buggy nor its implementation - and then I write rules in a file (as regex) then there's snowball in hell of a chance that it would misnotify or miscategorize anything.
Are LLMs that super reliable in their output already with all the guardrails around?
Don't think so. Hence it is snake oil just like dozens of harnesses.
It might behave differently than specified and a human is required to validate every output carefully or else.
> Are LLMs that super reliable in their output already with all the guardrails around?
Well, what is your definition of "super reliable in the output", and is it a quantifiable/measurable target or just a feeling?
Is it "more than humans", "more than senior developers", "almost perfect", "perfect"?
> It might behave differently than specified and a human is required to validate every output carefully or else.
Sure, just like meatbag developers. All the security flaws AI finds today were introduced years/decades ago by humans and haven't been found (that we know) by humans in ages.
It is quantifiable thing not a feeling.
Between ten thousand runs of:
``` const int MAX_COUNT = 10000;
printf("I'll count up to %d", MAX_COUNT); for(int i=1; < MAX_COUNT; i++) printf("I'm now counting %d", i); ```
And of the following prompt:
``` You'll count to 10,000. At the start say "I'll count up to 10,000" and then for each number say "I'm now counting <number>" and do not say anything else. Do not miss numbers in between. ```
Which one is going to produce 100% correct results out of a 10,000 run of each?
Now don't give me "these are different tools". We all know. I'm talking about reliability and predictability.
Just an anecdote: I was designing a project for LLMs onboarding and orchestration. Claude chose to read only the first 40 lines of each file. Later, in another session, looking for causes of low quality result, Claude detected the fault and changed the code to perform an AST analysis, so now the analyzer takes documentation lines and functions signature (input/output) as input.
Claude's initial approach was really poor. One has to wonder how many times Claude code has to be modified/reviewed for improvement, or whether it is possible at all to make good code with it.
Edited: Generalization: Claude can fix a localized, identifiable poor decision (e.g., "only reading first 40 lines") because the fault is discrete and traceable to one piece of code.
But real software quality problems often arise from many small, individually reasonable decisions that collectively produce bad outcomes. No single one is obviously "the fault." In that scenario, a tool that generates low-quality building blocks piecemeal may never converge on good code, because each piece seems fine in isolation.
I think it's taught to look at source code through a peephole for the sake of context preservation, but I feel like this could be a good use-case for some sub-logic or even a full sub-agent. Like, here sub-agent, you skim that file and tell me a summary, and highlight any areas related to X and Y so that I can look at them in my main context. You can also periodically observe the main work stream and interrupt me if you realise that something in the file you're thinking about is relevant to what I'm working on or might change the direction of what I'm doing.
> I think it's taught to look at source code through a peephole for the sake of context preservation
Yes, to a (real) fault. Less than one in fifty times it ignores an instruction or piece of data in a file has it seen the instruction or data before ignoring it. The other times, it's done this sampling nonsense.
Results are night and day using the 1M token models and reading the full files.
I think what you suggest is like a local second order approximation, that can help. But, I think that the real problem is a global one, is about architectural taste, how the many local pieces interact and their friction. Currently that demands human expertize.
Why can't Claude Code generate effective harness for us by inspecting the code base?
I tried defining CLAUDE.md (or AGENTS.md), skills, plugins, but I'm not getting the effectiveness others claim to be. LSP plugin for example, CC doesn't to use LSP's symbol renaming and edits file one by one slowly, or it does not invoke the skill when I explicitly ask to remember to invoke when prompt contains a specific clue.
Am I using it wrong? Is there a robust example I can copy the harness?
This is the pain point that existed for years now and its still not solved at all.
"If A, do X. Do B,C,D. Do A" - and it just never uses X because "it forgot".
You just cant trust that the time you spend building rules will actually pay off, in fact you can trust that it will fail you sooner or later.
RAG, Harness, Skills... all was supposed to fix this, but in reality it never had.
Harnesses do fix it IMO - itâs why Claude code and Codex had a massive jump in alleged productivity on release and then seems to have flatlined. But a custom harness _would_ allow you to do things like âon every message, run lint validation and testsâ. That in and of itself would be wildly useful.
a colleague using OpenCode was telling me it has linting/formatting configurable at harness level and I can't see why this is in every harness
Honestly - I think it's because it goes against the "vibe" part of the tooling - why do you care what the code looks like as long as when you run it it does what you want it to do?
> Am I using it wrong?
I stopped using `/init` and having CLAUDE|AGENTS.md files that explained the codebase. The only thing I kept was how it should explore the codebase and use `git log` when researching, which is probably redundant too. I can't figure it out either.
The codebase I work on is roughly 100k LOC so idk if it is considered large. Personally it's the largest repo I have worked on.
What seems to work in some cases are hooks with scripts that feed into the context window (I've had to strip out some of the unnecessary linter messaging to limit context). Linters and/or other language specific checkers that can be installed via OS package repository and called via script. Also, the model + skill context together could make a difference. Skills that "worked" on 4.6 may not work as well on 4.7, which seems to require more explicit direction, but is more reliable by comparison to 4.6. Updating skills might help too. Test and run before/after to check. CC also injects unnecessary tool calls into context, so you may need to suppress tasks if you're a beads fan for example.
I donât agree with the statement about indexing codebase: it works pretty well for IDEs like PHPstorm or other jetbrains IDEs
PHPStorm's indexing is incredible. Aside from a scant few times it's been corrupted, which is easily corrected, I've never gotten stale results.
Although if you've ever used Claude's search tool, you'll be unsurprised that the team knows nothing about indexing.
How a company, whose primary product is text-based chat, doesn't allow users to easily perform text search on said chat is beyond comprehension.
And Claude Code can use Jetbrain's MCP to use that index.
It's an odd statement. AI slop? GitHub Copilot has pretty good local indexing too. It's not a super hard problem to put code into a vector DB..
I ask Claude to fix given test:
- runs the test what is failing | grep "x|failing" | tail 10
- runs the test again to get the why it's failing message | tail 10
- runs the test again because tail 10 cut off the message
every fucking time.
I have a skill for it to not do that = save output for whatever test you run into file, read from file using whatever commands you want. Ignores the skill. It's maddening. It's as if, puts on tinfoil hat, it's designed to waste your tokens, while eventually accomplishing its task
> That also includes codebases running on languages that teams don't always associate with AI coding tools, such as C, C++, C#, Java, PHP.
What a strange comment for them to make. Why wouldn't I expect CC to work well with those languages? What languages would I associated it with? Python and Javascript?
Claude clearly wrote this. A lot of fluff, not much substance.
How very interesting. In an industry, where things shift around in months if not weeks, thereâs been not only enough time for clear patterns to emerge but also these patterns have proven successful on large codebases. Whatâs the success criteria? Didnât delete production database? Team velocity has increased? Codebase TTL has increased? Operations guys are happier?
> Didnât delete production database?
I still say if this happens to you with AI tooling, that's both a failure on you and your org for giving a developer prod credentials that could nuke production resources. I don't think I've worked in a place that gave me this level of blind access.
I have only worked in startups and I have been an early engineer in both of them. I would always get high privileges within a short time where I would have the access to create and delete resources. I don't think it's that uncommon.
But the correct way to do it is to have a separate account with more privileges, and only give AI access to your standard developer account
That's one way to do it, how about backup to a remote location every hour? There's more than one way to be careful.
I have personally seen AI bypass this multiple times.
Sounds like they're still giving the model the keys to the kingdom, which is my point, stop giving the model the avenue to do catastrophic mistakes, it makes no sense.
If youâre message is in response to me, which I think it is, I deliberately donât give access to credentials and env variables. Iâve worked to create restrictions and seen AI models use very interesting methods to bypass them.
Even now my prompt says the AI must verify the path of the files it intends to edit, and get permission before editing one file at a time and only after permission. I stop it from ignoring those rules once a day at least.
This is not privilege separation/sandboxing. Separate virtual machine for an agent with limited credentials is reasonably safe approach
We kinda need to architect things with the assumption that all token-output from an LLM can be unpredictably sneaky and malicious.
Alas, humans suck at constant vigilance, we're built to avoid it whenever possible, so a "reverse centaur" future of "do what the AI says but only if you see it's good" is going to suck.
I built my own IDE to replace vscode / cursor so I could design the harness and ensure that the model tool access was secure and limited. But the rest of the industry is YOLO
I would never have these privileges granted directly to my account.
Indeed itâs a good practice to use roles where supported (AWS has them) and explicitly switch when needed
The problem with agents is they regularly sidestep the guardrails and do what they want with a script anyway. The number of times Iâve seen Claude try to escape the folder itâs working in, and then for it to write a python script that does exactly what I told it itâs not allowed do supports that.
If you use SSO and have an AWS config that Claude is allowed to see to get the correct role in the first place, it will just pick the role and plough on anyway.
And this is why it is the height of irresponsibility to run LLMs on your system. We know they are unreliable and just make things up; it's extremely foolish to go "yeah I'm going to let that run commands".
It's not _really_ any different to running an undocumented third party binary. Is it the height of irresponsibility to run Windows, or VSCode, or Spotify?
I think the model we've got now is wrong, and the harnesses should be OS-level sandboxed, and the agents should be running in harness managed sandboxes.
The first step I do when I do any meaningful side project is to set up rds with snapshots. So any startup that doesnt do this one basic step already deserves to fail in my opinion.
Then next I've used AI agents like crazy, we even have linked mcp servers that let it query on the dev database. Haven't seen it try deleting everything a single time. I haven't seen any agent try to do anything destructive. Ever. Perhaps its just reflecting an outrageously bad engineer and nothing else.
Exactly. So is that level of obvious hygiene where the bar is or is it somewhere else. What ticks me off is the audacity of blanket claims without an attempt to even remotely state why itâs said this is a list of successful patterns and what does success mean. Weâre just supposed to eat it up, because, you know, Claude.
Dude, AI has been shown to execute queries on coworkers env files, extract master keys, decrypt variables and push to production.
Why are important push secrets in a dev env config? Btw humans devs make this same mistake all the time.
I wonder if Anthropic tested their claims on a pro, 5x, 20x subscriptions. When you have infinite amount of free tokens it sure makes sense, you just throw tokens at the problem. But not in a limited usage scenarios it doesn't fly far..
How important are Claude.MD files when they donât even describe (with concrete terms) what should even go into each one?
the fish: you can read about that here: https://code.claude.com/docs/en/best-practices#write-an-effe...
the fishing: 1) install the official `skill-creator`; 2) use that with the above link to create `claude-md-improver`; 3) improve the skill by tasking claude with researching the topic of `progressive-disclosure`, in the official docs; 4) point the new skill at you CLAUDE.md file and accept the changes
What Iâm curious about is how well LLMs do when they create something from scratch, because so far my experience was with letting it fix issues or add features to existing codebase where I already shaped the general architecture and put in a lot of guardrails. But what if the architecture is unclear and there is nothing letting agent know if change breaks something or not? My only experience with tiny codebase where it did a lot of scaffolding was poor - it did what I asked for, not what I needed. If i did more of the thinking myself I would realise itâs a code that works but doesnât solve the problem Iâm after.
If the developer can have a local copy of the monorepo it's not a "large" codebase.
Disagree, but also what do you classify as local storage? Does the repo âsizeâ include all projects or just one? What about multiple branches? How much capacity is local storage?
A stock Unreal Engine project is several hundred gigs, consists of multiple solutions, multiple languages, and I would classify as large personally.
Without some kind of indexing itâs very awkward to work with and very slow. To work with LLMs and Unreal projects we create a local index, that index file alone is 46GB.
Without distributed compilers and caches it can take multiple hours to compile the main solution per platform (usually PC, Linux, Xbox, PlayStation, Switch, and sometimes mobile).
So the codebase easily fits on local storage so long as you donât count assets (those are several TB) and extra so for source assets (10s of TB), and thatâs per stream per large project.
Anyways, point is I disagree and think Unreal Engine is an example of large codebase that fits locally.
If your codebase canât fit on a single developer dev machine itâs too big.
You mean like Teslas multi terabyte repo is not normal?
I think it's obvious that multi terabyte repos are not the norm.
How did they even manage to generate a terabyte sized repo, that's crazy. Do they have something written up on how it's structured and why they'd even go that route?
It couldnât be broken in to domain specific components?
Listen, I am a rails developer, so a monolith doesnât scare me, and yet, there are limits. Why does it need to be a multi terabyte monolith?
Ever work on a AAA game?
That probably mostly assets, no?
Probably, but you want to version control assets too.
People usually mention git-lfs at this point, but that is always annoying to use in practice. There is also shallow-clones and sparse-checkouts, but these only mitigate the problem as there is no way around cloning at least one revision completely with git.
My last project was about 400Gb, and probably 2M lines of C++. The days size is mostly assets but thereâs still a lot of code.
If you can't clone it it's not a repo
Small plug for what I built:
You need a code dependency graph: https://github.com/roboticforce/remembrallmcp Ask "what breaks if I change this?"
Saves 98% token usage. Saves 95% tools being called.
Runs as an MCP server, works for 8 languages.
It just works, you need to try it.
I don't have any LSP's hooked up to CC yet (going to fix that today), or particularly sophisticated CLAUDE.md files.
So, if I've read this post correctly, that means that CC is navigating my codebase today by sending lots of it up to a model, and building an understanding. Is that correct? Did I misunderstand it?
I kinda suspected there was more local inference going on somehow -- partly because the iteration times are fairly fast.
I think that's correct. Which is kinda funny, I remember 10y ago that I was heavily relying on IntelliJ features to understand new codebases (jump to definition, find all usages of a function, navigate from SQL to the table in database tab etc.).
It turns out, that for a machine, find and grep is all that's required.
Agents use find+grep because it's available everywhere and without any configuration, but they would still be more efficient with LSP. Once LSPs will be more easily configurable for agents, they will use them.
A human could get by with just find and grep too. And in both cases, find and grep will be slower and less precise than an IDE's code navigation features.
Wondering if enterprises have a modified version of CC that doesnt have to optimize to stop bleeding on fixed cost subscription plans.
The article really does not align with the current sentiment. Everyone with a choice has mostly moved on to codex (ofc in this world all it takes is a model update/harness update to turn things around).
CC is great at a lot of things, but repeatedly misses out reading on crucial parts of the code base, hallucinates on the work that was done and a bunch of other issues.
The influencer economy trades on hype, on frenzy, and ultimately, eyeballs. The more the better.
They want you feel like youâre missing out. They want you to switch. Being boring is far more productive. Pin your versions. Stick to stable releases and avoid the nightlies.
Significant noise created from 4.6 to 4.7 Opus transition has caused some to interpret this as signal. Excluding certain genuine and real bugs, the noise about perceived quality falling dramatically was noise. Influencers doing influencing turned it into âsignalâ. The reality was that if you had strong planning and spec driven development it ranged from manageable to non-existent.
The vast majority of the people I know and work with have not switched off CC or their Max sub.
I have a choice and have not moved to codex (100/mo personal + my employer pays for a subscription). I try codex here and there and it seems to go off the rails every time. I have had some good experiences with codex, but generally trying to get something big accomplished it doesn't work out.
But I may not have paid enough to get the full real experience with codex
I use codex at home 20 bucks a month the limits are very high relative to the price, maybe the gravy train ends soon for these and then it's probably to open router chinese models.
At work it's CC or sometime codex, personally don't see much difference at all and most normies will notice none. The cultists have their opinions.
> stop bleeding on fixed cost subscription plans
What bleeding? Anthropic wants as much of that "bleeding" as possible. The interaction data gathered from genuine human CC subscription usage of their models goes directly into their RL training, it's invaluable and they are more than happy to lose money on the inference to get it. That data is what xAI was recently willing to pay $10b to cursor to get.
They want you to use Claude Code. They hate other UI surfaces like OpenCode etc purely because they lose control over that data, so they're subsidizing the inference without getting what they actually want, the data (they still get some of it of course, but it's much less ergonomic for them. Those tools often abstract away the subagent calls, for example). OpenCode can collect that data themselves, so by allowing subscription there, Anthropic sees itself as subsidizing another org getting that data. Hard no.
And tools like OpenClaw are useless because they're mechanical and don't represent actual users interacting with the service - again, subsidizing but not getting the reward.
It's all very simple once you understand their motivations.
> Everyone with a choice has mostly moved on to codex
Ha!
You must be using a different CC. Or what theyâre writing here is correct, and itâs all due to the CLAUDE.md file that I only occassionally yell at claude.
Hmm please share more. I have had the max CC sub since it came out. Religiously follow all of Boris/Cats advice but still struggle with it. Meanwhile a really badly written AGENTS.md will still get the work done.
Apologies but what is a Boris Cat?
Boris Cherny and Cat Wu are the lead devs of CC at Anthropic who unsurprisingly talk their book and find so many ways to justify tokenmaxing.
As the product they deliver is greenfield and in the newest of domain spaces, there is a serious halo-effect to consider.
On a side note, at a company I know the devs are split between
Stick to Copilot inside visual studio
- suspiciously cheap Opus quotas there
+ they read their code
pi coding agent
+ control all the things my way
- each their own way
Claude Code
+ it's magic
- you mean it did that to my prompt!
Copilot's extreme subsidies end this month. Starting in June, you'll be paying API rates for all models.
I find that most âtechniquesâ are basically user hallucinations. Simple plan-write-refactor loops and trivial CLAUDE/AGENTS.md, generated by the harness itself, work nicely. Maaaaaaaaaybe write a skill or two, but usually itâs better to just write a script.
I think it's a good rule of thumb that if you find yourself saying everyone prefers this model or that model you're in a bubble. I've made this mistake before, I used to go around saying everyone knew Claude was the only model for serious professional use, but I was wrong.
I always assume that people making those comments on HN are trying to convince others to switch to their model. Surely no one actually believes their friend circle is a representative sample of the hundreds of millions of people that use these LLMs?
Anthropic has the best marketing for sure.
Btw the guy in charge of that stuff for Anthropic is the same guy who said GPT 2 was too dangerous to release, Jack Clark. LMAO. That model could barely string a sentence together.
It's probably not a coincidence that I both prefer Claude and think that they made the right judgment call on GPT-2 at the time.
> Everyone with a choice has mostly moved on to codex
You are deep in an information bubble, mostly driven by hype-train influencers with magpie attention spans.
Interesting that MCP was mentioned over CLI. For production or controlled environments, I would not make MCP the deployment path. I would let MCP help generate or choose commands, but have the actual deployment go through CLI scripts, Git commits, and CI/CD approval.
Iâm super interested to know what the back and forth between models and tools really looks like in practice.
Are there any much more detailed walkthroughs of how it works and how it decides the tools to use and the grep to use etc and what the conversations actually look like?
In the UI you see just enough to know itâs doing something but you donât really see the jumps itâs making offscreen.
You can easily inspect the full requests it makes to the API which contains the full system prompt, tools, tool calls, etc.
or easier, open ~/.claude/projects/[project]/[session].jsonl (excluding the system prompt)
Doesn't really seem easier and it's in a harder to read format
Codex is open source if youâre interested https://github.com/openai/codex
So ... the better you explain the codebase to the LLM the better it explains it to you?
A lot of words for not much. The harness taxonomy is fine, but anyone using Claude Code already knows CLAUDE.md exists.
Never make assumptions about what "everyone knows", you'd be surprised.
I use Claude Code quite a bit and quite enjoy it, so I'm a bit confused by how often it's mentioned that you should have CLAUDE.md.
I mean: If there was something you could add to the prompt to consistently increase performance why isn't it in the system prompt already?
If it's all about clarifying a couple of local idiosyncrasies, shouldn't it be able to quickly get them by looking through the repo?
Does anyone have an example of a CLAUDE.md that really makes a difference for them?
In general, this article would really have profited massively from examples of good applications of those patterns.
There's a bunch of stuff I include, depending on the project. Some general ones are commenting style and coding standards. In theory it should be able to do it without that by looking at the repo style, but I haven't found that to be the case (especially with overly verbose/repetitive comments).
A specific example in another project is the testing/verification procedure. It's a wasm/WebGPU and the test harness is fairly complex. There are scripts to handle it, but by default Claude will churn for a while to figure it out and sometimes just give up. It definitely saves a lot of tokens/speeds things up.
The tokens it uses up clarifying can be saved, and it's often good to write out intentions. For instance, you may be mid-process on cleaning up some architectural pattern, and giving it guidance about where to find docs to follow, etc, are very project-specific.
>I mean: If there was something you could add to the prompt to consistently increase performance why isn't it in the system prompt already?
I think about this a lot. So far I think we are mostly just being gaslit. That we can influence the AI to be better with a few encouraging words and role playing, actually seems absurd. Maybe there is some element of randomness introduced there or something. All these extra MD files don't seem to do nearly as much for results as people believe they do.
A long article about nothing, seems written by Claude itself.
Lots of concepts. Release the harness that made it possible to port Bun to Rust in 9 days. That's what everyone really wants. Then everyone can go "do that but for this other goal".
what if this magical harness is just: experienced operatorâ + claude code + official plugins + opus 4.7 + max effort ?
â swe with practical experience, a code wrangler if you will
+ an infinite supply of free tokens + convincing Claude to just keep working overnight.
There you have a verifier though. As in you have test cases (which are written in JS and thus do not need to be translated). The moment you have a verifier signal LLMs become extremely reliable. Now of course they can reward hack your test cases but in a large codebase with many tests it becomes the only small thing you have to worry about.
This is really a zero information blog post. I want to know how they use the LSP to improve their understanding of the code base. Would be great if it was open source for us to review.
A post like this should be providing people with some reassurance about Claude's ability to understand code at a large scale. It's mostly fluff.
Edit: so I did some googling to dig around for thoughts on LSP performance and integration. the author of bun has a tweet about saying that they are a big drag on performance for no real gain and virtually all of the replies agree. Anyone else have any experience/thoughts?
https://xcancel.com/jarredsumner/status/2017704989540684176
Really? I thought it explained the point that harnessing for agentic search of a large code base is more beneficial than RAG-indexing a monorepo.
A lot of words about nothing.
Meanwhile we are still waiting for these statements to come true:
https://eu.36kr.com/en/p/3648851352018565
https://www.businessinsider.com/anthropic-ceo-ai-90-percent-...
https://www.reddit.com/r/Anthropic/comments/1nemhxb/futurism...
https://medium.com/@coders.stop/dario-amodei-said-90-of-code...
https://www.youtube.com/shorts/0j1HqEEDThc
Accountability, anyone?
This is already the case for many startups. In fact, the figure might be closer to 100%. The work shifts to requirements analysis, high-level specifications, and final review instead (after AI code review).
The first link states literally
"AI will take over almost all the work of software engineers (SWEs) end - to - end in just 6 - 12 months!"
What you describe is >50% of the job of SWEs, even when they write all code by hand.
Are you saying that "for many start-ups", this isn't done by SWE's but by some other career type or are you implying that it's just the code written (and first review) is replaced by AI?
I have watched Darioâs interview at WEF referred to in the article and I am quite certain Dario didnât say that. He talked about AI automating most coding already or soon, not software engineering as a whole.
He did say a few months later in an interview in India that AI will eventually take over most of SWE tasks.
â-
My statement on startups is largely about automating coding by SWEs. My startup also uses AI to automate part of technical specifications and code review but I am not sure how widespread that is.
Yeah I'm working on one of those now that a 3rd-party vendor cranked out for us. I spent all day ripping out an endpoint that did 98% of what another endpoint did and should never have existed. I also ripped out 80 lines of code that looked like this:
const sqlStatement = (!params.mostRecentOnly) ? {giant SQL statement} : {identical giant SQL statement + 'LIMIT 1' at the end}
AI never met a problem that can't be solved with more code. Need some data in a slightly different structure? Don't try to modify an existing endpoint, just build a new one! Need to access a field that's buried in a JSON object in the database? Just create a new column, but don't bother removing the field from the JSON object. The more sources of truth, the merrier! When it comes time to update, just write more code to update the field everywhere it lives!
Factor out the extra sources of truth you say? Good luck scanning the most verbose front-end you've ever seen to make sure nothing is looking at the source you want to remove. In the beginning of big projects, you have to be absolutely ruthless about keeping complexity down so it doesn't get out of control later. AI is terrible at keeping complexity down.
My goal is to halve the lines of code from what the vendor turned over to us. One baby step at a time.
If only we had this tech back when managers were looking at how many lines of code you were committing weekly as a performance metric.
Now they're looking at your token consumption, which is even more gameable (and stupid).
That is a skill issue though. I have rules for my agents to write compositional, reusable, modular, small files and to avoid any sort of boilerplate etc. Being config driven, single source of truth, having other agents review that rules are followed, etc. Any API or UI or any sort of entry points very light, just proxying to the modular logic basically, so this logic could be reused by any entrypoint easily.
UI components always presentational only logic abstracted modularly, etc...
How do you make it so that the model doesn't forget to follow those rules and skills? How do you make it actually understand the architecture and constraints? You can't, current models don't work that way to make it happen.
Can you share your rules and some of the example PRs that it auto generates and reviews?
The number of times Iâve seen Claude say âthis test was failing already so is ignoredâ when it _wasnt_ despite me telling it to never do that makes me doubt.
Ah, the make_no_mistakes.md
I mean quite frankly I have seen enough code that was definitely written by humans that had exactly this "style".
Then again I don't want to pay for AI to give me the coding style of the worst I ever worked with either.
> many startups
which startups? I'm genuinely curious
And not only startups...
He would be right if claude code was written by a team of humans. The AI written blob is slowing progress.
I mean, since Opus 4.6 came out, that rings more and more true. You still have to babysit the output, do some planning and be proactive about ways to do things better⌠but 80-90% isnât out of the question if youâre in the domains that are well represented in the training data, e.g. if youâre writing a lot of CRUD functionality as a web dev.
Companies will definitely expect devs to ship more with the same headcount, oftentimes either wonât hire juniors to train them up or will straight up do layoffs, sometimes the AI just being a convenient scapegoat. We kind of canât ignore that either, sure a lot of those companies will be shooting themselves in the foot, but livelihoods will be impacted a bunch.