State of Copilot, Part 2

Nov 11, 2024

Since the publication of State of Copilot 2 months ago, we’ve had more interactions with copilot, and this gave me some new insights.

2 recent events

Let me first describe 2 events that happened in the past 2 weeks.

Server deployment failure

One day one of the team members discovered that our Python Strawberry GraphQL server deployment has been failing. We looked into the deployment log, and determined that it’s due to a deployment script change we made requires Docker BuildKit to be enabled, but the cloud deployment machine by default doesn’t have this.

It took me 5 attempts to fix this.

First, I used Gemini suggestions, directly from the GCP logs. GCP server, GCP log, GCP Gemini suggestion. This must work, right? Nope, the Gemini suggested fix didn’t work.
Then I used Cursor small model suggested fix. Nope, that didn’t work.
Then I used OpenAI suggested fix. Nope, that didn’t work.
Then I used Anthropic Claude suggested fix. Nope, that didn’t work.
Finally, I went to read Google Cloud Build YAML documentation, and made the change. That worked.

AI : Human = 4 failures : 1 success.

What’s more interesting is, the 4 AI suggestions were all different, and they were equally invalid.

Mobile app crash

The following week, we encountered a mobile app crash. After debugging, we nailed it down to this line in our React Native code:

import { Easing } from "react-native/Libraries/Animated/Easing";

This import is incorrect, and makes Easing as undefined at run time, which crashes our app when being accessed.

The correct way should be

import { Easing } from "react-native";

How was the bogus import line added in the first place? AI suggested auto-completion.

Insights / Hypothesis

These 2 events (along with many other past observations) gave me the following hypothesis:

GenAI’s code completion intelligence mostly comes from pattern matching from existing available code.

What’s “existing available code”? 2 parts: public open source code base trained in foundation models, and private code through RAG.

What works great? Repetitive patterns in these 2 corpora.

Repetitive patterns in public open source code base include application code using widely adopted open source packages.

Repetitive patterns in private code include the same logic used in multiple places, and test code.

What tends to not work well?

DevOps code in general is much lower volume than business logic eng code, thus less likely to work well for LLM suggestions. (My worst personal experiences were all copilot suggestions in the DevOps realm.)
Application code using less popular open source packages.
Rare logic which you only use in one place.

In the end, it makes sense to double check the auto-suggested code to make sure it does what you want, and doesn’t bring errors, or worse, unintended irreversible outcomes, like my last time where the server was deleted altogether and couldn’t be recovered.

The key takeaway is, even if GenAI suggests code that works, it still doesn’t know why it works, yet. At the moment, to understand “why” is still human intelligence.

Stuart Marshall

Nov 11

Yup, you're right. Matches my experience too.

There are a lot of fiddly build and deployment bits that are not well documented on the Internet, so the LLMs don't have good suggestions.

Plus, the LLMs don't have good actual core understanding, so they can't extrapolate well from principles.

A bummer here is that the automation isn't helping us much with the harder problems we face.

Expand full comment

Yunkai’s Tidbits

Discussion about this post