Skip to main content

Beyond Benchmarks: Measuring the True Cost of AI-Generated Code

The first wave of AI adoption in software development was about productivity. For the past few
years, AI has felt like a magic trick for software developers: We ask a question, and seemingly
perfect code appears. The productivity gains are undeniable, and a generation of developers is
now growing up with an AI assistant as their constant companion. This is a huge leap forward in
the software development world, and it’s here to stay.

The next — and far more critical — wave will be about managing risk. While developers have
embraced large language models (LLMs) for their remarkable ability to solve coding challenges,
it’s time for a conversation about the quality, security, and long-term cost of the code these
models produce. The challenge is no longer about getting AI to write code that works. It’s about
ensuring AI writes code that lasts.

And so far, the time spent by software developers in dealing with the quality and risk issues
spawned by LLMs has not made developers faster. It has actually slowed down their overall
work by nearly 20%, according to research from METR.

The Quality Debt

The first and most widespread risk of the current AI approach is the creation of a massive, long-
term technical debt in quality. The industry’s focus on performance benchmarks incentivizes
models to find a correct answer at any cost, regardless of the quality of the code itself. While
models can achieve high pass rates on functional tests, these scores say nothing about the
code’s structure or maintainability.

In fact, a deep analysis of their output in our research report, “The Coding Personalities of
Leading LLMs,” shows that for every model, over 90% of the issues found were “code smells” — the raw material of technical debt. These aren’t functional bugs but are indicators of poor
structure and high complexity that lead to a higher total cost of ownership.

For some models, the most common issue is leaving behind “Dead/unused/redundant code,”
which can account for over 42% of their quality problems. For other models, the main issue is a
failure to adhere to “Design/framework best practices. This means that while AI is accelerating
the creation of new features, it is also systematically embedding the maintenance problems of
the future into our codebases today.

The Security Deficit

The second risk is a systemic and severe security deficit. This isn’t an occasional mistake; it’s a
fundamental lack of security awareness across all evaluated models. This is also not a matter of
occasional hallucination but a structural failure rooted in their design and training. LLMs struggle
to prevent injection flaws because doing so requires a non-local data flow analysis known as
taint-tracking, which is often beyond the scope of their typical context window. LLMs also generate hard-coded secrets — like API keys or access tokens — because these flaws exist in
their training data.

The results are stark: All models produce a “frighteningly high percentage of vulnerabilities with the highest severity ratings.” For Meta’s Llama 3.2 90B, over 70% of the vulnerabilities it introduces are of the highest “BLOCKER” severity. The most common flaws across the board are critical vulnerabilities like “Path-traversal & Injection,” and “Hard-coded credentials.” This reveals a critical gap: The very process that makes models powerful code generators also makes them efficient at reproducing the insecure patterns they have learned from public data.

The Personality Paradox

The third and most complex risk comes from the models’ unique and measurable “coding
personalities.” These personalities are defined by quantifiable traits like Verbosity (the sheer
volume of code generated), Complexity (the logical intricacy of the code), and Communication
(the density of comments).

Different models introduce different kinds of risk, and the pursuit of “better” personalities can paradoxically lead to more dangerous outcomes. For example, one model like Anthropic’s Claude Sonnet 4, the “senior architect” introduces risk through complexity. It has the highest functional skill with a 77.04% pass rate. However, it achieves this by writing an enormous amount of code — 370,816 lines of code (LOC) — with the highest cognitive complexity score of any model, at 47,649.

This sophistication is a trap, leading to a high rate of difficult concurrency and threading bugs.
In contrast, a model like the open-source OpenCoder-8B, the “rapid prototyper” introduces risk
through haste. It is the most concise, writing only 120,288 LOC to solve the same problems. But
this speed comes at the cost of being a “technical debt machine” with the highest issue density of all models (32.45 issues/KLOC).

This personality paradox is most evident when a model is upgraded. The newer Claude
Sonnet 4 has a better performance score than its predecessor, improving its pass rate by 6.3%.
However, this “smarter” personality is also more reckless: The percentage of its bugs that are of
“BLOCKER” severity skyrocketed by over 93%. The pursuit of a better scorecard can create a
tool that is, in practice, a greater liability.

Growing Up with AI

This isn’t a call to abandon AI — it’s a call to grow with it. The first phase of our relationship with
AI was one of wide-eyed wonder. This next phase must be one of clear-eyed pragmatism.
These models are powerful tools, not replacements for skilled software developers. Their speed
is an incredible asset, but it must be paired with human wisdom, judgment, and oversight.

Or as a recent report from the DORA research program put it: “AI’s primary role in software
development is that of an amplifier. It magnifies the strengths of high-performing organizations
and the dysfunctions of struggling ones.”

The path forward requires a “trust but verify” approach to every line of AI-generated code. We
must expand our evaluation of these models beyond performance benchmarks to include the
crucial, non-functional attributes of security, reliability, and maintainability. We need to choose
the right AI personality for the right task — and build the governance to manage its weaknesses.
The productivity boost from AI is real. But if we’re not careful, it can be erased by the long-term
cost of maintaining the insecure, unreadable, and unstable code it leaves in its wake.

The post Beyond Benchmarks: Measuring the True Cost of AI-Generated Code appeared first on SD Times.



from SD Times https://ift.tt/jTIPfbB

Comments

Popular posts from this blog

A guide to data integration tools

CData Software is a leader in data access and connectivity solutions. It specializes in the development of data drivers and data access technologies for real-time access to online or on-premise applications, databases and web APIs. The company is focused on bringing data connectivity capabilities natively into tools organizations already use. It also features ETL/ELT solutions, enterprise connectors, and data visualization. Matillion ’s data transformation software empowers customers to extract data from a wide number of sources, load it into their chosen cloud data warehouse (CDW) and transform that data from its siloed source state, into analytics-ready insights – prepared for advanced analytics, machine learning, and artificial intelligence use cases. Only Matillion is purpose-built for Snowflake, Amazon Redshift, Google BigQuery, and Microsoft Azure, enabling businesses to achieve new levels of simplicity, speed, scale, and savings. Trusted by companies of all sizes to meet...

2022: The year of hybrid work

Remote work was once considered a luxury to many, but in 2020, it became a necessity for a large portion of the workforce, as the scary and unknown COVID-19 virus sickened and even took the lives of so many people around the world.  Some workers were able to thrive in a remote setting, while others felt isolated and struggled to keep up a balance between their work and home lives. Last year saw the availability of life-saving vaccines, so companies were able to start having the conversation about what to do next. Should they keep everyone remote? Should they go back to working in the office full time? Or should they do something in between? Enter hybrid work, which offers a mix of the two. A Fall 2021 study conducted by Google revealed that over 75% of survey respondents expect hybrid work to become a standard practice within their organization within the next three years.  Thus, two years after the world abruptly shifted to widespread adoption of remote work, we are dec...

10 Simple Image Slider HTML CSS JavaScript Examples Neeraj Mishra The Crazy Programmer

Slider is a very important part of any website or web project. Here are some simple image slider examples that I handpicked from various sites. These are built by different developers using basic HTML, CSS, and JavaScript. Some are manual while others have auto-slide functionality. You can find the source code for each by clicking on the code button or on the image. 1. Very Simple Slider Demo + Code 2. Popout Slider Demo + Code 3. Really Simple Slider Demo + Code 4. Jquery Simple Slider Demo + Code 5. Manual Slideshow Demo + Code 6. Slideshow Indicators Demo + Code 7. Simple Responsive Fullscreen Slider Demo + Code 8. Responsive Image Slider Demo + Code 9. Simple Image Slider Demo + Code 10. Slicebox – 3D Image Slider Demo + Code I hope these simple image sliders are helpful for you. For any queries, you can ask in the comment section below. The post 10 Simple Image Slider HTML CSS JavaScript Examples appeared first on The Crazy Prog...