<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Viacheslav Bogdanov</title>
<link>https://vv-bogdanov.github.io/posts.html</link>
<atom:link href="https://vv-bogdanov.github.io/posts.xml" rel="self" type="application/rss+xml"/>
<description>Minimal engineering notes on research, architecture, developer tools, AI-assisted software development, and reliable products.</description>
<generator>quarto-1.9.38</generator>
<lastBuildDate>Thu, 04 Jun 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>AI agents generate code faster. Duplicate-code checks need to keep up.</title>
  <dc:creator>Viacheslav Bogdanov</dc:creator>
  <link>https://vv-bogdanov.github.io/posts/fast-duplicate-code-detection-for-agents/</link>
  <description><![CDATA[ 




<p>Software teams have always created duplicate code.</p>
<p>Sometimes it is intentional: copy a working block, adapt it, ship the feature. Sometimes it is accidental: two teams solve the same problem in parallel, a helper function grows in three packages, or a test setup gets copied until no one remembers which version is canonical.</p>
<p>AI coding agents do not create this problem from nothing. They make it cheaper.</p>
<p>That distinction matters. The right conclusion is not “AI writes bad code”. The real shift is that code throughput is rising. Humans can generate more code with less friction, agents can produce repeated structures very quickly, and large teams already had the same issue before agents became common. When code volume goes up, quality gates have to become cheaper too.</p>
<p>One of those gates is duplicate-code detection.</p>
<p><img src="https://vv-bogdanov.github.io/assets/jscpd-rs-duplicate-code-flow-preview.svg" class="launch-diagram img-fluid" alt="jscpd-rs scans a repository, matches duplicate fragments, and feeds duplicate-code feedback back to CI, reviewers, and AI coding agents."></p>
<p>That is the reason for <code>jscpd-rs</code>: a fast native duplicate-code detector for humans, AI coding agents, and CI/CD pipelines.</p>
<p>It is a Rust implementation of the common <code>jscpd</code> workflow. It scans a repository, finds repeated source fragments, writes reports for humans and automation, and can fail CI when duplication crosses a configured threshold.</p>
<p>The goal is not to make refactoring automatic. The goal is to make duplication visible quickly enough that humans and agents can act on it before it becomes maintenance debt.</p>
<section id="what-duplicate-code-detection-is" class="level2">
<h2 class="anchored" data-anchor-id="what-duplicate-code-detection-is">What duplicate-code detection is</h2>
<p>If you have not used <code>jscpd</code>: it is a copy-paste detector for source code. It scans a repository, finds duplicated fragments across files, generates reports, and can fail CI when duplicated code crosses a configured threshold.</p>
<p>This is useful because duplicated code is not only a cleanliness issue. It can increase review load, hide inconsistent fixes, make refactors harder, and spread bugs across several files. In large repositories the problem is rarely a single obvious copy-paste block. It is usually a slow accumulation of repeated handlers, adapters, tests, config objects, validators, API wrappers, and business rules.</p>
<p>For humans, duplicate-code detection is a refactoring map.</p>
<p>For agents, it is a feedback signal.</p>
<p>An agent can generate a new implementation, run a fast duplicate check, inspect the repeated regions, and then decide whether to reuse an existing abstraction, extract a shared helper, or leave the duplication because it is intentionally local. The same loop helps a human reviewer: instead of scanning a large diff for copy-paste manually, the tool gives a concrete list of repeated fragments.</p>
</section>
<section id="why-speed-changes-the-product" class="level2">
<h2 class="anchored" data-anchor-id="why-speed-changes-the-product">Why speed changes the product</h2>
<p>A slow quality gate becomes optional.</p>
<p>Teams start running it only before releases. Or only manually. Or only on small folders. Or they keep the check, but developers learn to avoid it because it adds enough waiting time to hurt the normal pull request loop.</p>
<p>That is why I built <code>jscpd-rs</code>.</p>
<p>The goal is not to invent a completely new workflow. The goal is to make an existing useful workflow cheap enough to leave on.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">npx</span> jscpd-rs <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--threshold</span> 5 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--exitCode</span> 1 .</span></code></pre></div></div>
<p>or:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">cargo</span> install jscpd-rs <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--locked</span></span>
<span id="cb2-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">jscpd</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--threshold</span> 5 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--exitCode</span> 1 .</span></code></pre></div></div>
<p>The npm package installs prebuilt native binaries on common Linux, macOS, and Windows targets, so most npm users do not need a Rust toolchain.</p>
<p>For a team, this turns duplicate-code detection from an occasional cleanup tool into a normal pull-request check. For an agent workflow, it turns duplication into a quick iteration signal instead of a later code review surprise.</p>
</section>
<section id="who-this-is-for" class="level2">
<h2 class="anchored" data-anchor-id="who-this-is-for">Who this is for</h2>
<p>In practice, the tool is meant for this group:</p>
<blockquote class="blockquote">
<p><code>jscpd-rs</code> is a fast native duplicate-code detector for humans, coding agents, and CI/CD.</p>
</blockquote>
<p>More specifically:</p>
<ul>
<li>for humans, it finds copy-paste regions before they become maintenance debt;</li>
<li>for reviewers, it turns repeated code into concrete report entries;</li>
<li>for agents, it provides a fast feedback loop for deduplication and reuse;</li>
<li>for CI/CD, it keeps the duplicate-code gate cheap enough to run on every pull request;</li>
<li>for teams, it reduces waiting time, review load, and wasted compute.</li>
</ul>
<p>The agent angle is important, but it should stay practical. Agents are not the enemy. Higher code throughput is the change. Faster guardrails are the response.</p>
<p>What it is designed to do:</p>
<ul>
<li>find repeated code quickly enough for normal development loops;</li>
<li>produce reports that humans can inspect;</li>
<li>produce machine-readable output that CI and agent workflows can consume;</li>
<li>stay compatible with familiar <code>jscpd</code> usage where it matters;</li>
<li>avoid adding runtime complexity just to chase theoretical extensibility.</li>
</ul>
<p>What it is not:</p>
<ul>
<li>it does not decide whether every duplicate is bad;</li>
<li>it does not replace human judgment about abstraction boundaries;</li>
<li>it does not currently load dynamic npm reporters, stores, listeners, or plugins;</li>
<li>it is coverage-first compatible, not a promise of exact 1:1 internal token streams.</li>
</ul>
</section>
<section id="current-benchmark-baseline" class="level2">
<h2 class="anchored" data-anchor-id="current-benchmark-baseline">Current benchmark baseline</h2>
<p>The current public benchmark suite uses pinned public repositories and compares <code>jscpd-rs</code> with upstream <code>jscpd</code> on the same high-level inputs and options.</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Case</th>
<th>Format</th>
<th style="text-align: right;">jscpd-rs avg</th>
<th style="text-align: right;">upstream jscpd avg</th>
<th style="text-align: right;">Speedup</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>React</td>
<td>JavaScript</td>
<td style="text-align: right;">0.197325s</td>
<td style="text-align: right;">10.413453s</td>
<td style="text-align: right;">52.77x</td>
</tr>
<tr class="even">
<td>Next.js</td>
<td>TypeScript</td>
<td style="text-align: right;">0.270786s</td>
<td style="text-align: right;">14.983243s</td>
<td style="text-align: right;">55.33x</td>
</tr>
<tr class="odd">
<td>Prometheus</td>
<td>Go</td>
<td style="text-align: right;">0.083162s</td>
<td style="text-align: right;">4.842499s</td>
<td style="text-align: right;">58.23x</td>
</tr>
</tbody>
</table>
<p>These numbers are not microbenchmarks. They are release-gate numbers for duplicate-code detection on real public repositories pinned by commit.</p>
<p>The important product point is not only the exact multiplier. It is that the check moves from “noticeable quality gate” toward “cheap enough to run often”.</p>
</section>
<section id="compatibility-model" class="level2">
<h2 class="anchored" data-anchor-id="compatibility-model">Compatibility model</h2>
<p>The compatibility gate is coverage-first.</p>
<p>On the same inputs and options, <code>jscpd-rs</code> must not miss duplicated source lines reported by upstream <code>jscpd</code>. Extra Rust findings are allowed while compatibility converges, but they remain visible in compatibility reports as <code>extra</code> findings.</p>
<p>This is a deliberate trade-off. Exact pair identity is valuable, but it is not always the right first release blocker for multi-way duplicates and different tokenizers. The important contract is that the Rust implementation should not silently miss duplicated source ranges that upstream reports.</p>
<p>The current release is native-only. It does not embed a JavaScript runtime or call upstream <code>jscpd</code> for core behavior. Dynamic npm reporters, stores, listeners, and plugins are intentionally out of scope for the current 0.x line.</p>
</section>
<section id="what-is-implemented-today" class="level2">
<h2 class="anchored" data-anchor-id="what-is-implemented-today">What is implemented today</h2>
<p>The current release covers the common duplicate-code detection workflow:</p>
<ul>
<li><code>jscpd</code> and <code>jscpd-server</code> command names;</li>
<li><code>.jscpd.json</code> and <code>package.json#jscpd</code> config loading;</li>
<li>upstream-style CLI flags for thresholds, formats, ignores, reports, blame, exit codes, and CI usage;</li>
<li>native built-in reports: console, JSON, SARIF, HTML, XML, CSV, Markdown, badge, Xcode, threshold, AI-oriented output, and silent output;</li>
<li>native Oxc-backed JavaScript, TypeScript, JSX, and TSX token processing;</li>
<li>upstream-synchronized format registry with generic native tokenization for long-tail formats;</li>
<li>native Git blame support;</li>
<li>native REST/MCP server workflow for project statistics and snippet checks.</li>
</ul>
<p>For CI, the most common shape is simple:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">jobs</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb3-2"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">duplication</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb3-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">runs-on</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> ubuntu-latest</span></span>
<span id="cb3-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">steps</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb3-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">      </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">uses</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> actions/checkout@v5</span></span>
<span id="cb3-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">      </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">uses</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> actions/setup-node@v5</span></span>
<span id="cb3-7"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">        </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">with</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb3-8"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">          </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">node-version</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">22</span></span>
<span id="cb3-9"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">      </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">run</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> npx jscpd-rs src --reporters console,json --threshold 5 --exitCode 1</span></span></code></pre></div></div>
</section>
<section id="how-this-helps-agents" class="level2">
<h2 class="anchored" data-anchor-id="how-this-helps-agents">How this helps agents</h2>
<p>Fast duplicate-code detection gives an agent a concrete loop:</p>
<ol type="1">
<li>Generate or modify code.</li>
<li>Run a duplicate scan.</li>
<li>Inspect repeated fragments.</li>
<li>Decide whether to reuse an existing function, extract a helper, consolidate test fixtures, or keep the duplication with a reason.</li>
<li>Run the scan again.</li>
</ol>
<p>The same loop works for humans, but agents make the economics more obvious. If an agent can produce a large patch quickly, the project needs fast checks that can keep up with that pace. A slow check becomes a bottleneck. A cheap check becomes part of the normal edit loop.</p>
<p>This is also why reports matter. A tool that only prints a summary is not enough. <code>jscpd-rs</code> can write machine-readable JSON and SARIF reports for CI, HTML and console reports for humans, and compact AI-oriented output for agent prompts and refactoring workflows.</p>
<p>The useful agent pattern is not “generate code and trust it”. The useful pattern is “generate code, run deterministic checks, feed precise findings back into the next edit”. Duplicate-code detection fits that loop well because the finding is concrete: these files contain repeated source fragments.</p>
</section>
<section id="the-implementation-approach" class="level2">
<h2 class="anchored" data-anchor-id="the-implementation-approach">The implementation approach</h2>
<p>The hot path is intentionally straightforward:</p>
<ol type="1">
<li>parse CLI/config into normalized options;</li>
<li>discover files with Rust ignore and glob handling;</li>
<li>tokenize into compact native token maps;</li>
<li>shard by format and run preparation/detection in parallel;</li>
<li>compare numeric token hashes;</li>
<li>write native reports and apply upstream-style exit-code behavior.</li>
</ol>
<p>For JavaScript and TypeScript, token processing is backed by Oxc. For long-tail formats, the project starts with a synchronized format registry and generic native tokenization, then promotes formats only when real compatibility data shows that generic tokenization is not enough.</p>
<p>That keeps the project maintainable. The point is not to write the maximum amount of Rust. The point is to keep the product fast, native, and practical without rebuilding mature infrastructure unnecessarily.</p>
</section>
<section id="try-it" class="level2">
<h2 class="anchored" data-anchor-id="try-it">Try it</h2>
<p>With npm:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">npx</span> jscpd-rs <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--threshold</span> 5 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--exitCode</span> 1 .</span></code></pre></div></div>
<p>With Cargo:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb5-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">cargo</span> install jscpd-rs <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--locked</span></span>
<span id="cb5-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">jscpd</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--threshold</span> 5 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--exitCode</span> 1 .</span></code></pre></div></div>
<p>Links:</p>
<ul>
<li>Repository: <a href="https://github.com/vv-bogdanov/jscpd-rs" class="uri">https://github.com/vv-bogdanov/jscpd-rs</a></li>
<li>npm: <a href="https://www.npmjs.com/package/jscpd-rs" class="uri">https://www.npmjs.com/package/jscpd-rs</a></li>
<li>crates.io: <a href="https://crates.io/crates/jscpd-rs" class="uri">https://crates.io/crates/jscpd-rs</a></li>
<li>docs.rs: <a href="https://docs.rs/jscpd-rs" class="uri">https://docs.rs/jscpd-rs</a></li>
</ul>
<p>The feedback I care about most:</p>
<ul>
<li>repositories where <code>jscpd-rs</code> misses duplicated lines found by upstream <code>jscpd</code>;</li>
<li>noisy extra findings that make reports less useful;</li>
<li>install friction on supported npm platforms;</li>
<li>public benchmark cases that represent real large repositories;</li>
<li>workflows where agents or humans need better deduplication output.</li>
</ul>
<p>Duplicate code was already a human problem. Agents make it easier to create at scale. The answer is not to slow down development. The answer is to make the guardrails fast enough to keep up.</p>


</section>

 ]]></description>
  <category>developer-tools</category>
  <category>rust</category>
  <category>ci</category>
  <category>npm</category>
  <category>ai-agents</category>
  <category>code-quality</category>
  <guid>https://vv-bogdanov.github.io/posts/fast-duplicate-code-detection-for-agents/</guid>
  <pubDate>Thu, 04 Jun 2026 00:00:00 GMT</pubDate>
  <media:content url="https://vv-bogdanov.github.io/assets/jscpd-rs-duplicate-code-flow-preview.png" medium="image" type="image/png" height="108" width="144"/>
</item>
</channel>
</rss>
