Skip to content

feat(metrics): add near-duplicate block detection#18

Open
aspala wants to merge 71 commits intomainfrom
feat/near-duplicate-blocks
Open

feat(metrics): add near-duplicate block detection#18
aspala wants to merge 71 commits intomainfrom
feat/near-duplicate-blocks

Conversation

@aspala
Copy link
Copy Markdown
Member

@aspala aspala commented Mar 14, 2026

What's in this PR

This branch is a major evolution of the codeqa analysis engine, centered around near-duplicate block detection but encompassing a full architectural restructure.

Architecture changes

Engine layer (lib/codeqa/engine/)

Replaces the monolithic analyzer.ex / collector.ex / pipeline.ex with a layered engine:

  • Analyzer — orchestrates per-file metric collection
  • Collector — aggregates results
  • Pipeline — run coordination
  • Registry — metric registration
  • Parallel — concurrent execution
  • FileContext — per-file context building

AST subsystem (lib/codeqa/ast/)

Language-agnostic code structure analysis:

  • Lexing — tokenizes source into structural tokens (<NL>, <WS>, string tokens)
  • Parsing — detects blocks and applies signals via SignalStream
  • Signals — 13 structural + 10 classification signal detectors that vote on node type
  • ClassificationNodeClassifier, NodeTypeDetector, NodeProtocol
  • Nodes — typed node structs: FunctionNode, ModuleNode, ImportNode, DocNode, AttributeNode, CodeNode, TestNode
  • EnrichmentCompoundNodeBuilder assembles enriched compound nodes with metadata

Language definitions (lib/codeqa/languages/)

40+ language modules organized by category:

  • Native: C++, Go, Rust, Swift, Zig, Haskell, OCaml
  • Scripting: Python, Ruby, PHP, Lua, Julia, Perl, R, Shell
  • VM: Elixir, Erlang, Java, Kotlin, Scala, C#, F#, Dart, Clojure
  • Web: JavaScript, TypeScript
  • Config: Dockerfile, Makefile, Terraform
  • Data: SQL, YAML, JSON, TOML, GraphQL
  • Markup: HTML, CSS, Markdown, XML

OTP analysis servers (lib/codeqa/analysis/)

Per-run supervised GenServers eliminate repeated disk I/O:

  • BehaviorConfigServer — caches YAML behavior configs
  • FileContextServer — caches per-file language/AST context
  • FileMetricsServer — accumulates per-file metric results
  • RunSupervisor — manages lifecycle of all servers per analysis run

New features

Near-duplicate block detection

  • NearDuplicateBlocks — winnowing-based fingerprinting + edit distance bucketing (d0=exact … d8=50%)
  • NearDuplicateBlocksFile / NearDuplicateBlocksCodebase — file and codebase level metrics
  • BlockImpactAnalyzer — identifies refactoring opportunities from duplicate blocks
  • LinePatterns — helper for structural pre-filtering

Combined metrics framework (lib/codeqa/combined_metrics/)

YAML-driven scoring across 100+ best-practice behaviors:

  • Categories: code smells, consistency, dependencies, documentation, error handling, file structure, function design, naming conventions, scope/assignment, testing, type/value, variable naming
  • FileScorer / Scorer — per-file and aggregate scoring
  • SampleRunner — validates behaviors against good/bad code samples
  • 200+ polyglot sample files (priv/combined_metrics/samples/) for 100+ behaviors across 10+ languages

Diagnostics and config

  • Config — loads and validates .codeqa.yml
  • Diagnostics — surfaces rule violations with structured output
  • diagnose CLI command

Scalar tuner UI (tools/scalar_tuner/)

React/Vite app for interactively tuning scalar metric weights with YAML export.

Metrics reorganization

All file metrics moved from lib/codeqa/metrics/ into lib/codeqa/metrics/file/ and lib/codeqa/metrics/codebase/. New metrics added: Bradford, Brevity, CommentStructure, PunctuationDensity, RFC, LinePatterns, Menzerath (post-processing).

Test plan

  • mix test — full suite passes
  • mix dialyzer — no new warnings
  • Sample runner validates combined metrics behaviors
  • Near-duplicate detection returns expected buckets on known-duplicate fixtures
  • diagnose command produces structured output on a real repo

🤖 Generated with Claude Code

@aspala aspala marked this pull request as draft March 14, 2026 23:51
@aspala aspala force-pushed the feat/near-duplicate-blocks branch from f6d899c to 4fcad28 Compare March 15, 2026 20:42
aspala and others added 15 commits March 19, 2026 18:25
Add dialyzer ignore file, pre-commit hooks config, gitignore updates,
blocks CI workflow, and mix.exs dependency updates.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move analyzer/collector/pipeline/registry/parallel into engine/ module,
reorganize all file metrics into metrics/file/ and codebase metrics into
metrics/codebase/ namespaces. Delete obsolete telemetry and stopwords modules.
Add new file metrics: bradford, brevity, comment_structure, punctuation_density, rfc,
and post-processing metrics (menzerath).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sification

Introduce a full AST abstraction layer: token lexing (string, whitespace,
newline tokens), structural and classification signals, parser with signal
stream, node types (function, module, import, doc, etc.), and compound node
builder with enrichment. Enables language-agnostic code structure analysis.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add language abstractions for 30+ languages across native (C++, Go, Rust,
Swift, Zig, Haskell, OCaml), scripting (Python, Ruby, JS, PHP, Lua, R, etc.),
VM-based (Elixir, Java, Kotlin, Scala, C#, Dart, etc.), web, config (Docker,
Terraform, Makefile), data (SQL, YAML, GraphQL), and markup formats.
Includes test fixtures for each language family.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce supervised GenServer processes for managing analysis runs:
BehaviorConfigServer, FileContextServer, FileMetricsServer, RunContext,
and RunSupervisor. Enables concurrent per-file metric collection with
shared configuration state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a YAML-driven scoring framework that evaluates code quality across
categories (code smells, consistency, dependencies, documentation, error
handling, file structure, function design, naming conventions, etc.).
Each category loads behaviors from YAML config with per-language sample
validation. Includes FileSCorer, Scorer, SampleRunner, and mix tasks for
running sample reports and debugging signals.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mpact analysis

Introduce winnowing-based near-duplicate block detection at file and codebase
level. BlockImpactAnalyzer computes refactoring potentials by identifying
duplicated code blocks across files. Includes LinePatterns helper and
CRC-based block matching for efficient similarity detection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add Config module for loading and validating .codeqa.yml configuration,
and Diagnostics module for surfacing rule violations and code issues
with structured output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ding

Rework health report categories, grader, and formatters (GitHub and plain)
to integrate with combined metrics scoring. Config module now drives
category weighting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add diagnose CLI command for surfacing code quality issues. Update
analyze, compare, correlate, history, health_report, and options
commands to work with the new engine and config layers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a React/Vite web app for interactively tuning scalar metric weights
and visualizing combined metric scores. Includes knob controls, behavior
cards, YAML export, and a bundled metric report for offline use.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update README with new feature documentation. Update .codeqa.yml,
action.yml, and run script to reflect the new engine and CLI structure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@aspala aspala force-pushed the feat/near-duplicate-blocks branch from 961d39e to 45ae2b2 Compare March 19, 2026 17:31
aspala and others added 12 commits March 20, 2026 13:06
…vior

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…docs

- Add guard (is_binary/is_list) to catch-all clause of behavior_language_applies?/3
- Add explicit clause for (_, nil, []) to treat empty languages list as "no filter"
- Add comment on the [] catch-all clause clarifying priority semantics
- Document :language and :languages options in diagnose_aggregate/2 @doc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…l sites

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fy project_languages param

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@aspala aspala marked this pull request as ready for review March 26, 2026 14:40
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
Adds configurable line range filtering for code blocks to focus on
actionable, refactorable chunks:

- Default: only show blocks between 3 and 20 lines
- Configurable via .codeqa.yml:
  - block_min_lines: 3  (default)
  - block_max_lines: 20 (default)

Blocks outside this range are excluded before ranking. This filters out:
- Tiny blocks (< 3 lines) that are too small to be meaningful
- Large blocks (> 20 lines) that need bigger refactoring
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@num42 num42 deleted a comment from github-actions bot Mar 26, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 26, 2026

🔍 Top Likely Issues (cosine similarity)

Most negative cosine = file's metric profile best matches this anti-pattern.

Behavior Cosine Score
scope_and_assignment.used_only_once -0.65 -4.75
code_smells.no_debug_print_statements -0.60 -6.74
scope_and_assignment.declared_close_to_use -0.51 -5.42
file_structure.single_responsibility -0.44 -8.34
scope_and_assignment.reassigned_multiple_times -0.44 4.85
file_structure.line_count_under_300 -0.43 -8.45
code_smells.no_dead_code_after_return -0.43 -21.81
scope_and_assignment.shadowed_by_inner_scope -0.38 -5.25
dependencies.low_coupling -0.37 -6.18
file_structure.uses_standard_indentation_width -0.35 0.26
🟢 Readability — A (95/100)

Codebase averages: flesch_adapted=98.02, fog_adapted=4.72, avg_tokens_per_line=9.29, avg_line_length=35.07

Metric Value Score
readability.flesch_adapted 98.02 100
readability.fog_adapted 4.72 100
readability.avg_tokens_per_line 9.29 74
readability.avg_line_length 35.07 100
🔴 Complexity — D- (31/100)

Codebase averages: difficulty=40.82, effort=214953.30, volume=3831.96, estimated_bugs=1.28

Metric Value Score
halstead.difficulty 40.82 42
halstead.effort 214953.30 0
halstead.volume 3831.96 47
halstead.estimated_bugs 1.28 47
🟢 Structure — A- (87/100)

Codebase averages: branching_density=0.14, mean_depth=3.91, avg_function_lines=8.65, max_depth=9.77, max_function_lines=20.21, variance=6.89, avg_param_count=1.12, max_param_count=1.97

Metric Value Score
branching.branching_density 0.14 76
indentation.mean_depth 3.91 88
function_metrics.avg_function_lines 8.65 88
indentation.max_depth 9.77 86
function_metrics.max_function_lines 20.21 90
indentation.variance 6.89 100
function_metrics.avg_param_count 1.12 100
function_metrics.max_param_count 1.97 100
🟠 Duplication — C- (48/100)

Codebase averages: redundancy=0.59, bigram_repetition_rate=0.54, trigram_repetition_rate=0.37

Metric Value Score
compression.redundancy 0.59 58
ngram.bigram_repetition_rate 0.54 37
ngram.trigram_repetition_rate 0.37 41
🟢 Naming — A (96/100)

Codebase averages: entropy=0.89, mean=6.65, variance=18.86, avg_sub_words_per_id=1.17

Metric Value Score
casing_entropy.entropy 0.89 100
identifier_length_variance.mean 6.65 100
identifier_length_variance.variance 18.86 85
readability.avg_sub_words_per_id 1.17 100
🟢 Magic Numbers — A (100/100)

Codebase averages: density=0.00

Metric Value Score
magic_number_density.density 0.00 100
🔴 Combined Metrics — F (58/100)
Category Score Grade
Code Smells 21 🔴 E+
Consistency 50 🟠 C-
Dependencies 27 🔴 D-
Documentation 83 🟡 B+
Error Handling 92 🟢 A-
File Structure 39 🔴 D
Function Design 80 🟡 B+
Naming Conventions 81 🟡 B+
Scope And Assignment 22 🔴 E+
Testing 83 🟡 B+
Type And Value 90 🟢 A-
Variable Naming 50 🟠 C-
🔴 Code Smells — E+ (21/100)

Cosine similarity scores for 2 behaviors.

Behavior Cosine Score Grade
no_debug_print_statements -0.60 17 E
no_dead_code_after_return -0.43 24 E+

Worst offender (lib/codeqa/ast/signals/classification/import_signal.ex:1-13):

defmodule CodeQA.AST.Signals.Classification.ImportSignal do
  @moduledoc """
  Classification signal — votes `:import` when an import/require/use/alias keyword
  appears at indent 0.

  Weights:
  - 3 when it is the first content token of the block (strong match)
  - 1 when found later in the block

  Covers: Elixir (import, require, use, alias), Python (import, from),
  JavaScript/Go (import, package), C# (using), Ruby/Lua (require, include).
  Emits at most one vote per token stream.
  """
🟠 Consistency — C- (50/100)

Cosine similarity scores for 0 behaviors.

Behavior Cosine Score Grade

Worst offender (lib/codeqa/ast/signals/classification/comment_density_signal.ex:1-10):

defmodule CodeQA.AST.Signals.Classification.CommentDensitySignal do
  @moduledoc """
  Classification signal — votes `:comment` when more than 60% of non-blank
  lines begin with a comment prefix.

  Requires `comment_prefixes: [String.t()]` in opts (from the language
  module). Returns no vote if no prefixes are configured.

  Emits at the end of the stream.
  """
🔴 Dependencies — D- (27/100)

Cosine similarity scores for 1 behaviors.

Behavior Cosine Score Grade
low_coupling -0.37 27 D-

Worst offender (lib/codeqa/metrics/file/heaps.ex:71-75):

    %{
      "k" => Float.round(k, 4),
      "beta" => Float.round(beta, 4),
      "r_squared" => Float.round(r_sq, 4)
    }
🟡 Documentation — B+ (83/100)

Cosine similarity scores for 3 behaviors.

Behavior Cosine Score Grade
file_has_module_docstring 0.29 76 B
function_has_docstring 0.43 86 A-
docstring_is_nonempty 0.44 86 A-

Worst offender (lib/codeqa/ast/lexing/whitespace_token.ex:1-13):

defmodule CodeQA.AST.Lexing.WhitespaceToken do
  @moduledoc """
  A whitespace/indentation token emitted by `TokenNormalizer.normalize_structural/1`.

  Represents one indentation unit (2 spaces or 1 tab) at the start of a line.

  ## Fields

  - `kind`    — always `"<WS>"`.
  - `content` — the original source text for this indentation unit (`"  "`).
  - `line`    — 1-based line number in the source file.
  - `col`     — 0-based byte offset from the start of the line.
  """
🟢 Error Handling — A- (92/100)

Cosine similarity scores for 3 behaviors.

Behavior Cosine Score Grade
error_message_is_descriptive 0.52 90 A-
does_not_swallow_errors 0.59 92 A-
returns_typed_error 0.70 94 A

Worst offender (lib/codeqa/ast/lexing/token.ex:39-44):

  @type t :: %__MODULE__{
          kind: String.t(),
          content: String.t(),
          line: non_neg_integer() | nil,
          col: non_neg_integer() | nil
        }
🔴 File Structure — D (39/100)

Cosine similarity scores for 5 behaviors.

Behavior Cosine Score Grade
single_responsibility -0.44 24 E+
line_count_under_300 -0.43 24 E+
uses_standard_indentation_width -0.35 28 D-
has_consistent_indentation -0.33 29 D-
no_magic_numbers 0.58 92 A-

Worst offender (lib/codeqa/ast/classification/node_classifier.ex:39-47):

  alias CodeQA.AST.Nodes.{
    AttributeNode,
    CodeNode,
    DocNode,
    FunctionNode,
    ImportNode,
    ModuleNode,
    TestNode
  }
🟡 Function Design — B+ (80/100)

Cosine similarity scores for 4 behaviors.

Behavior Cosine Score Grade
boolean_function_has_question_mark 0.34 79 B+
is_less_than_20_lines 0.35 80 B+
has_verb_in_name 0.35 80 B+
no_magic_numbers 0.37 82 B+

Worst offender (lib/codeqa/languages/markup/xml.ex:28-30):

  def delimiters, do: ~w[
    ( ) , . : ; " ' # ! ?
  ] ++ ~w( [ ] )
🟡 Naming Conventions — B+ (81/100)

Cosine similarity scores for 2 behaviors.

Behavior Cosine Score Grade
function_name_matches_return_type 0.25 74 B
function_name_is_not_single_word 0.45 87 A-

Worst offender (lib/codeqa/engine/file_context.ex:16-28):

  @type t :: %__MODULE__{
          content: String.t(),
          tokens: [CodeQA.Engine.Pipeline.Token.t()],
          token_counts: map(),
          words: list(),
          identifiers: list(),
          lines: list(),
          encoded: String.t(),
          byte_count: non_neg_integer(),
          line_count: non_neg_integer(),
          path: String.t() | nil,
          blocks: [CodeQA.AST.Enrichment.Node.t()] | nil
        }
🔴 Scope And Assignment — E+ (22/100)

Cosine similarity scores for 4 behaviors.

Behavior Cosine Score Grade
used_only_once -0.65 15 E
declared_close_to_use -0.51 21 E+
reassigned_multiple_times -0.44 24 E+
shadowed_by_inner_scope -0.38 27 D-

Worst offender (lib/codeqa/metrics/file/comment_structure.ex:1-14):

defmodule CodeQA.Metrics.File.CommentStructure do
  @moduledoc """
  Measures comment density and annotation patterns.

  Counts lines that begin with a comment marker (language-agnostic: `#`, `//`,
  `/*`, ` *`) relative to non-blank lines. Also counts TODO/FIXME/HACK/XXX
  markers which indicate deferred work or known issues.

  ## Output keys

  - `"comment_line_ratio"` — comment lines / non-blank lines
  - `"comment_line_count"` — raw count of comment lines
  - `"todo_fixme_count"` — occurrences of TODO, FIXME, HACK, or XXX
  """
🟡 Testing — B+ (83/100)

Cosine similarity scores for 2 behaviors.

Behavior Cosine Score Grade
test_single_concept 0.26 74 B
test_name_describes_behavior 0.56 91 A-
🟢 Type And Value — A- (90/100)

Cosine similarity scores for 1 behaviors.

Behavior Cosine Score Grade
hardcoded_url_or_path 0.50 90 A-

Worst offender (lib/codeqa/ast/parsing/signal_registry.ex:1-8):

defmodule CodeQA.AST.Parsing.SignalRegistry do
  @moduledoc """
  Registry for structural and classification signals.

  Use `default/0` for the standard signal set. Compose custom registries
  with `register_structural/2` and `register_classification/2` for
  language-specific or analysis-specific configurations.
  """
🟠 Variable Naming — C- (50/100)

Cosine similarity scores for 0 behaviors.

Behavior Cosine Score Grade

Worst offender (lib/codeqa/cli/health_report.ex:1-14):

defmodule CodeQA.CLI.HealthReport do
  @moduledoc false

  @behaviour CodeQA.CLI.Command

  alias CodeQA.CLI.Options
  alias CodeQA.Config
  alias CodeQA.Engine.Analyzer
  alias CodeQA.Engine.Collector
  alias CodeQA.Git
  alias CodeQA.HealthReport

  @impl CodeQA.CLI.Command
  def usage do

aspala and others added 4 commits March 26, 2026 21:22
…ided

When comparing refs, only show blocks whose lines overlap with the
actual diff hunks, not just blocks in changed files. This surfaces
only blocks relevant to the current PR changes.

- Add Git.diff_line_ranges/3 to parse unified diff and extract
  changed line ranges per file
- Add filter_by_diff_overlap/3 to TopBlocks to filter by line overlap
- Integrate in health-report CLI when --base-ref is provided
- Add warning when diff parsing fails (graceful degradation)
- Add comprehensive tests for edge cases
Add worst_per_category/4 to TopBlocks that identifies the single worst
block for each cosine-based category (code_smells, function_design, etc.)
based on cosine_delta. Blocks must overlap with PR diff lines.

Display in GitHub formatter:
- Show source code if block is 4-10 lines
- Show file location only if <4 or >10 lines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant