feat(metrics): add near-duplicate block detection#18
Conversation
f6d899c to
4fcad28
Compare
Add dialyzer ignore file, pre-commit hooks config, gitignore updates, blocks CI workflow, and mix.exs dependency updates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move analyzer/collector/pipeline/registry/parallel into engine/ module, reorganize all file metrics into metrics/file/ and codebase metrics into metrics/codebase/ namespaces. Delete obsolete telemetry and stopwords modules. Add new file metrics: bradford, brevity, comment_structure, punctuation_density, rfc, and post-processing metrics (menzerath). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sification Introduce a full AST abstraction layer: token lexing (string, whitespace, newline tokens), structural and classification signals, parser with signal stream, node types (function, module, import, doc, etc.), and compound node builder with enrichment. Enables language-agnostic code structure analysis. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add language abstractions for 30+ languages across native (C++, Go, Rust, Swift, Zig, Haskell, OCaml), scripting (Python, Ruby, JS, PHP, Lua, R, etc.), VM-based (Elixir, Java, Kotlin, Scala, C#, Dart, etc.), web, config (Docker, Terraform, Makefile), data (SQL, YAML, GraphQL), and markup formats. Includes test fixtures for each language family. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce supervised GenServer processes for managing analysis runs: BehaviorConfigServer, FileContextServer, FileMetricsServer, RunContext, and RunSupervisor. Enables concurrent per-file metric collection with shared configuration state. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a YAML-driven scoring framework that evaluates code quality across categories (code smells, consistency, dependencies, documentation, error handling, file structure, function design, naming conventions, etc.). Each category loads behaviors from YAML config with per-language sample validation. Includes FileSCorer, Scorer, SampleRunner, and mix tasks for running sample reports and debugging signals. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mpact analysis Introduce winnowing-based near-duplicate block detection at file and codebase level. BlockImpactAnalyzer computes refactoring potentials by identifying duplicated code blocks across files. Includes LinePatterns helper and CRC-based block matching for efficient similarity detection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add Config module for loading and validating .codeqa.yml configuration, and Diagnostics module for surfacing rule violations and code issues with structured output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ding Rework health report categories, grader, and formatters (GitHub and plain) to integrate with combined metrics scoring. Config module now drives category weighting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add diagnose CLI command for surfacing code quality issues. Update analyze, compare, correlate, history, health_report, and options commands to work with the new engine and config layers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a React/Vite web app for interactively tuning scalar metric weights and visualizing combined metric scores. Includes knob controls, behavior cards, YAML export, and a bundled metric report for offline use. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update README with new feature documentation. Update .codeqa.yml, action.yml, and run script to reflect the new engine and CLI structure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
961d39e to
45ae2b2
Compare
…vior Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…docs - Add guard (is_binary/is_list) to catch-all clause of behavior_language_applies?/3 - Add explicit clause for (_, nil, []) to treat empty languages list as "no filter" - Add comment on the [] catch-all clause clarifying priority semantics - Document :language and :languages options in diagnose_aggregate/2 @doc Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…l sites Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fy project_languages param Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds configurable line range filtering for code blocks to focus on actionable, refactorable chunks: - Default: only show blocks between 3 and 20 lines - Configurable via .codeqa.yml: - block_min_lines: 3 (default) - block_max_lines: 20 (default) Blocks outside this range are excluded before ranking. This filters out: - Tiny blocks (< 3 lines) that are too small to be meaningful - Large blocks (> 20 lines) that need bigger refactoring
🔍 Top Likely Issues (cosine similarity)
🟢 Readability — A (95/100)Codebase averages: flesch_adapted=98.02, fog_adapted=4.72, avg_tokens_per_line=9.29, avg_line_length=35.07
🔴 Complexity — D- (31/100)Codebase averages: difficulty=40.82, effort=214953.30, volume=3831.96, estimated_bugs=1.28
🟢 Structure — A- (87/100)Codebase averages: branching_density=0.14, mean_depth=3.91, avg_function_lines=8.65, max_depth=9.77, max_function_lines=20.21, variance=6.89, avg_param_count=1.12, max_param_count=1.97
🟠 Duplication — C- (48/100)Codebase averages: redundancy=0.59, bigram_repetition_rate=0.54, trigram_repetition_rate=0.37
🟢 Naming — A (96/100)Codebase averages: entropy=0.89, mean=6.65, variance=18.86, avg_sub_words_per_id=1.17
🟢 Magic Numbers — A (100/100)Codebase averages: density=0.00
🔴 Combined Metrics — F (58/100)
🔴 Code Smells — E+ (21/100)
🟠 Consistency — C- (50/100)
🔴 Dependencies — D- (27/100)
🟡 Documentation — B+ (83/100)
🟢 Error Handling — A- (92/100)
🔴 File Structure — D (39/100)
🟡 Function Design — B+ (80/100)
🟡 Naming Conventions — B+ (81/100)
🔴 Scope And Assignment — E+ (22/100)
🟡 Testing — B+ (83/100)
🟢 Type And Value — A- (90/100)
🟠 Variable Naming — C- (50/100)
|
…ided When comparing refs, only show blocks whose lines overlap with the actual diff hunks, not just blocks in changed files. This surfaces only blocks relevant to the current PR changes. - Add Git.diff_line_ranges/3 to parse unified diff and extract changed line ranges per file - Add filter_by_diff_overlap/3 to TopBlocks to filter by line overlap - Integrate in health-report CLI when --base-ref is provided - Add warning when diff parsing fails (graceful degradation) - Add comprehensive tests for edge cases
Add worst_per_category/4 to TopBlocks that identifies the single worst block for each cosine-based category (code_smells, function_design, etc.) based on cosine_delta. Blocks must overlap with PR diff lines. Display in GitHub formatter: - Show source code if block is 4-10 lines - Show file location only if <4 or >10 lines
What's in this PR
This branch is a major evolution of the codeqa analysis engine, centered around near-duplicate block detection but encompassing a full architectural restructure.
Architecture changes
Engine layer (
lib/codeqa/engine/)Replaces the monolithic
analyzer.ex/collector.ex/pipeline.exwith a layered engine:AST subsystem (
lib/codeqa/ast/)Language-agnostic code structure analysis:
<NL>,<WS>, string tokens)SignalStreamNodeClassifier,NodeTypeDetector,NodeProtocolFunctionNode,ModuleNode,ImportNode,DocNode,AttributeNode,CodeNode,TestNodeCompoundNodeBuilderassembles enriched compound nodes with metadataLanguage definitions (
lib/codeqa/languages/)40+ language modules organized by category:
OTP analysis servers (
lib/codeqa/analysis/)Per-run supervised GenServers eliminate repeated disk I/O:
New features
Near-duplicate block detection
NearDuplicateBlocks— winnowing-based fingerprinting + edit distance bucketing (d0=exact … d8=50%)NearDuplicateBlocksFile/NearDuplicateBlocksCodebase— file and codebase level metricsBlockImpactAnalyzer— identifies refactoring opportunities from duplicate blocksLinePatterns— helper for structural pre-filteringCombined metrics framework (
lib/codeqa/combined_metrics/)YAML-driven scoring across 100+ best-practice behaviors:
FileScorer/Scorer— per-file and aggregate scoringSampleRunner— validates behaviors against good/bad code samplespriv/combined_metrics/samples/) for 100+ behaviors across 10+ languagesDiagnostics and config
Config— loads and validates.codeqa.ymlDiagnostics— surfaces rule violations with structured outputdiagnoseCLI commandScalar tuner UI (
tools/scalar_tuner/)React/Vite app for interactively tuning scalar metric weights with YAML export.
Metrics reorganization
All file metrics moved from
lib/codeqa/metrics/intolib/codeqa/metrics/file/andlib/codeqa/metrics/codebase/. New metrics added: Bradford, Brevity, CommentStructure, PunctuationDensity, RFC, LinePatterns, Menzerath (post-processing).Test plan
mix test— full suite passesmix dialyzer— no new warningsdiagnosecommand produces structured output on a real repo🤖 Generated with Claude Code