Skip to content

feat(indexer): introduce incremental indexing engine pipeline#21

Open
Chethan-Regala wants to merge 7 commits intoAOSSIE-Org:mainfrom
Chethan-Regala:feat/incremental-indexing-engine
Open

feat(indexer): introduce incremental indexing engine pipeline#21
Chethan-Regala wants to merge 7 commits intoAOSSIE-Org:mainfrom
Chethan-Regala:feat/incremental-indexing-engine

Conversation

@Chethan-Regala
Copy link

@Chethan-Regala Chethan-Regala commented Mar 8, 2026

Introduce Incremental Indexing Engine Pipeline

This PR introduces the foundational architecture for the Smart Notes incremental indexing engine, which will power future semantic search and AI-assisted features.

Motivation

Smart Notes aims to support offline-first AI-powered knowledge retrieval.
To enable scalable semantic search across large vaults, we need an indexing pipeline that can:

  • process notes incrementally
  • generate semantic embeddings
  • maintain deterministic chunk identities
  • support pluggable storage and embedding models

This PR introduces the core indexing pipeline architecture that enables these capabilities.


Architecture Overview

The indexing engine is designed around a modular pipeline:

image

Components Introduced

IndexingEngine

Coordinates the full indexing pipeline.

Responsibilities:

  • scheduling update/delete jobs
  • reading notes from the vault
  • chunking markdown content
  • generating embeddings
  • delegating persistence to the storage layer

NoteChunker

Splits markdown notes into deterministic paragraph-based chunks.

Features:

  • stable chunk hashing
  • deterministic chunk IDs
  • predictable ordering

IndexQueue

A lightweight sequential job queue that guarantees ordered indexing operations and prevents concurrency issues during updates.


Adapter Interfaces

To keep the indexing system extensible, the following adapter contracts were introduced:

  • VaultAdapter – abstraction over vault storage
  • EmbeddingAdapter – abstraction over embedding models
  • IndexStore – abstraction over index persistence

This allows the engine to integrate with different implementations such as:

  • filesystem vaults
  • SQLite metadata stores
  • vector databases
  • local embedding models (e.g. MiniLM / Ollama)

Demo Harness

A minimal demo runner is included to illustrate the pipeline:

src/demo/DemoRunner.ts

This demonstrates the indexing flow using in-memory adapters.


Scope of This PR

This PR intentionally focuses on architecture and pipeline design, not full storage or embedding implementations.

Future PRs will extend this work with:

  • SQLite-backed index storage
  • incremental filesystem watchers
  • embedding model integration
  • semantic search retrieval

Why This Matters

This indexing pipeline forms the foundation for the AI layer of Smart Notes, enabling:

  • scalable semantic search
  • efficient incremental updates
  • offline-first AI knowledge retrieval

I would appreciate feedback on the architecture and adapter boundaries before expanding this into the full indexing subsystem.

Summary by CodeRabbit

  • New Features

    • Added a new indexer app that chunks notes into semantic paragraphs, generates embeddings, and indexes content.
    • Added an asynchronous sequential job queue to schedule and process index update and delete jobs reliably.
    • Added demo runner with in-memory mock components to exercise the indexing flow.
  • Chores

    • Added package manifest, TypeScript project config, and build/test scripts for the indexer.

@coderabbitai
Copy link

coderabbitai bot commented Mar 8, 2026

Walkthrough

Adds a new indexer app implementing an incremental markdown indexing pipeline: types, adapter interfaces (vault, embedding, store), core components (queue, chunker, engine), a demo runner, package/tsconfig, and gitignore updates for node artifacts.

Changes

Cohort / File(s) Summary
Repository ignores & app manifest
/.gitignore, apps/indexer/.gitignore, apps/indexer/package.json, apps/indexer/tsconfig.json
Added node/npm ignore rules and app-specific ignores (node_modules, dist); created package.json with TypeScript dev deps and scripts; added tsconfig for the indexer app.
Public types
apps/indexer/src/types.ts
Introduced NoteChunk, IndexJob, and IndexResult type definitions used across the indexer.
Adapter interfaces
apps/indexer/src/adapters/VaultAdapter.ts, apps/indexer/src/adapters/EmbeddingAdapter.ts, apps/indexer/src/adapters/IndexStore.ts
Added interfaces to abstract note I/O, embedding generation, and storage operations (readNote, listNotes, embed, saveChunks, deleteNote).
Core logic
apps/indexer/src/IndexQueue.ts, apps/indexer/src/NoteChunker.ts, apps/indexer/src/IndexingEngine.ts
Added IndexQueue for sequential job processing, NoteChunker for paragraph-based chunking with SHA-1 ids, and IndexingEngine to orchestrate read → chunk → embed → store flows and schedule update/delete jobs.
Demo & exports
apps/indexer/src/demo/DemoRunner.ts, apps/indexer/src/index.ts
Added an in-memory demo implementation (vault, embedder, store) and runner; created a barrel export re-exporting types, engine, queue, chunker, and adapter interfaces.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant IndexingEngine
    participant IndexQueue
    participant VaultAdapter
    participant NoteChunker
    participant EmbeddingAdapter
    participant IndexStore

    Client->>IndexingEngine: scheduleUpdate("demo.md")
    IndexingEngine->>IndexQueue: enqueue({type: "update", notePath})
    IndexingEngine->>IndexQueue: process(handler)
    IndexQueue->>IndexingEngine: handler(job)
    IndexingEngine->>VaultAdapter: readNote(notePath)
    VaultAdapter-->>IndexingEngine: markdown
    IndexingEngine->>NoteChunker: split(notePath, markdown)
    NoteChunker-->>IndexingEngine: NoteChunk[]
    IndexingEngine->>EmbeddingAdapter: embed(chunk texts)
    EmbeddingAdapter-->>IndexingEngine: embeddings
    IndexingEngine->>IndexStore: saveChunks(notePath, chunks, embeddings)
    IndexStore-->>IndexingEngine: void
    IndexingEngine-->>Client: indexing complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Suggested labels

Typescript Lang

Poem

🐇 I nibble lines and split them neat,
Hash each chunk with tiny feet.
Queues hum softly, embeddings sing,
A demo world where indexes spring. 🥕

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change—introducing the incremental indexing engine pipeline—which aligns with the PR's core objective of establishing the foundational indexing architecture.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Chethan-Regala
Copy link
Author

Thanks for reviewing this PR.

This change introduces the foundational architecture for the incremental indexing pipeline that will later support semantic search and AI-powered retrieval in Smart Notes.

The goal of this PR is to establish a clean, modular indexing pipeline before integrating heavier components like:

  • filesystem watchers
  • SQLite metadata storage
  • embedding model integration
  • hybrid retrieval

The current implementation focuses on defining clear adapter boundaries (VaultAdapter, EmbeddingAdapter, IndexStore) so the indexing engine remains decoupled from specific implementations.

I would especially appreciate feedback on:

  • the adapter boundaries
  • the indexing pipeline structure
  • whether this aligns with the intended future architecture of Smart Notes

Happy to refine the design based on maintainer suggestions.

@github-actions github-actions bot added size/L and removed size/L labels Mar 8, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/indexer/package.json`:
- Line 5: The package.json "main" field currently points to "index.js" but
compiled TypeScript outputs to the dist/ directory; update the "main" value in
package.json to point to the compiled entry (e.g., "dist/index.js" or the actual
compiled filename) so imports resolve correctly when consuming this package, and
verify the tsconfig.json/outDir and build outputs match the new main value.
- Around line 6-7: Add a "build" script to the package.json "scripts" section
(next to the existing "test" script) that runs the TypeScript compiler using the
repository's tsconfig (e.g., invoke tsc with -p tsconfig.json) so the project
can be compiled before publishing or running; update any CI or npm lifecycle
hooks to call "npm run build" where needed. Reference symbols: "scripts" object
and existing "test" entry in package.json and the project's tsconfig.json to
ensure the correct compiler configuration is used.

In `@apps/indexer/src/adapters/IndexStore.ts`:
- Around line 8-15: The saveChunks contract on IndexStore currently reads like a
generic save; change it to require an atomic replace of all stored
chunks/embeddings for the given notePath (i.e., replace existing state for
notePath rather than upserting/appending). Update the IndexStore.saveChunks
JSDoc/signature to state “replace all chunks for notePath atomically,” update
implementations of IndexStore (and any concrete classes) to delete/replace the
notePath’s existing chunks in one atomic operation, and ensure IndexingEngine
calls (where it reindexes notes) rely on this replace semantics so removed
chunks are no longer searchable.

In `@apps/indexer/src/IndexingEngine.ts`:
- Around line 33-42: scheduleUpdate (and the similar scheduleDelete) currently
fire-and-forget; change their signatures to return a Promise that resolves when
the work completes by either returning this.queue.drain() for a minimal fix or,
better, having enqueue return a per-job Promise that resolves/rejects from
processJob and then return that Promise from scheduleUpdate/scheduleDelete;
update calls to this.queue.enqueue(job) and
this.queue.process(this.processJob.bind(this)) accordingly (or leave process
setup separate) so scheduleUpdate/scheduleDelete return the queue drain or the
enqueue-provided per-job Promise instead of void.
- Around line 81-89: The embeddings array returned by this.embedder.embed may
not match chunks length, so before constructing the IndexResult and calling
this.store.saveChunks you must validate that embeddings.length === chunks.length
(or throw/return an error); update the IndexingEngine code around
embedder.embed, IndexResult creation, and the call to store.saveChunks to check
the count and fail fast with a clear error if it mismatches, ensuring you do not
persist misaligned chunk/vector pairs.

In `@apps/indexer/src/IndexQueue.ts`:
- Around line 15-18: The process() method currently returns immediately when
this.running is true, giving callers a false completion signal; change it to
keep and return a shared in-flight promise (e.g., this.inFlightPromise) while a
drain is active instead of returning undefined. Specifically, in
IndexQueue.process(handler) create this.inFlightPromise when you set
this.running = true, resolve/reject that promise when the queue drain finishes
(where you currently clear this.running), and when process() is called while
this.running is true simply return the existing this.inFlightPromise so callers
await actual completion; apply the same pattern to the other occurrence
referenced by the review (the second early-return at line 32).
- Around line 25-29: The catch in IndexQueue.ts around await handler(job)
currently just console.error's and swallows failures; change it to propagate the
error instead of resolving successfully so upstream (IndexingEngine) can requeue
or dead-letter the job—specifically, replace the swallow in the try/catch around
handler(job) with logic that either rethrows the caught err (throw err or return
Promise.reject(err)) or invokes the queue/job-level
negative-ack/retry/dead-letter API if one exists; update any tests/consumers
that assume success to handle the propagated failure.

In `@apps/indexer/src/NoteChunker.ts`:
- Line 1: The import in NoteChunker.ts imports a type-only symbol NoteChunk
using a value import; change it to a type-only import (use import type {
NoteChunk } from "./types") so transpilers/linters know it's a type-only
dependency and avoid including it in runtime output; update any references to
NoteChunk in the file as needed but no runtime code changes are required.
- Around line 16-19: The SHA1 input currently concatenates notePath, index, and
text directly in the id calculation (crypto.createHash(...).update(notePath +
index + text)), which can cause ambiguous collisions; change the update call to
join these pieces with a clear delimiter (e.g., '|' or '\0') between notePath,
index, and text when computing id in NoteChunker so each component boundary is
unambiguous.

In `@apps/indexer/tsconfig.json`:
- Around line 1-11: The tsconfig.json sets rootDir: "src" but lacks an include
array, which can cause "file is outside rootDir" errors for files like
vitest.config.ts; update the file to add an "include" array that explicitly
includes "src/**/*" and any top-level TS config/test files (e.g.,
"vitest.config.ts", "*.d.ts") or, alternatively, remove/change rootDir to a
build-only config — modify the tsconfig.json keys rootDir and add include to
cover source and config/test files so tsc no longer errors.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: cee0a819-7e16-4592-8c37-7efbacc8a821

📥 Commits

Reviewing files that changed from the base of the PR and between a3ccb2b and fc75321.

⛔ Files ignored due to path filters (1)
  • apps/indexer/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (13)
  • .gitignore
  • apps/indexer/.gitignore
  • apps/indexer/package.json
  • apps/indexer/src/IndexQueue.ts
  • apps/indexer/src/IndexingEngine.ts
  • apps/indexer/src/NoteChunker.ts
  • apps/indexer/src/adapters/EmbeddingAdapter.ts
  • apps/indexer/src/adapters/IndexStore.ts
  • apps/indexer/src/adapters/VaultAdapter.ts
  • apps/indexer/src/demo/DemoRunner.ts
  • apps/indexer/src/index.ts
  • apps/indexer/src/types.ts
  • apps/indexer/tsconfig.json

@github-actions github-actions bot added size/L and removed size/L labels Mar 9, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/indexer/src/adapters/IndexStore.ts`:
- Line 1: The import for NoteChunk is type-only and should use TypeScript's
type-only import syntax; update the import in IndexStore.ts to use "import type
{ NoteChunk } from '../types'" so that the NoteChunk reference in the IndexStore
interface (and any related type annotations) is imported as a type-only import.

In `@apps/indexer/src/IndexingEngine.ts`:
- Around line 1-6: Change the type-only imports to use TypeScript's `import
type` for the interfaces and types: replace the imports for VaultAdapter,
EmbeddingAdapter, IndexStore, IndexResult, and IndexJob in IndexingEngine.ts
with `import type` declarations so only runtime values remain as normal imports;
ensure that NoteChunker and IndexQueue remain regular imports if they are used
at runtime and that no runtime-only imports are accidentally converted.

In `@apps/indexer/src/IndexQueue.ts`:
- Around line 15-30: The process() method on IndexQueue currently lets
exceptions from handler(job) abort the drain; either document this in the method
JSDoc or add resilience: wrap the await handler(job) call in a try/catch, call
an optional failure handler (e.g., this.onJobFailed?.(job, err) or a provided
onFailure callback) with the IndexJob and error, and continue processing
remaining jobs; ensure this.processing is still cleared in finally and keep the
shared-promise behavior intact.
- Line 1: Change the value import to a type-only import: replace the runtime
import of IndexJob with an `import type { IndexJob } from "./types"` import
since `IndexJob` is only used as a type annotation in this module (check usages
in this file to confirm no runtime references), ensuring TypeScript emits no
runtime import for IndexJob.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5deacfc7-7f51-4fa2-a77b-bec162742043

📥 Commits

Reviewing files that changed from the base of the PR and between fc75321 and 487fbfa.

📒 Files selected for processing (6)
  • apps/indexer/package.json
  • apps/indexer/src/IndexQueue.ts
  • apps/indexer/src/IndexingEngine.ts
  • apps/indexer/src/NoteChunker.ts
  • apps/indexer/src/adapters/IndexStore.ts
  • apps/indexer/tsconfig.json

@Chethan-Regala
Copy link
Author

I've pushed an update addressing the automated review suggestions.
Please let me know if anything should be refined further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant