-
Notifications
You must be signed in to change notification settings - Fork 3
Codebase Search
The Codebase Search system provides AI assistants with semantic search capabilities across your entire codebase using StarCoder2 tokenization + TF-IDF for intelligent code understanding.
A semantic search system that understands code by meaning, not just by keywords:
- StarCoder2 Tokenization - World-class code tokenization (70+ languages)
- TF-IDF Statistical Ranking - Proven relevance scoring
- Intelligent Indexing - Automatic codebase scanning and indexing
- Context-Aware - Understands code semantically through tokenization
- Real-time Updates - Can reindex as code changes
- Multi-Language - Works with 70+ programming languages
- π§ Semantic Understanding - Find code by what it does, not what it's called
- β‘ Fast Search - TF-IDF statistical search in milliseconds
- π Relevance Ranking - Results ranked by TF-IDF + semantic tokens
- π Auto-Indexing - Keeps search index up-to-date
- π― Precise Results - Find exact code sections, not entire files
- π 70+ Languages - StarCoder2 understands TypeScript, JavaScript, Python, Go, Rust, and 65+ more
- π No API Needed - Runs locally, no external dependencies
# Basic search
npx @sylphx/flow codebase search "authentication logic"
# Limit results
npx @sylphx/flow codebase search "api endpoints" --limit 10
# Include more content
npx @sylphx/flow codebase search "database queries" --include-content
# JSON output for scripting
npx @sylphx/flow codebase search "user validation" --output json# Full reindex
npx @sylphx/flow codebase reindex
# Reindex with progress
npx @sylphx/flow codebase reindex --verbose# View indexing status
npx @sylphx/flow codebase statusWhen the MCP server is running, AI assistants can use these tools:
Search the codebase semantically.
Parameters:
-
query(required): Search query describing what to find -
limit(optional): Maximum results (default: 10) -
include_content(optional): Include full code content (default: true)
Example:
// AI assistant internally calls:
codebase_search({
query: "user authentication implementation",
limit: 10,
include_content: true
})Response:
{
"results": [
{
"file": "src/auth/user-auth.ts",
"chunk": "export function authenticateUser(credentials) { ... }",
"score": 0.92,
"line_start": 45,
"line_end": 67,
"metadata": {
"language": "typescript",
"size": 523
}
}
],
"total": 5,
"query": "user authentication implementation"
}Trigger a full codebase reindex.
Parameters:
- None
Example:
// AI assistant internally calls:
codebase_reindex()Get current indexing status.
Parameters:
- None
Example:
// AI assistant internally calls:
codebase_status()# Traditional keyword search (limited)
grep -r "login" src/
# Semantic search (powerful)
flow codebase search "user login and authentication"Finds:
- Login functions
- Auth middleware
- Token validation
- Session management
- Related security code
# Describe what you're looking for
flow codebase search "REST API endpoints for user management"Finds:
- Route definitions
- Controller methods
- Request handlers
- Validation logic
- Response formatting
# Search by purpose, not syntax
flow codebase search "database queries for user data"Finds:
- SQL queries
- ORM queries
- Database utilities
- Query builders
- Data access layers
# Find error handling patterns
flow codebase search "error handling and exception management"Finds:
- Try-catch blocks
- Error middleware
- Custom error classes
- Error logging
- Recovery logic
1. Scan project directory
β
2. Filter files (ignore node_modules, .git, etc.)
β
3. Read source files
β
4. Tokenize with StarCoder2 (70+ languages)
β
5. Calculate TF-IDF scores
β
6. Store in .sylphx-flow/codebase.db
β
7. Ready for fast TF-IDF search
1. User/AI searches: "authentication logic"
β
2. Query tokenized with StarCoder2
β
3. TF-IDF statistical search
β
4. Cosine similarity ranking
β
5. Results ranked by relevance score
β
6. Code sections returned with context
Indexed:
- Source code files (.ts, .js, .py, .go, .rs, etc.)
- Configuration files (.json, .yaml, .toml)
- Documentation (.md)
Ignored:
- Dependencies (node_modules, vendor, etc.)
- Build artifacts (dist, build, target, etc.)
- Version control (.git, .svn)
- Binary files
- Media files
Scenario: New developer needs to understand authentication
# Find all authentication-related code
flow codebase search "authentication and authorization"
# Find specific implementation
flow codebase search "JWT token validation"Result: AI assistant or developer quickly finds relevant code.
Scenario: Implementing similar feature to existing one
# Find existing implementation
flow codebase search "payment processing workflow"
# Study the patterns
flow codebase search "error handling in payment code"Result: Consistent implementation following existing patterns.
Scenario: Reviewing security-critical code
# Find all input validation
flow codebase search "user input validation and sanitization"
# Find authentication checks
flow codebase search "authentication middleware"Result: Comprehensive security review.
Scenario: Refactoring authentication system
# Find all authentication code
flow codebase search "user authentication logic"
# Find related code
flow codebase search "session management"
flow codebase search "token generation"Result: Complete understanding of authentication system.
Scenario: AI agent implementing a feature
# AI searches for similar code
flow run "implement user registration" --agent coder
# AI internally calls:
# codebase_search("user registration implementation")
# codebase_search("form validation patterns")
# codebase_search("database user creation")Result: AI implements feature following existing patterns.
# Codebase search uses TF-IDF (primary method)
# No API key required for basic functionality
# Optional: Future support for OpenAI-compatible vector embeddings
# OPENAI_API_KEY=your-key-here # Not yet implemented for codebase
# Note: Knowledge base supports optional vector embeddings
# Codebase search currently uses TF-IDF only (fast and accurate)# Start with codebase search enabled (default)
flow mcp start
# Disable codebase search
flow mcp start --disable-codebaseFile Extensions Indexed:
// Source code
.ts, .tsx, .js, .jsx, .py, .go, .rs, .java, .cpp, .c, .h
// Configuration
.json, .yaml, .yml, .toml, .ini
// Documentation
.md, .mdx, .txt
// Markup & Styles
.html, .css, .scss, .sassIgnored Patterns:
node_modules/
dist/
build/
target/
.git/
.next/
__pycache__/
*.min.js
*.bundle.js
# View indexing status and statistics
flow codebase statusExample Output:
π Codebase Search Status
=========================
Status: β
Indexed and ready
Index Statistics:
β’ Total files: 347
β’ Indexed files: 285
β’ Skipped files: 62
β’ Total chunks: 1,847
β’ Vector dimensions: 1536
Languages:
β’ TypeScript: 215 files
β’ JavaScript: 45 files
β’ JSON: 18 files
β’ Markdown: 7 files
Database:
β’ Size: 12.4 MB
β’ Last indexed: 2025-10-30 18:45:00
β’ Index age: 15 minutes
π Database: .sylphx-flow/codebase.db
- Small projects (<100 files): 10-30 seconds
- Medium projects (100-500 files): 30-90 seconds
- Large projects (500-2000 files): 2-5 minutes
- Very large projects (2000+ files): 5-15 minutes
- Cold search: ~200-300ms
- Warm search: ~50-100ms
- Large codebase: ~100-200ms
# Reindex during low usage
flow codebase reindex
# Limit search results for speed
flow codebase search "query" --limit 5
# Use specific queries for better results
flow codebase search "specific implementation detail"
# Clean old index before reindexing
rm .sylphx-flow/codebase.db
flow codebase reindexβ Good Queries:
# Specific and descriptive
flow codebase search "user authentication with JWT tokens"
flow codebase search "API error handling middleware"
flow codebase search "database connection pooling"
flow codebase search "React component lazy loading"
# Focused on intent
flow codebase search "validate email addresses"
flow codebase search "handle file uploads"
flow codebase search "parse JSON configuration"β Poor Queries:
# Too vague
flow codebase search "code"
flow codebase search "function"
# Too broad
flow codebase search "all user code"
# Just keywords
flow codebase search "auth"1. Describe the Purpose:
flow codebase search "code that processes payment transactions"2. Specify the Context:
flow codebase search "React hooks for managing form state"3. Include Related Concepts:
flow codebase search "authentication middleware with session validation"4. Use Natural Language:
flow codebase search "how are user permissions checked"Problem: Search returns no results
Solutions:
# Check if codebase is indexed
flow codebase status
# Reindex the codebase
flow codebase reindex
# Try broader search query
flow codebase search "broader term"
# Verify database exists
ls -la .sylphx-flow/codebase.dbProblem: Indexing takes too long
Solutions:
# Check file count
flow codebase status
# Verify .gitignore is being respected
# Large directories should be ignored
# Consider selective indexing
# Add patterns to .gitignoreProblem: Search returns old code
Solutions:
# Reindex to update
flow codebase reindex
# Set up automatic reindexing
# (Consider adding to git hooks)
# Verify last index time
flow codebase statusProblem: Search returns no results or fails
Solutions:
# Check if codebase is indexed
flow codebase status
# Reindex if needed
flow codebase reindex
# Verify database exists
ls -la .sylphx-flow/codebase.db
# Test with simpler query
flow codebase search "function"
# Note: No API key needed - uses local StarCoder2 tokenization# Search knowledge for patterns
flow knowledge search "authentication patterns"
# Then search codebase for implementations
flow codebase search "authentication implementation"
# Compare and identify gaps# Agent automatically uses codebase search
flow run "refactor authentication system" --agent coder
# Agent will internally:
# 1. codebase_search("authentication implementation")
# 2. knowledge_search("authentication best practices")
# 3. Implement refactoring# Get results as JSON
flow codebase search "api endpoints" --output json > results.json
# Process with jq
flow codebase search "auth" --output json | jq '.results[].file'
# Integrate with other tools# After significant code changes
flow codebase reindex
# After pulling updates
git pull && flow codebase reindex
# Scheduled (via cron)
0 */6 * * * cd /path/to/project && flow codebase reindex# Check index size
du -sh .sylphx-flow/codebase.db
# Clean and rebuild
rm .sylphx-flow/codebase.db
flow codebase reindex
# Backup index (optional)
cp .sylphx-flow/codebase.db backup/codebase-$(date +%Y%m%d).db| Project Size | Files | Chunks | Index Size |
|---|---|---|---|
| Small | 50-100 | ~500 | ~2-3 MB |
| Medium | 100-500 | ~2,000 | ~8-12 MB |
| Large | 500-2000 | ~8,000 | ~30-50 MB |
| Very Large | 2000+ | ~20,000+ | ~80-150 MB |
- Search codebase before implementing new features
- Use search to understand existing patterns
- Combine with knowledge base for best practices
- Verify assumptions with code search
- Use semantic search for code discovery
- Find similar implementations before coding
- Discover undocumented features
- Understand legacy code quickly
- Onboard new developers faster
- Maintain consistency across codebase
- Document patterns through code examples
- Share knowledge through searchable code
- Find all instances of a pattern
- Verify consistent implementation
- Discover edge cases
- Check for similar bugs
- Knowledge Base - Search development guidelines
- Agent Framework - Use agents with codebase search
- MCP Integration - Connect AI tools
- CLI Commands - Complete command reference
Last Updated: 2025-10-30 | Edit this page | Report Issues