Table of Contents
- Semantic Search for Git Repositories
- What You'll Build
- Architecture
- Prerequisites
- Quick Start
- 1. Start Ollama and Pull Model
- 2. Start VSKI Server
- 3. Initialize the Database
- 4. Index Your Code
- 5. Search
- How It Works
- Project Structure
- Step-by-Step Tutorial
- Step 1: Docker Compose Setup
- Step 2: Define the Schema
- Step 3: Embedding Generation
- Step 4: File Indexing
- Step 5: Semantic Search
- Step 6: CLI Entry Point
- Configuration
- Advanced Usage
- Troubleshooting
- Next Steps
- Resources
Semantic Search for Git Repositories
Note
See source code at https://vski.sh/x/embeddings_tutorial
A tutorial on building a semantic code search service using VSKI's vector embeddings and Ollama. This guide walks you through creating a CLI tool that indexes your codebase and enables natural language search across it.
What You'll Build
A CLI tool with two commands:
index <path>- Indexes a directory, tracks git changes, generates embeddingssearch <query>- Performs semantic search, returns ranked results with line numbers
Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ CLI Tool │────▶│ VSKI Server │────▶│ SQLite + Vec │
│ (Deno/TS) │ │ (Docker) │ │ (Embedded) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
│ HTTP
▼
┌─────────────────┐
│ Ollama │
│ (Embeddings) │
│ (Local) │
└─────────────────┘
Prerequisites
Quick Start
1. Start Ollama and Pull Model
# Ensure Ollama is running
ollama serve
# Pull the embedding model
ollama pull nomic-embed-text
2. Start VSKI Server
docker compose up -d
This starts VSKI server on http://localhost:8347
Create admin account (first time only)
Navigate to installer http://localhost:8347/installer and create your admin.
Or use curl:
curl -X POST http://localhost:8347/api/admins/init \
-H "Content-Type: application/json" \
-d '{"email":"admin@example.com","password":"yourpassword"}'
3. Initialize the Database
# Install VSKI CLI
deno install --global -A -f -n vski https://vski.sh/x/vski-js/raw/branch/main/dist/main.js
# Login via CLI
vski login --url http://localhost:8347 --email admin@example.com --password yourpassword
# Create database
vski db create semantic_search
# Apply migration
vski migrate migrations/001_initial.ts --db=semantic_search
4. Index Your Code
# Index current directory
deno task index .
# Index a specific directory
deno task index ./src
5. Search
deno task search "error handling"
deno task search "database connection"
deno task search "authentication middleware"
How It Works
Indexing Process
- File Discovery - Walks the directory tree, detects text files via MIME type
- Git Integration - Gets current branch, commit hash, and detects uncommitted changes
- Chunking - Splits files into ~500 token segments with 50 token overlap
- Line Tracking - Records line numbers for each chunk
- Embedding Generation - Sends chunks to Ollama for vector embedding
- Storage - Stores chunks and embeddings in VSKI
Search Process
- Query Embedding - Converts search query to vector via Ollama
- Vector Search - Queries VSKI for similar embeddings
- Ranking - Returns results sorted by similarity score
- Formatting - Displays file paths, line ranges, and content previews
Project Structure
docs/semantic_search/
├── README.md # This file
├── docker-compose.yml # Service definitions
├── deno.json # Deno configuration
├── migrations/
│ └── 001_initial.ts # Collection schema migration
└── src/
├── main.ts # CLI entry point
├── indexer.ts # Indexing logic
├── embedder.ts # Ollama integration
├── searcher.ts # Search logic
├── types.ts # Type definitions
└── utils.ts # Utility functions
Step-by-Step Tutorial
Step 1: Docker Compose Setup
The docker-compose.yml defines the VSKI service:
services:
vski:
image: vski.sh/x/vski:latest-standalone
ports:
- "8347:8347"
environment:
- SERVER_PORT=8347
- DATA_DIR=/app/data
- JWT_SECRET=dev-secret-change-in-production
volumes:
- vski_data:/app/data
volumes:
vski_data:
Ollama runs locally (not in Docker) for better GPU access.
Step 2: Define the Schema
The code_chunks collection stores indexed code segments:
// migrations/001_initial.ts
export const migrations = [
{
name: "001_initial",
up: async (client: VskiClient) => {
await client.settings.collections.create({
name: "code_chunks",
type: "base",
fields: [
{ name: "file_path", type: "text", required: true },
{ name: "content", type: "text", required: true },
{ name: "line_start", type: "number", required: true },
{ name: "line_end", type: "number", required: true },
{ name: "branch", type: "text" },
{ name: "commit_hash", type: "text" },
{ name: "modified", type: "bool" },
],
});
},
},
];
Step 3: Embedding Generation
The embedder module handles communication with Ollama:
// src/embedder.ts
export async function generateEmbedding(text: string): Promise<number[]> {
const response = await fetch("http://localhost:11434/api/embeddings", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "nomic-embed-text",
prompt: text,
}),
});
const data = await response.json();
return data.embedding;
}
Step 4: File Indexing
The indexer walks directories, chunks text, and stores embeddings:
// src/indexer.ts
export async function indexDirectory(
client: VskiClient,
dirPath: string
): Promise<void> {
const files = await discoverTextFiles(dirPath);
const gitInfo = await getGitInfo(dirPath);
for (const file of files) {
const content = await Deno.readTextFile(file.path);
const chunks = chunkText(content, file.path);
for (const chunk of chunks) {
const embedding = await generateEmbedding(chunk.content);
const record = await client.collection("code_chunks").create({
file_path: chunk.file_path,
content: chunk.content,
line_start: chunk.line_start,
line_end: chunk.line_end,
branch: gitInfo.branch,
commit_hash: gitInfo.commit,
modified: gitInfo.hasChanges,
});
await client.embeddings.upsert("code_chunks", record.id, embedding);
}
}
}
Step 5: Semantic Search
The searcher queries embeddings and formats results:
// src/searcher.ts
export async function searchCode(
client: VskiClient,
query: string,
limit: number = 10
): Promise<SearchResult[]> {
const embedding = await generateEmbedding(query);
const results = await client.embeddings.search("code_chunks", embedding, {
limit,
threshold: 0.7,
});
return results.results.map((r) => ({
filePath: r.record.file_path,
lineStart: r.record.line_start,
lineEnd: r.record.line_end,
content: r.record.content,
distance: r.distance,
}));
}
Step 6: CLI Entry Point
The main module ties everything together:
// src/main.ts
import { Command } from "@cliffy/command";
import { indexDirectory } from "./indexer.ts";
import { searchCode } from "./searcher.ts";
await new Command()
.name("semantic-search")
.command("index <path:string>")
.action(async (_, path) => {
await indexDirectory(client, path);
})
.command("search <query:string>")
.action(async (_, query) => {
const results = await searchCode(client, query);
displayResults(results);
})
.parse(Deno.args);
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
VSKI_URL |
http://localhost:8347 |
VSKI server URL |
VSKI_TOKEN |
- | Admin JWT token |
OLLAMA_URL |
http://localhost:11434 |
Ollama server URL |
EMBEDDING_MODEL |
nomic-embed-text |
Ollama model for embeddings |
Chunking Parameters
| Parameter | Default | Description |
|---|---|---|
CHUNK_SIZE |
500 |
Target tokens per chunk |
CHUNK_OVERLAP |
50 |
Token overlap between chunks |
Advanced Usage
Index Only Changed Files
The indexer automatically detects uncommitted changes via git. You can filter:
// Only index modified files
const files = await discoverTextFiles(dirPath);
const modifiedFiles = files.filter(f => isModified(f.path));
Re-indexing
To re-index from scratch:
// Clear existing chunks
const all = await client.collection("code_chunks").getList(1, 10000);
for (const item of all.items) {
await client.collection("code_chunks").delete(item.id);
}
// Re-index
await indexDirectory(client, dirPath);
Multiple Branches
Index multiple branches for comparison:
git checkout feature-branch
deno task index .
git checkout main
deno task index .
Then search with branch filter:
const results = await client.collection("code_chunks").getList(1, 20, {
filter: 'branch = "feature-branch"',
});
Troubleshooting
Ollama Model Not Found
# Pull the model manually
docker-compose exec ollama ollama pull nomic-embed-text
VSKI Authentication Failed
# Re-login
vski login --url http://localhost:8347
Slow Indexing
The indexer processes files sequentially. For large codebases, consider:
- Reducing chunk size
- Indexing specific directories
- Using a GPU-enabled Ollama setup
Next Steps
- Add file watching for automatic re-indexing
- Implement incremental updates (only index changed files)
- Add support for more embedding models
- Build a web UI for search
- Add syntax highlighting in previews