1 Tutorial: Embeddings
x edited this page 2026-03-02 18:44:50 +00:00

Semantic Search for Git Repositories

Note

See source code at https://vski.sh/x/embeddings_tutorial

A tutorial on building a semantic code search service using VSKI's vector embeddings and Ollama. This guide walks you through creating a CLI tool that indexes your codebase and enables natural language search across it.

What You'll Build

A CLI tool with two commands:

  • index <path> - Indexes a directory, tracks git changes, generates embeddings
  • search <query> - Performs semantic search, returns ranked results with line numbers

Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   CLI Tool      │────▶│   VSKI Server   │────▶│  SQLite + Vec   │
│   (Deno/TS)     │     │   (Docker)      │     │  (Embedded)     │
└─────────────────┘     └─────────────────┘     └─────────────────┘
          │
          │ HTTP
          ▼
┌─────────────────┐
│     Ollama      │
│  (Embeddings)   │
│   (Local)       │
└─────────────────┘

Prerequisites

Quick Start

1. Start Ollama and Pull Model

# Ensure Ollama is running
ollama serve

# Pull the embedding model
ollama pull nomic-embed-text

2. Start VSKI Server

docker compose up -d

This starts VSKI server on http://localhost:8347

Create admin account (first time only)

Navigate to installer http://localhost:8347/installer and create your admin.

Or use curl:

curl -X POST http://localhost:8347/api/admins/init \
  -H "Content-Type: application/json" \
  -d '{"email":"admin@example.com","password":"yourpassword"}'

3. Initialize the Database

# Install VSKI CLI
deno install --global -A -f -n vski https://vski.sh/x/vski-js/raw/branch/main/dist/main.js

# Login via CLI
vski login --url http://localhost:8347 --email admin@example.com --password yourpassword

# Create database
vski db create semantic_search

# Apply migration
vski migrate migrations/001_initial.ts --db=semantic_search

4. Index Your Code

# Index current directory
deno task index .

# Index a specific directory
deno task index ./src
deno task search "error handling"
deno task search "database connection"
deno task search "authentication middleware"

How It Works

Indexing Process

  1. File Discovery - Walks the directory tree, detects text files via MIME type
  2. Git Integration - Gets current branch, commit hash, and detects uncommitted changes
  3. Chunking - Splits files into ~500 token segments with 50 token overlap
  4. Line Tracking - Records line numbers for each chunk
  5. Embedding Generation - Sends chunks to Ollama for vector embedding
  6. Storage - Stores chunks and embeddings in VSKI

Search Process

  1. Query Embedding - Converts search query to vector via Ollama
  2. Vector Search - Queries VSKI for similar embeddings
  3. Ranking - Returns results sorted by similarity score
  4. Formatting - Displays file paths, line ranges, and content previews

Project Structure

docs/semantic_search/
├── README.md              # This file
├── docker-compose.yml     # Service definitions
├── deno.json              # Deno configuration
├── migrations/
│   └── 001_initial.ts     # Collection schema migration
└── src/
    ├── main.ts            # CLI entry point
    ├── indexer.ts         # Indexing logic
    ├── embedder.ts        # Ollama integration
    ├── searcher.ts        # Search logic
    ├── types.ts           # Type definitions
    └── utils.ts           # Utility functions

Step-by-Step Tutorial

Step 1: Docker Compose Setup

The docker-compose.yml defines the VSKI service:

services:
  vski:
    image: vski.sh/x/vski:latest-standalone
    ports:
      - "8347:8347"
    environment:
      - SERVER_PORT=8347
      - DATA_DIR=/app/data
      - JWT_SECRET=dev-secret-change-in-production
    volumes:
      - vski_data:/app/data

volumes:
  vski_data:

Ollama runs locally (not in Docker) for better GPU access.

Step 2: Define the Schema

The code_chunks collection stores indexed code segments:

// migrations/001_initial.ts
export const migrations = [
  {
    name: "001_initial",
    up: async (client: VskiClient) => {
      await client.settings.collections.create({
        name: "code_chunks",
        type: "base",
        fields: [
          { name: "file_path", type: "text", required: true },
          { name: "content", type: "text", required: true },
          { name: "line_start", type: "number", required: true },
          { name: "line_end", type: "number", required: true },
          { name: "branch", type: "text" },
          { name: "commit_hash", type: "text" },
          { name: "modified", type: "bool" },
        ],
      });
    },
  },
];

Step 3: Embedding Generation

The embedder module handles communication with Ollama:

// src/embedder.ts
export async function generateEmbedding(text: string): Promise<number[]> {
  const response = await fetch("http://localhost:11434/api/embeddings", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "nomic-embed-text",
      prompt: text,
    }),
  });
  const data = await response.json();
  return data.embedding;
}

Step 4: File Indexing

The indexer walks directories, chunks text, and stores embeddings:

// src/indexer.ts
export async function indexDirectory(
  client: VskiClient,
  dirPath: string
): Promise<void> {
  const files = await discoverTextFiles(dirPath);
  const gitInfo = await getGitInfo(dirPath);
  
  for (const file of files) {
    const content = await Deno.readTextFile(file.path);
    const chunks = chunkText(content, file.path);
    
    for (const chunk of chunks) {
      const embedding = await generateEmbedding(chunk.content);
      
      const record = await client.collection("code_chunks").create({
        file_path: chunk.file_path,
        content: chunk.content,
        line_start: chunk.line_start,
        line_end: chunk.line_end,
        branch: gitInfo.branch,
        commit_hash: gitInfo.commit,
        modified: gitInfo.hasChanges,
      });
      
      await client.embeddings.upsert("code_chunks", record.id, embedding);
    }
  }
}

The searcher queries embeddings and formats results:

// src/searcher.ts
export async function searchCode(
  client: VskiClient,
  query: string,
  limit: number = 10
): Promise<SearchResult[]> {
  const embedding = await generateEmbedding(query);
  
  const results = await client.embeddings.search("code_chunks", embedding, {
    limit,
    threshold: 0.7,
  });
  
  return results.results.map((r) => ({
    filePath: r.record.file_path,
    lineStart: r.record.line_start,
    lineEnd: r.record.line_end,
    content: r.record.content,
    distance: r.distance,
  }));
}

Step 6: CLI Entry Point

The main module ties everything together:

// src/main.ts
import { Command } from "@cliffy/command";
import { indexDirectory } from "./indexer.ts";
import { searchCode } from "./searcher.ts";

await new Command()
  .name("semantic-search")
  .command("index <path:string>")
  .action(async (_, path) => {
    await indexDirectory(client, path);
  })
  .command("search <query:string>")
  .action(async (_, query) => {
    const results = await searchCode(client, query);
    displayResults(results);
  })
  .parse(Deno.args);

Configuration

Environment Variables

Variable Default Description
VSKI_URL http://localhost:8347 VSKI server URL
VSKI_TOKEN - Admin JWT token
OLLAMA_URL http://localhost:11434 Ollama server URL
EMBEDDING_MODEL nomic-embed-text Ollama model for embeddings

Chunking Parameters

Parameter Default Description
CHUNK_SIZE 500 Target tokens per chunk
CHUNK_OVERLAP 50 Token overlap between chunks

Advanced Usage

Index Only Changed Files

The indexer automatically detects uncommitted changes via git. You can filter:

// Only index modified files
const files = await discoverTextFiles(dirPath);
const modifiedFiles = files.filter(f => isModified(f.path));

Re-indexing

To re-index from scratch:

// Clear existing chunks
const all = await client.collection("code_chunks").getList(1, 10000);
for (const item of all.items) {
  await client.collection("code_chunks").delete(item.id);
}

// Re-index
await indexDirectory(client, dirPath);

Multiple Branches

Index multiple branches for comparison:

git checkout feature-branch
deno task index .

git checkout main
deno task index .

Then search with branch filter:

const results = await client.collection("code_chunks").getList(1, 20, {
  filter: 'branch = "feature-branch"',
});

Troubleshooting

Ollama Model Not Found

# Pull the model manually
docker-compose exec ollama ollama pull nomic-embed-text

VSKI Authentication Failed

# Re-login
vski login --url http://localhost:8347

Slow Indexing

The indexer processes files sequentially. For large codebases, consider:

  • Reducing chunk size
  • Indexing specific directories
  • Using a GPU-enabled Ollama setup

Next Steps

  • Add file watching for automatic re-indexing
  • Implement incremental updates (only index changed files)
  • Add support for more embedding models
  • Build a web UI for search
  • Add syntax highlighting in previews

Resources