AIdatabasesvectorsembeddingsarchitecture

Vector Databases: A Complete Deep Dive

25 min read

Vector Databases: A Complete Deep Dive

Vector databases have become fundamental infrastructure for modern AI applications. Whether you're building semantic search, RAG systems, recommendation engines, or similarity-based features, understanding how vector databases work at a deep level is crucial.

This is a comprehensive technical exploration-from the mathematical foundations to production deployment strategies.

TL;DR

Vector databases enable similarity search by representing data as high-dimensional numerical arrays (vectors). Unlike traditional databases that match exact values, they find semantically similar items-crucial for AI applications like semantic search, RAG systems, and recommendations.

Key concepts:

  • Embeddings convert text/images/audio into vectors using ML models (e.g., OpenAI's text-embedding-3)
  • Distance metrics (cosine similarity, Euclidean) measure how "close" vectors are in meaning
  • ANN algorithms (HNSW, IVF) enable fast search across millions of vectors without checking each one
  • Production considerations: chunk documents (256-512 tokens), cache embeddings, combine with metadata filtering, and rerank results

Use cases: Semantic search (find by meaning, not keywords), RAG (give LLMs relevant context), recommendations, image/audio similarity, code search.

Bottom line: Traditional databases find exact matches. Vector databases find similar items by understanding meaning. This unlocks entirely new application capabilities impossible with SQL alone.

Table of Contents

  1. The Problem Traditional Databases Can't Solve
  2. Understanding Vectors: The Mathematical Foundation
  3. Visualising Vector Space
  4. The Magic: How Embeddings Are Created
  5. Mathematical Foundations: Distance Metrics
  6. How Vector Databases Work: Internal Architecture
  7. Indexing Algorithms: Deep Dive
  8. Vector Database Operations
  9. Building Production RAG Systems
  10. Performance Optimisation
  11. Production Architecture Patterns
  12. Monitoring and Observability
  13. Common Pitfalls and Solutions
  14. Cost Optimisation
  15. The Future of Vector Databases
  16. Building the Future: Products and Applications

The Problem Traditional Databases Can't Solve

Traditional databases excel at exact matches and structured queries:

SELECT * FROM products WHERE id = 42;
SELECT * FROM users WHERE email = 'user@example.com';
SELECT * FROM orders WHERE status = 'pending' AND created_at > '2025-01-01';

These queries work because databases use indexes (B-trees, hash indexes) optimised for equality checks and range queries.

But what happens when you need to find similar items?

The Similarity Problem

Consider these real-world requirements:

  • "Find documents similar to this research paper"
  • "Show me images that look like this sketch"
  • "Recommend products based on this user's preferences"
  • "Find code snippets that solve similar problems"

None of these can be expressed as exact matches. You need semantic similarity-understanding meaning, not just matching strings.

Traditional approaches fail here:

-- This only finds exact text matches
SELECT * FROM documents WHERE content LIKE '%machine learning%';

-- This can't understand that "ML" and "machine learning" are the same concept
-- It can't find "neural networks" when you search for "deep learning"
-- It has no notion of semantic similarity

Why Full-Text Search Isn't Enough

Full-text search (Elasticsearch, Solr) improves on basic SQL with:

  • Tokenization: Breaking text into words
  • TF-IDF: Ranking by term frequency
  • BM25: Better ranking algorithm

But it's still fundamentally keyword-based:

// Full-text search example
// Query: "How do I learn Python programming?"

// Finds documents containing: "Python", "programming", "learn"
// MISSES documents about:
// - "Getting started with Python development"
// - "Python tutorial for beginners"
// - "Introduction to coding in Python"

// These are semantically identical but use different words!

This is where vector databases shine. They understand meaning, not just words.

Understanding Vectors: The Mathematical Foundation

What Is a Vector?

A vector is a mathematical object-a list of numbers representing a point in multi-dimensional space.

// A 2D vector (point on a plane)
const vector2D = [3, 4];

// A 3D vector (point in space)
const vector3D = [1, 2, 3];

// An embedding vector (typical in ML)
const embedding = [0.23, -0.45, 0.67, 0.12, -0.89, ...]; // 768 or 1536 dimensions

Vector Properties

Vectors have magnitude (length) and direction:

// Magnitude (length) of a vector
function magnitude(vector) {
  return Math.sqrt(
    vector.reduce((sum, val) => sum + val * val, 0)
  );
}

magnitude([3, 4]); // 5 (Pythagorean theorem: sqrt(3² + 4²))

The direction is what matters for similarity. Two vectors pointing in similar directions represent similar concepts.

High-Dimensional Space

Embedding vectors typically have hundreds or thousands of dimensions:

  • OpenAI text-embedding-3-small: 1536 dimensions
  • OpenAI text-embedding-3-large: 3072 dimensions
  • BERT base: 768 dimensions
  • Sentence transformers: 384-768 dimensions

Why so many dimensions? Each dimension captures a different aspect of meaning:

Dimension 0:   How related to "animals"?
Dimension 1:   How related to "pets"?
Dimension 2:   Formality level?
Dimension 3:   Sentiment (positive/negative)?
Dimension 4:   Time-related concepts?
...
Dimension 1535: [Some learned semantic feature]

The model learns these dimensions during training-we don't manually assign them.

Visualising Vector Space

Here's an interactive 3D visualisation showing how vectors representing different words are positioned in space. Similar concepts (like "cat", "dog", "kitten") cluster together naturally:

In this visualisation, each point represents a word's vector in 3D space. Words with similar meanings are positioned closer together. Try hovering over any point to see which other vectors are nearby-these are exactly the words a vector database would return as "similar" when you search!

Note: Real embeddings use 768-3072 dimensions. I've simplified them to 3D here so you can actually see what's happening-the underlying principles are identical, just harder to visualise when you're working with thousands of dimensions.

The Magic: How Embeddings Are Created

What Is an Embedding?

An embedding is the process of converting data (text, images, audio) into a vector representation. The embedding model learns to position similar items close together in vector space.

Text Embeddings

Modern embedding models use transformer architectures trained on massive datasets:

import { OpenAI } from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

// Generate an embedding
const response = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: "What is a vector database?",
  encoding_format: "float" // Returns float32 array
});

const embedding = response.data[0].embedding;
console.log(embedding.length); // 1536
console.log(embedding.slice(0, 5)); // [0.0234, -0.0456, 0.0789, ...]

How Training Works

Embedding models are trained using contrastive learning:

  1. Positive pairs: Similar items should have similar vectors

    • "cat" and "kitten" → vectors should be close
    • "The cat sat on the mat" and "A feline rested on the rug" → similar vectors
  2. Negative pairs: Different items should have different vectors

    • "cat" and "database" → vectors should be far apart
    • "I love this movie" and "This movie was terrible" → different vectors

The model adjusts weights to minimise distance between positive pairs and maximise distance between negative pairs.

Training Objective: Cosine Similarity

# Simplified training loss (contrastive learning)
def contrastive_loss(anchor, positive, negative, margin=0.5):
    # anchor: "cat"
    # positive: "kitten" (similar)
    # negative: "car" (different)

    pos_similarity = cosine_similarity(anchor, positive)
    neg_similarity = cosine_similarity(anchor, negative)

    # Loss is low when:
    # - positive similarity is high (close to 1)
    # - negative similarity is low (close to 0)
    loss = max(0, neg_similarity - pos_similarity + margin)
    return loss

Domain-Specific Embeddings

Different models specialise in different domains:

// General purpose text
const general = await embed("machine learning", "text-embedding-3-small");

// Code-specific
const code = await embed("function sort(arr) {...}", "code-embedding-model");

// Multilingual
const multilingual = await embed("Hola mundo", "multilingual-e5-large");

// Always use the SAME model for indexing and querying!

Mathematical Foundations: Distance Metrics

How do we measure if two vectors are "similar"? We use distance metrics.

1. Cosine Similarity

Measures the angle between two vectors. Perfect for text embeddings.

function cosineSimilarity(a, b) {
  // Dot product: sum of element-wise multiplication
  const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);

  // Magnitude of each vector
  const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));

  // Cosine similarity: dot product / (magnitude_a * magnitude_b)
  return dotProduct / (magnitudeA * magnitudeB);
}

// Returns value between -1 and 1:
// 1.0  = vectors point in same direction (identical)
// 0.0  = vectors are perpendicular (unrelated)
// -1.0 = vectors point in opposite directions (opposites)

Why cosine for text?

Cosine similarity ignores magnitude, focusing only on direction. For text:

  • Direction = semantic meaning
  • Magnitude = less important (depends on document length, not meaning)
const vec1 = [1, 2, 3];
const vec2 = [2, 4, 6]; // Same direction, 2x magnitude

cosineSimilarity(vec1, vec2); // 1.0 (perfect similarity)
// Even though magnitudes differ, direction (meaning) is identical

2. Euclidean Distance (L2)

The straight-line distance between two points in space.

function euclideanDistance(a, b) {
  return Math.sqrt(
    a.reduce((sum, val, i) => sum + Math.pow(val - b[i], 2), 0)
  );
}

// Smaller distance = more similar
// 0 = identical vectors
// Large number = very different vectors

When to use Euclidean?

  • Image embeddings (magnitude matters)
  • When vectors are normalised (magnitude = 1)
  • When you care about absolute differences

3. Dot Product

Similar to cosine, but doesn't normalise by magnitude.

function dotProduct(a, b) {
  return a.reduce((sum, val, i) => sum + val * b[i], 0);
}

// Larger value = more similar (if vectors are normalised)

Optimisation trick: If you normalise all vectors to unit length (magnitude = 1), dot product and cosine similarity are equivalent:

function normalise(vector) {
  const mag = Math.sqrt(vector.reduce((sum, val) => sum + val * val, 0));
  return vector.map(val => val / mag);
}

const a = normalise([1, 2, 3]);
const b = normalise([2, 4, 6]);

dotProduct(a, b) === cosineSimilarity([1,2,3], [2,4,6]); // true
// Dot product is MUCH faster to compute!

4. Manhattan Distance (L1)

Sum of absolute differences along each dimension.

function manhattanDistance(a, b) {
  return a.reduce((sum, val, i) => sum + Math.abs(val - b[i]), 0);
}

Less common in vector databases, but useful in specific scenarios.

Metric Comparison

const vectorA = [1, 2, 3];
const vectorB = [2, 4, 6];
const vectorC = [10, 0, 0];

console.log({
  cosine: {
    'A vs B': cosineSimilarity(vectorA, vectorB),  // 1.0 (same direction)
    'A vs C': cosineSimilarity(vectorA, vectorC),  // 0.27 (different)
  },
  euclidean: {
    'A vs B': euclideanDistance(vectorA, vectorB), // 3.74
    'A vs C': euclideanDistance(vectorA, vectorC), // 9.95
  },
  manhattan: {
    'A vs B': manhattanDistance(vectorA, vectorB), // 3
    'A vs C': manhattanDistance(vectorA, vectorC), // 12
  }
});

Now let's see these metrics in action with an interactive visualisation:

Try it yourself: Drag the vector endpoints around and watch how each metric responds differently. Notice how:

  • Cosine similarity only cares about the angle between vectors (direction), not their length
  • Euclidean distance gives you the shortest straight-line path between endpoints
  • Manhattan distance measures the path if you could only move along the grid axes

This is why we typically use cosine similarity for text embeddings-the semantic meaning is captured by the direction, and the magnitude (length) of the vector is less important.

How Vector Databases Work: Internal Architecture

Storage Layer

Vector databases store three types of data:

  1. Vector embeddings: The high-dimensional arrays
  2. Metadata: Original content, tags, timestamps, etc.
  3. Index structures: For fast approximate nearest neighbor search
// Conceptual data structure
interface VectorRecord {
  id: string;
  vector: Float32Array;  // The embedding [0.234, -0.456, ...]
  metadata: {
    text?: string;       // Original content
    title?: string;
    url?: string;
    timestamp?: number;
    tags?: string[];
    [key: string]: any;  // Arbitrary metadata
  };
}

Indexing: The Core Problem

Naive approach (exhaustive search):

function naiveSearch(query: Float32Array, database: VectorRecord[], topK: number) {
  // Calculate similarity with EVERY vector in database
  const results = database.map(record => ({
    id: record.id,
    score: cosineSimilarity(query, record.vector),
    metadata: record.metadata
  }));

  // Sort by similarity and return top K
  return results
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

// Problem: O(n * d) where n = number of vectors, d = dimensions
// For 1M vectors with 1536 dimensions: ~1.5 billion calculations!
// Way too slow for production.

Solution: Approximate Nearest Neighbor (ANN) algorithms

Instead of checking every vector, ANN algorithms use clever data structures to narrow down the search space.

Indexing Algorithms: Deep Dive

1. HNSW (Hierarchical Navigable Small World)

The most popular algorithm for vector search. Think of it as a skip list in multi-dimensional space.

How it works:

  1. Build a multi-layer graph where each node is a vector
  2. Higher layers have fewer nodes (sparse, long-distance connections)
  3. Lower layers have more nodes (dense, short-distance connections)
  4. To search: Start at the top layer, jump to nearby nodes, descend layers
Layer 2 (sparse):   A ----------> B ----------> C
                    |             |             |
Layer 1:            A --> D --> E-B --> F --> G-C
                    |     |     | |     |     | |
Layer 0 (dense):    A-D-E-B-F-G-C-H-I-J-K-L-M-N

Search algorithm:

function hnsw_search(query, topK, entryPoint, layers) {
  let currentNode = entryPoint;

  // Start from top layer, work down
  for (let layer = layers.length - 1; layer >= 0; layer--) {
    // Greedy search: always move to nearest neighbor
    while (true) {
      const neighbors = getNeighbors(currentNode, layer);
      const nearest = findNearest(query, neighbors);

      if (similarity(query, nearest) > similarity(query, currentNode)) {
        currentNode = nearest;
      } else {
        break; // Local optimum found
      }
    }
  }

  // At layer 0, do a local search for top K results
  return expandSearch(currentNode, query, topK);
}

Let's see this in action with an interactive visualisation:

Try it yourself: Drag the purple query point to different locations and click "Run Search" to watch how HNSW navigates through the layers. Notice how:

  • It starts at the top layer (sparse, red nodes) and makes big jumps
  • Descends to middle layer (orange nodes) for medium-grain navigation
  • Finally searches the dense bottom layer (green nodes) for the exact nearest neighbor
  • The search path (blue arrows) shows the greedy traversal at each layer

This hierarchical structure is why HNSW achieves logarithmic search time-it's like a skip list but in multi-dimensional space!

HNSW characteristics:

  • Query speed: Very fast (logarithmic)
  • Memory usage: High (stores full graph)
  • Accuracy: Excellent (typically >95%)
  • Index build time: Fast
  • Updates: Support incremental updates

Parameters to tune:

const hnsw_params = {
  m: 16,              // Number of connections per node (higher = better accuracy, more memory)
  ef_construction: 200, // Size of dynamic candidate list during build (higher = better quality, slower build)
  ef_search: 50        // Size of dynamic candidate list during search (higher = better accuracy, slower search)
};

2. IVF (Inverted File Index)

Divides vector space into clusters (Voronoi cells) using k-means clustering.

How it works:

  1. Run k-means clustering to create n_list clusters
  2. Each cluster has a centroid (centre point)
  3. Assign each vector to its nearest cluster
  4. To search: Find nearest cluster centroids, search only those clusters
// Build IVF index
function buildIVF(vectors, n_lists = 100) {
  // 1. Run k-means to create clusters
  const centroids = kmeans(vectors, n_lists);

  // 2. Assign each vector to nearest cluster
  const clusters = Array(n_lists).fill(null).map(() => []);

  vectors.forEach((vector, id) => {
    const nearestCluster = findNearestCentroid(vector, centroids);
    clusters[nearestCluster].push({ id, vector });
  });

  return { centroids, clusters };
}

// Search IVF index
function ivfSearch(query, index, n_probe = 10, topK = 10) {
  // 1. Find n_probe nearest cluster centroids
  const nearestClusters = findTopK(
    index.centroids,
    query,
    n_probe
  );

  // 2. Search only those clusters
  let candidates = [];
  nearestClusters.forEach(clusterId => {
    const clusterVectors = index.clusters[clusterId];
    clusterVectors.forEach(item => {
      candidates.push({
        id: item.id,
        score: cosineSimilarity(query, item.vector)
      });
    });
  });

  // 3. Return top K
  return candidates
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

Try it yourself: Drag the query point (Q) around and adjust the n_probe slider to see the speed vs accuracy trade-off. Higher n_probe searches more clusters for better accuracy but takes longer. The colored regions show which Voronoi cells (clusters) are being searched. Notice how the percentage of points searched changes with n_probe!

IVF characteristics:

  • Query speed: Fast (but slower than HNSW)
  • Memory usage: Lower than HNSW
  • Accuracy: Good (depends on n_probe)
  • Index build time: Slow (k-means clustering)
  • Updates: Difficult (need to rebalance clusters)

Parameters to tune:

const ivf_params = {
  n_lists: 100,    // Number of clusters (more = better accuracy, slower search)
  n_probe: 10      // Number of clusters to search (more = better accuracy, slower)
};

3. Product Quantization (PQ)

Compresses vectors to save memory, often used with IVF (IVF+PQ).

How it works:

  1. Split each vector into subvectors
  2. Cluster each subvector space independently
  3. Replace subvectors with their cluster IDs
  4. Store only cluster centroids and IDs
// Example: Compress 1536-dim vector to 96 bytes
// Original: 1536 dimensions * 4 bytes/float = 6,144 bytes
// Compressed: 96 subvectors * 1 byte/ID = 96 bytes
// Compression ratio: 64x!

function productQuantization(vectors, m = 96) {
  const d = vectors[0].length;  // e.g., 1536
  const d_sub = d / m;           // 16 dimensions per subvector

  // For each subspace
  const codebooks = [];
  for (let i = 0; i < m; i++) {
    // Extract all subvectors for this subspace
    const subvectors = vectors.map(v =>
      v.slice(i * d_sub, (i + 1) * d_sub)
    );

    // Cluster subvectors (typically 256 clusters for 1 byte)
    const centroids = kmeans(subvectors, 256);
    codebooks.push(centroids);
  }

  // Encode vectors as cluster IDs
  return { codebooks, /* encode function */ };
}

Trade-off: Massive memory savings, but lower accuracy due to quantization error.

4. LSH (Locality-Sensitive Hashing)

Uses hash functions that map similar vectors to the same hash bucket.

How it works:

function lshHash(vector, randomPlanes) {
  // Each random plane defines a hash bit
  let hash = 0;
  randomPlanes.forEach((plane, i) => {
    // Dot product with random plane
    const dotProd = dotProduct(vector, plane);
    // If positive, set bit to 1
    if (dotProd > 0) {
      hash |= (1 << i);
    }
  });
  return hash;
}

// Search: hash query and check vectors in same bucket
function lshSearch(query, index, topK) {
  const queryHash = lshHash(query, index.planes);
  const candidates = index.buckets[queryHash] || [];

  return candidates
    .map(id => ({
      id,
      score: cosineSimilarity(query, index.vectors[id])
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

Less common in modern vector databases, but useful for specific high-dimensional scenarios.

Algorithm Comparison

| Algorithm | Speed | Accuracy | Memory | Build Time | Updates | |-----------|-------|----------|--------|------------|---------| | HNSW | ⚡⚡⚡ | ⭐⭐⭐ | 💾💾💾 | Fast | Good | | IVF | ⚡⚡ | ⭐⭐ | 💾💾 | Slow | Poor | | IVF+PQ | ⚡⚡ | ⭐ | 💾 | Slow | Poor | | LSH | ⚡⚡ | ⭐ | 💾💾 | Fast | Good |

In practice:

  • Pinecone, Weaviate, Qdrant: Use HNSW by default
  • Faiss: Supports all algorithms (often uses IVF+PQ for large scale)
  • Milvus: Supports multiple indexes

Vector Database Operations

Inserting Vectors

// Single insert
await vectorDB.upsert({
  id: "doc-123",
  vector: await embed("The quick brown fox..."),
  metadata: {
    text: "The quick brown fox jumps over the lazy dog",
    title: "Example Document",
    category: "example",
    timestamp: Date.now()
  }
});

// Batch insert (much more efficient)
const documents = [...]; // Array of documents
const embeddings = await Promise.all(
  documents.map(doc => embed(doc.content))
);

await vectorDB.upsert(
  documents.map((doc, i) => ({
    id: doc.id,
    vector: embeddings[i],
    metadata: {
      text: doc.content,
      title: doc.title,
      url: doc.url
    }
  }))
);

Best practices:

  • Batch operations: 100-500 vectors per batch
  • Parallel embedding: Use concurrent API calls (respect rate limits)
  • Retry logic: Handle transient failures
  • Idempotency: Use upsert (not insert) to handle duplicates

Querying Vectors

// Basic similarity search
const results = await vectorDB.query({
  vector: await embed("machine learning tutorials"),
  topK: 10,
  includeMetadata: true,
  includeVector: false  // Usually don't need full vectors back
});

// Results structure
results.matches.forEach(match => {
  console.log({
    id: match.id,
    score: match.score,      // Similarity score
    metadata: match.metadata
  });
});

Metadata Filtering

Combine vector search with traditional filters:

// Find similar documents from 2024 in "tech" category
const results = await vectorDB.query({
  vector: queryEmbedding,
  topK: 10,
  filter: {
    $and: [
      { category: { $eq: "tech" } },
      { year: { $eq: 2024 } },
      { status: { $in: ["published", "featured"] } }
    ]
  }
});

Filter strategies:

  1. Pre-filtering: Apply metadata filter first, then vector search

    • Pros: Accurate filters
    • Cons: May have too few vectors to search
  2. Post-filtering: Vector search first, then filter results

    • Pros: Find enough results
    • Cons: May need to over-fetch and filter
  3. Hybrid: Some DBs optimise this automatically

Updating Vectors

// Update vector and/or metadata
await vectorDB.update({
  id: "doc-123",
  vector: newEmbedding,     // Optional: update embedding
  metadata: {
    status: "updated",       // Update metadata
    lastModified: Date.now()
  }
});

Important: If you update the text, you must regenerate the embedding!

// Wrong: Update text without re-embedding
await vectorDB.update({
  id: "doc-123",
  metadata: { text: "New content" }
  // vector is now out of sync with text!
});

// Right: Re-embed when content changes
const newEmbedding = await embed("New content");
await vectorDB.update({
  id: "doc-123",
  vector: newEmbedding,
  metadata: { text: "New content" }
});

Deleting Vectors

// Delete single vector
await vectorDB.delete({ id: "doc-123" });

// Delete by filter
await vectorDB.delete({
  filter: {
    category: "spam",
    createdAt: { $lt: Date.now() - 30 * 24 * 60 * 60 * 1000 }
  }
});

Building Production RAG Systems

Let's build a complete production-ready RAG system.

Architecture Overview

User Query
    ↓
[Query Processing]
    ↓
[Embedding API] → Vector
    ↓
[Vector Database] → Search
    ↓
[Context Retrieval] → Top K documents
    ↓
[Prompt Construction] → "Context: ...\n\nQuestion: ..."
    ↓
[LLM API] → Answer
    ↓
[Post-processing]
    ↓
Response to User

Step 1: Document Ingestion Pipeline

import { OpenAI } from 'openai';
import { Pinecone } from '@pinecone-database/pinecone';

const openai = new OpenAI();
const pinecone = new Pinecone();
const index = pinecone.index('knowledge-base');

interface Document {
  id: string;
  content: string;
  title: string;
  url?: string;
  metadata?: Record<string, any>;
}

async function ingestDocuments(documents: Document[]) {
  // Step 1: Chunk documents (important for long texts)
  const chunks = documents.flatMap(doc =>
    chunkDocument(doc, { maxTokens: 512, overlap: 50 })
  );

  // Step 2: Generate embeddings in batches
  const BATCH_SIZE = 100;
  for (let i = 0; i < chunks.length; i += BATCH_SIZE) {
    const batch = chunks.slice(i, i + BATCH_SIZE);

    // Parallel embedding requests
    const embeddings = await Promise.all(
      batch.map(chunk =>
        openai.embeddings.create({
          model: "text-embedding-3-small",
          input: chunk.content
        })
      )
    );

    // Step 3: Upsert to vector database
    await index.upsert(
      batch.map((chunk, idx) => ({
        id: chunk.id,
        values: embeddings[idx].data[0].embedding,
        metadata: {
          content: chunk.content,
          title: chunk.title,
          url: chunk.url,
          chunkIndex: chunk.chunkIndex,
          ...chunk.metadata
        }
      }))
    );

    console.log(`Processed ${Math.min(i + BATCH_SIZE, chunks.length)} / ${chunks.length}`);
  }
}

Step 2: Chunking Strategy

Why chunk? Embeddings work best on focused, coherent text passages.

interface ChunkOptions {
  maxTokens: number;
  overlap: number;
  method: 'sentence' | 'paragraph' | 'fixed';
}

function chunkDocument(doc: Document, options: ChunkOptions) {
  const { maxTokens, overlap, method } = options;

  if (method === 'sentence') {
    return chunkBySentence(doc, maxTokens, overlap);
  } else if (method === 'paragraph') {
    return chunkByParagraph(doc, maxTokens, overlap);
  } else {
    return chunkFixed(doc, maxTokens, overlap);
  }
}

function chunkBySentence(doc: Document, maxTokens: number, overlap: number) {
  // Split into sentences
  const sentences = doc.content.match(/[^.!?]+[.!?]+/g) || [doc.content];

  const chunks = [];
  let currentChunk = '';
  let currentTokens = 0;

  for (const sentence of sentences) {
    const sentenceTokens = estimateTokens(sentence);

    if (currentTokens + sentenceTokens > maxTokens && currentChunk) {
      // Create chunk
      chunks.push({
        id: `${doc.id}-chunk-${chunks.length}`,
        content: currentChunk.trim(),
        title: doc.title,
        url: doc.url,
        chunkIndex: chunks.length
      });

      // Start new chunk with overlap
      const overlapSentences = getLastNSentences(currentChunk, overlap);
      currentChunk = overlapSentences + sentence;
      currentTokens = estimateTokens(currentChunk);
    } else {
      currentChunk += sentence;
      currentTokens += sentenceTokens;
    }
  }

  // Add final chunk
  if (currentChunk) {
    chunks.push({
      id: `${doc.id}-chunk-${chunks.length}`,
      content: currentChunk.trim(),
      title: doc.title,
      url: doc.url,
      chunkIndex: chunks.length
    });
  }

  return chunks;
}

function estimateTokens(text: string): number {
  // Rough estimate: ~4 characters per token for English
  return Math.ceil(text.length / 4);
}

Chunking best practices:

  • Chunk size: 256-512 tokens for most use cases
  • Overlap: 10-20% overlap between chunks to preserve context
  • Boundaries: Respect semantic boundaries (sentences, paragraphs)
  • Metadata: Include chunk position, parent document info

Step 3: Query Pipeline

async function queryRAG(question: string, options: {
  topK?: number;
  filter?: Record<string, any>;
  temperature?: number;
}) {
  const { topK = 5, filter, temperature = 0.7 } = options;

  // Step 1: Embed the question
  const questionEmbedding = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: question
  });

  // Step 2: Search vector database
  const searchResults = await index.query({
    vector: questionEmbedding.data[0].embedding,
    topK,
    filter,
    includeMetadata: true
  });

  // Step 3: Rerank results (optional but improves quality)
  const rerankedResults = await rerank(question, searchResults.matches);

  // Step 4: Build context from top results
  const context = rerankedResults
    .slice(0, 3)
    .map(match => match.metadata.content)
    .join('\n\n---\n\n');

  // Step 5: Generate answer with LLM
  const completion = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [
      {
        role: "system",
        content: "You are a helpful assistant. Answer questions based on the provided context. If the context doesn't contain enough information, say so."
      },
      {
        role: "user",
        content: `Context:\n${context}\n\nQuestion: ${question}`
      }
    ],
    temperature
  });

  return {
    answer: completion.choices[0].message.content,
    sources: rerankedResults.slice(0, 3).map(r => ({
      title: r.metadata.title,
      url: r.metadata.url,
      content: r.metadata.content.substring(0, 200) + '...'
    }))
  };
}

Step 4: Reranking (Critical for Quality)

Vector search returns semantically similar results, but they may not be the best for answering the specific question. Reranking improves relevance.

import { CohereClient } from 'cohere-ai';

const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });

async function rerank(query: string, results: any[]) {
  // Use Cohere's rerank API
  const reranked = await cohere.rerank({
    query,
    documents: results.map(r => r.metadata.content),
    topN: 10,
    model: 'rerank-english-v3.0'
  });

  // Reorder original results based on rerank scores
  return reranked.results
    .sort((a, b) => b.relevanceScore - a.relevanceScore)
    .map(r => results[r.index]);
}

// Or implement simple BM25 reranking
function bm25Rerank(query: string, results: any[]) {
  const queryTerms = tokenize(query.toLowerCase());

  const scored = results.map(result => {
    const content = result.metadata.content.toLowerCase();
    const score = computeBM25(queryTerms, content);
    return { ...result, bm25Score: score };
  });

  return scored.sort((a, b) => b.bm25Score - a.bm25Score);
}

Step 5: Hybrid Search

Combine vector search with keyword search for best results:

async function hybridSearch(query: string, topK: number = 10) {
  // Run both searches in parallel
  const [vectorResults, keywordResults] = await Promise.all([
    vectorSearch(query, topK * 2),
    keywordSearch(query, topK * 2)
  ]);

  // Reciprocal Rank Fusion (RRF)
  const scores = new Map<string, number>();
  const k = 60; // RRF constant

  vectorResults.forEach((result, index) => {
    const score = 1 / (k + index + 1);
    scores.set(result.id, (scores.get(result.id) || 0) + score);
  });

  keywordResults.forEach((result, index) => {
    const score = 1 / (k + index + 1);
    scores.set(result.id, (scores.get(result.id) || 0) + score);
  });

  // Sort by combined score
  return Array.from(scores.entries())
    .sort((a, b) => b[1] - a[1])
    .slice(0, topK)
    .map(([id, score]) => ({ id, score }));
}

Performance Optimisation

1. Vector Normalisation

If using dot product for similarity, normalise vectors to unit length:

function normaliseVector(vector: number[]): number[] {
  const magnitude = Math.sqrt(
    vector.reduce((sum, val) => sum + val * val, 0)
  );
  return vector.map(val => val / magnitude);
}

// Normalise during insertion
const embedding = await embed(text);
const normalised = normaliseVector(embedding);
await vectorDB.upsert({ id, vector: normalised, metadata });

// Now dot product = cosine similarity, but much faster!

2. Caching Embeddings

Embeddings are expensive-cache them aggressively:

import Redis from 'ioredis';

const redis = new Redis();

async function embedWithCache(text: string): Promise<number[]> {
  // Check cache
  const cached = await redis.get(`embedding:${hash(text)}`);
  if (cached) {
    return JSON.parse(cached);
  }

  // Generate embedding
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text
  });

  const embedding = response.data[0].embedding;

  // Cache for 30 days
  await redis.setex(
    `embedding:${hash(text)}`,
    30 * 24 * 60 * 60,
    JSON.stringify(embedding)
  );

  return embedding;
}

3. Batch Operations

Always batch when possible:

// Bad: Sequential inserts
for (const doc of documents) {
  await vectorDB.upsert({ /* single doc */ });
}

// Good: Batch inserts
const BATCH_SIZE = 500;
for (let i = 0; i < documents.length; i += BATCH_SIZE) {
  const batch = documents.slice(i, i + BATCH_SIZE);
  await vectorDB.upsert(batch);
}

4. Dimensionality Reduction

Reduce embedding dimensions to save memory and improve speed:

// Use smaller embedding model
const embedding = await openai.embeddings.create({
  model: "text-embedding-3-small",  // 1536 dims
  input: text,
  dimensions: 512  // Reduce to 512 dimensions
});

// Or use PCA/t-SNE after generation (less common)

5. Approximate Search Tuning

Trade accuracy for speed by tuning search parameters:

// HNSW tuning
await vectorDB.query({
  vector: queryVec,
  topK: 10,
  params: {
    ef: 50  // Lower ef = faster but less accurate (default: 100)
  }
});

// IVF tuning
await vectorDB.query({
  vector: queryVec,
  topK: 10,
  params: {
    nprobe: 5  // Search fewer clusters (default: 10)
  }
});

Production Architecture Patterns

Pattern 1: Simple RAG Stack

┌─────────────┐
│   User App  │
└──────┬──────┘
       │
       ↓
┌─────────────────────────────┐
│   API Gateway / Backend     │
│  ┌───────────────────────┐  │
│  │  Query Handler        │  │
│  │  - Embed query        │  │
│  │  - Search vectors     │  │
│  │  - Call LLM           │  │
│  └───────────────────────┘  │
└──┬────────────────────┬─────┘
   │                    │
   ↓                    ↓
┌──────────────┐  ┌─────────────┐
│ Vector DB    │  │  OpenAI API │
│ (Pinecone)   │  │             │
└──────────────┘  └─────────────┘

Pattern 2: High-Throughput Architecture

┌─────────────┐
│  Load       │
│  Balancer   │
└──────┬──────┘
       │
   ┌───┴───┬───────┬───────┐
   ↓       ↓       ↓       ↓
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│API-1│ │API-2│ │API-3│ │API-N│
└──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘
   │       │       │       │
   └───────┴───┬───┴───────┘
               ↓
       ┌───────────────┐
       │  Redis Cache  │
       │  (Embeddings) │
       └───────┬───────┘
               │
   ┌───────────┼───────────┐
   ↓           ↓           ↓
┌─────────┐ ┌──────┐ ┌────────┐
│Vector DB│ │ LLM  │ │ Metrics│
│(Qdrant) │ │ API  │ │ (Prom) │
└─────────┘ └──────┘ └────────┘

Pattern 3: Multi-Tenant SaaS

┌──────────────────────────────┐
│       Application Layer      │
│  ┌────────────────────────┐  │
│  │  Tenant Isolation      │  │
│  │  - API key validation  │  │
│  │  - Rate limiting       │  │
│  │  - Usage tracking      │  │
│  └────────────────────────┘  │
└───────────┬──────────────────┘
            │
    ┌───────┴───────┐
    ↓               ↓
┌───────────┐  ┌────────────────┐
│ Vector DB │  │  Namespace per │
│           │  │  tenant (Pince)│
│ tenant-1  │  │  or Collection │
│ tenant-2  │  │  per tenant    │
│ tenant-N  │  │  (Qdrant)      │
└───────────┘  └────────────────┘

Pattern 4: Real-Time Ingestion

┌─────────────┐
│  Data       │
│  Sources    │
└──────┬──────┘
       │
       ↓
┌─────────────────┐
│  Message Queue  │
│  (Kafka/SQS)    │
└────────┬────────┘
         │
    ┌────┴────┐
    ↓         ↓
┌────────┐ ┌────────┐
│Worker-1│ │Worker-N│
│- Embed │ │- Embed │
│- Index │ │- Index │
└───┬────┘ └───┬────┘
    │          │
    └────┬─────┘
         ↓
   ┌─────────────┐
   │  Vector DB  │
   │  (Write)    │
   └─────────────┘
         ↑
         │
   ┌─────┴──────┐
   │  Vector DB │
   │  (Read     │
   │  Replica)  │
   └────────────┘
         ↑
         │
   ┌─────┴──────┐
   │   Query    │
   │   Service  │
   └────────────┘

Monitoring and Observability

Key metrics to track:

// Latency metrics
const metrics = {
  embedding_latency_ms: histogram(),
  vector_search_latency_ms: histogram(),
  llm_latency_ms: histogram(),
  end_to_end_latency_ms: histogram(),

  // Quality metrics
  search_accuracy: gauge(),
  average_similarity_score: gauge(),
  results_returned: histogram(),

  // Cost metrics
  embedding_tokens_used: counter(),
  llm_tokens_used: counter(),
  vector_db_reads: counter(),
  vector_db_writes: counter(),

  // Error metrics
  embedding_errors: counter(),
  vector_db_errors: counter(),
  llm_errors: counter(),
};

// Instrumentation example
async function monitoredEmbed(text: string) {
  const start = Date.now();
  try {
    const result = await embed(text);
    metrics.embedding_latency_ms.observe(Date.now() - start);
    metrics.embedding_tokens_used.inc(estimateTokens(text));
    return result;
  } catch (error) {
    metrics.embedding_errors.inc();
    throw error;
  }
}

Common Pitfalls and Solutions

Pitfall 1: Model Mismatch

// WRONG: Different models for indexing vs querying
const docEmbedding = await modelA.embed(document);
await vectorDB.upsert({ vector: docEmbedding });

const queryEmbedding = await modelB.embed(query);
await vectorDB.query({ vector: queryEmbedding }); // Won't find anything!

// RIGHT: Same model everywhere
const MODEL = "text-embedding-3-small";
const docEmbedding = await embed(document, MODEL);
const queryEmbedding = await embed(query, MODEL);

Pitfall 2: Not Chunking Long Documents

// WRONG: Embed 10,000 word document as one vector
const embedding = await embed(longDocument);
// Result: Poor quality, loses nuance

// RIGHT: Chunk into meaningful pieces
const chunks = chunkDocument(longDocument, {
  maxTokens: 512,
  overlap: 50,
  method: 'sentence'
});
const embeddings = await Promise.all(chunks.map(embed));

Pitfall 3: Ignoring Metadata Filtering

// SLOW: Return 1000 results and filter in app
const results = await vectorDB.query({ vector, topK: 1000 });
const filtered = results.filter(r => r.metadata.year === 2024);

// FAST: Filter in database
const results = await vectorDB.query({
  vector,
  topK: 10,
  filter: { year: { $eq: 2024 } }
});

Pitfall 4: Not Handling Embedding Failures

// Add retry logic with exponential backoff
async function embedWithRetry(text: string, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await embed(text);
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      await sleep(Math.pow(2, i) * 1000);
    }
  }
}

Pitfall 5: Cold Start Performance

// Warm up index after deployment
async function warmUpIndex() {
  const sampleQueries = [
    "machine learning",
    "data science",
    "python programming"
  ];

  await Promise.all(
    sampleQueries.map(q =>
      vectorDB.query({
        vector: await embed(q),
        topK: 1
      })
    )
  );
}

Cost Optimisation

Embedding Costs

// OpenAI text-embedding-3-small pricing (as of 2024)
// $0.02 per 1M tokens

// Example calculation:
// 1M documents * 500 tokens each = 500M tokens
// 500M tokens * $0.02 / 1M = $10

// Optimisation: Cache embeddings!
const costPerQuery = {
  embedding: 0.02 / 1_000_000 * 50,  // 50 tokens
  vectorDB: 0.0001,                   // Per search
  llm: 0.03 / 1_000_000 * 1000,      // 1k tokens
  total: 0.0301                       // ~$0.03 per query
};

// For 1M queries/month: $30,100
// With caching (50% hit rate): $15,050

Storage Costs

// Vector storage calculation
const storageRequired = {
  numVectors: 1_000_000,
  dimensions: 1536,
  bytesPerDimension: 4,  // float32
  totalGB: 1_000_000 * 1536 * 4 / (1024 ** 3), // ~5.7 GB

  // Add overhead for index structures (HNSW ~2-3x)
  withIndex: 5.7 * 2.5, // ~14 GB

  // Monthly cost (varies by provider)
  monthlyUSD: 14 * 0.25  // ~$3.50/month at $0.25/GB
};

The Future of Vector Databases

Emerging Trends

1. Multimodal Embeddings

// Single vector for text + image
const multimodalEmbedding = await embed({
  text: "A cat sitting on a laptop",
  image: imageBuffer,
  model: "multimodal-embedding-v1"
});

// Search across modalities
const results = await vectorDB.query({
  vector: multimodalEmbedding,
  topK: 10
});
// Returns similar images, text, or both!

2. Sparse + Dense Hybrid Vectors

// Dense: Semantic meaning (from neural network)
const dense = await embed(text); // [0.23, -0.45, ...]

// Sparse: Keyword importance (like TF-IDF)
const sparse = extractKeywords(text); // {machine: 0.8, learning: 0.6}

// Combine both for best of both worlds
await vectorDB.upsert({
  id,
  denseVector: dense,
  sparseVector: sparse,
  metadata
});

3. On-Device Vector Search

// Run vector search on mobile/edge
import { VectorDB } from '@edge-vector/core';

const db = new VectorDB({ maxVectors: 10000 });
await db.loadIndex(); // ~5MB index

const results = db.search(queryVector, { topK: 5 });
// No API calls, instant results!

4. Serverless Vector Databases

// Pay only for actual usage
const db = new ServerlessVectorDB({
  provider: 'aws',
  scalingMode: 'on-demand',
  coldStartOptimisation: true
});

// Scales to zero when not in use
// Scales up automatically under load

Building the Future: Products and Applications

Understanding vector databases opens up entirely new categories of products and capabilities. Here's where this technology is heading:

Emerging Product Categories

1. Intelligent Knowledge Bases

  • Companies building internal "ChatGPT for your docs" using RAG
  • Legal firms creating case law search that understands intent, not just keywords
  • Medical databases that find similar diagnoses across millions of patient records

2. Multimodal Search Platforms

  • Search engines that find images using text descriptions
  • E-commerce platforms where you upload a photo to find similar products
  • Music discovery apps that find songs with similar "vibes" across audio, lyrics, and metadata

3. Personalised Recommendation Engines

  • Content platforms that understand nuanced user preferences beyond simple tags
  • B2B tools that suggest relevant documents/experts based on current context
  • Educational platforms that adapt learning paths based on understanding gaps

4. Code Intelligence Tools

  • Search entire codebases for similar logic patterns, not exact matches
  • Find security vulnerabilities by semantic similarity to known exploits
  • Suggest relevant code snippets based on natural language intent

5. Real-Time Context Engines

  • Customer service bots that retrieve relevant knowledge in milliseconds
  • Sales tools that surface similar past deals during live conversations
  • Medical diagnosis assistants that find relevant research papers in real-time

Real-World Examples

Here are actual companies and products using vector databases in production:

Notion AI Uses Pinecone to power semantic search across millions of workspace documents. When you search in Notion, it's not just matching keywords-it's understanding the meaning of your query and finding conceptually related content across your entire workspace.

GitHub Copilot Leverages vector embeddings to find relevant code snippets from billions of lines of public code. When you start typing, it searches for semantically similar code patterns (not just exact matches) to suggest completions that match your intent.

Shopify Powers product discovery and recommendations using vector search. Merchants can enable semantic search where customers describe what they're looking for naturally ("waterproof hiking boots for winter") instead of exact keyword matching.

ChatGPT (OpenAI) Custom GPTs and the Assistants API use vector databases for RAG. When you upload documents to a custom GPT, they're chunked, embedded, and stored in a vector database so the model can retrieve relevant context during conversations.

Perplexity AI Built their entire search engine around vector databases combined with LLMs. Every query generates embeddings that search across indexed web content, then LLMs synthesise answers from the most relevant sources.

Stripe Documentation Uses vector search to power their documentation site. Instead of basic keyword search, developers can ask questions naturally and get relevant API references, guides, and code examples.

Retool Integrates vector search into their AI-powered app builder. When you describe what you want to build, it finds relevant UI components, workflows, and templates from their library using semantic similarity.

Zapier Uses vector embeddings to help users discover automation templates. Instead of browsing categories, you can describe your workflow need and it finds relevant "Zaps" even if they use different terminology.

Common Pattern: Most production implementations combine vector search with traditional filters (metadata, dates, categories) and often add a reranking step using cross-encoders for maximum relevance.

How Applications Are Evolving

From Keyword to Intent Applications are moving from "what you typed" to "what you meant":

  • Search becoming conversational and context-aware
  • Interfaces understanding natural requests instead of requiring specific syntax
  • Systems that learn from interaction patterns, not just explicit ratings

From Static to Dynamic Vector embeddings enable applications that adapt in real-time:

  • Live product recommendations based on current browsing session
  • Dynamic documentation that surfaces relevant content as you work
  • Adaptive learning systems that adjust to your current understanding level

From Isolated to Connected Vector similarity bridges disparate data sources:

  • Unified search across emails, documents, Slack, Jira, and codebases
  • Cross-domain knowledge transfer (medical research → biotech applications)
  • Automatic linking of related concepts across your entire knowledge graph

From Reactive to Proactive Understanding semantic relationships enables anticipatory features:

  • Suggesting relevant documents before you search for them
  • Warning about potential issues based on similarity to past problems
  • Recommending connections to colleagues working on related topics

The Next Wave of Innovation

Hybrid Intelligence Systems Combining vector search with traditional databases, graph databases, and LLMs:

  • Start with semantic search (vector DB)
  • Filter by structured criteria (SQL)
  • Traverse relationships (graph DB)
  • Generate insights (LLM)

Privacy-Preserving Search Running vector search on encrypted data or on-device:

  • Secure medical record search without exposing patient data
  • Personal AI assistants that never send your data to the cloud
  • Compliant enterprise search in regulated industries

Real-Time Personalisation at Scale Maintaining millions of user-specific vector spaces:

  • Every user gets a personalised embedding space tuned to their interests
  • Applications that understand your unique context and vocabulary
  • Recommendations that account for temporal preferences (morning vs. evening, weekday vs. weekend)

What You Can Build Today

The barrier to entry is lower than ever:

Weekend Projects

  • Personal knowledge base: RAG over your notes, bookmarks, and documents
  • Smart recipe finder: Upload food photos, find similar recipes
  • Code snippet manager: Natural language search over your code snippets

Startup-Scale Products

  • Industry-specific search (legal, medical, financial)
  • Niche recommendation engines (books, courses, tools)
  • Context-aware chatbots for specific domains

Enterprise Solutions

  • Internal knowledge management platforms
  • Customer support automation with high accuracy
  • Market intelligence tools that find similar companies/trends

The key insight: Vector databases don't just improve existing applications-they enable entirely new product categories that were impossible before. The constraint is no longer "can we find similar items?" but "what should we build now that we can?"

Wrapping Up

Vector databases are more than just a trend-they're foundational infrastructure for AI applications. Understanding how they work at a deep level lets you:

  • Build better systems: Make informed architecture decisions
  • Optimise costs: Know where to cache, batch, and tune
  • Debug issues: Understand why results aren't good and how to fix them
  • Scale confidently: Know the trade-offs between different algorithms and approaches

Key Takeaways

  1. Vectors encode meaning as points in high-dimensional space
  2. Distance metrics (cosine, euclidean) measure similarity
  3. ANN algorithms (HNSW, IVF) make search fast at scale
  4. Chunking strategy dramatically affects quality
  5. Metadata filtering combines semantic and structured search
  6. Caching and batching are critical for production
  7. Monitoring search quality is as important as latency

Next Steps

The most effective way to develop a working understanding is through practical implementation. I'd recommend starting with a modest RAG system using around 1000 documents-ideally your own notes or domain-specific documentation. This will surface the key architectural decisions: appropriate chunking strategies, embedding model selection, and relevance tuning.

Focus your measurements on what drives system quality: embedding costs, search relevance, and query latency. Iterate systematically through different chunking sizes, reranking approaches, and search parameters. A single well-instrumented implementation will teach you more than theoretical study alone.

For infrastructure, pgvector offers a sensible starting point with minimal operational overhead. It integrates directly into PostgreSQL, which means you're working within a familiar environment. As your requirements scale-both in terms of traffic and performance expectations-you can evaluate managed services like Pinecone or Qdrant. Early optimisation tends to solve problems you don't yet have.


Questions or want to discuss vector database architecture? Feel free to reach out on X (@rcrebo) to dive deeper into any of these topics.