As the bug fixes were going on, I tackled two core challenges in building an intelligent reading companion: how to give an AI a "long-term memory" without blowing the token budget, and how to make RAG truly smart by combining semantic understanding with precise keyword matching.
The Challenge of Context: Engineering Hierarchical Memory
One of the biggest hurdles when building LLM-powered applications is managing context. Users expect conversations to flow naturally, remembering past interactions. However, passing an entire chat history to the LLM quickly hits token limits, leading to truncated responses or huge costs. On the flip side, only providing the last few messages results in a frustratingly forgetful AI.
To solve this, I implemented a 3-tier hierarchical memory compression system, inspired by how humans process and recall information. My goal was to mimic our ability to hold recent details, recall broader themes, and summarize long-term knowledge.
Here's how it breaks down:
- Tier 1: Recent Memory - This holds the most recent messages compressed slightly, providing immediate conversational context.
- Tier 2: Broader Theme - A compressed summary of tier-1 + the conversation's recent past, capturing key points and turns.
- Tier 3: Long-term Summary – A comprehensive, highly compressed summary of the entire session, ensuring the AI always understands the overarching themes and user goals.
By sending last 6 messages + each of tier-1, tier-2 and tier-3 summaries, I can provide a richer context without exceeding token limits. For the user interface, I decided to show only the last 5 messages in the chat UI. This keeps the display clean and focused, as the underlying AI still has access to the deeper memory layers.
From Gemini to Cloud Vision OCR
Engrain allows users to upload photos of physical book highlights, which then need to be converted to text. Initially, I experimented with Gemini 3 Flash Preview for this task. While Gemini performed admirably, its broad capabilities as a multimodal LLM were overkill for a simple OCR task.
My decision was to switch to Google Cloud Vision API's dedicated OCR model, available via Vertex AI. This was a straightforward optimization:
- Speed: Cloud Vision API is purpose-built for OCR, making it significantly faster for text extraction.
- Cost-Efficiency: For a small-scale application like Engrain, a dedicated OCR service is far more cost-effective than using a general-purpose LLM.
This move allows me to maintain high accuracy in highlight ingestion while optimizing for both performance and operational costs, a crucial consideration for any growing project.
Hybrid RAG with Keyword and Semantic Search
Effective Retrieval Augmented Generation (RAG) is the backbone of Engrain's ability to connect users with their knowledge. My initial RAG implementation relied solely on dense-based (semantic) matching, using embeddings to find conceptually similar highlights. While powerful for understanding intent, pure semantic search can sometimes miss exact keyword matches, especially for very specific queries.
To improve the robustness and precision of Engrain's RAG, I implemented a hybrid approach, combining semantic search with traditional keyword matching (Full-Text Search, or FTS, in PostgreSQL). The core idea is to leverage the strengths of both:
- Semantic Search (Dense): Excellent for conceptual understanding, synonyms, and broader topics.
- Keyword Search (Sparse): Unbeatable for exact phrase matching and specific terms.
I decided on a weighting scheme: 0.7 for semantic similarity and 0.3 for keyword matching. This prioritizes conceptual relevance while ensuring that direct keyword hits are not overlooked. The combined score is then used to rank the most relevant highlight chunks.
Here's the SQL query that brings this hybrid approach to life:
BEGIN
RETURN QUERY
WITH scored_chunks AS (
SELECT
hc.highlight_id AS _highlight_id,
hc.chunk_text AS _chunk_text,
h.text AS _highlight_text,
h.book_id AS _book_id,
h.chapter_id AS _chapter_id,
(1 - (hc.embedding <=> query_embedding))::FLOAT AS _sim,
ts_rank(hc.fts, websearch_to_tsquery('english', query_text))::FLOAT AS _kw_score,
(
((1 - (hc.embedding <=> query_embedding)) * 0.7) +
(
(ts_rank(hc.fts, websearch_to_tsquery('english', query_text)) /
(1 + ts_rank(hc.fts, websearch_to_tsquery('english', query_text))))
* 0.3
)
)::FLOAT AS _comb_score
FROM highlight_chunks hc
JOIN highlights h ON hc.highlight_id = h.id
WHERE
h.user_id = p_user_id
AND (p_book_id IS NULL OR h.book_id = p_book_id)
AND (p_chapter_id IS NULL OR h.chapter_id = p_chapter_id)
-- Hard semantic gate: anything below 0.20 is noise regardless of keyword match
AND (1 - (hc.embedding <=> query_embedding)) > 0.20
AND (
(1 - (hc.embedding <=> query_embedding)) > match_threshold
OR ts_rank(hc.fts, websearch_to_tsquery('english', query_text)) > 0
)
),
best_per_parent AS (
SELECT DISTINCT ON (_highlight_id)
_highlight_id,
_chunk_text,
_highlight_text,
_book_id,
_chapter_id,
_sim,
_kw_score,
_comb_score
FROM scored_chunks
ORDER BY _highlight_id, _comb_score DESC
)
SELECT
b._highlight_id AS hc_highlight_id,
b._chunk_text AS hc_chunk_text,
b._highlight_text AS highlight_text,
COALESCE(bk.book_name, 'Unknown')::text AS book_name,
COALESCE(bk.author, 'Unknown')::text AS author,
COALESCE(ch.chapter_name, 'N/A')::text AS chapter_name,
COALESCE(ch.chapter_number, 0)::int AS chapter_number,
b._sim AS sim,
b._kw_score AS kw_score,
b._comb_score AS comb_score
FROM best_per_parent b
LEFT JOIN books bk ON bk.id = b._book_id
LEFT JOIN chapters ch ON ch.id = b._chapter_id
-- Combined score floor: filters weak keyword-only matches after ranking
WHERE b._comb_score > min_comb_score
ORDER BY b._comb_score DESC
LIMIT match_count;
END;
A few key aspects of this query:
- The _comb_score uses a Reciprocal Rank Fusion-style normalization. By placing (1 + ts_rank) in the denominator, the formula "squashes" the raw keyword score into a range between 0 and 1. This prevents a high-frequency keyword match from overwhelming the semantic vector score, ensuring that literal matches provide a helpful boost without drowning out the conceptual relevance.
- I've included a "Hard semantic gate" (`(1 - (hc.embedding <=> query_embedding)) > 0.20`) to filter out purely noisy semantic matches, regardless of keyword score. This prevents irrelevant chunks from entering the scoring.
- Finally, a `min_comb_score` floor filters out weak matches, ensuring only truly relevant information is passed to the LLM.
This hybrid approach significantly enhances the quality of retrieved information, leading to more accurate and helpful AI responses for the user.
More explanation on this SQL function in upcoming blogs!
Thoughtful engineering of memory and retrieval systems, balancing LLM capabilities with efficiency and user experience, is paramount for building truly intelligent and scalable AI applications.
Built with FastAPI, Next.js, Supabase, and Gemini on Google Cloud Run.
Comments
Post a Comment