Skip to main content

Engineering Hierarchical Memory and Hybrid RAG for Deeper AI Conversations

As the bug fixes were going on, I tackled two core challenges in building an intelligent reading companion: how to give an AI a "long-term memory" without blowing the token budget, and how to make RAG truly smart by combining semantic understanding with precise keyword matching.

The Challenge of Context: Engineering Hierarchical Memory

One of the biggest hurdles when building LLM-powered applications is managing context. Users expect conversations to flow naturally, remembering past interactions. However, passing an entire chat history to the LLM quickly hits token limits, leading to truncated responses or huge costs. On the flip side, only providing the last few messages results in a frustratingly forgetful AI.

To solve this, I implemented a 3-tier hierarchical memory compression system, inspired by how humans process and recall information. My goal was to mimic our ability to hold recent details, recall broader themes, and summarize long-term knowledge.

Here's how it breaks down:

  1. Tier 1: Recent Memory - This holds the most recent messages compressed slightly, providing immediate conversational context.
  2. Tier 2: Broader Theme - A compressed summary of tier-1 + the conversation's recent past, capturing key points and turns.
  3. Tier 3: Long-term Summary – A comprehensive, highly compressed summary of the entire session, ensuring the AI always understands the overarching themes and user goals.

By sending last 6 messages + each of tier-1, tier-2 and tier-3 summaries, I can provide a richer context without exceeding token limits. For the user interface, I decided to show only the last 5 messages in the chat UI. This keeps the display clean and focused, as the underlying AI still has access to the deeper memory layers.

From Gemini to Cloud Vision OCR

Engrain allows users to upload photos of physical book highlights, which then need to be converted to text. Initially, I experimented with Gemini 3 Flash Preview for this task. While Gemini performed admirably, its broad capabilities as a multimodal LLM were overkill for a simple OCR task.

My decision was to switch to Google Cloud Vision API's dedicated OCR model, available via Vertex AI. This was a straightforward optimization:

  • Speed: Cloud Vision API is purpose-built for OCR, making it significantly faster for text extraction.
  • Cost-Efficiency: For a small-scale application like Engrain, a dedicated OCR service is far more cost-effective than using a general-purpose LLM.

This move allows me to maintain high accuracy in highlight ingestion while optimizing for both performance and operational costs, a crucial consideration for any growing project.

Hybrid RAG with Keyword and Semantic Search

Effective Retrieval Augmented Generation (RAG) is the backbone of Engrain's ability to connect users with their knowledge. My initial RAG implementation relied solely on dense-based (semantic) matching, using embeddings to find conceptually similar highlights. While powerful for understanding intent, pure semantic search can sometimes miss exact keyword matches, especially for very specific queries.

To improve the robustness and precision of Engrain's RAG, I implemented a hybrid approach, combining semantic search with traditional keyword matching (Full-Text Search, or FTS, in PostgreSQL). The core idea is to leverage the strengths of both:

  • Semantic Search (Dense): Excellent for conceptual understanding, synonyms, and broader topics.
  • Keyword Search (Sparse): Unbeatable for exact phrase matching and specific terms.

I decided on a weighting scheme: 0.7 for semantic similarity and 0.3 for keyword matching. This prioritizes conceptual relevance while ensuring that direct keyword hits are not overlooked. The combined score is then used to rank the most relevant highlight chunks.

Here's the SQL query that brings this hybrid approach to life:

BEGIN
  RETURN QUERY
  WITH scored_chunks AS (
    SELECT
      hc.highlight_id                                                      AS _highlight_id,
      hc.chunk_text                                                        AS _chunk_text,
      h.text                                                               AS _highlight_text,
      h.book_id                                                            AS _book_id,
      h.chapter_id                                                           AS _chapter_id,
      (1 - (hc.embedding <=> query_embedding))::FLOAT                     AS _sim,
      ts_rank(hc.fts, websearch_to_tsquery('english', query_text))::FLOAT AS _kw_score,
      (
        ((1 - (hc.embedding <=> query_embedding)) * 0.7) +
        (
          (ts_rank(hc.fts, websearch_to_tsquery('english', query_text)) /
          (1 + ts_rank(hc.fts, websearch_to_tsquery('english', query_text))))
          * 0.3
        )
      )::FLOAT AS _comb_score
    FROM highlight_chunks hc
    JOIN highlights h ON hc.highlight_id = h.id
    WHERE
      h.user_id = p_user_id
      AND (p_book_id    IS NULL OR h.book_id    = p_book_id)
      AND (p_chapter_id IS NULL OR h.chapter_id = p_chapter_id)
      -- Hard semantic gate: anything below 0.20 is noise regardless of keyword match
      AND (1 - (hc.embedding <=> query_embedding)) > 0.20
      AND (
        (1 - (hc.embedding <=> query_embedding)) > match_threshold
        OR ts_rank(hc.fts, websearch_to_tsquery('english', query_text)) > 0
      )
  ),
  best_per_parent AS (
    SELECT DISTINCT ON (_highlight_id)
      _highlight_id,
      _chunk_text,
      _highlight_text,
      _book_id,
      _chapter_id,
      _sim,
      _kw_score,
      _comb_score
    FROM scored_chunks
    ORDER BY _highlight_id, _comb_score DESC
  )
  SELECT
    b._highlight_id                                 AS hc_highlight_id,
    b._chunk_text                                   AS hc_chunk_text,
    b._highlight_text                               AS highlight_text,
    COALESCE(bk.book_name,     'Unknown')::text     AS book_name,
    COALESCE(bk.author,        'Unknown')::text     AS author,
    COALESCE(ch.chapter_name,  'N/A')::text         AS chapter_name,
    COALESCE(ch.chapter_number, 0)::int              AS chapter_number,
    b._sim                                          AS sim,
    b._kw_score                                     AS kw_score,
    b._comb_score                                   AS comb_score
  FROM best_per_parent b
  LEFT JOIN books    bk ON bk.id = b._book_id
  LEFT JOIN chapters ch ON ch.id = b._chapter_id
  -- Combined score floor: filters weak keyword-only matches after ranking
  WHERE b._comb_score > min_comb_score
  ORDER BY b._comb_score DESC
  LIMIT match_count;
END;

A few key aspects of this query:

  • The _comb_score uses a Reciprocal Rank Fusion-style normalization. By placing (1 + ts_rank) in the denominator, the formula "squashes" the raw keyword score into a range between 0 and 1. This prevents a high-frequency keyword match from overwhelming the semantic vector score, ensuring that literal matches provide a helpful boost without drowning out the conceptual relevance.
  • I've included a "Hard semantic gate" (`(1 - (hc.embedding <=> query_embedding)) > 0.20`) to filter out purely noisy semantic matches, regardless of keyword score. This prevents irrelevant chunks from entering the scoring.
  • Finally, a `min_comb_score` floor filters out weak matches, ensuring only truly relevant information is passed to the LLM.

This hybrid approach significantly enhances the quality of retrieved information, leading to more accurate and helpful AI responses for the user.

More explanation on this SQL function in upcoming blogs!


Thoughtful engineering of memory and retrieval systems, balancing LLM capabilities with efficiency and user experience, is paramount for building truly intelligent and scalable AI applications.


Built with FastAPI, Next.js, Supabase, and Gemini on Google Cloud Run.

Comments

Popular posts from this blog

Does God exist?

Does God exist? There is an emotional perspective to this question (called as religion) and then there is a logical perspective. Let us touch upon the latter one in this blog. If God doesn't exist, who created this universe? How are we conscious? Who designed everything so perfectly that we are alive? These are some of the typical questions of the "logical" class of people. But what they forget to take into consideration is that how old the life on earth really is compared to the age of the universe? The age of the universe is estimated to be around 13.8 billion years since the Big Bang. The life on earth started to exist from about 3.7 billion years ago. What does that mean? If there really was a creator, what took him/her so long to form life on earth? And if indeed he designed every position of the planet, comet and space-time fabric so perfectly, why was there an imperfection for about 10.1 billion years? What do you mean by "perfect"? Anything that supports...

What is mind?

What is mind? Think about it. Vow, isn't it an irony? How can you even think about thought? That's right. Do not think about it. Observe your thought. What is it exactly? Isn't it something that you say to yourself? It is basically a bundle of words. But wait, where do these words come from? And are you even controlling it? Can you control all of your thoughts? How do we find answers to these questions? Observe. You don't control most of your thoughts. They just happen subconsciously. Where do the words come from? From all of the information that you have collected till now from birth. Think of it like ChatGPT, someone is prompting and it is answering. Is ChatGPT aware and in control? No. It is answering what it knows, what it is trained for. Similar is our mind, except, the prompt is either the sensory inputs or the mind itself. Aren't we the best AI Language Model possible already? Not in terms of latency, but in terms of accuracy.

Top Prompting Techniques for LLMs

 1. Ask it to imitate the person you look up to. Example, MS Dhoni. You don't have to meet Dhoni to ask your question any more. LLM can answer your query pretending to be him. 2. GRWC - Specify Goal, Return format, Warnings, Context. 3. Reverse prompting - Give LLM a piece of art that you want to get inspiration from and ask it to design a prompt that will generate that art piece. Now, tune that prompt to your liking.  4. When reasoning, ask the LLM for a tree of thought to consider all possibilities and explore all options rationally. Think of what frameworks you apply when you think and replicate the same thing with the LLM.