Moving from a local prototype to a shared environment requires a shift in mindset from "does it work?" to "can it be broken?" The Cost of Curiosity As I prepared to share Engrain with a small group of friends, I had to face a reality of LLM-based applications: tokens are expensive and resources are finite. Without safeguards, a single user (or a malicious bot) could exhaust my Google Cloud budget or trigger a DoS by spamming the Gemini API with massive blocks of text. I needed to move beyond functional code and implement defensive layers to protect the system's availability and my wallet. Layer 1: Intelligent Rate Limiting I chose slowapi to handle request throttling. A key decision here was how to identify users. Relying solely on IP addresses is unreliable in the age of VPNs and shared networks, so I implemented a custom key function that prioritizes the authenticated user_id from the request state, falling back to the remote address only f...
As the bug fixes were going on, I tackled two core challenges in building an intelligent reading companion: how to give an AI a "long-term memory" without blowing the token budget, and how to make RAG truly smart by combining semantic understanding with precise keyword matching. The Challenge of Context: Engineering Hierarchical Memory One of the biggest hurdles when building LLM-powered applications is managing context. Users expect conversations to flow naturally, remembering past interactions. However, passing an entire chat history to the LLM quickly hits token limits, leading to truncated responses or huge costs. On the flip side, only providing the last few messages results in a frustratingly forgetful AI. To solve this, I implemented a 3-tier hierarchical memory compression system, inspired by how humans process and recall information. My goal was to mimic our ability to hold recent details, recall broader themes, and summarize long-term knowledge. Here...