Skip to main content

Hardening the Pipeline: Defensive Engineering for a Public Beta

Moving from a local prototype to a shared environment requires a shift in mindset from "does it work?" to "can it be broken?"

The Cost of Curiosity

As I prepared to share Engrain with a small group of friends, I had to face a reality of LLM-based applications: tokens are expensive and resources are finite. Without safeguards, a single user (or a malicious bot) could exhaust my Google Cloud budget or trigger a DoS by spamming the Gemini API with massive blocks of text. I needed to move beyond functional code and implement defensive layers to protect the system's availability and my wallet.

Layer 1: Intelligent Rate Limiting

I chose slowapi to handle request throttling. A key decision here was how to identify users. Relying solely on IP addresses is unreliable in the age of VPNs and shared networks, so I implemented a custom key function that prioritizes the authenticated user_id from the request state, falling back to the remote address only for unauthenticated endpoints.

def get_user_id_key(request: Request):
    # Prioritize authenticated user context over raw IP
    return getattr(request.state, "user_id", get_remote_address(request))

limiter = Limiter(key_func=get_user_id_key)
app.state.limiter = limiter

I applied tiered limits to the query endpoints. For example, allowing 10 queries per minute handles natural conversation flow, while a cap of 50 per day prevents long-term resource drain during the beta phase.

Layer 2: Input Hardening and Pydantic Validation

Rate limits stop the frequency of attacks, but not the intensity of a single request. A user could still send a 100,000-word prompt that blows past context windows and costs. I used Pydantic's Field to enforce strict constraints at the schema level. I capped queries at 5,000 characters (~1,200 tokens) and restricted top_k retrieval to 20 citations. This ensures that every request remains within a predictable cost and performance envelope.

class QueryRequest(BaseModel):
    query: str = Field(..., min_length=2, max_length=5000)
    top_k: int = Field(5, ge=1, le=20) # Prevent excessive RAG context

Layer 3: Resilience via Exponential Backoff

Distributed systems are inherently flaky. Vertex AI might throw a 503 error, or a network hiccup could drop a connection. Instead of letting the user see a raw error, I implemented a retry mechanism with exponential backoff. This allows the system to gracefully recover from transient failures without overwhelming the upstream provider.

The main takeaway from this session was that to be ready for production wasn't just about features, it was about building a system that is economically and operationally resilient against misuse and instability.


Built with FastAPI, Next.js, Supabase, and Gemini on Google Cloud Run.

Comments

Popular posts from this blog

Does God exist?

Does God exist? There is an emotional perspective to this question (called as religion) and then there is a logical perspective. Let us touch upon the latter one in this blog. If God doesn't exist, who created this universe? How are we conscious? Who designed everything so perfectly that we are alive? These are some of the typical questions of the "logical" class of people. But what they forget to take into consideration is that how old the life on earth really is compared to the age of the universe? The age of the universe is estimated to be around 13.8 billion years since the Big Bang. The life on earth started to exist from about 3.7 billion years ago. What does that mean? If there really was a creator, what took him/her so long to form life on earth? And if indeed he designed every position of the planet, comet and space-time fabric so perfectly, why was there an imperfection for about 10.1 billion years? What do you mean by "perfect"? Anything that supports...

What is mind?

What is mind? Think about it. Vow, isn't it an irony? How can you even think about thought? That's right. Do not think about it. Observe your thought. What is it exactly? Isn't it something that you say to yourself? It is basically a bundle of words. But wait, where do these words come from? And are you even controlling it? Can you control all of your thoughts? How do we find answers to these questions? Observe. You don't control most of your thoughts. They just happen subconsciously. Where do the words come from? From all of the information that you have collected till now from birth. Think of it like ChatGPT, someone is prompting and it is answering. Is ChatGPT aware and in control? No. It is answering what it knows, what it is trained for. Similar is our mind, except, the prompt is either the sensory inputs or the mind itself. Aren't we the best AI Language Model possible already? Not in terms of latency, but in terms of accuracy.

Top Prompting Techniques for LLMs

 1. Ask it to imitate the person you look up to. Example, MS Dhoni. You don't have to meet Dhoni to ask your question any more. LLM can answer your query pretending to be him. 2. GRWC - Specify Goal, Return format, Warnings, Context. 3. Reverse prompting - Give LLM a piece of art that you want to get inspiration from and ask it to design a prompt that will generate that art piece. Now, tune that prompt to your liking.  4. When reasoning, ask the LLM for a tree of thought to consider all possibilities and explore all options rationally. Think of what frameworks you apply when you think and replicate the same thing with the LLM.