Moving from a local prototype to a shared environment requires a shift in mindset from "does it work?" to "can it be broken?"
The Cost of Curiosity
As I prepared to share Engrain with a small group of friends, I had to face a reality of LLM-based applications: tokens are expensive and resources are finite. Without safeguards, a single user (or a malicious bot) could exhaust my Google Cloud budget or trigger a DoS by spamming the Gemini API with massive blocks of text. I needed to move beyond functional code and implement defensive layers to protect the system's availability and my wallet.
Layer 1: Intelligent Rate Limiting
I chose slowapi to handle request throttling. A key decision here was how to identify users. Relying solely on IP addresses is unreliable in the age of VPNs and shared networks, so I implemented a custom key function that prioritizes the authenticated user_id from the request state, falling back to the remote address only for unauthenticated endpoints.
def get_user_id_key(request: Request):
# Prioritize authenticated user context over raw IP
return getattr(request.state, "user_id", get_remote_address(request))
limiter = Limiter(key_func=get_user_id_key)
app.state.limiter = limiter
I applied tiered limits to the query endpoints. For example, allowing 10 queries per minute handles natural conversation flow, while a cap of 50 per day prevents long-term resource drain during the beta phase.
Layer 2: Input Hardening and Pydantic Validation
Rate limits stop the frequency of attacks, but not the intensity of a single request. A user could still send a 100,000-word prompt that blows past context windows and costs. I used Pydantic's Field to enforce strict constraints at the schema level. I capped queries at 5,000 characters (~1,200 tokens) and restricted top_k retrieval to 20 citations. This ensures that every request remains within a predictable cost and performance envelope.
class QueryRequest(BaseModel):
query: str = Field(..., min_length=2, max_length=5000)
top_k: int = Field(5, ge=1, le=20) # Prevent excessive RAG context
Layer 3: Resilience via Exponential Backoff
Distributed systems are inherently flaky. Vertex AI might throw a 503 error, or a network hiccup could drop a connection. Instead of letting the user see a raw error, I implemented a retry mechanism with exponential backoff. This allows the system to gracefully recover from transient failures without overwhelming the upstream provider.
The main takeaway from this session was that to be ready for production wasn't just about features, it was about building a system that is economically and operationally resilient against misuse and instability.
Built with FastAPI, Next.js, Supabase, and Gemini on Google Cloud Run.
Comments
Post a Comment