The Problem
Support teams are drowning, and throwing more headcount at it does not work
Every business with a complex product faces the same wall at scale: support volume grows faster than headcount can keep up. Hiring is expensive. Training takes months. Even after all that, the most common queries, nuanced questions requiring domain context rather than a simple lookup, still take minutes per ticket instead of seconds.
The instinct is to bolt on a chatbot. But generic chatbots make things worse. They handle low-effort FAQ traffic just fine. The moment a query gets complex, when a user asks something with multiple conditions or references a prior conversation, the bot deflects, loops, or confidently gives the wrong answer. Users lose trust and agents spend time cleaning up after the bot. The promised efficiency never arrives.
The business we were building for operated in a high-stakes, knowledge-intensive domain where users expect the chatbot to reason, not just respond. A wrong or vague answer is not just annoying; it erodes confidence in the entire product. The bar for AI accuracy here was significantly higher than a standard support bot.
"Users do not mind talking to an AI. They mind talking to an AI that does not understand them."
The core challenge was not deploying GPT. Any developer can write a few lines to call the OpenAI API. The challenge was making the AI reliably accurate in a specific domain, keeping it useful across long multi-turn conversations, making it cost-efficient to run at scale, and ensuring the whole system ran on infrastructure that would not buckle under load.
Before We Built This
What the support experience looked like without a capable AI agent
To understand what needed to be built, it helps to see the contrast between where things stood and what we were aiming for. This was not about replacing a working system; it was about building something that did not yet exist in a form that actually worked.
Without the AI agent
Complex queries routed directly to human agents, creating backlogs during peak hours
Generic chatbots deflected or looped on anything outside FAQ scope
No memory between messages; users repeated context every turn
Inconsistent answers depending on which agent handled the query
High cost per conversation as volume scaled
With the AI agent
AI handles multi-turn, complex queries with domain-specific accuracy
Context preserved across the full conversation thread
Confident answers on known topics; graceful escalation on ambiguous ones
Consistent, on-brand tone across every interaction
Roughly 60% cost reduction via intelligent model routing per query complexity
Our Approach
We did not wrap an API. We engineered the intelligence layer.
There is a significant difference between connecting GPT to a chat interface and building an AI agent that actually performs. The first takes an afternoon; the second requires deliberate decisions at every level: model selection, prompt architecture, context management, infrastructure and cost control.
Our approach centred on three core principles. First, domain specificity: the agent needed to understand the business's subject matter deeply, not just respond generically. This meant building a proper prompt engineering framework: structured templates, few-shot examples and output format constraints tested against hundreds of real query patterns.
Second, conversation continuity: most real queries are not single messages. A user asks about pricing, then follows up on that answer, then asks something that only makes sense given both previous turns. The agent needed to maintain a coherent thread with token-efficient context management so long sessions stayed accurate without becoming expensive.
Third, intelligent cost control: at scale, running every query through the most powerful model is not viable. We designed a routing layer that matches query complexity to model capability. Simple clarification questions go to lighter models; multi-variable reasoning tasks escalate to GPT-5. The result is the cost profile of a basic bot with the capability ceiling of an advanced AI.
~60%
API cost reduction via model routing
200+
Query types tested in prompt development
3
GPT models in the routing stack
Multi-turn
Context-aware conversation memory
What This Unlocks
A support agent that gets better as the business grows
The most important design decision was building the system to be model-agnostic and dataset-scalable. As OpenAI releases newer GPT versions, upgrading the agent means updating a configuration file rather than refactoring the backend. As the business adds new product areas or query categories, the prompt framework extends without architectural changes.
For the business, this means the AI support investment compounds over time rather than requiring periodic full rebuilds. The deployment on Azure with auto-scaling means the system handles traffic spikes, a product launch, a news moment or a seasonal surge, without degradation. Application Insights gives the team real-time visibility into response latency and error rates so issues surface before users notice.
The agent also knows its limits. When a query is genuinely ambiguous, out of scope, or likely to mislead, it acknowledges uncertainty and routes to a human agent rather than confidently hallucinating. This trust mechanism is often more valuable than raw accuracy. Users who know the AI will say it is not sure and connect them to a person trust it more than a bot that always has an answer.
How We Built It
The engineering behind the agent
Dual-framework backend: Django Ninja and FastAPI
We separated concerns by framework. Django Ninja handles the full application layer, covering authentication, session management, user data and admin, where its mature ORM and ecosystem shine. FastAPI handles the AI inference endpoints exclusively, where its async-first architecture eliminates blocking on concurrent chatbot requests. The result is the robustness of Django combined with the throughput of FastAPI where latency actually matters.
Django NinjaFastAPIPythonJWT AuthAsync endpoints
Prompt engineering framework, not just a system prompt
We built a structured prompt templating system with domain-specific few-shot examples, output format constraints and explicit uncertainty handling instructions. Templates were developed and tested across 200+ real query types, covering common, edge-case and adversarial inputs, then iterated based on output quality assessment. This framework is the single biggest driver of agent accuracy over vanilla GPT outputs.
Prompt templatesFew-shot examplesOutput constraintsUncertainty routing
Multi-model GPT router: capability matched to cost
Integrated GPT-4.0, GPT-4.1 and GPT-5.2 via the OpenAI API with a routing layer that classifies each query by complexity before dispatching. Simple clarification or FAQ queries route to GPT-4.0. Multi-turn conversations with moderate context route to GPT-4.1. Complex multi-variable reasoning escalates to GPT-5.2. The router runs classification in under 50ms, transparent to the user and significant on the cost line.
OpenAI GPT-4.0GPT-4.1GPT-5.2Model routerComplexity classification
Sliding context window with intent detection
Long conversations degrade AI quality if context is not managed; token limits are hit or irrelevant prior turns dominate the context window. We implemented a sliding window that preserves the most relevant recent turns and key extracted facts while summarising older context. Intent detection identifies when users switch topics or introduce contradictions, triggering clarification prompts rather than letting the agent proceed on false assumptions.
Sliding context windowToken managementIntent detectionTopic switching
Azure deployment with auto-scaling and observability
Deployed on Azure App Service with auto-scaling configured for peak traffic windows, with minimum instance pre-warming to eliminate cold start latency. Azure API Management handles rate limiting and API key lifecycle. Application Insights instruments every inference call, covering response time, token usage, error rates and model distribution, giving the team real-time visibility to tune and troubleshoot in production.
Azure App ServiceAzure API MgmtApp InsightsAuto-scalingPre-warming
System Architecture
Request flow from user to response
→
☁️
Azure API Mgmt
Rate · Auth
→
→
🧠
Model Router
4.0 / 4.1 / 5.2
←
📝
Prompt Engine
Templates · FewShot
←
💾
Context Store
Sliding window
Infrastructure layers
1
Frontend / UI
Chat interfaceWebSocket / RESTStreaming
2
API layer
Django NinjaFastAPIPythonJWT Auth
3
AI orchestration
GPT-4.0GPT-4.1GPT-5.2Model routerPrompt templates
4
Context engine
Sliding windowIntent detectionToken management
5
Cloud infra
Azure App ServiceAzure API MgmtApp InsightsAuto-scaling
Model routing logic
GPT-4.0Simple lookups, FAQ responses and low complexity queries. Fastest and cheapest.
GPT-4.1Multi-turn conversations, moderate analysis and topic comparisons.
GPT-5.2Complex multi-variable reasoning, edge cases and high-stakes escalations.
What Made This Hard
The engineering challenges that mattered
⚡
AI accuracy on ambiguous queriesSolved with structured prompt templates and explicit output format constraints that force the model to acknowledge uncertainty rather than hallucinate.
⚡
Token limits in long sessionsSolved with a sliding context window that retains the most relevant recent turns and extracted key facts. Quality held without exceeding limits.
⚡
Cold start latency on AzureSolved via minimum instance pre-warming and response streaming. Perceived response time drops significantly even before the full answer is ready.
⚡
API cost at scaleSolved with the model routing layer. Simple queries use cheaper models, reducing average cost per conversation by roughly 60% compared to always using the most capable model.
Key Technical Decisions
Why we built it this way
Django Ninja and FastAPI together
Using FastAPI alone loses Django's mature application layer. Using Django alone adds unnecessary blocking overhead to AI inference routes. Splitting them by responsibility, Django for the app and FastAPI for inference, gets the best of both without the compromises of either.
Azure over AWS for GPT workloads
Azure OpenAI Service deploys GPT models within the client's own Azure tenant, which is critical for data residency, enterprise compliance and predictable cost at high API call volumes. For GPT-heavy workloads, Azure's native integration is a structural advantage over AWS's third-party OpenAI access.
Multi-model router over a single-model architecture
Locking into one model creates both cost risk and capability risk. When GPT-5.3 ships, a single-model architecture requires re-evaluation and potential refactoring. The router decouples model selection from business logic entirely. Upgrades happen in configuration, not in code.
Prompt framework is the most undervalued investment
Most GPT deployments treat prompting as an afterthought. The structured prompt engineering framework, covering templates, few-shot examples, output constraints and uncertainty handling, is what separates a reliable domain AI from a general-purpose chatbot with a branded skin. It is where the majority of quality improvement comes from.