RAG Support Assistant

Case Study Summary

Type: Personal project / portfolio piece Stack: Python, FastAPI, OpenAI gpt-4o-mini, ChromaDB, sentence-transformers, cross-encoder reranker, pydantic-ai, Langfuse, vanilla JS GitHub: alfredpersson/rag-support-assistant

Evaluation Results:

71% expert hit rate @ 5 (retrieval accuracy)
4.80 / 5 faithfulness (grounding, not hallucination)
4.76 / 5 relevancy (answers the actual question)
7 experiments across retrieval strategy, query expansion, and chunk enrichment

Overview

A customer support RAG pipeline built over the Wix Help Center dataset. Every stage — chunking, retrieval, reranking, classification, generation, and evaluation — is implemented from scratch so the design decisions are visible and auditable.

The focus is on what happens when the pipeline can't confidently answer: when retrieval returns nothing useful, when the model hedges, or when the user is upset and asking to cancel their account. Each of these cases has a specific, designed response path rather than a generic fallback.

In a support context, these failure cases are where the business impact lives. A bot that handles a cancellation request the same way it handles a how-to question will lose that customer. A bot that confidently presents a wrong answer erodes trust faster than having no bot at all. The failure handling in this pipeline is designed to deflect tickets when the bot can help, preserve trust when it can't, and route to a human immediately when the situation calls for it.

Screenshots

The support widget lives in the corner of a mock SaaS dashboard:

Mock SaaS dashboard with closed chat widget

Dashboard with chat widget open

Response types

The widget renders each response type differently based on the pipeline's routing decision:

Fully answered — a grounded answer with step-by-step instructions and source article links:

Fully answered query with source links

Follow-up question — when retrieval finds nothing relevant, the bot asks a clarifying question instead of guessing:

Follow-up clarifying question

High-stakes escalation — sensitive queries (cancellations, billing disputes) get an empathetic response with a path to a human agent:

High-stakes query with empathetic response and escalation

Low confidence — the answer is shown with a disclaimer and no source links:

Low confidence response with disclaimer

How it works

Every question is classified into one of five categories — answerable, nonsense, irrelevant, out-of-scope, or high-stakes — and routed to a purpose-built handler. Only answerable and high-stakes queries go through retrieval; the rest get static responses immediately without wasting compute.

For queries that go through retrieval, the pipeline applies two confidence gates before presenting an answer:

Relevance gate — if the cross-encoder reranker's top score is below 2.0, the retrieved results aren't good enough. Instead of generating from weak context, the bot asks a clarifying follow-up question.
Confidence gate — if the top score is below 5.0, the answer is shown but the bot signals that it isn't fully sure and asks the user to be more specific.

After generation, a self-critique step assesses whether the answer fully, partially, or cannot address the question. This determines whether to offer human escalation and how much confidence to convey to the user.

High-stakes queries (cancellations, billing disputes, complaints) bypass normal generation entirely. They get an empathetic acknowledgment with a structured retention offer and escalation path — because a standard RAG answer to "I want to cancel my account" is the wrong response.

Pipeline

flowchart LR

  %% ── Ingest (one-time) ──────────────────────────────────────
  subgraph INGEST["ingest.py  (one-time)"]
    direction LR
    DS["Wix/WixQA dataset<br/>(HuggingFace)"]
    CHUNK["Chunker<br/>1 · split on \\n\\n<br/>2 · merge &lt;50 tok<br/>3 · split &gt;300 tok at sentences<br/>4 · 50-token sliding overlap"]
    EMBED_I["Embed chunks<br/>all-MiniLM-L6-v2"]

    DS -->|"title + body"| CHUNK --> EMBED_I
  end

  EMBED_I --> CHROMA[("ChromaDB<br/>collection: wix_kb")]

  %% ── Runtime ────────────────────────────────────────────────
  USER["User question"] --> API["FastAPI<br/>POST /ask"] --> RL{"Rate limit<br/>≤100 req/day"}

  RL -- "limit exceeded" --> ERR429["HTTP 429"]
  RL -- "ok" --> CLF{"Classifier<br/>gpt-4o-mini<br/>5 categories"}

  CLF -- "nonsense /<br/>irrelevant" --> STATIC["Static response<br/>+ topic suggestions"]
  CLF -- "out of scope" --> OOS["Escalation offer"]
  CLF -- "answerable /<br/>high-stakes" --> QR["Query rewriter<br/>gpt-4o-mini<br/>(short queries only)"]

  QR --> EMBED_R["Embed query<br/>all-MiniLM-L6-v2"] --> VSEARCH["Vector search<br/>ChromaDB · top-20"]
  VSEARCH <--> CHROMA
  VSEARCH -- "20 candidates" --> RERANK["Cross-encoder reranker<br/>ms-marco-MiniLM-L6-v2 · top-5"]

  RERANK --> REL{"Relevance gate<br/>score ≥ 2.0?"}

  REL -- "yes" --> HS_CHECK{"High-stakes?"}
  REL -- "no" --> HS_CHECK_NO{"High-stakes?"}

  HS_CHECK -- "yes" --> HS["High-stakes response<br/>empathy + context<br/>+ escalation"] --> RESP["Response<br/>to user"]
  HS_CHECK -- "no" --> GEN["Generator<br/>gpt-4o-mini"] --> CONF{"Confidence gate<br/>score ≥ 5.0?"}

  HS_CHECK_NO -- "yes" --> HS_NO["High-stakes response<br/>empathy + escalation<br/>(no context)"] --> RESP
  HS_CHECK_NO -- "no" --> FOLLOWUP["Follow-up question<br/>ask for clarification"] --> RESP

  CONF -- "low confidence" --> LC["Answer + disclaimer<br/>no sources"] --> RESP
  CONF -- "confident" --> CRITIQUE{"Self-critique<br/>gpt-4o-mini"}

  CRITIQUE -- "CANNOT_ANSWER" --> CA["Escalation message<br/>+ connect-agent"] --> RESP
  CRITIQUE -- "PARTIALLY_ANSWERED" --> PA["Answer + 1 source<br/>+ soft escalation link"] --> RESP
  CRITIQUE -- "FULLY_ANSWERED" --> FA["Answer + up to 2 sources"] --> RESP

  STATIC --> RESP
  OOS --> RESP
  RESP --> USER

  classDef store     fill:#e8f4fd,stroke:#3b82f6,color:#1e3a5f
  classDef llm       fill:#fef3c7,stroke:#f59e0b,color:#78350f
  classDef gate      fill:#f3f4f6,stroke:#6b7280,color:#111827
  classDef endpoint  fill:#ede9fe,stroke:#7c3aed,color:#3b0764
  classDef io        fill:#dcfce7,stroke:#16a34a,color:#14532d

  class CHROMA store
  class CLF,QR,GEN,FOLLOWUP,HS,HS_NO llm
  class RL,HS_CHECK,HS_CHECK_NO,REL,CONF,CRITIQUE gate
  class API endpoint
  class USER,RESP io

Failure states

Situation	Response	Reasoning
Retrieval finds nothing relevant	Clarifying follow-up question	Avoids hallucinating from weak context; keeps the conversation going
Low retrieval confidence (score 2.0–5.0)	Disclaimer above the answer, no source links	Context is good enough to attempt an answer but not good enough to cite confidently
Self-critique: CANNOT_ANSWER	"I couldn't find an answer" + connect-to-agent button	Immediate escalation path instead of a vague hedge
Self-critique: PARTIALLY_ANSWERED	Answer + 1 source link + soft inline escalation link	Partial answers are still useful — a soft text link gives the user an easy path if the answer isn't enough
Self-critique: FULLY_ANSWERED	Answer + up to 2 source links	Confident answer with "read more" links to the full articles
High-stakes query	Empathetic acknowledgment + retention offer + escalation	Sensitive queries need tone and structure control, not a help article
Out-of-scope query	Escalation offer	Doesn't pretend to help with things it can't handle
Nonsense or irrelevant query	Polite deflection + topic suggestions	Saves retrieval and generation compute

Evaluation

The pipeline is evaluated against the WixQA benchmark with retrieval and generation scored separately, so failures can be diagnosed to the right stage. 7 experiments were run across retrieval strategy, query expansion, classifier tuning, and chunk enrichment. The best configuration is title-prepended embeddings + cross-encoder reranking.

Metric	Score	What it means
Expert Hit Rate @ 5	0.71 (n=200)	The correct help article appeared in the top 5 results 71% of the time
Expert MRR	0.51 (n=200)	On average, the correct article appeared around position 2
Faithfulness	4.80 / 5 (n=50)	Answers are almost entirely grounded in retrieved content, not hallucinated
Relevancy	4.76 / 5 (n=50)	Answers consistently address what was actually asked

The hit rate ceiling at 0.71 is a genuine finding: the remaining 29% are KB coverage gaps and vocabulary mismatches that persisted across all seven experiments. The most impactful next steps would be hybrid retrieval (dense + BM25) and a larger embedding model.

What changes for production

Area	Demo	Production
Vector store	ChromaDB (local file)	pgvector or managed service (Pinecone, Qdrant)
Retrieval	Dense only	Hybrid: dense + BM25 with RRF
Classification	LLM call (gpt-4o-mini)	Fine-tuned small classifier, <50 ms
Evaluation	Offline script, LLM-as-judge on 50 questions	CI-gated regression suite on prompt/retrieval changes
Rate limiting	JSON file	Redis `INCR` + `EXPIRE`, per-user
Conversation context	Stateless (single turn)	Sliding window stored in Redis, session ID on API
Streaming	None	Server-Sent Events, token-by-token
Frontend	Vanilla JS widget	React + TypeScript

Tech stack

Layer	Tech	What it does
Chunking	Paragraph split + merge/split heuristics + 50-token overlap	Breaks help articles into retrievable pieces while preserving natural structure
Embedding	`sentence-transformers/all-MiniLM-L6-v2` (local)	Converts text to vectors for similarity search, runs locally with no API cost
Vector store	ChromaDB (persistent, local)	Stores and searches article chunks by similarity
Reranker	`cross-encoder/ms-marco-MiniLM-L6-v2`	Two-pass search: fast vector scan, then precise re-scoring of top candidates
LLM	OpenAI `gpt-4o-mini` via `pydantic-ai`	Generates answers, classifies queries, and self-critiques responses
API	FastAPI + CORS + static files	Serves the pipeline and frontend
Frontend	Vanilla HTML/CSS/JS chat widget	Renders each response type differently: source links, escalation buttons, suggestion chips
Observability	Langfuse (traces + prompt versioning)	Shows exactly what happened on every request

Let's have a virtual coffee together!

Want to see if we're a match? Let's have a chat and find out. Schedule a free 30-minute intro call to discuss your AI challenges and explore how we can work together.

Book Free Intro Call