PaderBot — Multilingual RAG Q&A for International Students

title	PaderBot
emoji	🎓
colorFrom	blue
colorTo	green
sdk	streamlit
sdk_version	1.39.0
app_file	app.py
pinned	false
license	mit

PaderBot — Multilingual RAG Q&A for International Students

PaderBot answers questions about studying at Paderborn University in English or German, grounded in the university's own web pages. Ask it about admissions, English-taught Master's programmes, the semester fee, housing, the Studierendenwerk, enrolment steps — and it answers with citations back to the source page. When it doesn't have the information, it says so instead of guessing.

Live demo: (https://huggingface.co/spaces/Pra2002/Paderbot) Code: (https://github.com/Pra0809/Paderbot)

Why this exists

International applicants to German universities hit the same wall: the information they need is real and public, but it is scattered across dozens of pages, split between English and German, and often the English version of a page quietly falls back to German. PaderBot is a focused retrieval-augmented generation (RAG) system over a curated slice of that content — built to be accurate and honest rather than broad, because a prospective student would rather hear "I don't know" than a confident wrong answer about a visa deadline.

How it works

          ┌─────────────────────────────────────────────┐
  Query   │  1. Gated query rewriting (skip if specific) │
  ───────▶│     llama-3.1-8b-instant                     │
          └───────────────────┬─────────────────────────┘
                              ▼
          ┌─────────────────────────────────────────────┐
          │  2. Hybrid retrieval over 586 chunks         │
          │     • Dense: multilingual-e5-base (768-dim)  │
          │     • Sparse: BM25                           │
          │     • Fused with Reciprocal Rank Fusion      │
          └───────────────────┬─────────────────────────┘
                              ▼
          ┌─────────────────────────────────────────────┐
          │  3. Confidence gate                          │
          │     too-weak retrieval → refuse, don't guess │
          └───────────────────┬─────────────────────────┘
                              ▼
          ┌─────────────────────────────────────────────┐
          │  4. Grounded generation                      │
          │     llama-3.3-70b-versatile, strict context, │
          │     answers in the question's language,      │
          │     cites source URLs                        │
          └─────────────────────────────────────────────┘

Corpus. 94 pages scraped from uni-paderborn.de and the Paderborn Studierendenwerk (48 English / 46 German, ~762k characters), chunked paragraph-aware into 586 chunks and embedded into a persistent ChromaDB index that ships with the repo (no rebuild needed at startup).

Two-model split. A cheap fast model (8B) handles optional query rewriting; a strong model (70B) handles the actual grounded answer. Rewriting is gated — short, specific queries skip it, because an ablation showed rewriting hurt exact-term lookups while helping vague ones.

Hybrid over dense-only. BM25 catches exact tokens — programme names, acronyms, proper nouns — that dense embeddings smooth over. Dense catches paraphrase and cross-language matches. Reciprocal Rank Fusion combines them without trying to reconcile their incompatible score scales.

Evaluation

Evaluated on a hand-built 30-question benchmark (12 English easy lookups, 6 English multi-hop, 8 German, 4 should-refuse questions) using three RAGAS-style metrics implemented from scratch — faithfulness, answer relevance, and context precision.

Group	n	Faithfulness	Answer Rel.	Context Prec.
All	30	0.80	0.91	0.71
English	21	0.79	0.91	0.66
German	9	0.84	0.92	0.80
Easy	18	0.83	0.92	0.60
Multi-hop	8	0.73	0.91	0.80
Refusal	4	n/a	n/a	0.98 *

English and German faithfulness are within 0.05 of each other — the central result, showing the multilingual approach holds up rather than quietly degrading on German.

* Context-precision on the refusal set is an artifact, not a real 0.98. The relevance judge matched surface keywords (e.g. it counted a page mentioning the "International Relations Office" as relevant to a question about the US president). The bot correctly refused all four of these — the high number reflects a limitation of a lightweight LLM-as-judge on adversarial questions, not retrieval quality. It is documented here rather than quietly re-run away.

Methodology caveats (stated up front, the way they'd come up in an interview): 30 questions is indicative, not statistical; the judge is from the same model family as the generator, so same-family bias is possible; and results are scoped to this single Paderborn corpus and don't generalise to other domains.

Running it locally

# 1. Install
pip install -r requirements.txt

# 2. Set your Groq API key (free tier: https://console.groq.com)
export GROQ_API_KEY="your_key_here"

# 3. Run
streamlit run app.py

The prebuilt index (chroma_db/) and scraped corpus (data/pages.jsonl) ship with the repo, so there is no scrape or index step to run first. To rebuild from scratch instead: python scrape.py → python index.py.

Project layout

File	Purpose
`app.py`	Streamlit UI — bilingual, example questions, citations, refusal handling
`paderbot.py`	Core `PaderBot` class — retrieval, fusion, gating, generation
`scrape.py`, `discover_urls.py`	Corpus collection (curated-URL crawl + clean extraction)
`index.py`	Paragraph-aware chunking + e5 embedding into ChromaDB
`eval_set.py`, `evaluate.py`	30-question benchmark + from-scratch RAGAS metrics
`chroma_db/`	Prebuilt vector index (ships with repo)
`data/pages.jsonl`	Scraped corpus (ships with repo)

Decisions

Curated-URL crawl, not recursive. A broad crawl pulled in 1,200+ noisy URLs (news, research, PhD, equality-office pages). Scoping to ~14 English-taught programmes plus universal info kept the corpus relevant to the actual user — a prospective applicant.
Detected "fake English" pages. Paderborn serves a 200-OK English URL for German-only programmes whose body is still German. Caught and removed these so the English index isn't polluted with German content.
Scoped out library operational pages. A prospective applicant doesn't yet need loan rules — that detail comes once you've enrolled. Kept 3 overview pages, dropped ~30 operational ones.
Single-turn by design. No conversational memory: every answer is grounded in freshly retrieved context, which keeps the citation story clean. Multi-turn is a deliberate future extension, not an oversight.

Limitations & future work

Stale data. The corpus is a snapshot; deadlines and fees change. A scheduled re-scrape + re-index would keep it current.
Same-page domination. Retrieval sometimes returns several chunks from one page. MMR-style diversity reranking would spread coverage.
Free-tier concurrency. The live demo runs on one shared Groq key, so heavy concurrent traffic can hit rate limits.
Judge strength. A stronger, different-family judge model would tighten the evaluation, especially on adversarial refusal cases.

Stack

Python · sentence-transformers · ChromaDB · rank-bm25 · Groq (Llama 3.1 / 3.3) · Streamlit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PaderBot — Multilingual RAG Q&A for International Students

Why this exists

How it works

Evaluation

Running it locally

Project layout

Decisions

Limitations & future work

Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
chroma_db		chroma_db
data		data
.gitignore		.gitignore
README.md		README.md
app.py		app.py
discover_urls.py		discover_urls.py
eval_set.py		eval_set.py
eval_summary.final.csv		eval_summary.final.csv
evaluate.py		evaluate.py
index.py		index.py
paderbot.py		paderbot.py
requirements.txt		requirements.txt
scrape.py		scrape.py

Folders and files

Latest commit

History

Repository files navigation

PaderBot — Multilingual RAG Q&A for International Students

Why this exists

How it works

Evaluation

Running it locally

Project layout

Decisions

Limitations & future work

Stack

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages