| title | PaderBot |
|---|---|
| emoji | 🎓 |
| colorFrom | blue |
| colorTo | green |
| sdk | streamlit |
| sdk_version | 1.39.0 |
| app_file | app.py |
| pinned | false |
| license | mit |
PaderBot answers questions about studying at Paderborn University in English or German, grounded in the university's own web pages. Ask it about admissions, English-taught Master's programmes, the semester fee, housing, the Studierendenwerk, enrolment steps — and it answers with citations back to the source page. When it doesn't have the information, it says so instead of guessing.
Live demo: (https://huggingface.co/spaces/Pra2002/Paderbot) Code: (https://github.com/Pra0809/Paderbot)
International applicants to German universities hit the same wall: the information they need is real and public, but it is scattered across dozens of pages, split between English and German, and often the English version of a page quietly falls back to German. PaderBot is a focused retrieval-augmented generation (RAG) system over a curated slice of that content — built to be accurate and honest rather than broad, because a prospective student would rather hear "I don't know" than a confident wrong answer about a visa deadline.
┌─────────────────────────────────────────────┐
Query │ 1. Gated query rewriting (skip if specific) │
───────▶│ llama-3.1-8b-instant │
└───────────────────┬─────────────────────────┘
▼
┌─────────────────────────────────────────────┐
│ 2. Hybrid retrieval over 586 chunks │
│ • Dense: multilingual-e5-base (768-dim) │
│ • Sparse: BM25 │
│ • Fused with Reciprocal Rank Fusion │
└───────────────────┬─────────────────────────┘
▼
┌─────────────────────────────────────────────┐
│ 3. Confidence gate │
│ too-weak retrieval → refuse, don't guess │
└───────────────────┬─────────────────────────┘
▼
┌─────────────────────────────────────────────┐
│ 4. Grounded generation │
│ llama-3.3-70b-versatile, strict context, │
│ answers in the question's language, │
│ cites source URLs │
└─────────────────────────────────────────────┘
Corpus. 94 pages scraped from uni-paderborn.de and the Paderborn
Studierendenwerk (48 English / 46 German, ~762k characters), chunked
paragraph-aware into 586 chunks and embedded into a persistent ChromaDB
index that ships with the repo (no rebuild needed at startup).
Two-model split. A cheap fast model (8B) handles optional query rewriting; a strong model (70B) handles the actual grounded answer. Rewriting is gated — short, specific queries skip it, because an ablation showed rewriting hurt exact-term lookups while helping vague ones.
Hybrid over dense-only. BM25 catches exact tokens — programme names, acronyms, proper nouns — that dense embeddings smooth over. Dense catches paraphrase and cross-language matches. Reciprocal Rank Fusion combines them without trying to reconcile their incompatible score scales.
Evaluated on a hand-built 30-question benchmark (12 English easy lookups, 6 English multi-hop, 8 German, 4 should-refuse questions) using three RAGAS-style metrics implemented from scratch — faithfulness, answer relevance, and context precision.
| Group | n | Faithfulness | Answer Rel. | Context Prec. |
|---|---|---|---|---|
| All | 30 | 0.80 | 0.91 | 0.71 |
| English | 21 | 0.79 | 0.91 | 0.66 |
| German | 9 | 0.84 | 0.92 | 0.80 |
| Easy | 18 | 0.83 | 0.92 | 0.60 |
| Multi-hop | 8 | 0.73 | 0.91 | 0.80 |
| Refusal | 4 | n/a | n/a | 0.98 * |
English and German faithfulness are within 0.05 of each other — the central result, showing the multilingual approach holds up rather than quietly degrading on German.
* Context-precision on the refusal set is an artifact, not a real 0.98. The relevance judge matched surface keywords (e.g. it counted a page mentioning the "International Relations Office" as relevant to a question about the US president). The bot correctly refused all four of these — the high number reflects a limitation of a lightweight LLM-as-judge on adversarial questions, not retrieval quality. It is documented here rather than quietly re-run away.
Methodology caveats (stated up front, the way they'd come up in an interview): 30 questions is indicative, not statistical; the judge is from the same model family as the generator, so same-family bias is possible; and results are scoped to this single Paderborn corpus and don't generalise to other domains.
# 1. Install
pip install -r requirements.txt
# 2. Set your Groq API key (free tier: https://console.groq.com)
export GROQ_API_KEY="your_key_here"
# 3. Run
streamlit run app.pyThe prebuilt index (chroma_db/) and scraped corpus (data/pages.jsonl) ship
with the repo, so there is no scrape or index step to run first. To rebuild from
scratch instead: python scrape.py → python index.py.
| File | Purpose |
|---|---|
app.py |
Streamlit UI — bilingual, example questions, citations, refusal handling |
paderbot.py |
Core PaderBot class — retrieval, fusion, gating, generation |
scrape.py, discover_urls.py |
Corpus collection (curated-URL crawl + clean extraction) |
index.py |
Paragraph-aware chunking + e5 embedding into ChromaDB |
eval_set.py, evaluate.py |
30-question benchmark + from-scratch RAGAS metrics |
chroma_db/ |
Prebuilt vector index (ships with repo) |
data/pages.jsonl |
Scraped corpus (ships with repo) |
- Curated-URL crawl, not recursive. A broad crawl pulled in 1,200+ noisy URLs (news, research, PhD, equality-office pages). Scoping to ~14 English-taught programmes plus universal info kept the corpus relevant to the actual user — a prospective applicant.
- Detected "fake English" pages. Paderborn serves a 200-OK English URL for German-only programmes whose body is still German. Caught and removed these so the English index isn't polluted with German content.
- Scoped out library operational pages. A prospective applicant doesn't yet need loan rules — that detail comes once you've enrolled. Kept 3 overview pages, dropped ~30 operational ones.
- Single-turn by design. No conversational memory: every answer is grounded in freshly retrieved context, which keeps the citation story clean. Multi-turn is a deliberate future extension, not an oversight.
- Stale data. The corpus is a snapshot; deadlines and fees change. A scheduled re-scrape + re-index would keep it current.
- Same-page domination. Retrieval sometimes returns several chunks from one page. MMR-style diversity reranking would spread coverage.
- Free-tier concurrency. The live demo runs on one shared Groq key, so heavy concurrent traffic can hit rate limits.
- Judge strength. A stronger, different-family judge model would tighten the evaluation, especially on adversarial refusal cases.
Python · sentence-transformers · ChromaDB · rank-bm25 · Groq (Llama 3.1 / 3.3) · Streamlit