Skip to content

Pra0809/Paderbot

Repository files navigation

title PaderBot
emoji 🎓
colorFrom blue
colorTo green
sdk streamlit
sdk_version 1.39.0
app_file app.py
pinned false
license mit

PaderBot — Multilingual RAG Q&A for International Students

PaderBot answers questions about studying at Paderborn University in English or German, grounded in the university's own web pages. Ask it about admissions, English-taught Master's programmes, the semester fee, housing, the Studierendenwerk, enrolment steps — and it answers with citations back to the source page. When it doesn't have the information, it says so instead of guessing.

Live demo: (https://huggingface.co/spaces/Pra2002/Paderbot) Code: (https://github.com/Pra0809/Paderbot)


Why this exists

International applicants to German universities hit the same wall: the information they need is real and public, but it is scattered across dozens of pages, split between English and German, and often the English version of a page quietly falls back to German. PaderBot is a focused retrieval-augmented generation (RAG) system over a curated slice of that content — built to be accurate and honest rather than broad, because a prospective student would rather hear "I don't know" than a confident wrong answer about a visa deadline.


How it works

          ┌─────────────────────────────────────────────┐
  Query   │  1. Gated query rewriting (skip if specific) │
  ───────▶│     llama-3.1-8b-instant                     │
          └───────────────────┬─────────────────────────┘
                              ▼
          ┌─────────────────────────────────────────────┐
          │  2. Hybrid retrieval over 586 chunks         │
          │     • Dense: multilingual-e5-base (768-dim)  │
          │     • Sparse: BM25                           │
          │     • Fused with Reciprocal Rank Fusion      │
          └───────────────────┬─────────────────────────┘
                              ▼
          ┌─────────────────────────────────────────────┐
          │  3. Confidence gate                          │
          │     too-weak retrieval → refuse, don't guess │
          └───────────────────┬─────────────────────────┘
                              ▼
          ┌─────────────────────────────────────────────┐
          │  4. Grounded generation                      │
          │     llama-3.3-70b-versatile, strict context, │
          │     answers in the question's language,      │
          │     cites source URLs                        │
          └─────────────────────────────────────────────┘

Corpus. 94 pages scraped from uni-paderborn.de and the Paderborn Studierendenwerk (48 English / 46 German, ~762k characters), chunked paragraph-aware into 586 chunks and embedded into a persistent ChromaDB index that ships with the repo (no rebuild needed at startup).

Two-model split. A cheap fast model (8B) handles optional query rewriting; a strong model (70B) handles the actual grounded answer. Rewriting is gated — short, specific queries skip it, because an ablation showed rewriting hurt exact-term lookups while helping vague ones.

Hybrid over dense-only. BM25 catches exact tokens — programme names, acronyms, proper nouns — that dense embeddings smooth over. Dense catches paraphrase and cross-language matches. Reciprocal Rank Fusion combines them without trying to reconcile their incompatible score scales.


Evaluation

Evaluated on a hand-built 30-question benchmark (12 English easy lookups, 6 English multi-hop, 8 German, 4 should-refuse questions) using three RAGAS-style metrics implemented from scratch — faithfulness, answer relevance, and context precision.

Group n Faithfulness Answer Rel. Context Prec.
All 30 0.80 0.91 0.71
English 21 0.79 0.91 0.66
German 9 0.84 0.92 0.80
Easy 18 0.83 0.92 0.60
Multi-hop 8 0.73 0.91 0.80
Refusal 4 n/a n/a 0.98 *

English and German faithfulness are within 0.05 of each other — the central result, showing the multilingual approach holds up rather than quietly degrading on German.

* Context-precision on the refusal set is an artifact, not a real 0.98. The relevance judge matched surface keywords (e.g. it counted a page mentioning the "International Relations Office" as relevant to a question about the US president). The bot correctly refused all four of these — the high number reflects a limitation of a lightweight LLM-as-judge on adversarial questions, not retrieval quality. It is documented here rather than quietly re-run away.

Methodology caveats (stated up front, the way they'd come up in an interview): 30 questions is indicative, not statistical; the judge is from the same model family as the generator, so same-family bias is possible; and results are scoped to this single Paderborn corpus and don't generalise to other domains.


Running it locally

# 1. Install
pip install -r requirements.txt

# 2. Set your Groq API key (free tier: https://console.groq.com)
export GROQ_API_KEY="your_key_here"

# 3. Run
streamlit run app.py

The prebuilt index (chroma_db/) and scraped corpus (data/pages.jsonl) ship with the repo, so there is no scrape or index step to run first. To rebuild from scratch instead: python scrape.pypython index.py.


Project layout

File Purpose
app.py Streamlit UI — bilingual, example questions, citations, refusal handling
paderbot.py Core PaderBot class — retrieval, fusion, gating, generation
scrape.py, discover_urls.py Corpus collection (curated-URL crawl + clean extraction)
index.py Paragraph-aware chunking + e5 embedding into ChromaDB
eval_set.py, evaluate.py 30-question benchmark + from-scratch RAGAS metrics
chroma_db/ Prebuilt vector index (ships with repo)
data/pages.jsonl Scraped corpus (ships with repo)

Decisions

  • Curated-URL crawl, not recursive. A broad crawl pulled in 1,200+ noisy URLs (news, research, PhD, equality-office pages). Scoping to ~14 English-taught programmes plus universal info kept the corpus relevant to the actual user — a prospective applicant.
  • Detected "fake English" pages. Paderborn serves a 200-OK English URL for German-only programmes whose body is still German. Caught and removed these so the English index isn't polluted with German content.
  • Scoped out library operational pages. A prospective applicant doesn't yet need loan rules — that detail comes once you've enrolled. Kept 3 overview pages, dropped ~30 operational ones.
  • Single-turn by design. No conversational memory: every answer is grounded in freshly retrieved context, which keeps the citation story clean. Multi-turn is a deliberate future extension, not an oversight.

Limitations & future work

  • Stale data. The corpus is a snapshot; deadlines and fees change. A scheduled re-scrape + re-index would keep it current.
  • Same-page domination. Retrieval sometimes returns several chunks from one page. MMR-style diversity reranking would spread coverage.
  • Free-tier concurrency. The live demo runs on one shared Groq key, so heavy concurrent traffic can hit rate limits.
  • Judge strength. A stronger, different-family judge model would tighten the evaluation, especially on adversarial refusal cases.

Stack

Python · sentence-transformers · ChromaDB · rank-bm25 · Groq (Llama 3.1 / 3.3) · Streamlit

About

Multilingual (English / German) RAG Q&A assistant for international applicants to Paderborn University · Hybrid retrieval (e5 + BM25 + RRF) · Two-model Llama setup via Groq · Live demo on HuggingFace Spaces

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages