# 📘 Semantic Search & QA over Policy Documents

GitHub ยท vxrachit/Semantic-Search-QA-over-Policy-Documents

A FastAPI backend that enables PDF ingestion, vector search using FAISS, and precise Q&A over policy documents with Google Gemini, including citations and per-user isolation.


# Feature Highlights

Feature Role
📄 PDF Ingestion Upload one or more policy PDFs, tied to a specific user_id
🧮 Embeddings Convert document text into dense vectors with sentence-transformers
🔍 Semantic Search Retrieve top-k context snippets using FAISS similarity search
🤖 Gemini QA Generate natural language answers with Google Gemini API
📂 Supabase Storage Store raw PDFs and per-user FAISS indexes
📑 Citations Every answer includes source doc and page reference
👤 User Isolation Each user gets a separate vector index in Supabase

# :icon-diagram: Workflow Architecture

flowchart TD
    ingest["๐Ÿ“„ User Uploads PDF (/ingest)"] --> apiIngest["โšก FastAPI Endpoint (/ingest)"]
    apiIngest --> extract["๐Ÿ“‘ Text Extraction (PyPDF / PyMuPDF)"]
    extract --> embed["๐Ÿงฎ Embeddings (SentenceTransformers)"]
    embed --> faissIndex["๐Ÿ” FAISS Index Creation"]
    faissIndex --> supabase["โ˜๏ธ Store Index + PDF in Supabase"]

    query["โ“ User Sends Query (/query)"] --> apiQuery["โšก FastAPI Endpoint (/query)"]
    apiQuery --> retrieve["๐Ÿ”Ž Retrieve Top-K Chunks from FAISS"]
    retrieve --> prompt["๐Ÿ“ Build Context + Prompt"]
    prompt --> gemini["๐Ÿค– Google Gemini API"]
    gemini --> answer["๐Ÿ“‘ Answer with Citations"]
    answer --> response["๐Ÿ“ฆ Return JSON Response"]


    %% Styling
    classDef node fill:#f9f9f9,stroke:#333,stroke-width:1px,font-size:14px,padding:10px;
    class A,B,C,D,E,F,G,H,I,J,K,L,M node;




# Dependency Breakdown

# ⚙️ Core API Framework

  • fastapi โ†’ Defines endpoints (/ingest, /query) and handles request/response lifecycle.
  • uvicorn โ†’ ASGI server used to run the FastAPI app.
  • starlette โ†’ Underlying ASGI framework (routing, middleware).
  • anyio, h11, httptools, websockets โ†’ Async I/O, HTTP, and WebSocket support.

# 📄 PDF Parsing

  • PyMuPDF (fitz) โ†’ Extracts text from PDFs.

# 🔎 Vector Search & Embeddings

  • faiss-cpu โ†’ High-performance vector search.
  • sentence-transformers โ†’ Embedding models (all-MiniLM-L6-v2).
  • transformers, tokenizers, torch โ†’ HuggingFace stack powering embeddings.
  • safetensors โ†’ Efficient model weight storage.

# 🤖 LLM Integration (Gemini)

  • google-generativeai โ†’ Official SDK for Gemini.
  • google-ai-generativelanguage, google-api-core, google-auth, grpcio โ†’ Google Cloud auth and API stack.

# ☁️ Supabase Integration

  • supabase, supabase_auth, supabase_functions, storage3, postgrest โ†’ Supabase storage, Postgres, and auth APIs.
  • PyJWT โ†’ JSON Web Tokens for user isolation.

# 📦 Utilities & Helpers

  • python-dotenv โ†’ Load .env configs.
  • requests, httpx โ†’ HTTP requests.
  • joblib, scikit-learn, scipy, numpy โ†’ Vector operations & preprocessing.
  • tqdm โ†’ Progress bars during ingestion.
  • regex โ†’ Text cleaning.
  • watchfiles โ†’ Live reload for dev.
  • colorama, click โ†’ Console utilities.

# Project Structure

.
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ app
โ”‚   โ”œโ”€โ”€ chunking.py
โ”‚   โ”œโ”€โ”€ config.py
โ”‚   โ”œโ”€โ”€ main.py
โ”‚   โ”œโ”€โ”€ pdf_utils.py
โ”‚   โ”œโ”€โ”€ rag.py
โ”‚   โ”œโ”€โ”€ static
โ”‚   โ”‚   โ””โ”€โ”€ favicon.ico
โ”‚   โ”œโ”€โ”€ storage.py
โ”‚   โ””โ”€โ”€ vectorstore.py
โ”œโ”€โ”€ render.yaml
โ””โ”€โ”€ requirements.txt

# Setup & Run

# 1. clone repo
git clone https://github.com/vxrachit/Semantic-Search-QA-over-Policy-Documents.git
cd Semantic-Search-QA-over-Policy-Documents

# 2. create env
python -m venv .venv
source .venv/bin/activate   # Mac/Linux
.venv\Scripts\activate      # Windows

# 3. install dependencies
pip install -r requirements.txt

# Configure .env

GOOGLE_API_KEY=your_gemini_api_key
GEMINI_MODEL=gemini-2.0-flash
EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2


SUPABASE_URL=your_supabase_project_url
SUPABASE_SERVICE_ROLE_KEY=your_supabase_key
SUPABASE_BUCKET=policyqa

# Run API

uvicorn app.main:app --reload

👉 http://127.0.0.1:8000
👉 http://127.0.0.1:8000/docs


# API Endpoints

# /ingest โ†’ Upload & embed PDFs

Uploads PDFs โ†’ extracts text โ†’ builds embeddings โ†’ saves FAISS index in Supabase.

curl -X POST "http://127.0.0.1:8000/ingest"   -F "user_id=demo123"   -F "files=@NEP_Final_English_0.pdf"

# /query โ†’ Ask a question

Retrieves top-k chunks โ†’ builds Gemini prompt โ†’ returns citation-backed answer.

{
  "user_id": "demo",
  "question": "What is Education Policy 2020?",
  "top_k": 5
}

Example response:

{
  "answer": "National Education Policy 2020 is a policy from the Ministry of Human Resource Development, Government of India [Doc: NEP_Final_English_0.pdf, p.1]. A major development since the last Policy of 1986/92 has been the Right of Children to Free and Compulsory Education Act 2009 [Doc: NEP_Final_English_0.pdf, p.5].",
  "sources": [
    {
      "doc_name": "NEP_Final_English_0.pdf",
      "page": 1,
      "score": 0.6943,
      "preview": "1 National Education Policy 2020 Ministry of Human Resource Development Government of India"
    },
    {
      "doc_name": "NEP_Final_English_0.pdf",
      "page": 5,
      "score": 0.6855,
      "preview": "this Policy. A major development since the last Policy of 1986/92 has been the Right of Children to Free and Compulsory Education Act 2009 which laid down legalโ€ฆ"
    },
    {
      "doc_name": "NEP_Final_English_0.pdf",
      "page": 63,
      "score": 0.6647,
      "preview": "National Education Policy 2020 62 systematic manner. Therefore, the implementation of this Policy will be led by various bodies including MHRD, CABE, Union and โ€ฆ"
    },
    {
      "doc_name": "NEP_Final_English_0.pdf",
      "page": 3,
      "score": 0.6631,
      "preview": "National Education Policy 2020 2 19 Effective Governance and Leadership for Higher Education Institutions 49 PART III. OTHER KEY AREAS OF FOCUS 20 Professional โ€ฆ"
    },
    {
      "doc_name": "NEP_Final_English_0.pdf",
      "page": 32,
      "score": 0.6565,
      "preview": "National Education Policy 2020 31 8.4. The public education system is the foundation of a vibrant democratic society, and the way it is run must be transformed โ€ฆ"
    }
  ]
}

# License

MIT ยฉ 2025 Rachit Verma