#
📘 Semantic Search & QA over Policy Documents
GitHub ยท vxrachit/Semantic-Search-QA-over-Policy-Documents
A FastAPI backend that enables PDF ingestion, vector search using FAISS, and precise Q&A over policy documents with Google Gemini, including citations and per-user isolation.
#
Feature Highlights
#
:icon-diagram: Workflow Architecture
flowchart TD
ingest["๐ User Uploads PDF (/ingest)"] --> apiIngest["โก FastAPI Endpoint (/ingest)"]
apiIngest --> extract["๐ Text Extraction (PyPDF / PyMuPDF)"]
extract --> embed["๐งฎ Embeddings (SentenceTransformers)"]
embed --> faissIndex["๐ FAISS Index Creation"]
faissIndex --> supabase["โ๏ธ Store Index + PDF in Supabase"]
query["โ User Sends Query (/query)"] --> apiQuery["โก FastAPI Endpoint (/query)"]
apiQuery --> retrieve["๐ Retrieve Top-K Chunks from FAISS"]
retrieve --> prompt["๐ Build Context + Prompt"]
prompt --> gemini["๐ค Google Gemini API"]
gemini --> answer["๐ Answer with Citations"]
answer --> response["๐ฆ Return JSON Response"]
%% Styling
classDef node fill:#f9f9f9,stroke:#333,stroke-width:1px,font-size:14px,padding:10px;
class A,B,C,D,E,F,G,H,I,J,K,L,M node;
#
Dependency Breakdown
#
⚙️ Core API Framework
- fastapi โ Defines endpoints (
/ingest,/query) and handles request/response lifecycle. - uvicorn โ ASGI server used to run the FastAPI app.
- starlette โ Underlying ASGI framework (routing, middleware).
- anyio, h11, httptools, websockets โ Async I/O, HTTP, and WebSocket support.
#
📄 PDF Parsing
- PyMuPDF (fitz) โ Extracts text from PDFs.
#
🔎 Vector Search & Embeddings
- faiss-cpu โ High-performance vector search.
- sentence-transformers โ Embedding models (
all-MiniLM-L6-v2). - transformers, tokenizers, torch โ HuggingFace stack powering embeddings.
- safetensors โ Efficient model weight storage.
#
🤖 LLM Integration (Gemini)
- google-generativeai โ Official SDK for Gemini.
- google-ai-generativelanguage, google-api-core, google-auth, grpcio โ Google Cloud auth and API stack.
#
☁️ Supabase Integration
- supabase, supabase_auth, supabase_functions, storage3, postgrest โ Supabase storage, Postgres, and auth APIs.
- PyJWT โ JSON Web Tokens for user isolation.
#
📦 Utilities & Helpers
- python-dotenv โ Load
.envconfigs. - requests, httpx โ HTTP requests.
- joblib, scikit-learn, scipy, numpy โ Vector operations & preprocessing.
- tqdm โ Progress bars during ingestion.
- regex โ Text cleaning.
- watchfiles โ Live reload for dev.
- colorama, click โ Console utilities.
#
Project Structure
.
โโโ .gitignore
โโโ LICENSE
โโโ app
โ โโโ chunking.py
โ โโโ config.py
โ โโโ main.py
โ โโโ pdf_utils.py
โ โโโ rag.py
โ โโโ static
โ โ โโโ favicon.ico
โ โโโ storage.py
โ โโโ vectorstore.py
โโโ render.yaml
โโโ requirements.txt
#
Setup & Run
# 1. clone repo
git clone https://github.com/vxrachit/Semantic-Search-QA-over-Policy-Documents.git
cd Semantic-Search-QA-over-Policy-Documents
# 2. create env
python -m venv .venv
source .venv/bin/activate # Mac/Linux
.venv\Scripts\activate # Windows
# 3. install dependencies
pip install -r requirements.txt
#
Configure .env
GOOGLE_API_KEY=your_gemini_api_key
GEMINI_MODEL=gemini-2.0-flash
EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2
SUPABASE_URL=your_supabase_project_url
SUPABASE_SERVICE_ROLE_KEY=your_supabase_key
SUPABASE_BUCKET=policyqa
#
Run API
uvicorn app.main:app --reload
👉 http://127.0.0.1:8000
👉 http://127.0.0.1:8000/docs
#
API Endpoints
#
/ingest โ Upload & embed PDFs
Uploads PDFs โ extracts text โ builds embeddings โ saves FAISS index in Supabase.
curl -X POST "http://127.0.0.1:8000/ingest" -F "user_id=demo123" -F "files=@NEP_Final_English_0.pdf"
#
/query โ Ask a question
Retrieves top-k chunks โ builds Gemini prompt โ returns citation-backed answer.
{
"user_id": "demo",
"question": "What is Education Policy 2020?",
"top_k": 5
}
Example response:
{
"answer": "National Education Policy 2020 is a policy from the Ministry of Human Resource Development, Government of India [Doc: NEP_Final_English_0.pdf, p.1]. A major development since the last Policy of 1986/92 has been the Right of Children to Free and Compulsory Education Act 2009 [Doc: NEP_Final_English_0.pdf, p.5].",
"sources": [
{
"doc_name": "NEP_Final_English_0.pdf",
"page": 1,
"score": 0.6943,
"preview": "1 National Education Policy 2020 Ministry of Human Resource Development Government of India"
},
{
"doc_name": "NEP_Final_English_0.pdf",
"page": 5,
"score": 0.6855,
"preview": "this Policy. A major development since the last Policy of 1986/92 has been the Right of Children to Free and Compulsory Education Act 2009 which laid down legalโฆ"
},
{
"doc_name": "NEP_Final_English_0.pdf",
"page": 63,
"score": 0.6647,
"preview": "National Education Policy 2020 62 systematic manner. Therefore, the implementation of this Policy will be led by various bodies including MHRD, CABE, Union and โฆ"
},
{
"doc_name": "NEP_Final_English_0.pdf",
"page": 3,
"score": 0.6631,
"preview": "National Education Policy 2020 2 19 Effective Governance and Leadership for Higher Education Institutions 49 PART III. OTHER KEY AREAS OF FOCUS 20 Professional โฆ"
},
{
"doc_name": "NEP_Final_English_0.pdf",
"page": 32,
"score": 0.6565,
"preview": "National Education Policy 2020 31 8.4. The public education system is the foundation of a vibrant democratic society, and the way it is run must be transformed โฆ"
}
]
}
#
License
MIT ยฉ 2025 Rachit Verma