Files

481 lines
17 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SpecForge — BIS Standards Recommendation Engine
> **BIS × Sigma Squad AI Hackathon** | Track: AI / Retrieval Augmented Generation (RAG)
>
> An end-to-end RAG system that turns plain-language product descriptions into accurate BIS standard recommendations in milliseconds — helping Indian MSEs find compliance requirements in seconds instead of weeks.
---
## Public Test Set Results
> Evaluated on the 10 provided public queries. Judges run: `python inference.py --input <hidden_dataset>.json --output team_results.json`
| Metric | Target | **Our Score** |
|---|---|---|
| Hit Rate @3 | > 80% | **100%** (10/10) |
| MRR @5 | > 0.7 | **0.950** |
| Avg Latency | < 5 s | **~18 ms** |
All 10 public queries returned the expected standard in the top 2 results. Average query latency is 19 ms after the index warms up — 250× faster than the 5 s target.
---
## What It Does
Indian Micro and Small Enterprises (MSEs) spend weeks manually searching BIS SP-21 to identify which standards apply to their products. SpecForge eliminates that.
1. **Describe your product** in plain language — e.g. *"We manufacture 33 Grade Ordinary Portland Cement"*
2. **Get ranked BIS standards** with matched sections and relevance scores in milliseconds
3. **Read AI explanations** of why each standard applies, generated by Groq LLM
The system covers all **586 unique standards** across **25 building material categories** from BIS SP-21 (Summaries of Indian Standards for Building Materials).
---
## System Architecture
### Data Flow
```
data/raw/dataset.pdf (BIS SP-21, 929 pages)
→ src/parse_bis_pdf.py
→ data/processed/standards.json 586 structured records [committed]
→ data/processed/standards_chunks.json 1,269 RAG-ready chunks [committed]
→ inference.py --build
→ data/processed/embeddings.npy dense vectors [gitignored — rebuild locally]
→ data/processed/faiss.index FAISS index [gitignored — rebuild locally]
```
### Request Pipeline
```
Browser / API Client
→ POST /api/recommend { query, top_n, rewrite }
→ Express server (web/server/index.js)
├─ [optional] llmService.rewriteQuery() Groq — expands to IS-standard vocabulary
├─ retrieverService.retrieve()
│ └─ PythonRetriever singleton EventEmitter, queues concurrent requests
│ └─ bridge/retrieve.py daemon stdin/stdout newline-delimited JSON
│ └─ inference.py FAISS 0.6 + BM25 0.4 → re-rank → top-N
└─ llmService.generateExplanation() × N Promise.allSettled — parallel, non-blocking
→ JSON { standards[], latency: { retrieval_ms, llm_ms, total_ms } }
```
### Chunking & Retrieval Strategy
**Chunking** (`src/parse_bis_pdf.py`):
- 4-pass boundary detection splits the 929-page PDF into per-standard records
- Pass 12: primary block splitting and secondary boundary recovery
- Pass 3: recovers scope text stolen by the preceding block (SP-21 PDF layout quirk)
- Pass 4: truncates next-standard content bleed at a second `1. Scope` marker
- Each standard is further split by section with **50-word overlap** to prevent context loss at boundaries
- Weak chunks (<30 words) are merged with their neighbour
- Result: 1,269 chunks from 586 standards (avg 2.2 chunks/standard)
**Hybrid Retrieval** (`inference.py`):
- **Dense**: FAISS `IndexFlatIP` with `all-MiniLM-L6-v2` embeddings (384-dim cosine similarity)
- **Sparse**: BM25Okapi with weighted document construction — title ×4, keywords ×3, section ×2, body ×1
- **Fusion**: `score = 0.6 × dense_norm + 0.4 × sparse_norm`
**Re-ranking** bonuses applied per candidate:
- +0.05 per overlapping keyword (max 4) between query and standard's keyword list
- +0.05 per overlapping title word (max 5)
- +0.25 if ≥60% of significant title words appear in the query (strong title match)
- +0.20 if an exact IS ID from the query matches this standard
- +0.35 / -0.40 grade discriminator: boosts/penalises OPC-grade standards (33/43/53) when query names a specific grade
- +0.30 / -0.20 part-number discriminator: boosts matching Part N and penalises non-matching parts when query explicitly names a part number (handles "Part 1", "PART2" etc.)
- -0.15 penalty for very short chunks (<40 body words)
**Post-grouping Part disambiguation**: when multiple parts of the same IS base number survive into the candidate set with identical titles, IDF-weighted discriminating keyword scores break the tie — rarer corpus terms (e.g. "lightweight") carry proportionally more weight.
**Deduplication**: candidates grouped by `standard_id`; only the best-scoring chunk per standard survives. Final output is top-N unique IS standards.
### Key Design Decisions
| Decision | Rationale |
|---|---|
| Persistent Python daemon | FAISS index load takes ~18 s cold. Spawn once at boot, queue all requests through a single process — zero cold start per query. |
| `inference.py` never modified | Bridge pattern: `bridge/retrieve.py` imports `inference.py` as a module. Judges run `inference.py` directly; the web server uses the bridge. Both paths are identical. |
| In-memory data | 586 standards + 1,269 chunks fit comfortably in RAM. No database dependency, no I/O per request. |
| LLM fallbacks everywhere | Every Groq call is wrapped with a timeout (8 s) and a safe default return. `Promise.allSettled` for parallel calls. Server starts and retrieval works without a `GROQ_API_KEY`. |
| Weighted BM25 document | Repeating title tokens ×4 makes exact IS-standard name queries dominant over body-text noise — critical for the BIS domain where standard names are precise. |
---
## Project Structure
```
SpecForge/
├── inference.py # Entry point for judges
├── requirements.txt # All Python dependencies
├── eval_script.py # Provided evaluation script (Hit@3, MRR@5, latency)
├── data/
│ └── processed/
│ ├── standards.json # 586 parsed standards (committed)
│ ├── standards_chunks.json # 1,269 RAG chunks (committed)
│ ├── public_test_set.json # 10 public evaluation queries
│ └── retrieval_results.json # Our results on public test set
├── src/
│ └── parse_bis_pdf.py # PDF → JSON parsing pipeline
└── web/
├── server/
│ ├── index.js # Express API — all routes
│ ├── start.js # Safe launcher (kills stale port process)
│ ├── .env.example # Environment template
│ ├── bridge/
│ │ └── retrieve.py # Daemon wrapping inference.py for the web server
│ └── services/
│ ├── llmService.js # Groq wrappers with fallbacks
│ └── retrieverService.js # PythonRetriever — daemon lifecycle manager
└── client/
└── src/
├── App.jsx # React router (5 pages)
├── api/standards.js # Typed fetch wrappers
├── pages/ # Home, Standards, Categories, Recommend, About
├── components/ # Navbar, Footer, StandardCard, StandardModal
└── locales/ # en/ and hi/ (English + Hindi i18n)
```
---
## External APIs & Data Sources
All sources disclosed per hackathon transparency requirements.
| Source | Purpose | Key required? | Notes |
|---|---|---|---|
| **BIS SP-21** (Bureau of Indian Standards, Special Publication 21) | Source dataset — 929-page PDF of building material standard summaries | No | Provided by organisers; processed JSON committed to repo |
| **HuggingFace `all-MiniLM-L6-v2`** | 384-dimension sentence embedding model for FAISS dense retrieval | No | Downloaded automatically by `sentence-transformers` on first `--build` (~90 MB) |
| **Groq API** (`llama-3.1-8b-instant`) | Query rewriting, per-result explanation, conversational QA | Yes — `GROQ_API_KEY` | Free tier sufficient. Groq chosen for sub-second inference latency. Retrieval works without this key. |
No other external APIs, databases, or paid services are used.
---
## Environment Dependencies
### System Requirements
| Dependency | Minimum | Notes |
|---|---|---|
| Python | 3.10 | For retrieval pipeline and `inference.py` |
| Node.js | 18 | For Express server and React client |
| npm | 9 | Ships with Node 18 |
| `fuser` | any | Linux — used by `start.js` to clear stale port; install via `psmisc` if missing |
### Hardware
- **CPU**: Any x86-64 or ARM64 — no GPU required
- **RAM**: 2 GB minimum; index + embeddings use ~500 MB
- **GPU**: Optional — a CUDA GPU reduces index build time but `faiss-cpu` and `sentence-transformers` run fully on CPU
- **Disk**: ~1 GB free for venv and generated index files
---
## Setup & Running
### Step 1 — Clone
```bash
git clone https://github.com/kshitij-ka/SpecForge.git
cd SpecForge
```
### Step 2 — Python virtual environment
```bash
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
```
`requirements.txt`:
```
pymupdf>=1.24.0
faiss-cpu>=1.7.4
rank-bm25>=0.2.2
sentence-transformers>=3.0.0
numpy>=1.26.0
```
> `sentence-transformers` downloads `all-MiniLM-L6-v2` (~90 MB) from HuggingFace on first use.
### Step 3 — Build the FAISS index
The processed JSON is committed. Index files are gitignored and must be built once locally.
```bash
source .venv/bin/activate
python inference.py --build
```
Encodes 1,269 chunks, writes `embeddings.npy` + `faiss.index` to `data/processed/`. Takes **~2 min on CPU**. Subsequent starts load from cache — no rebuild needed unless chunks change.
### Step 4 — Node.js dependencies
```bash
cd web/server && npm install
cd ../client && npm install
```
### Step 5 — Environment variables
```bash
cp web/server/.env.example web/server/.env
```
Edit `web/server/.env`:
```env
# Required for LLM explanations, query rewriting, and /api/chat
GROQ_API_KEY=your_groq_api_key_here
# Optional — defaults to 5000
PORT=5000
# Required if "python" is not Python 3 — point to your venv
PYTHON_BIN=/path/to/SpecForge/.venv/bin/python3
```
> `PYTHON_BIN` accepts only `"python"`, `"python3"`, or an absolute path. The server validates and rejects arbitrary values on startup.
### Step 6 — Start the application
**Terminal 1 — API server (port 5000):**
```bash
cd web/server
npm start
```
Wait for the log line `Python retriever ready` (~20 s first boot). The server is accepting queries after that.
**Terminal 2 — Frontend dev server (port 5173):**
```bash
cd web/client
npm run dev
```
Open **http://localhost:5173**. The Vite dev server proxies all `/api/*` requests to `:5000`.
---
## Using `inference.py` (Judge Entry Point)
`inference.py` is the mandatory entry point. It runs independently of the web server.
> Always activate the virtual environment first: `source .venv/bin/activate`
### Build / force-rebuild the index
```bash
python inference.py --build
```
### Single query (interactive testing)
```bash
python inference.py --query "Which standard covers 33 grade OPC cement?"
```
Output:
```
============================================================
Query : Which standard covers 33 grade OPC cement?
Latency: 0.019s
Top results:
1. IS 269: 1989 — Ordinary Portland Cement, 33 Grade
Category: Cement and Concrete | Section: Scope | Score: 0.8921
2. IS 8112: 1989 — 43 Grade Ordinary Portland Cement
...
```
### Batch evaluation (judge command)
```bash
python inference.py \
--input data/processed/public_test_set.json \
--output data/processed/retrieval_results.json
```
Input format:
```json
[
{
"id": "PUB-01",
"query": "We are a small enterprise manufacturing 33 Grade OPC...",
"expected_standards": ["IS 269: 1989"]
}
]
```
Output format:
```json
[
{
"id": "PUB-01",
"query": "...",
"retrieved_standards": ["IS 8112: 1989", "IS 269: 1989", "..."],
"details": [
{
"standard_id": "IS 269: 1989",
"title": "Ordinary Portland Cement, 33 Grade",
"category": "Cement and Concrete",
"score": 0.8921,
"matched_section": "Scope"
}
],
"latency_seconds": 0.019,
"expected_standards": ["IS 269: 1989"]
}
]
```
## Evaluation
```bash
# Step 1: generate results
python inference.py \
--input data/processed/public_test_set.json \
--output data/processed/retrieval_results.json
# Step 2: score
python eval_script.py \
--results data/processed/retrieval_results.json
```
Targets and our results on the public set:
| Metric | Formula | Target | Achieved |
|---|---|---|---|
| Hit Rate @3 | correct queries where expected std in top-3 / total | > 80% | **100%** |
| MRR @5 | Σ(1/rank_i) / N | > 0.7 | **0.950** |
| Avg Latency | total_time / num_queries | < 5 s | **~0.018 s** |
---
## API Reference
All endpoints on Express server (default `http://localhost:5000`).
### `POST /api/recommend`
Core RAG endpoint. Retrieval + optional LLM explanations.
```json
// Request
{ "query": "fire resistance for brick masonry", "top_n": 5, "rewrite": false }
// Response
{
"standards": [
{
"standard_id": "IS 1905: 1987",
"title": "Code of Practice for Structural Use of Unreinforced Masonry",
"category": "Masonry",
"score": 0.812,
"matched_section": "Fire Resistance",
"explanation": "This standard specifies..."
}
],
"latency": { "retrieval_ms": 19, "llm_ms": 820, "total_ms": 839 }
}
```
| Field | Type | Default | Description |
|---|---|---|---|
| `query` | string | required | Natural-language product description or compliance question |
| `top_n` | integer | 5 | Results to return (110) |
| `rewrite` | boolean | `false` | Expand query to IS-standard vocabulary via LLM before retrieval |
Rate limit: 20 req/min.
### `POST /api/ask`
Chunk-grounded QA for a specific standard.
```json
{ "standard_id": "IS 1905: 1987", "question": "What is the minimum wall thickness?" }
```
### `POST /api/chat`
Conversational QA over the standards corpus. Requires `GROQ_API_KEY`; returns `503` if absent.
```json
{ "message": "What grades of Portland cement does BIS cover?" }
```
### `GET /api/standards`
Paginated list. Query params: `q` (keyword search), `category`, `page` (default 1), `limit` (default 20, max 100).
### `GET /api/standards/:id`
Single standard. `:id` is URL-encoded IS ID, e.g. `IS%20269%3A%201989`.
### `GET /api/categories`
All 25 material categories sorted alphabetically.
### `GET /api/stats`
```json
{ "standards": 586, "chunks": 1269, "categories": 25 }
```
---
## Features
| Feature | Description |
|---|---|
| **Hybrid RAG retrieval** | FAISS (dense, 60%) + BM25 (sparse, 40%) fused and re-ranked |
| **Re-ranking** | Keyword overlap, title match, exact IS-ID match, short-chunk penalty |
| **Part-number disambiguation** | Explicit Part N in query boosts matching part ±0.30, penalises siblings ±0.20; handles em-dash/PART2 variants |
| **AI explanations** | Groq `llama-3.1-8b-instant` — parallel, fallback-safe |
| **Query rewriting** | LLM expands natural language to IS-standard vocabulary (optional) |
| **Chunk-grounded QA** | Question answered from the most relevant chunk of a specific standard |
| **Conversational chat** | Open-ended QA against the full corpus |
| **Browse & filter** | Paginated standards list with keyword scoring; category gallery |
| **Persistent daemon** | Python retrieval process spawned once at boot; auto-restarts on crash |
| **Internationalisation** | UI in English and Hindi (i18next + react-i18next) |
| **Rate limiting** | 60 req/min global, 20 req/min on LLM endpoints (Helmet + express-rate-limit) |
| **Production-ready API** | Input validation, sanitisation, structured JSON logging, latency breakdown |
---
## Tech Stack
| Layer | Technology |
|---|---|
| Embedding model | `all-MiniLM-L6-v2` via `sentence-transformers` |
| Dense index | FAISS `IndexFlatIP` (cosine via inner product) |
| Sparse index | BM25Okapi (`rank-bm25`) |
| PDF parsing | PyMuPDF |
| LLM | Groq API (`llama-3.1-8b-instant`) |
| Backend | Node.js 18 + Express 5 |
| Security middleware | Helmet, CORS, express-rate-limit |
| Frontend | React 19, Vite 8, React Router 7 |
| Internationalisation | i18next, react-i18next, i18next-browser-languagedetector |
---
## Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| `PYTHON_BIN validation failed` on start | Invalid `PYTHON_BIN` | Set to `python`, `python3`, or absolute venv path |
| `ModuleNotFoundError: faiss` | Wrong Python binary (system Python instead of venv) | Set `PYTHON_BIN=/path/to/.venv/bin/python3` in `.env` |
| `Python daemon boot timeout` (90 s) | Index files missing | Run `python inference.py --build` with venv active |
| Results return but no `explanation` field | `GROQ_API_KEY` absent or invalid | Set key in `.env`; retrieval still works, explanations fall back silently |
| `fuser: command not found` on Linux | `psmisc` not installed | `sudo apt install psmisc` / `sudo dnf install psmisc` |
| Port 5000 still in use after crash | `fuser` not available | Manually: `kill $(lsof -t -i:5000)` |
---
## License
See [LICENSE](LICENSE).