docs: update README.

2026-05-03 22:23:23 +05:30
parent 6af7b05c53
commit c54c893eac
1 changed files with 439 additions and 68 deletions
@@ -1,102 +1,473 @@
-# SpecForge
+# SpecForge — BIS Standards Recommendation Engine
-A web application for querying BIS SP-21 building material standards with semantic search and AI-powered explanations.
+> **BIS × Sigma Squad AI Hackathon** | Track: AI / Retrieval Augmented Generation (RAG)
 >
 > An end-to-end RAG system that turns plain-language product descriptions into accurate BIS standard recommendations in milliseconds — helping Indian MSEs find compliance requirements in seconds instead of weeks.
 ---
-## Features
+## Public Test Set Results
- **PDF Parser**: Extracts 573 unique standards from the BIS SP-21 document (929 pages, 25 material categories)
+> Evaluated on the 10 provided public queries. Judges run: `python inference.py --input <hidden_dataset>.json --output team_results.json`
 - **Hybrid Retrieval**: FAISS dense vectors + BM25 sparse index for accurate matching
 - **AI Explanations**: Groq LLM generates natural language explanations for recommendations
 - **Gallery UI**: Photography-first interface with alternating light/dark sections
-## Tech Stack
+| Metric | Target | **Our Score** |
 |---|---|---|
 | Hit Rate @3 | > 80% | **100%** (10/10) |
 | MRR @5 | > 0.7 | **0.783** |
 | Avg Latency | < 5 s | **~19 ms** |
-| Layer | Technology |
+All 10 public queries returned the expected standard in the top-3 results. Average query latency is 19 ms after the index warms up — 250× faster than the 5 s target.
 |-------|------------|
 | PDF Processing | Python, PyMuPDF |
 | Retrieval | FAISS, BM25 |
 | LLM | Groq (llama-3.1-8b-instant) |
 | Backend | Node.js, Express |
 | Frontend | React 19, Vite 8, React Router |
-## Getting Started
+---
-### Prerequisites
+## What It Does
- Node.js 18+
+Indian Micro and Small Enterprises (MSEs) spend weeks manually searching BIS SP-21 to identify which standards apply to their products. SpecForge eliminates that.
 - Python 3.10+
-### Installation
+1. **Describe your product** in plain language — e.g. *"We manufacture 33 Grade Ordinary Portland Cement"*
 2. **Get ranked BIS standards** with matched sections and relevance scores in milliseconds
 3. **Read AI explanations** of why each standard applies, generated by Groq LLM
-```bash
+The system covers all **573 unique standards** across **25 building material categories** from BIS SP-21 (Summaries of Indian Standards for Building Materials).
 # Install Python dependencies
 pip install -r requirements.txt
-# Install web dependencies
+---
-cd web/server && npm install
+
-cd web/client && npm install
+## System Architecture
 ### Data Flow
 ```
 data/raw/dataset.pdf  (BIS SP-21, 929 pages)
  → src/parse_bis_pdf.py
  → data/processed/standards.json          573 structured records  [committed]
  → data/processed/standards_chunks.json   1,261 RAG-ready chunks  [committed]
  → inference.py --build
  → data/processed/embeddings.npy          dense vectors           [gitignored — rebuild locally]
  → data/processed/faiss.index             FAISS index             [gitignored — rebuild locally]
 ```
-### Running the Application
+### Request Pipeline
-**All platforms:**
+```
-```bash
+Browser / API Client
-cd web && npm run dev
+  → POST /api/recommend  { query, top_n, rewrite }
  → Express server (web/server/index.js)
      ├─ [optional] llmService.rewriteQuery()        Groq — expands to IS-standard vocabulary
      ├─ retrieverService.retrieve()
      │     └─ PythonRetriever singleton              EventEmitter, queues concurrent requests
      │           └─ bridge/retrieve.py daemon        stdin/stdout newline-delimited JSON
      │                 └─ inference.py               FAISS 0.6 + BM25 0.4 → re-rank → top-N
      └─ llmService.generateExplanation() × N        Promise.allSettled — parallel, non-blocking
  → JSON { standards[], latency: { retrieval_ms, llm_ms, total_ms } }
 ```
-**Windows:**
+### Chunking & Retrieval Strategy
 ```bash
 npm run dev
 ```
-**Manual start:**
+**Chunking** (`src/parse_bis_pdf.py`):
-```bash
+- 2-pass boundary detection splits the 929-page PDF into per-standard records
-# Terminal 1: Python retrieval index
+- Each standard is further split by section with **50-word overlap** to prevent context loss at boundaries
-cd web/server && node bridge/retrieve.py --build-index
+- Weak chunks (<30 words) are merged with their neighbour
 - Result: 1,261 chunks from 573 standards (avg 2.2 chunks/standard)
-# Terminal 2: Backend
+**Hybrid Retrieval** (`inference.py`):
-cd web/server && npm start
+- **Dense**: FAISS `IndexFlatIP` with `all-MiniLM-L6-v2` embeddings (384-dim cosine similarity)
 - **Sparse**: BM25Okapi with weighted document construction — title ×4, keywords ×3, section ×2, body ×1
 - **Fusion**: `score = 0.6 × dense_norm + 0.4 × sparse_norm`
-# Terminal 3: Frontend
+**Re-ranking** bonuses applied per candidate:
-cd web/client && npm run dev
+- +0.05 per overlapping keyword (max 4) between query and standard's keyword list
-```
+- +0.05 per overlapping title word (max 5)
 - +0.25 if ≥60% of significant title words appear in the query (strong title match)
 - +0.20 if an exact IS ID from the query matches this standard
 - -0.15 penalty for very short chunks (<40 body words)
-## API Endpoints
+**Deduplication**: candidates grouped by `standard_id`; only the best-scoring chunk per standard survives. Final output is top-N unique IS standards.
-| Method | Endpoint | Description |
+### Key Design Decisions
-|--------|----------|-------------|
+
-| POST | `/api/recommend` | Get recommended standards with AI explanations |
+| Decision | Rationale |
-| POST | `/api/ask` | Ask questions about a specific standard |
+|---|---|
-| GET | `/api/standards` | List all standards |
+| Persistent Python daemon | FAISS index load takes ~18 s cold. Spawn once at boot, queue all requests through a single process — zero cold start per query. |
-| GET | `/api/search?q=query` | Search standards by keyword |
+| `inference.py` never modified | Bridge pattern: `bridge/retrieve.py` imports `inference.py` as a module. Judges run `inference.py` directly; the web server uses the bridge. Both paths are identical. |
 | In-memory data | 573 standards + 1,261 chunks fit comfortably in RAM. No database dependency, no I/O per request. |
 | LLM fallbacks everywhere | Every Groq call is wrapped with a timeout (8 s) and a safe default return. `Promise.allSettled` for parallel calls. Server starts and retrieval works without a `GROQ_API_KEY`. |
 | Weighted BM25 document | Repeating title tokens ×4 makes exact IS-standard name queries dominant over body-text noise — critical for the BIS domain where standard names are precise. |
 ---
 ## Project Structure
 ```
 SpecForge/
-├── data/
+├── inference.py                         # Entry point for judges — do not modify
-│   ├── raw/dataset.pdf           # Source BIS SP-21 PDF
+├── requirements.txt                     # All Python dependencies
 │   └── processed/                 # Generated outputs
 │       ├── standards.json       # 573 parsed standards
 │       └── standards_chunks.json # 1,261 RAG chunks
 ├── src/
 │   └── parse_bis_pdf.py         # PDF parser pipeline
 ├── scripts/
-│   └── eval_script.py          # Evaluation metrics
+│   └── eval_script.py                   # Provided evaluation script (Hit@3, MRR@5, latency)
-├── web/
+├── data/
-│   ├── client/                 # React + Vite frontend
+│   └── processed/
-│   └── server/                 # Express backend
+│       ├── standards.json               # 573 parsed standards (committed)
-│       ├── services/            # LLM & retrieval services
+│       ├── standards_chunks.json        # 1,261 RAG chunks (committed)
-│       └── bridge/             # Node→Python bridge
+│       ├── public_test_set.json         # 10 public evaluation queries
-└── requirements.txt           # Python dependencies
+│       └── retrieval_results.json       # Our results on public test set
 ├── src/
 │   └── parse_bis_pdf.py                 # PDF → JSON parsing pipeline
 └── web/
    ├── server/
    │   ├── index.js                     # Express API — all routes
    │   ├── start.js                     # Safe launcher (kills stale port process)
    │   ├── .env.example                 # Environment template
    │   ├── bridge/
    │   │   └── retrieve.py              # Daemon wrapping inference.py for the web server
    │   └── services/
    │       ├── llmService.js            # Groq wrappers with fallbacks
    │       └── retrieverService.js      # PythonRetriever — daemon lifecycle manager
    └── client/
        └── src/
            ├── App.jsx                  # React router (5 pages)
            ├── api/standards.js         # Typed fetch wrappers
            ├── pages/                   # Home, Standards, Categories, Recommend, About
            ├── components/              # Navbar, Footer, StandardCard, StandardModal
            └── locales/                 # en/ and hi/ (English + Hindi i18n)
 ```
-## Configuration
+---
- **GROQ_API_KEY**: Set in `web/server/.env` (gitignored)
+## External APIs & Data Sources
- **Server port**: 5000
+
- **Client dev port**: 5173
+All sources disclosed per hackathon transparency requirements.
 | Source | Purpose | Key required? | Notes |
 |---|---|---|---|
 | **BIS SP-21** (Bureau of Indian Standards, Special Publication 21) | Source dataset — 929-page PDF of building material standard summaries | No | Provided by organisers; processed JSON committed to repo |
 | **HuggingFace `all-MiniLM-L6-v2`** | 384-dimension sentence embedding model for FAISS dense retrieval | No | Downloaded automatically by `sentence-transformers` on first `--build` (~90 MB) |
 | **Groq API** (`llama-3.1-8b-instant`) | Query rewriting, per-result explanation, conversational QA | Yes — `GROQ_API_KEY` | Free tier sufficient. Groq chosen for sub-second inference latency. Retrieval works without this key. |
 No other external APIs, databases, or paid services are used.
 ---
 ## Environment Dependencies
 ### System Requirements
 | Dependency | Minimum | Notes |
 |---|---|---|
 | Python | 3.10 | For retrieval pipeline and `inference.py` |
 | Node.js | 18 | For Express server and React client |
 | npm | 9 | Ships with Node 18 |
 | `fuser` | any | Linux — used by `start.js` to clear stale port; install via `psmisc` if missing |
 ### Hardware
 - **CPU**: Any x86-64 or ARM64 — no GPU required
 - **RAM**: 2 GB minimum; index + embeddings use ~500 MB
 - **GPU**: Optional — a CUDA GPU reduces index build time but `faiss-cpu` and `sentence-transformers` run fully on CPU
 - **Disk**: ~1 GB free for venv and generated index files
 ---
 ## Setup & Running
 ### Step 1 — Clone
 ```bash
 git clone https://github.com/kshitij-ka/SpecForge
 cd SpecForge
 ```
 ### Step 2 — Python virtual environment
 ```bash
 python3 -m venv .venv
 source .venv/bin/activate        # Windows: .venv\Scripts\activate
 pip install --upgrade pip
 pip install -r requirements.txt
 ```
 `requirements.txt`:
 ```
 pymupdf>=1.24.0
 faiss-cpu>=1.7.4
 rank-bm25>=0.2.2
 sentence-transformers>=3.0.0
 numpy>=1.26.0
 ```
 > `sentence-transformers` downloads `all-MiniLM-L6-v2` (~90 MB) from HuggingFace on first use.
 ### Step 3 — Build the FAISS index
 The processed JSON is committed. Index files are gitignored and must be built once locally.
 ```bash
 source .venv/bin/activate
 python inference.py --build
 ```
 Encodes 1,261 chunks, writes `embeddings.npy` + `faiss.index` to `data/processed/`. Takes **~2 min on CPU**. Subsequent starts load from cache — no rebuild needed unless chunks change.
 ### Step 4 — Node.js dependencies
 ```bash
 cd web/server && npm install
 cd ../client && npm install
 ```
 ### Step 5 — Environment variables
 ```bash
 cp web/server/.env.example web/server/.env
 ```
 Edit `web/server/.env`:
 ```env
 # Required for LLM explanations, query rewriting, and /api/chat
 GROQ_API_KEY=your_groq_api_key_here
 # Optional — defaults to 5000
 PORT=5000
 # Required if "python" is not Python 3 — point to your venv
 PYTHON_BIN=/path/to/SpecForge/.venv/bin/python3
 ```
 > `PYTHON_BIN` accepts only `"python"`, `"python3"`, or an absolute path. The server validates and rejects arbitrary values on startup.
 ### Step 6 — Start the application
 **Terminal 1 — API server (port 5000):**
 ```bash
 cd web/server
 npm start
 ```
 Wait for the log line `Python retriever ready` (~20 s first boot). The server is accepting queries after that.
 **Terminal 2 — Frontend dev server (port 5173):**
 ```bash
 cd web/client
 npm run dev
 ```
 Open **http://localhost:5173**. The Vite dev server proxies all `/api/*` requests to `:5000`.
 ---
 ## Using `inference.py` (Judge Entry Point)
 `inference.py` is the mandatory entry point. It runs independently of the web server.
 > Always activate the virtual environment first: `source .venv/bin/activate`
 ### Build / force-rebuild the index
 ```bash
 python inference.py --build
 ```
 ### Single query (interactive testing)
 ```bash
 python inference.py --query "Which standard covers 33 grade OPC cement?"
 ```
 Output:
 ```
 ============================================================
 Query : Which standard covers 33 grade OPC cement?
 Latency: 0.019s
 Top results:
  1. IS 269: 1989 — Ordinary Portland Cement, 33 Grade
     Category: Cement and Concrete  |  Section: Scope  |  Score: 0.8921
  2. IS 8112: 1989 — 43 Grade Ordinary Portland Cement
     ...
 ```
 ### Batch evaluation (judge command)
 ```bash
 python inference.py \
  --input  data/processed/public_test_set.json \
  --output data/processed/retrieval_results.json
 ```
 Input format:
 ```json
 [
  {
    "id": "PUB-01",
    "query": "We are a small enterprise manufacturing 33 Grade OPC...",
    "expected_standards": ["IS 269: 1989"]
  }
 ]
 ```
 Output format:
 ```json
 [
  {
    "id": "PUB-01",
    "query": "...",
    "retrieved_standards": ["IS 8112: 1989", "IS 269: 1989", "..."],
    "details": [
      {
        "standard_id": "IS 269: 1989",
        "title": "Ordinary Portland Cement, 33 Grade",
        "category": "Cement and Concrete",
        "score": 0.8921,
        "matched_section": "Scope"
      }
    ],
    "latency_seconds": 0.019,
    "expected_standards": ["IS 269: 1989"]
  }
 ]
 ```
 ## Evaluation
 ```bash
 # Step 1: generate results
 python inference.py \
  --input  data/processed/public_test_set.json \
  --output data/processed/retrieval_results.json
 # Step 2: score
 python scripts/eval_script.py \
  --results data/processed/retrieval_results.json
 ```
 Targets and our results on the public set:
 | Metric | Formula | Target | Achieved |
 |---|---|---|---|
 | Hit Rate @3 | correct queries where expected std in top-3 / total | > 80% | **100%** |
 | MRR @5 | Σ(1/rank_i) / N | > 0.7 | **0.783** |
 | Avg Latency | total_time / num_queries | < 5 s | **~0.019 s** |
 ---
 ## API Reference
 All endpoints on Express server (default `http://localhost:5000`).
 ### `POST /api/recommend`
 Core RAG endpoint. Retrieval + optional LLM explanations.
 ```json
 // Request
 { "query": "fire resistance for brick masonry", "top_n": 5, "rewrite": false }
 // Response
 {
  "standards": [
    {
      "standard_id": "IS 1905: 1987",
      "title": "Code of Practice for Structural Use of Unreinforced Masonry",
      "category": "Masonry",
      "score": 0.812,
      "matched_section": "Fire Resistance",
      "explanation": "This standard specifies..."
    }
  ],
  "latency": { "retrieval_ms": 19, "llm_ms": 820, "total_ms": 839 }
 }
 ```
 | Field | Type | Default | Description |
 |---|---|---|---|
 | `query` | string | required | Natural-language product description or compliance question |
 | `top_n` | integer | 5 | Results to return (1–10) |
 | `rewrite` | boolean | `false` | Expand query to IS-standard vocabulary via LLM before retrieval |
 Rate limit: 20 req/min.
 ### `POST /api/ask`
 Chunk-grounded QA for a specific standard.
 ```json
 { "standard_id": "IS 1905: 1987", "question": "What is the minimum wall thickness?" }
 ```
 ### `POST /api/chat`
 Conversational QA over the standards corpus. Requires `GROQ_API_KEY`; returns `503` if absent.
 ```json
 { "message": "What grades of Portland cement does BIS cover?" }
 ```
 ### `GET /api/standards`
 Paginated list. Query params: `q` (keyword search), `category`, `page` (default 1), `limit` (default 20, max 100).
 ### `GET /api/standards/:id`
 Single standard. `:id` is URL-encoded IS ID, e.g. `IS%20269%3A%201989`.
 ### `GET /api/categories`
 All 25 material categories sorted alphabetically.
 ### `GET /api/stats`
 ```json
 { "standards": 573, "chunks": 1261, "categories": 25 }
 ```
 ---
 ## Features
 | Feature | Description |
 |---|---|
 | **Hybrid RAG retrieval** | FAISS (dense, 60%) + BM25 (sparse, 40%) fused and re-ranked |
 | **Re-ranking** | Keyword overlap, title match, exact IS-ID match, short-chunk penalty |
 | **AI explanations** | Groq `llama-3.1-8b-instant` — parallel, fallback-safe |
 | **Query rewriting** | LLM expands natural language to IS-standard vocabulary (optional) |
 | **Chunk-grounded QA** | Question answered from the most relevant chunk of a specific standard |
 | **Conversational chat** | Open-ended QA against the full corpus |
 | **Browse & filter** | Paginated standards list with keyword scoring; category gallery |
 | **Persistent daemon** | Python retrieval process spawned once at boot; auto-restarts on crash |
 | **Internationalisation** | UI in English and Hindi (i18next + react-i18next) |
 | **Rate limiting** | 60 req/min global, 20 req/min on LLM endpoints (Helmet + express-rate-limit) |
 | **Production-ready API** | Input validation, sanitisation, structured JSON logging, latency breakdown |
 ---
 ## Tech Stack
 | Layer | Technology |
 |---|---|
 | Embedding model | `all-MiniLM-L6-v2` via `sentence-transformers` |
 | Dense index | FAISS `IndexFlatIP` (cosine via inner product) |
 | Sparse index | BM25Okapi (`rank-bm25`) |
 | PDF parsing | PyMuPDF |
 | LLM | Groq API (`llama-3.1-8b-instant`) |
 | Backend | Node.js 18 + Express 5 |
 | Security middleware | Helmet, CORS, express-rate-limit |
 | Frontend | React 19, Vite 8, React Router 7 |
 | Internationalisation | i18next, react-i18next, i18next-browser-languagedetector |
 ---
 ## Troubleshooting
 | Symptom | Likely cause | Fix |
 |---|---|---|
 | `PYTHON_BIN validation failed` on start | Invalid `PYTHON_BIN` | Set to `python`, `python3`, or absolute venv path |
 | `ModuleNotFoundError: faiss` | Wrong Python binary (system Python instead of venv) | Set `PYTHON_BIN=/path/to/.venv/bin/python3` in `.env` |
 | `Python daemon boot timeout` (90 s) | Index files missing | Run `python inference.py --build` with venv active |
 | Results return but no `explanation` field | `GROQ_API_KEY` absent or invalid | Set key in `.env`; retrieval still works, explanations fall back silently |
 | `fuser: command not found` on Linux | `psmisc` not installed | `sudo apt install psmisc` / `sudo dnf install psmisc` |
 | Port 5000 still in use after crash | `fuser` not available | Manually: `kill $(lsof -t -i:5000)` |
 ---
 ## License
 See [LICENSE](LICENSE).