# SpecForge — BIS Standards Recommendation Engine > **BIS × Sigma Squad AI Hackathon** | Track: AI / Retrieval Augmented Generation (RAG) > > An end-to-end RAG system that turns plain-language product descriptions into accurate BIS standard recommendations in milliseconds — helping Indian MSEs find compliance requirements in seconds instead of weeks. --- ## Public Test Set Results > Evaluated on the 10 provided public queries. Judges run: `python inference.py --input .json --output team_results.json` | Metric | Target | **Our Score** | |---|---|---| | Hit Rate @3 | > 80% | **100%** (10/10) | | MRR @5 | > 0.7 | **0.950** | | Avg Latency | < 5 s | **~18 ms** | All 10 public queries returned the expected standard in the top 2 results. Average query latency is 19 ms after the index warms up — 250× faster than the 5 s target. --- ## What It Does Indian Micro and Small Enterprises (MSEs) spend weeks manually searching BIS SP-21 to identify which standards apply to their products. SpecForge eliminates that. 1. **Describe your product** in plain language — e.g. *"We manufacture 33 Grade Ordinary Portland Cement"* 2. **Get ranked BIS standards** with matched sections and relevance scores in milliseconds 3. **Read AI explanations** of why each standard applies, generated by Groq LLM The system covers all **586 unique standards** across **25 building material categories** from BIS SP-21 (Summaries of Indian Standards for Building Materials). --- ## System Architecture ### Data Flow ``` data/raw/dataset.pdf (BIS SP-21, 929 pages) → src/parse_bis_pdf.py → data/processed/standards.json 586 structured records [committed] → data/processed/standards_chunks.json 1,269 RAG-ready chunks [committed] → inference.py --build → data/processed/embeddings.npy dense vectors [gitignored — rebuild locally] → data/processed/faiss.index FAISS index [gitignored — rebuild locally] ``` ### Request Pipeline ``` Browser / API Client → POST /api/recommend { query, top_n, rewrite } → Express server (web/server/index.js) ├─ [optional] llmService.rewriteQuery() Groq — expands to IS-standard vocabulary ├─ retrieverService.retrieve() │ └─ PythonRetriever singleton EventEmitter, queues concurrent requests │ └─ bridge/retrieve.py daemon stdin/stdout newline-delimited JSON │ └─ inference.py FAISS 0.6 + BM25 0.4 → re-rank → top-N └─ llmService.generateExplanation() × N Promise.allSettled — parallel, non-blocking → JSON { standards[], latency: { retrieval_ms, llm_ms, total_ms } } ``` ### Chunking & Retrieval Strategy **Chunking** (`src/parse_bis_pdf.py`): - 4-pass boundary detection splits the 929-page PDF into per-standard records - Pass 1–2: primary block splitting and secondary boundary recovery - Pass 3: recovers scope text stolen by the preceding block (SP-21 PDF layout quirk) - Pass 4: truncates next-standard content bleed at a second `1. Scope` marker - Each standard is further split by section with **50-word overlap** to prevent context loss at boundaries - Weak chunks (<30 words) are merged with their neighbour - Result: 1,269 chunks from 586 standards (avg 2.2 chunks/standard) **Hybrid Retrieval** (`inference.py`): - **Dense**: FAISS `IndexFlatIP` with `all-MiniLM-L6-v2` embeddings (384-dim cosine similarity) - **Sparse**: BM25Okapi with weighted document construction — title ×4, keywords ×3, section ×2, body ×1 - **Fusion**: `score = 0.6 × dense_norm + 0.4 × sparse_norm` **Re-ranking** bonuses applied per candidate: - +0.05 per overlapping keyword (max 4) between query and standard's keyword list - +0.05 per overlapping title word (max 5) - +0.25 if ≥60% of significant title words appear in the query (strong title match) - +0.20 if an exact IS ID from the query matches this standard - +0.35 / -0.40 grade discriminator: boosts/penalises OPC-grade standards (33/43/53) when query names a specific grade - +0.30 / -0.20 part-number discriminator: boosts matching Part N and penalises non-matching parts when query explicitly names a part number (handles "Part – 1", "PART2" etc.) - -0.15 penalty for very short chunks (<40 body words) **Post-grouping Part disambiguation**: when multiple parts of the same IS base number survive into the candidate set with identical titles, IDF-weighted discriminating keyword scores break the tie — rarer corpus terms (e.g. "lightweight") carry proportionally more weight. **Deduplication**: candidates grouped by `standard_id`; only the best-scoring chunk per standard survives. Final output is top-N unique IS standards. ### Key Design Decisions | Decision | Rationale | |---|---| | Persistent Python daemon | FAISS index load takes ~18 s cold. Spawn once at boot, queue all requests through a single process — zero cold start per query. | | `inference.py` never modified | Bridge pattern: `bridge/retrieve.py` imports `inference.py` as a module. Judges run `inference.py` directly; the web server uses the bridge. Both paths are identical. | | In-memory data | 586 standards + 1,269 chunks fit comfortably in RAM. No database dependency, no I/O per request. | | LLM fallbacks everywhere | Every Groq call is wrapped with a timeout (8 s) and a safe default return. `Promise.allSettled` for parallel calls. Server starts and retrieval works without a `GROQ_API_KEY`. | | Weighted BM25 document | Repeating title tokens ×4 makes exact IS-standard name queries dominant over body-text noise — critical for the BIS domain where standard names are precise. | --- ## Project Structure ``` SpecForge/ ├── inference.py # Entry point for judges ├── requirements.txt # All Python dependencies ├── eval_script.py # Provided evaluation script (Hit@3, MRR@5, latency) ├── data/ │ └── processed/ │ ├── standards.json # 586 parsed standards (committed) │ ├── standards_chunks.json # 1,269 RAG chunks (committed) │ ├── public_test_set.json # 10 public evaluation queries │ └── retrieval_results.json # Our results on public test set ├── src/ │ └── parse_bis_pdf.py # PDF → JSON parsing pipeline └── web/ ├── server/ │ ├── index.js # Express API — all routes │ ├── start.js # Safe launcher (kills stale port process) │ ├── .env.example # Environment template │ ├── bridge/ │ │ └── retrieve.py # Daemon wrapping inference.py for the web server │ └── services/ │ ├── llmService.js # Groq wrappers with fallbacks │ └── retrieverService.js # PythonRetriever — daemon lifecycle manager └── client/ └── src/ ├── App.jsx # React router (5 pages) ├── api/standards.js # Typed fetch wrappers ├── pages/ # Home, Standards, Categories, Recommend, About ├── components/ # Navbar, Footer, StandardCard, StandardModal └── locales/ # en/ and hi/ (English + Hindi i18n) ``` --- ## External APIs & Data Sources All sources disclosed per hackathon transparency requirements. | Source | Purpose | Key required? | Notes | |---|---|---|---| | **BIS SP-21** (Bureau of Indian Standards, Special Publication 21) | Source dataset — 929-page PDF of building material standard summaries | No | Provided by organisers; processed JSON committed to repo | | **HuggingFace `all-MiniLM-L6-v2`** | 384-dimension sentence embedding model for FAISS dense retrieval | No | Downloaded automatically by `sentence-transformers` on first `--build` (~90 MB) | | **Groq API** (`llama-3.1-8b-instant`) | Query rewriting, per-result explanation, conversational QA | Yes — `GROQ_API_KEY` | Free tier sufficient. Groq chosen for sub-second inference latency. Retrieval works without this key. | No other external APIs, databases, or paid services are used. --- ## Environment Dependencies ### System Requirements | Dependency | Minimum | Notes | |---|---|---| | Python | 3.10 | For retrieval pipeline and `inference.py` | | Node.js | 18 | For Express server and React client | | npm | 9 | Ships with Node 18 | | `fuser` | any | Linux — used by `start.js` to clear stale port; install via `psmisc` if missing | ### Hardware - **CPU**: Any x86-64 or ARM64 — no GPU required - **RAM**: 2 GB minimum; index + embeddings use ~500 MB - **GPU**: Optional — a CUDA GPU reduces index build time but `faiss-cpu` and `sentence-transformers` run fully on CPU - **Disk**: ~1 GB free for venv and generated index files --- ## Setup & Running ### Step 1 — Clone ```bash git clone https://github.com/kshitij-ka/SpecForge.git cd SpecForge ``` ### Step 2 — Python virtual environment ```bash python3 -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install --upgrade pip pip install -r requirements.txt ``` `requirements.txt`: ``` pymupdf>=1.24.0 faiss-cpu>=1.7.4 rank-bm25>=0.2.2 sentence-transformers>=3.0.0 numpy>=1.26.0 ``` > `sentence-transformers` downloads `all-MiniLM-L6-v2` (~90 MB) from HuggingFace on first use. ### Step 3 — Build the FAISS index The processed JSON is committed. Index files are gitignored and must be built once locally. ```bash source .venv/bin/activate python inference.py --build ``` Encodes 1,269 chunks, writes `embeddings.npy` + `faiss.index` to `data/processed/`. Takes **~2 min on CPU**. Subsequent starts load from cache — no rebuild needed unless chunks change. ### Step 4 — Node.js dependencies ```bash cd web/server && npm install cd ../client && npm install ``` ### Step 5 — Environment variables ```bash cp web/server/.env.example web/server/.env ``` Edit `web/server/.env`: ```env # Required for LLM explanations, query rewriting, and /api/chat GROQ_API_KEY=your_groq_api_key_here # Optional — defaults to 5000 PORT=5000 # Required if "python" is not Python 3 — point to your venv PYTHON_BIN=/path/to/SpecForge/.venv/bin/python3 ``` > `PYTHON_BIN` accepts only `"python"`, `"python3"`, or an absolute path. The server validates and rejects arbitrary values on startup. ### Step 6 — Start the application **Terminal 1 — API server (port 5000):** ```bash cd web/server npm start ``` Wait for the log line `Python retriever ready` (~20 s first boot). The server is accepting queries after that. **Terminal 2 — Frontend dev server (port 5173):** ```bash cd web/client npm run dev ``` Open **http://localhost:5173**. The Vite dev server proxies all `/api/*` requests to `:5000`. --- ## Using `inference.py` (Judge Entry Point) `inference.py` is the mandatory entry point. It runs independently of the web server. > Always activate the virtual environment first: `source .venv/bin/activate` ### Build / force-rebuild the index ```bash python inference.py --build ``` ### Single query (interactive testing) ```bash python inference.py --query "Which standard covers 33 grade OPC cement?" ``` Output: ``` ============================================================ Query : Which standard covers 33 grade OPC cement? Latency: 0.019s Top results: 1. IS 269: 1989 — Ordinary Portland Cement, 33 Grade Category: Cement and Concrete | Section: Scope | Score: 0.8921 2. IS 8112: 1989 — 43 Grade Ordinary Portland Cement ... ``` ### Batch evaluation (judge command) ```bash python inference.py \ --input data/processed/public_test_set.json \ --output data/processed/retrieval_results.json ``` Input format: ```json [ { "id": "PUB-01", "query": "We are a small enterprise manufacturing 33 Grade OPC...", "expected_standards": ["IS 269: 1989"] } ] ``` Output format: ```json [ { "id": "PUB-01", "query": "...", "retrieved_standards": ["IS 8112: 1989", "IS 269: 1989", "..."], "details": [ { "standard_id": "IS 269: 1989", "title": "Ordinary Portland Cement, 33 Grade", "category": "Cement and Concrete", "score": 0.8921, "matched_section": "Scope" } ], "latency_seconds": 0.019, "expected_standards": ["IS 269: 1989"] } ] ``` ## Evaluation ```bash # Step 1: generate results python inference.py \ --input data/processed/public_test_set.json \ --output data/processed/retrieval_results.json # Step 2: score python eval_script.py \ --results data/processed/retrieval_results.json ``` Targets and our results on the public set: | Metric | Formula | Target | Achieved | |---|---|---|---| | Hit Rate @3 | correct queries where expected std in top-3 / total | > 80% | **100%** | | MRR @5 | Σ(1/rank_i) / N | > 0.7 | **0.950** | | Avg Latency | total_time / num_queries | < 5 s | **~0.018 s** | --- ## API Reference All endpoints on Express server (default `http://localhost:5000`). ### `POST /api/recommend` Core RAG endpoint. Retrieval + optional LLM explanations. ```json // Request { "query": "fire resistance for brick masonry", "top_n": 5, "rewrite": false } // Response { "standards": [ { "standard_id": "IS 1905: 1987", "title": "Code of Practice for Structural Use of Unreinforced Masonry", "category": "Masonry", "score": 0.812, "matched_section": "Fire Resistance", "explanation": "This standard specifies..." } ], "latency": { "retrieval_ms": 19, "llm_ms": 820, "total_ms": 839 } } ``` | Field | Type | Default | Description | |---|---|---|---| | `query` | string | required | Natural-language product description or compliance question | | `top_n` | integer | 5 | Results to return (1–10) | | `rewrite` | boolean | `false` | Expand query to IS-standard vocabulary via LLM before retrieval | Rate limit: 20 req/min. ### `POST /api/ask` Chunk-grounded QA for a specific standard. ```json { "standard_id": "IS 1905: 1987", "question": "What is the minimum wall thickness?" } ``` ### `POST /api/chat` Conversational QA over the standards corpus. Requires `GROQ_API_KEY`; returns `503` if absent. ```json { "message": "What grades of Portland cement does BIS cover?" } ``` ### `GET /api/standards` Paginated list. Query params: `q` (keyword search), `category`, `page` (default 1), `limit` (default 20, max 100). ### `GET /api/standards/:id` Single standard. `:id` is URL-encoded IS ID, e.g. `IS%20269%3A%201989`. ### `GET /api/categories` All 25 material categories sorted alphabetically. ### `GET /api/stats` ```json { "standards": 586, "chunks": 1269, "categories": 25 } ``` --- ## Features | Feature | Description | |---|---| | **Hybrid RAG retrieval** | FAISS (dense, 60%) + BM25 (sparse, 40%) fused and re-ranked | | **Re-ranking** | Keyword overlap, title match, exact IS-ID match, short-chunk penalty | | **Part-number disambiguation** | Explicit Part N in query boosts matching part ±0.30, penalises siblings ±0.20; handles em-dash/PART2 variants | | **AI explanations** | Groq `llama-3.1-8b-instant` — parallel, fallback-safe | | **Query rewriting** | LLM expands natural language to IS-standard vocabulary (optional) | | **Chunk-grounded QA** | Question answered from the most relevant chunk of a specific standard | | **Conversational chat** | Open-ended QA against the full corpus | | **Browse & filter** | Paginated standards list with keyword scoring; category gallery | | **Persistent daemon** | Python retrieval process spawned once at boot; auto-restarts on crash | | **Internationalisation** | UI in English and Hindi (i18next + react-i18next) | | **Rate limiting** | 60 req/min global, 20 req/min on LLM endpoints (Helmet + express-rate-limit) | | **Production-ready API** | Input validation, sanitisation, structured JSON logging, latency breakdown | --- ## Tech Stack | Layer | Technology | |---|---| | Embedding model | `all-MiniLM-L6-v2` via `sentence-transformers` | | Dense index | FAISS `IndexFlatIP` (cosine via inner product) | | Sparse index | BM25Okapi (`rank-bm25`) | | PDF parsing | PyMuPDF | | LLM | Groq API (`llama-3.1-8b-instant`) | | Backend | Node.js 18 + Express 5 | | Security middleware | Helmet, CORS, express-rate-limit | | Frontend | React 19, Vite 8, React Router 7 | | Internationalisation | i18next, react-i18next, i18next-browser-languagedetector | --- ## Troubleshooting | Symptom | Likely cause | Fix | |---|---|---| | `PYTHON_BIN validation failed` on start | Invalid `PYTHON_BIN` | Set to `python`, `python3`, or absolute venv path | | `ModuleNotFoundError: faiss` | Wrong Python binary (system Python instead of venv) | Set `PYTHON_BIN=/path/to/.venv/bin/python3` in `.env` | | `Python daemon boot timeout` (90 s) | Index files missing | Run `python inference.py --build` with venv active | | Results return but no `explanation` field | `GROQ_API_KEY` absent or invalid | Set key in `.env`; retrieval still works, explanations fall back silently | | `fuser: command not found` on Linux | `psmisc` not installed | `sudo apt install psmisc` / `sudo dnf install psmisc` | | Port 5000 still in use after crash | `fuser` not available | Manually: `kill $(lsof -t -i:5000)` | --- ## License See [LICENSE](LICENSE).