Boost scores when query grade matches standard title grade, penalize mismatches. Add part disambiguation to correctly route queries to specific standard parts (e.g., IS 12269 (Part 1) vs (Part 2)). Regenerate retrieval results with improved ranking.
SpecForge — BIS Standards Recommendation Engine
BIS × Sigma Squad AI Hackathon | Track: AI / Retrieval Augmented Generation (RAG)
An end-to-end RAG system that turns plain-language product descriptions into accurate BIS standard recommendations in milliseconds — helping Indian MSEs find compliance requirements in seconds instead of weeks.
Public Test Set Results
Evaluated on the 10 provided public queries. Judges run:
python inference.py --input <hidden_dataset>.json --output team_results.json
| Metric | Target | Our Score |
|---|---|---|
| Hit Rate @3 | > 80% | 100% (10/10) |
| MRR @5 | > 0.7 | 0.783 |
| Avg Latency | < 5 s | ~19 ms |
All 10 public queries returned the expected standard in the top-3 results. Average query latency is 19 ms after the index warms up — 250× faster than the 5 s target.
What It Does
Indian Micro and Small Enterprises (MSEs) spend weeks manually searching BIS SP-21 to identify which standards apply to their products. SpecForge eliminates that.
- Describe your product in plain language — e.g. "We manufacture 33 Grade Ordinary Portland Cement"
- Get ranked BIS standards with matched sections and relevance scores in milliseconds
- Read AI explanations of why each standard applies, generated by Groq LLM
The system covers all 573 unique standards across 25 building material categories from BIS SP-21 (Summaries of Indian Standards for Building Materials).
System Architecture
Data Flow
data/raw/dataset.pdf (BIS SP-21, 929 pages)
→ src/parse_bis_pdf.py
→ data/processed/standards.json 573 structured records [committed]
→ data/processed/standards_chunks.json 1,261 RAG-ready chunks [committed]
→ inference.py --build
→ data/processed/embeddings.npy dense vectors [gitignored — rebuild locally]
→ data/processed/faiss.index FAISS index [gitignored — rebuild locally]
Request Pipeline
Browser / API Client
→ POST /api/recommend { query, top_n, rewrite }
→ Express server (web/server/index.js)
├─ [optional] llmService.rewriteQuery() Groq — expands to IS-standard vocabulary
├─ retrieverService.retrieve()
│ └─ PythonRetriever singleton EventEmitter, queues concurrent requests
│ └─ bridge/retrieve.py daemon stdin/stdout newline-delimited JSON
│ └─ inference.py FAISS 0.6 + BM25 0.4 → re-rank → top-N
└─ llmService.generateExplanation() × N Promise.allSettled — parallel, non-blocking
→ JSON { standards[], latency: { retrieval_ms, llm_ms, total_ms } }
Chunking & Retrieval Strategy
Chunking (src/parse_bis_pdf.py):
- 2-pass boundary detection splits the 929-page PDF into per-standard records
- Each standard is further split by section with 50-word overlap to prevent context loss at boundaries
- Weak chunks (<30 words) are merged with their neighbour
- Result: 1,261 chunks from 573 standards (avg 2.2 chunks/standard)
Hybrid Retrieval (inference.py):
- Dense: FAISS
IndexFlatIPwithall-MiniLM-L6-v2embeddings (384-dim cosine similarity) - Sparse: BM25Okapi with weighted document construction — title ×4, keywords ×3, section ×2, body ×1
- Fusion:
score = 0.6 × dense_norm + 0.4 × sparse_norm
Re-ranking bonuses applied per candidate:
- +0.05 per overlapping keyword (max 4) between query and standard's keyword list
- +0.05 per overlapping title word (max 5)
- +0.25 if ≥60% of significant title words appear in the query (strong title match)
- +0.20 if an exact IS ID from the query matches this standard
- -0.15 penalty for very short chunks (<40 body words)
Deduplication: candidates grouped by standard_id; only the best-scoring chunk per standard survives. Final output is top-N unique IS standards.
Key Design Decisions
| Decision | Rationale |
|---|---|
| Persistent Python daemon | FAISS index load takes ~18 s cold. Spawn once at boot, queue all requests through a single process — zero cold start per query. |
inference.py never modified |
Bridge pattern: bridge/retrieve.py imports inference.py as a module. Judges run inference.py directly; the web server uses the bridge. Both paths are identical. |
| In-memory data | 573 standards + 1,261 chunks fit comfortably in RAM. No database dependency, no I/O per request. |
| LLM fallbacks everywhere | Every Groq call is wrapped with a timeout (8 s) and a safe default return. Promise.allSettled for parallel calls. Server starts and retrieval works without a GROQ_API_KEY. |
| Weighted BM25 document | Repeating title tokens ×4 makes exact IS-standard name queries dominant over body-text noise — critical for the BIS domain where standard names are precise. |
Project Structure
SpecForge/
├── inference.py # Entry point for judges — do not modify
├── requirements.txt # All Python dependencies
├── scripts/
│ └── eval_script.py # Provided evaluation script (Hit@3, MRR@5, latency)
├── data/
│ └── processed/
│ ├── standards.json # 573 parsed standards (committed)
│ ├── standards_chunks.json # 1,261 RAG chunks (committed)
│ ├── public_test_set.json # 10 public evaluation queries
│ └── retrieval_results.json # Our results on public test set
├── src/
│ └── parse_bis_pdf.py # PDF → JSON parsing pipeline
└── web/
├── server/
│ ├── index.js # Express API — all routes
│ ├── start.js # Safe launcher (kills stale port process)
│ ├── .env.example # Environment template
│ ├── bridge/
│ │ └── retrieve.py # Daemon wrapping inference.py for the web server
│ └── services/
│ ├── llmService.js # Groq wrappers with fallbacks
│ └── retrieverService.js # PythonRetriever — daemon lifecycle manager
└── client/
└── src/
├── App.jsx # React router (5 pages)
├── api/standards.js # Typed fetch wrappers
├── pages/ # Home, Standards, Categories, Recommend, About
├── components/ # Navbar, Footer, StandardCard, StandardModal
└── locales/ # en/ and hi/ (English + Hindi i18n)
External APIs & Data Sources
All sources disclosed per hackathon transparency requirements.
| Source | Purpose | Key required? | Notes |
|---|---|---|---|
| BIS SP-21 (Bureau of Indian Standards, Special Publication 21) | Source dataset — 929-page PDF of building material standard summaries | No | Provided by organisers; processed JSON committed to repo |
HuggingFace all-MiniLM-L6-v2 |
384-dimension sentence embedding model for FAISS dense retrieval | No | Downloaded automatically by sentence-transformers on first --build (~90 MB) |
Groq API (llama-3.1-8b-instant) |
Query rewriting, per-result explanation, conversational QA | Yes — GROQ_API_KEY |
Free tier sufficient. Groq chosen for sub-second inference latency. Retrieval works without this key. |
No other external APIs, databases, or paid services are used.
Environment Dependencies
System Requirements
| Dependency | Minimum | Notes |
|---|---|---|
| Python | 3.10 | For retrieval pipeline and inference.py |
| Node.js | 18 | For Express server and React client |
| npm | 9 | Ships with Node 18 |
fuser |
any | Linux — used by start.js to clear stale port; install via psmisc if missing |
Hardware
- CPU: Any x86-64 or ARM64 — no GPU required
- RAM: 2 GB minimum; index + embeddings use ~500 MB
- GPU: Optional — a CUDA GPU reduces index build time but
faiss-cpuandsentence-transformersrun fully on CPU - Disk: ~1 GB free for venv and generated index files
Setup & Running
Step 1 — Clone
git clone https://github.com/kshitij-ka/SpecForge
cd SpecForge
Step 2 — Python virtual environment
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
requirements.txt:
pymupdf>=1.24.0
faiss-cpu>=1.7.4
rank-bm25>=0.2.2
sentence-transformers>=3.0.0
numpy>=1.26.0
sentence-transformersdownloadsall-MiniLM-L6-v2(~90 MB) from HuggingFace on first use.
Step 3 — Build the FAISS index
The processed JSON is committed. Index files are gitignored and must be built once locally.
source .venv/bin/activate
python inference.py --build
Encodes 1,261 chunks, writes embeddings.npy + faiss.index to data/processed/. Takes ~2 min on CPU. Subsequent starts load from cache — no rebuild needed unless chunks change.
Step 4 — Node.js dependencies
cd web/server && npm install
cd ../client && npm install
Step 5 — Environment variables
cp web/server/.env.example web/server/.env
Edit web/server/.env:
# Required for LLM explanations, query rewriting, and /api/chat
GROQ_API_KEY=your_groq_api_key_here
# Optional — defaults to 5000
PORT=5000
# Required if "python" is not Python 3 — point to your venv
PYTHON_BIN=/path/to/SpecForge/.venv/bin/python3
PYTHON_BINaccepts only"python","python3", or an absolute path. The server validates and rejects arbitrary values on startup.
Step 6 — Start the application
Terminal 1 — API server (port 5000):
cd web/server
npm start
Wait for the log line Python retriever ready (~20 s first boot). The server is accepting queries after that.
Terminal 2 — Frontend dev server (port 5173):
cd web/client
npm run dev
Open http://localhost:5173. The Vite dev server proxies all /api/* requests to :5000.
Using inference.py (Judge Entry Point)
inference.py is the mandatory entry point. It runs independently of the web server.
Always activate the virtual environment first:
source .venv/bin/activate
Build / force-rebuild the index
python inference.py --build
Single query (interactive testing)
python inference.py --query "Which standard covers 33 grade OPC cement?"
Output:
============================================================
Query : Which standard covers 33 grade OPC cement?
Latency: 0.019s
Top results:
1. IS 269: 1989 — Ordinary Portland Cement, 33 Grade
Category: Cement and Concrete | Section: Scope | Score: 0.8921
2. IS 8112: 1989 — 43 Grade Ordinary Portland Cement
...
Batch evaluation (judge command)
python inference.py \
--input data/processed/public_test_set.json \
--output data/processed/retrieval_results.json
Input format:
[
{
"id": "PUB-01",
"query": "We are a small enterprise manufacturing 33 Grade OPC...",
"expected_standards": ["IS 269: 1989"]
}
]
Output format:
[
{
"id": "PUB-01",
"query": "...",
"retrieved_standards": ["IS 8112: 1989", "IS 269: 1989", "..."],
"details": [
{
"standard_id": "IS 269: 1989",
"title": "Ordinary Portland Cement, 33 Grade",
"category": "Cement and Concrete",
"score": 0.8921,
"matched_section": "Scope"
}
],
"latency_seconds": 0.019,
"expected_standards": ["IS 269: 1989"]
}
]
Evaluation
# Step 1: generate results
python inference.py \
--input data/processed/public_test_set.json \
--output data/processed/retrieval_results.json
# Step 2: score
python scripts/eval_script.py \
--results data/processed/retrieval_results.json
Targets and our results on the public set:
| Metric | Formula | Target | Achieved |
|---|---|---|---|
| Hit Rate @3 | correct queries where expected std in top-3 / total | > 80% | 100% |
| MRR @5 | Σ(1/rank_i) / N | > 0.7 | 0.783 |
| Avg Latency | total_time / num_queries | < 5 s | ~0.019 s |
API Reference
All endpoints on Express server (default http://localhost:5000).
POST /api/recommend
Core RAG endpoint. Retrieval + optional LLM explanations.
// Request
{ "query": "fire resistance for brick masonry", "top_n": 5, "rewrite": false }
// Response
{
"standards": [
{
"standard_id": "IS 1905: 1987",
"title": "Code of Practice for Structural Use of Unreinforced Masonry",
"category": "Masonry",
"score": 0.812,
"matched_section": "Fire Resistance",
"explanation": "This standard specifies..."
}
],
"latency": { "retrieval_ms": 19, "llm_ms": 820, "total_ms": 839 }
}
| Field | Type | Default | Description |
|---|---|---|---|
query |
string | required | Natural-language product description or compliance question |
top_n |
integer | 5 | Results to return (1–10) |
rewrite |
boolean | false |
Expand query to IS-standard vocabulary via LLM before retrieval |
Rate limit: 20 req/min.
POST /api/ask
Chunk-grounded QA for a specific standard.
{ "standard_id": "IS 1905: 1987", "question": "What is the minimum wall thickness?" }
POST /api/chat
Conversational QA over the standards corpus. Requires GROQ_API_KEY; returns 503 if absent.
{ "message": "What grades of Portland cement does BIS cover?" }
GET /api/standards
Paginated list. Query params: q (keyword search), category, page (default 1), limit (default 20, max 100).
GET /api/standards/:id
Single standard. :id is URL-encoded IS ID, e.g. IS%20269%3A%201989.
GET /api/categories
All 25 material categories sorted alphabetically.
GET /api/stats
{ "standards": 573, "chunks": 1261, "categories": 25 }
Features
| Feature | Description |
|---|---|
| Hybrid RAG retrieval | FAISS (dense, 60%) + BM25 (sparse, 40%) fused and re-ranked |
| Re-ranking | Keyword overlap, title match, exact IS-ID match, short-chunk penalty |
| AI explanations | Groq llama-3.1-8b-instant — parallel, fallback-safe |
| Query rewriting | LLM expands natural language to IS-standard vocabulary (optional) |
| Chunk-grounded QA | Question answered from the most relevant chunk of a specific standard |
| Conversational chat | Open-ended QA against the full corpus |
| Browse & filter | Paginated standards list with keyword scoring; category gallery |
| Persistent daemon | Python retrieval process spawned once at boot; auto-restarts on crash |
| Internationalisation | UI in English and Hindi (i18next + react-i18next) |
| Rate limiting | 60 req/min global, 20 req/min on LLM endpoints (Helmet + express-rate-limit) |
| Production-ready API | Input validation, sanitisation, structured JSON logging, latency breakdown |
Tech Stack
| Layer | Technology |
|---|---|
| Embedding model | all-MiniLM-L6-v2 via sentence-transformers |
| Dense index | FAISS IndexFlatIP (cosine via inner product) |
| Sparse index | BM25Okapi (rank-bm25) |
| PDF parsing | PyMuPDF |
| LLM | Groq API (llama-3.1-8b-instant) |
| Backend | Node.js 18 + Express 5 |
| Security middleware | Helmet, CORS, express-rate-limit |
| Frontend | React 19, Vite 8, React Router 7 |
| Internationalisation | i18next, react-i18next, i18next-browser-languagedetector |
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
PYTHON_BIN validation failed on start |
Invalid PYTHON_BIN |
Set to python, python3, or absolute venv path |
ModuleNotFoundError: faiss |
Wrong Python binary (system Python instead of venv) | Set PYTHON_BIN=/path/to/.venv/bin/python3 in .env |
Python daemon boot timeout (90 s) |
Index files missing | Run python inference.py --build with venv active |
Results return but no explanation field |
GROQ_API_KEY absent or invalid |
Set key in .env; retrieval still works, explanations fall back silently |
fuser: command not found on Linux |
psmisc not installed |
sudo apt install psmisc / sudo dnf install psmisc |
| Port 5000 still in use after crash | fuser not available |
Manually: kill $(lsof -t -i:5000) |
License
See LICENSE.