SpecForge — BIS Standards Recommendation Engine
BIS × Sigma Squad AI Hackathon | Track: AI / Retrieval Augmented Generation (RAG)
An end-to-end RAG system that turns plain-language product descriptions into accurate BIS standard recommendations in milliseconds — helping Indian MSEs find compliance requirements in seconds instead of weeks.
Public Test Set Results
Evaluated on the 10 provided public queries. Judges run:
python inference.py --input <hidden_dataset>.json --output team_results.json
| Metric | Target | Our Score |
|---|---|---|
| Hit Rate @3 | > 80% | 100% (10/10) |
| MRR @5 | > 0.7 | 0.950 |
| Avg Latency | < 5 s | ~18 ms |
All 10 public queries returned the expected standard in the top 2 results. Average query latency is 19 ms after the index warms up — 250× faster than the 5 s target.
What It Does
Indian Micro and Small Enterprises (MSEs) spend weeks manually searching BIS SP-21 to identify which standards apply to their products. SpecForge eliminates that.
- Describe your product in plain language — e.g. "We manufacture 33 Grade Ordinary Portland Cement"
- Get ranked BIS standards with matched sections and relevance scores in milliseconds
- Read AI explanations of why each standard applies, generated by Groq LLM
The system covers all 586 unique standards across 25 building material categories from BIS SP-21 (Summaries of Indian Standards for Building Materials).
System Architecture
Data Flow
data/raw/dataset.pdf (BIS SP-21, 929 pages)
→ src/parse_bis_pdf.py
→ data/processed/standards.json 586 structured records [committed]
→ data/processed/standards_chunks.json 1,269 RAG-ready chunks [committed]
→ inference.py --build
→ data/processed/embeddings.npy dense vectors [gitignored — rebuild locally]
→ data/processed/faiss.index FAISS index [gitignored — rebuild locally]
Request Pipeline
Browser / API Client
→ POST /api/recommend { query, top_n, rewrite }
→ Express server (web/server/index.js)
├─ [optional] llmService.rewriteQuery() Groq — expands to IS-standard vocabulary
├─ retrieverService.retrieve()
│ └─ PythonRetriever singleton EventEmitter, queues concurrent requests
│ └─ bridge/retrieve.py daemon stdin/stdout newline-delimited JSON
│ └─ inference.py FAISS 0.6 + BM25 0.4 → re-rank → top-N
└─ llmService.generateExplanation() × N Promise.allSettled — parallel, non-blocking
→ JSON { standards[], latency: { retrieval_ms, llm_ms, total_ms } }
Chunking & Retrieval Strategy
Chunking (src/parse_bis_pdf.py):
- 4-pass boundary detection splits the 929-page PDF into per-standard records
- Pass 1–2: primary block splitting and secondary boundary recovery
- Pass 3: recovers scope text stolen by the preceding block (SP-21 PDF layout quirk)
- Pass 4: truncates next-standard content bleed at a second
1. Scopemarker
- Each standard is further split by section with 50-word overlap to prevent context loss at boundaries
- Weak chunks (<30 words) are merged with their neighbour
- Result: 1,269 chunks from 586 standards (avg 2.2 chunks/standard)
Hybrid Retrieval (inference.py):
- Dense: FAISS
IndexFlatIPwithall-MiniLM-L6-v2embeddings (384-dim cosine similarity) - Sparse: BM25Okapi with weighted document construction — title ×4, keywords ×3, section ×2, body ×1
- Fusion:
score = 0.6 × dense_norm + 0.4 × sparse_norm
Re-ranking bonuses applied per candidate:
- +0.05 per overlapping keyword (max 4) between query and standard's keyword list
- +0.05 per overlapping title word (max 5)
- +0.25 if ≥60% of significant title words appear in the query (strong title match)
- +0.20 if an exact IS ID from the query matches this standard
- +0.35 / -0.40 grade discriminator: boosts/penalises OPC-grade standards (33/43/53) when query names a specific grade
- +0.30 / -0.20 part-number discriminator: boosts matching Part N and penalises non-matching parts when query explicitly names a part number (handles "Part – 1", "PART2" etc.)
- -0.15 penalty for very short chunks (<40 body words)
Post-grouping Part disambiguation: when multiple parts of the same IS base number survive into the candidate set with identical titles, IDF-weighted discriminating keyword scores break the tie — rarer corpus terms (e.g. "lightweight") carry proportionally more weight.
Deduplication: candidates grouped by standard_id; only the best-scoring chunk per standard survives. Final output is top-N unique IS standards.
Key Design Decisions
| Decision | Rationale |
|---|---|
| Persistent Python daemon | FAISS index load takes ~18 s cold. Spawn once at boot, queue all requests through a single process — zero cold start per query. |
inference.py never modified |
Bridge pattern: bridge/retrieve.py imports inference.py as a module. Judges run inference.py directly; the web server uses the bridge. Both paths are identical. |
| In-memory data | 586 standards + 1,269 chunks fit comfortably in RAM. No database dependency, no I/O per request. |
| LLM fallbacks everywhere | Every Groq call is wrapped with a timeout (8 s) and a safe default return. Promise.allSettled for parallel calls. Server starts and retrieval works without a GROQ_API_KEY. |
| Weighted BM25 document | Repeating title tokens ×4 makes exact IS-standard name queries dominant over body-text noise — critical for the BIS domain where standard names are precise. |
Project Structure
SpecForge/
├── inference.py # Entry point for judges
├── requirements.txt # All Python dependencies
├── eval_script.py # Provided evaluation script (Hit@3, MRR@5, latency)
├── data/
│ └── processed/
│ ├── standards.json # 586 parsed standards (committed)
│ ├── standards_chunks.json # 1,269 RAG chunks (committed)
│ ├── public_test_set.json # 10 public evaluation queries
│ └── retrieval_results.json # Our results on public test set
├── src/
│ └── parse_bis_pdf.py # PDF → JSON parsing pipeline
└── web/
├── server/
│ ├── index.js # Express API — all routes
│ ├── start.js # Safe launcher (kills stale port process)
│ ├── .env.example # Environment template
│ ├── bridge/
│ │ └── retrieve.py # Daemon wrapping inference.py for the web server
│ └── services/
│ ├── llmService.js # Groq wrappers with fallbacks
│ └── retrieverService.js # PythonRetriever — daemon lifecycle manager
└── client/
└── src/
├── App.jsx # React router (5 pages)
├── api/standards.js # Typed fetch wrappers
├── pages/ # Home, Standards, Categories, Recommend, About
├── components/ # Navbar, Footer, StandardCard, StandardModal
└── locales/ # en/ and hi/ (English + Hindi i18n)
External APIs & Data Sources
All sources disclosed per hackathon transparency requirements.
| Source | Purpose | Key required? | Notes |
|---|---|---|---|
| BIS SP-21 (Bureau of Indian Standards, Special Publication 21) | Source dataset — 929-page PDF of building material standard summaries | No | Provided by organisers; processed JSON committed to repo |
HuggingFace all-MiniLM-L6-v2 |
384-dimension sentence embedding model for FAISS dense retrieval | No | Downloaded automatically by sentence-transformers on first --build (~90 MB) |
Groq API (llama-3.1-8b-instant) |
Query rewriting, per-result explanation, conversational QA | Yes — GROQ_API_KEY |
Free tier sufficient. Groq chosen for sub-second inference latency. Retrieval works without this key. |
No other external APIs, databases, or paid services are used.
Environment Dependencies
System Requirements
| Dependency | Minimum | Notes |
|---|---|---|
| Python | 3.10 | For retrieval pipeline and inference.py |
| Node.js | 18 | For Express server and React client |
| npm | 9 | Ships with Node 18 |
fuser |
any | Linux — used by start.js to clear stale port; install via psmisc if missing |
Hardware
- CPU: Any x86-64 or ARM64 — no GPU required
- RAM: 2 GB minimum; index + embeddings use ~500 MB
- GPU: Optional — a CUDA GPU reduces index build time but
faiss-cpuandsentence-transformersrun fully on CPU - Disk: ~1 GB free for venv and generated index files
Setup & Running
Step 1 — Clone
git clone https://github.com/kshitij-ka/SpecForge
cd SpecForge
Step 2 — Python virtual environment
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
requirements.txt:
pymupdf>=1.24.0
faiss-cpu>=1.7.4
rank-bm25>=0.2.2
sentence-transformers>=3.0.0
numpy>=1.26.0
sentence-transformersdownloadsall-MiniLM-L6-v2(~90 MB) from HuggingFace on first use.
Step 3 — Build the FAISS index
The processed JSON is committed. Index files are gitignored and must be built once locally.
source .venv/bin/activate
python inference.py --build
Encodes 1,269 chunks, writes embeddings.npy + faiss.index to data/processed/. Takes ~2 min on CPU. Subsequent starts load from cache — no rebuild needed unless chunks change.
Step 4 — Node.js dependencies
cd web/server && npm install
cd ../client && npm install
Step 5 — Environment variables
cp web/server/.env.example web/server/.env
Edit web/server/.env:
# Required for LLM explanations, query rewriting, and /api/chat
GROQ_API_KEY=your_groq_api_key_here
# Optional — defaults to 5000
PORT=5000
# Required if "python" is not Python 3 — point to your venv
PYTHON_BIN=/path/to/SpecForge/.venv/bin/python3
PYTHON_BINaccepts only"python","python3", or an absolute path. The server validates and rejects arbitrary values on startup.
Step 6 — Start the application
Terminal 1 — API server (port 5000):
cd web/server
npm start
Wait for the log line Python retriever ready (~20 s first boot). The server is accepting queries after that.
Terminal 2 — Frontend dev server (port 5173):
cd web/client
npm run dev
Open http://localhost:5173. The Vite dev server proxies all /api/* requests to :5000.
Using inference.py (Judge Entry Point)
inference.py is the mandatory entry point. It runs independently of the web server.
Always activate the virtual environment first:
source .venv/bin/activate
Build / force-rebuild the index
python inference.py --build
Single query (interactive testing)
python inference.py --query "Which standard covers 33 grade OPC cement?"
Output:
============================================================
Query : Which standard covers 33 grade OPC cement?
Latency: 0.019s
Top results:
1. IS 269: 1989 — Ordinary Portland Cement, 33 Grade
Category: Cement and Concrete | Section: Scope | Score: 0.8921
2. IS 8112: 1989 — 43 Grade Ordinary Portland Cement
...
Batch evaluation (judge command)
python inference.py \
--input data/processed/public_test_set.json \
--output data/processed/retrieval_results.json
Input format:
[
{
"id": "PUB-01",
"query": "We are a small enterprise manufacturing 33 Grade OPC...",
"expected_standards": ["IS 269: 1989"]
}
]
Output format:
[
{
"id": "PUB-01",
"query": "...",
"retrieved_standards": ["IS 8112: 1989", "IS 269: 1989", "..."],
"details": [
{
"standard_id": "IS 269: 1989",
"title": "Ordinary Portland Cement, 33 Grade",
"category": "Cement and Concrete",
"score": 0.8921,
"matched_section": "Scope"
}
],
"latency_seconds": 0.019,
"expected_standards": ["IS 269: 1989"]
}
]
Evaluation
# Step 1: generate results
python inference.py \
--input data/processed/public_test_set.json \
--output data/processed/retrieval_results.json
# Step 2: score
python eval_script.py \
--results data/processed/retrieval_results.json
Targets and our results on the public set:
| Metric | Formula | Target | Achieved |
|---|---|---|---|
| Hit Rate @3 | correct queries where expected std in top-3 / total | > 80% | 100% |
| MRR @5 | Σ(1/rank_i) / N | > 0.7 | 0.950 |
| Avg Latency | total_time / num_queries | < 5 s | ~0.018 s |
API Reference
All endpoints on Express server (default http://localhost:5000).
POST /api/recommend
Core RAG endpoint. Retrieval + optional LLM explanations.
// Request
{ "query": "fire resistance for brick masonry", "top_n": 5, "rewrite": false }
// Response
{
"standards": [
{
"standard_id": "IS 1905: 1987",
"title": "Code of Practice for Structural Use of Unreinforced Masonry",
"category": "Masonry",
"score": 0.812,
"matched_section": "Fire Resistance",
"explanation": "This standard specifies..."
}
],
"latency": { "retrieval_ms": 19, "llm_ms": 820, "total_ms": 839 }
}
| Field | Type | Default | Description |
|---|---|---|---|
query |
string | required | Natural-language product description or compliance question |
top_n |
integer | 5 | Results to return (1–10) |
rewrite |
boolean | false |
Expand query to IS-standard vocabulary via LLM before retrieval |
Rate limit: 20 req/min.
POST /api/ask
Chunk-grounded QA for a specific standard.
{ "standard_id": "IS 1905: 1987", "question": "What is the minimum wall thickness?" }
POST /api/chat
Conversational QA over the standards corpus. Requires GROQ_API_KEY; returns 503 if absent.
{ "message": "What grades of Portland cement does BIS cover?" }
GET /api/standards
Paginated list. Query params: q (keyword search), category, page (default 1), limit (default 20, max 100).
GET /api/standards/:id
Single standard. :id is URL-encoded IS ID, e.g. IS%20269%3A%201989.
GET /api/categories
All 25 material categories sorted alphabetically.
GET /api/stats
{ "standards": 586, "chunks": 1269, "categories": 25 }
Features
| Feature | Description |
|---|---|
| Hybrid RAG retrieval | FAISS (dense, 60%) + BM25 (sparse, 40%) fused and re-ranked |
| Re-ranking | Keyword overlap, title match, exact IS-ID match, short-chunk penalty |
| Part-number disambiguation | Explicit Part N in query boosts matching part ±0.30, penalises siblings ±0.20; handles em-dash/PART2 variants |
| AI explanations | Groq llama-3.1-8b-instant — parallel, fallback-safe |
| Query rewriting | LLM expands natural language to IS-standard vocabulary (optional) |
| Chunk-grounded QA | Question answered from the most relevant chunk of a specific standard |
| Conversational chat | Open-ended QA against the full corpus |
| Browse & filter | Paginated standards list with keyword scoring; category gallery |
| Persistent daemon | Python retrieval process spawned once at boot; auto-restarts on crash |
| Internationalisation | UI in English and Hindi (i18next + react-i18next) |
| Rate limiting | 60 req/min global, 20 req/min on LLM endpoints (Helmet + express-rate-limit) |
| Production-ready API | Input validation, sanitisation, structured JSON logging, latency breakdown |
Tech Stack
| Layer | Technology |
|---|---|
| Embedding model | all-MiniLM-L6-v2 via sentence-transformers |
| Dense index | FAISS IndexFlatIP (cosine via inner product) |
| Sparse index | BM25Okapi (rank-bm25) |
| PDF parsing | PyMuPDF |
| LLM | Groq API (llama-3.1-8b-instant) |
| Backend | Node.js 18 + Express 5 |
| Security middleware | Helmet, CORS, express-rate-limit |
| Frontend | React 19, Vite 8, React Router 7 |
| Internationalisation | i18next, react-i18next, i18next-browser-languagedetector |
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
PYTHON_BIN validation failed on start |
Invalid PYTHON_BIN |
Set to python, python3, or absolute venv path |
ModuleNotFoundError: faiss |
Wrong Python binary (system Python instead of venv) | Set PYTHON_BIN=/path/to/.venv/bin/python3 in .env |
Python daemon boot timeout (90 s) |
Index files missing | Run python inference.py --build with venv active |
Results return but no explanation field |
GROQ_API_KEY absent or invalid |
Set key in .env; retrieval still works, explanations fall back silently |
fuser: command not found on Linux |
psmisc not installed |
sudo apt install psmisc / sudo dnf install psmisc |
| Port 5000 still in use after crash | fuser not available |
Manually: kill $(lsof -t -i:5000) |
License
See LICENSE.