SpecForge — BIS Standards Recommendation Engine

BIS × Sigma Squad AI Hackathon | Track: AI / Retrieval Augmented Generation (RAG)

An end-to-end RAG system that turns plain-language product descriptions into accurate BIS standard recommendations in milliseconds — helping Indian MSEs find compliance requirements in seconds instead of weeks.


Public Test Set Results

Evaluated on the 10 provided public queries. Judges run: python inference.py --input <hidden_dataset>.json --output team_results.json

Metric Target Our Score
Hit Rate @3 > 80% 100% (10/10)
MRR @5 > 0.7 0.783
Avg Latency < 5 s ~19 ms

All 10 public queries returned the expected standard in the top-3 results. Average query latency is 19 ms after the index warms up — 250× faster than the 5 s target.


What It Does

Indian Micro and Small Enterprises (MSEs) spend weeks manually searching BIS SP-21 to identify which standards apply to their products. SpecForge eliminates that.

  1. Describe your product in plain language — e.g. "We manufacture 33 Grade Ordinary Portland Cement"
  2. Get ranked BIS standards with matched sections and relevance scores in milliseconds
  3. Read AI explanations of why each standard applies, generated by Groq LLM

The system covers all 573 unique standards across 25 building material categories from BIS SP-21 (Summaries of Indian Standards for Building Materials).


System Architecture

Data Flow

data/raw/dataset.pdf  (BIS SP-21, 929 pages)
  → src/parse_bis_pdf.py
  → data/processed/standards.json          573 structured records  [committed]
  → data/processed/standards_chunks.json   1,261 RAG-ready chunks  [committed]
  → inference.py --build
  → data/processed/embeddings.npy          dense vectors           [gitignored — rebuild locally]
  → data/processed/faiss.index             FAISS index             [gitignored — rebuild locally]

Request Pipeline

Browser / API Client
  → POST /api/recommend  { query, top_n, rewrite }
  → Express server (web/server/index.js)
      ├─ [optional] llmService.rewriteQuery()        Groq — expands to IS-standard vocabulary
      ├─ retrieverService.retrieve()
      │     └─ PythonRetriever singleton              EventEmitter, queues concurrent requests
      │           └─ bridge/retrieve.py daemon        stdin/stdout newline-delimited JSON
      │                 └─ inference.py               FAISS 0.6 + BM25 0.4 → re-rank → top-N
      └─ llmService.generateExplanation() × N        Promise.allSettled — parallel, non-blocking
  → JSON { standards[], latency: { retrieval_ms, llm_ms, total_ms } }

Chunking & Retrieval Strategy

Chunking (src/parse_bis_pdf.py):

  • 2-pass boundary detection splits the 929-page PDF into per-standard records
  • Each standard is further split by section with 50-word overlap to prevent context loss at boundaries
  • Weak chunks (<30 words) are merged with their neighbour
  • Result: 1,261 chunks from 573 standards (avg 2.2 chunks/standard)

Hybrid Retrieval (inference.py):

  • Dense: FAISS IndexFlatIP with all-MiniLM-L6-v2 embeddings (384-dim cosine similarity)
  • Sparse: BM25Okapi with weighted document construction — title ×4, keywords ×3, section ×2, body ×1
  • Fusion: score = 0.6 × dense_norm + 0.4 × sparse_norm

Re-ranking bonuses applied per candidate:

  • +0.05 per overlapping keyword (max 4) between query and standard's keyword list
  • +0.05 per overlapping title word (max 5)
  • +0.25 if ≥60% of significant title words appear in the query (strong title match)
  • +0.20 if an exact IS ID from the query matches this standard
  • -0.15 penalty for very short chunks (<40 body words)

Deduplication: candidates grouped by standard_id; only the best-scoring chunk per standard survives. Final output is top-N unique IS standards.

Key Design Decisions

Decision Rationale
Persistent Python daemon FAISS index load takes ~18 s cold. Spawn once at boot, queue all requests through a single process — zero cold start per query.
inference.py never modified Bridge pattern: bridge/retrieve.py imports inference.py as a module. Judges run inference.py directly; the web server uses the bridge. Both paths are identical.
In-memory data 573 standards + 1,261 chunks fit comfortably in RAM. No database dependency, no I/O per request.
LLM fallbacks everywhere Every Groq call is wrapped with a timeout (8 s) and a safe default return. Promise.allSettled for parallel calls. Server starts and retrieval works without a GROQ_API_KEY.
Weighted BM25 document Repeating title tokens ×4 makes exact IS-standard name queries dominant over body-text noise — critical for the BIS domain where standard names are precise.

Project Structure

SpecForge/
├── inference.py                         # Entry point for judges — do not modify
├── requirements.txt                     # All Python dependencies
├── scripts/
│   └── eval_script.py                   # Provided evaluation script (Hit@3, MRR@5, latency)
├── data/
│   └── processed/
│       ├── standards.json               # 573 parsed standards (committed)
│       ├── standards_chunks.json        # 1,261 RAG chunks (committed)
│       ├── public_test_set.json         # 10 public evaluation queries
│       └── retrieval_results.json       # Our results on public test set
├── src/
│   └── parse_bis_pdf.py                 # PDF → JSON parsing pipeline
└── web/
    ├── server/
    │   ├── index.js                     # Express API — all routes
    │   ├── start.js                     # Safe launcher (kills stale port process)
    │   ├── .env.example                 # Environment template
    │   ├── bridge/
    │   │   └── retrieve.py              # Daemon wrapping inference.py for the web server
    │   └── services/
    │       ├── llmService.js            # Groq wrappers with fallbacks
    │       └── retrieverService.js      # PythonRetriever — daemon lifecycle manager
    └── client/
        └── src/
            ├── App.jsx                  # React router (5 pages)
            ├── api/standards.js         # Typed fetch wrappers
            ├── pages/                   # Home, Standards, Categories, Recommend, About
            ├── components/              # Navbar, Footer, StandardCard, StandardModal
            └── locales/                 # en/ and hi/ (English + Hindi i18n)

External APIs & Data Sources

All sources disclosed per hackathon transparency requirements.

Source Purpose Key required? Notes
BIS SP-21 (Bureau of Indian Standards, Special Publication 21) Source dataset — 929-page PDF of building material standard summaries No Provided by organisers; processed JSON committed to repo
HuggingFace all-MiniLM-L6-v2 384-dimension sentence embedding model for FAISS dense retrieval No Downloaded automatically by sentence-transformers on first --build (~90 MB)
Groq API (llama-3.1-8b-instant) Query rewriting, per-result explanation, conversational QA Yes — GROQ_API_KEY Free tier sufficient. Groq chosen for sub-second inference latency. Retrieval works without this key.

No other external APIs, databases, or paid services are used.


Environment Dependencies

System Requirements

Dependency Minimum Notes
Python 3.10 For retrieval pipeline and inference.py
Node.js 18 For Express server and React client
npm 9 Ships with Node 18
fuser any Linux — used by start.js to clear stale port; install via psmisc if missing

Hardware

  • CPU: Any x86-64 or ARM64 — no GPU required
  • RAM: 2 GB minimum; index + embeddings use ~500 MB
  • GPU: Optional — a CUDA GPU reduces index build time but faiss-cpu and sentence-transformers run fully on CPU
  • Disk: ~1 GB free for venv and generated index files

Setup & Running

Step 1 — Clone

git clone https://github.com/kshitij-ka/SpecForge
cd SpecForge

Step 2 — Python virtual environment

python3 -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate

pip install --upgrade pip
pip install -r requirements.txt

requirements.txt:

pymupdf>=1.24.0
faiss-cpu>=1.7.4
rank-bm25>=0.2.2
sentence-transformers>=3.0.0
numpy>=1.26.0

sentence-transformers downloads all-MiniLM-L6-v2 (~90 MB) from HuggingFace on first use.

Step 3 — Build the FAISS index

The processed JSON is committed. Index files are gitignored and must be built once locally.

source .venv/bin/activate
python inference.py --build

Encodes 1,261 chunks, writes embeddings.npy + faiss.index to data/processed/. Takes ~2 min on CPU. Subsequent starts load from cache — no rebuild needed unless chunks change.

Step 4 — Node.js dependencies

cd web/server && npm install
cd ../client && npm install

Step 5 — Environment variables

cp web/server/.env.example web/server/.env

Edit web/server/.env:

# Required for LLM explanations, query rewriting, and /api/chat
GROQ_API_KEY=your_groq_api_key_here

# Optional — defaults to 5000
PORT=5000

# Required if "python" is not Python 3 — point to your venv
PYTHON_BIN=/path/to/SpecForge/.venv/bin/python3

PYTHON_BIN accepts only "python", "python3", or an absolute path. The server validates and rejects arbitrary values on startup.

Step 6 — Start the application

Terminal 1 — API server (port 5000):

cd web/server
npm start

Wait for the log line Python retriever ready (~20 s first boot). The server is accepting queries after that.

Terminal 2 — Frontend dev server (port 5173):

cd web/client
npm run dev

Open http://localhost:5173. The Vite dev server proxies all /api/* requests to :5000.


Using inference.py (Judge Entry Point)

inference.py is the mandatory entry point. It runs independently of the web server.

Always activate the virtual environment first: source .venv/bin/activate

Build / force-rebuild the index

python inference.py --build

Single query (interactive testing)

python inference.py --query "Which standard covers 33 grade OPC cement?"

Output:

============================================================
Query : Which standard covers 33 grade OPC cement?
Latency: 0.019s

Top results:
  1. IS 269: 1989 — Ordinary Portland Cement, 33 Grade
     Category: Cement and Concrete  |  Section: Scope  |  Score: 0.8921
  2. IS 8112: 1989 — 43 Grade Ordinary Portland Cement
     ...

Batch evaluation (judge command)

python inference.py \
  --input  data/processed/public_test_set.json \
  --output data/processed/retrieval_results.json

Input format:

[
  {
    "id": "PUB-01",
    "query": "We are a small enterprise manufacturing 33 Grade OPC...",
    "expected_standards": ["IS 269: 1989"]
  }
]

Output format:

[
  {
    "id": "PUB-01",
    "query": "...",
    "retrieved_standards": ["IS 8112: 1989", "IS 269: 1989", "..."],
    "details": [
      {
        "standard_id": "IS 269: 1989",
        "title": "Ordinary Portland Cement, 33 Grade",
        "category": "Cement and Concrete",
        "score": 0.8921,
        "matched_section": "Scope"
      }
    ],
    "latency_seconds": 0.019,
    "expected_standards": ["IS 269: 1989"]
  }
]

Evaluation

# Step 1: generate results
python inference.py \
  --input  data/processed/public_test_set.json \
  --output data/processed/retrieval_results.json

# Step 2: score
python scripts/eval_script.py \
  --results data/processed/retrieval_results.json

Targets and our results on the public set:

Metric Formula Target Achieved
Hit Rate @3 correct queries where expected std in top-3 / total > 80% 100%
MRR @5 Σ(1/rank_i) / N > 0.7 0.783
Avg Latency total_time / num_queries < 5 s ~0.019 s

API Reference

All endpoints on Express server (default http://localhost:5000).

POST /api/recommend

Core RAG endpoint. Retrieval + optional LLM explanations.

// Request
{ "query": "fire resistance for brick masonry", "top_n": 5, "rewrite": false }

// Response
{
  "standards": [
    {
      "standard_id": "IS 1905: 1987",
      "title": "Code of Practice for Structural Use of Unreinforced Masonry",
      "category": "Masonry",
      "score": 0.812,
      "matched_section": "Fire Resistance",
      "explanation": "This standard specifies..."
    }
  ],
  "latency": { "retrieval_ms": 19, "llm_ms": 820, "total_ms": 839 }
}
Field Type Default Description
query string required Natural-language product description or compliance question
top_n integer 5 Results to return (110)
rewrite boolean false Expand query to IS-standard vocabulary via LLM before retrieval

Rate limit: 20 req/min.

POST /api/ask

Chunk-grounded QA for a specific standard.

{ "standard_id": "IS 1905: 1987", "question": "What is the minimum wall thickness?" }

POST /api/chat

Conversational QA over the standards corpus. Requires GROQ_API_KEY; returns 503 if absent.

{ "message": "What grades of Portland cement does BIS cover?" }

GET /api/standards

Paginated list. Query params: q (keyword search), category, page (default 1), limit (default 20, max 100).

GET /api/standards/:id

Single standard. :id is URL-encoded IS ID, e.g. IS%20269%3A%201989.

GET /api/categories

All 25 material categories sorted alphabetically.

GET /api/stats

{ "standards": 573, "chunks": 1261, "categories": 25 }

Features

Feature Description
Hybrid RAG retrieval FAISS (dense, 60%) + BM25 (sparse, 40%) fused and re-ranked
Re-ranking Keyword overlap, title match, exact IS-ID match, short-chunk penalty
AI explanations Groq llama-3.1-8b-instant — parallel, fallback-safe
Query rewriting LLM expands natural language to IS-standard vocabulary (optional)
Chunk-grounded QA Question answered from the most relevant chunk of a specific standard
Conversational chat Open-ended QA against the full corpus
Browse & filter Paginated standards list with keyword scoring; category gallery
Persistent daemon Python retrieval process spawned once at boot; auto-restarts on crash
Internationalisation UI in English and Hindi (i18next + react-i18next)
Rate limiting 60 req/min global, 20 req/min on LLM endpoints (Helmet + express-rate-limit)
Production-ready API Input validation, sanitisation, structured JSON logging, latency breakdown

Tech Stack

Layer Technology
Embedding model all-MiniLM-L6-v2 via sentence-transformers
Dense index FAISS IndexFlatIP (cosine via inner product)
Sparse index BM25Okapi (rank-bm25)
PDF parsing PyMuPDF
LLM Groq API (llama-3.1-8b-instant)
Backend Node.js 18 + Express 5
Security middleware Helmet, CORS, express-rate-limit
Frontend React 19, Vite 8, React Router 7
Internationalisation i18next, react-i18next, i18next-browser-languagedetector

Troubleshooting

Symptom Likely cause Fix
PYTHON_BIN validation failed on start Invalid PYTHON_BIN Set to python, python3, or absolute venv path
ModuleNotFoundError: faiss Wrong Python binary (system Python instead of venv) Set PYTHON_BIN=/path/to/.venv/bin/python3 in .env
Python daemon boot timeout (90 s) Index files missing Run python inference.py --build with venv active
Results return but no explanation field GROQ_API_KEY absent or invalid Set key in .env; retrieval still works, explanations fall back silently
fuser: command not found on Linux psmisc not installed sudo apt install psmisc / sudo dnf install psmisc
Port 5000 still in use after crash fuser not available Manually: kill $(lsof -t -i:5000)

License

See LICENSE.

S
Description
No description provided
Readme MIT 12 MiB
Languages
JavaScript 42%
Python 38.9%
CSS 18.8%
HTML 0.3%