BIS × Sigma Squad AI Hackathon | Track: AI / Retrieval Augmented Generation (RAG)

An end-to-end RAG system that turns plain-language product descriptions into accurate BIS standard recommendations in milliseconds — helping Indian MSEs find compliance requirements in seconds instead of weeks.

Public Test Set Results

Evaluated on the 10 provided public queries. Judges run: python inference.py --input <hidden_dataset>.json --output team_results.json

Metric	Target	Our Score
Hit Rate @3	> 80%	100% (10/10)
MRR @5	> 0.7	0.950
Avg Latency	< 5 s	~18 ms

All 10 public queries returned the expected standard in the top 2 results. Average query latency is 19 ms after the index warms up — 250× faster than the 5 s target.

What It Does

Indian Micro and Small Enterprises (MSEs) spend weeks manually searching BIS SP-21 to identify which standards apply to their products. SpecForge eliminates that.

Describe your product in plain language — e.g. "We manufacture 33 Grade Ordinary Portland Cement"
Get ranked BIS standards with matched sections and relevance scores in milliseconds
Read AI explanations of why each standard applies, generated by Groq LLM

The system covers all 586 unique standards across 25 building material categories from BIS SP-21 (Summaries of Indian Standards for Building Materials).

System Architecture

Data Flow

data/raw/dataset.pdf  (BIS SP-21, 929 pages)
  → src/parse_bis_pdf.py
  → data/processed/standards.json          586 structured records  [committed]
  → data/processed/standards_chunks.json   1,269 RAG-ready chunks  [committed]
  → inference.py --build
  → data/processed/embeddings.npy          dense vectors           [gitignored — rebuild locally]
  → data/processed/faiss.index             FAISS index             [gitignored — rebuild locally]

Request Pipeline

Browser / API Client
  → POST /api/recommend  { query, top_n, rewrite }
  → Express server (web/server/index.js)
      ├─ [optional] llmService.rewriteQuery()        Groq — expands to IS-standard vocabulary
      ├─ retrieverService.retrieve()
      │     └─ PythonRetriever singleton              EventEmitter, queues concurrent requests
      │           └─ bridge/retrieve.py daemon        stdin/stdout newline-delimited JSON
      │                 └─ inference.py               FAISS 0.6 + BM25 0.4 → re-rank → top-N
      └─ llmService.generateExplanation() × N        Promise.allSettled — parallel, non-blocking
  → JSON { standards[], latency: { retrieval_ms, llm_ms, total_ms } }

Chunking & Retrieval Strategy

Chunking (src/parse_bis_pdf.py):

4-pass boundary detection splits the 929-page PDF into per-standard records
- Pass 1–2: primary block splitting and secondary boundary recovery
- Pass 3: recovers scope text stolen by the preceding block (SP-21 PDF layout quirk)
- Pass 4: truncates next-standard content bleed at a second 1. Scope marker
Each standard is further split by section with 50-word overlap to prevent context loss at boundaries
Weak chunks (<30 words) are merged with their neighbour
Result: 1,269 chunks from 586 standards (avg 2.2 chunks/standard)

Hybrid Retrieval (inference.py):

Dense: FAISS IndexFlatIP with all-MiniLM-L6-v2 embeddings (384-dim cosine similarity)
Sparse: BM25Okapi with weighted document construction — title ×4, keywords ×3, section ×2, body ×1
Fusion: score = 0.6 × dense_norm + 0.4 × sparse_norm

Re-ranking bonuses applied per candidate:

+0.05 per overlapping keyword (max 4) between query and standard's keyword list
+0.05 per overlapping title word (max 5)
+0.25 if ≥60% of significant title words appear in the query (strong title match)
+0.20 if an exact IS ID from the query matches this standard
+0.35 / -0.40 grade discriminator: boosts/penalises OPC-grade standards (33/43/53) when query names a specific grade
+0.30 / -0.20 part-number discriminator: boosts matching Part N and penalises non-matching parts when query explicitly names a part number (handles "Part – 1", "PART2" etc.)
-0.15 penalty for very short chunks (<40 body words)

Post-grouping Part disambiguation: when multiple parts of the same IS base number survive into the candidate set with identical titles, IDF-weighted discriminating keyword scores break the tie — rarer corpus terms (e.g. "lightweight") carry proportionally more weight.

Deduplication: candidates grouped by standard_id; only the best-scoring chunk per standard survives. Final output is top-N unique IS standards.

Key Design Decisions

Decision	Rationale
Persistent Python daemon	FAISS index load takes ~18 s cold. Spawn once at boot, queue all requests through a single process — zero cold start per query.
`inference.py` never modified	Bridge pattern: `bridge/retrieve.py` imports `inference.py` as a module. Judges run `inference.py` directly; the web server uses the bridge. Both paths are identical.
In-memory data	586 standards + 1,269 chunks fit comfortably in RAM. No database dependency, no I/O per request.
LLM fallbacks everywhere	Every Groq call is wrapped with a timeout (8 s) and a safe default return. `Promise.allSettled` for parallel calls. Server starts and retrieval works without a `GROQ_API_KEY`.
Weighted BM25 document	Repeating title tokens ×4 makes exact IS-standard name queries dominant over body-text noise — critical for the BIS domain where standard names are precise.

Project Structure

SpecForge/
├── inference.py                         # Entry point for judges
├── requirements.txt                     # All Python dependencies
├── eval_script.py                       # Provided evaluation script (Hit@3, MRR@5, latency)
├── data/
│   └── processed/
│       ├── standards.json               # 586 parsed standards (committed)
│       ├── standards_chunks.json        # 1,269 RAG chunks (committed)
│       ├── public_test_set.json         # 10 public evaluation queries
│       └── retrieval_results.json       # Our results on public test set
├── src/
│   └── parse_bis_pdf.py                 # PDF → JSON parsing pipeline
└── web/
    ├── server/
    │   ├── index.js                     # Express API — all routes
    │   ├── start.js                     # Safe launcher (kills stale port process)
    │   ├── .env.example                 # Environment template
    │   ├── bridge/
    │   │   └── retrieve.py              # Daemon wrapping inference.py for the web server
    │   └── services/
    │       ├── llmService.js            # Groq wrappers with fallbacks
    │       └── retrieverService.js      # PythonRetriever — daemon lifecycle manager
    └── client/
        └── src/
            ├── App.jsx                  # React router (5 pages)
            ├── api/standards.js         # Typed fetch wrappers
            ├── pages/                   # Home, Standards, Categories, Recommend, About
            ├── components/              # Navbar, Footer, StandardCard, StandardModal
            └── locales/                 # en/ and hi/ (English + Hindi i18n)

External APIs & Data Sources

All sources disclosed per hackathon transparency requirements.

Source	Purpose	Key required?	Notes
BIS SP-21 (Bureau of Indian Standards, Special Publication 21)	Source dataset — 929-page PDF of building material standard summaries	No	Provided by organisers; processed JSON committed to repo
HuggingFace `all-MiniLM-L6-v2`	384-dimension sentence embedding model for FAISS dense retrieval	No	Downloaded automatically by `sentence-transformers` on first `--build` (~90 MB)
Groq API (`llama-3.1-8b-instant`)	Query rewriting, per-result explanation, conversational QA	Yes — `GROQ_API_KEY`	Free tier sufficient. Groq chosen for sub-second inference latency. Retrieval works without this key.

No other external APIs, databases, or paid services are used.

Environment Dependencies

System Requirements

Dependency	Minimum	Notes
Python	3.10	For retrieval pipeline and `inference.py`
Node.js	18	For Express server and React client
npm	9	Ships with Node 18
`fuser`	any	Linux — used by `start.js` to clear stale port; install via `psmisc` if missing

Hardware

CPU: Any x86-64 or ARM64 — no GPU required
RAM: 2 GB minimum; index + embeddings use ~500 MB
GPU: Optional — a CUDA GPU reduces index build time but faiss-cpu and sentence-transformers run fully on CPU
Disk: ~1 GB free for venv and generated index files

Setup & Running

Step 1 — Clone

git clone https://github.com/kshitij-ka/SpecForge
cd SpecForge

Step 2 — Python virtual environment

python3 -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate

pip install --upgrade pip
pip install -r requirements.txt

requirements.txt:

pymupdf>=1.24.0
faiss-cpu>=1.7.4
rank-bm25>=0.2.2
sentence-transformers>=3.0.0
numpy>=1.26.0

sentence-transformers downloads all-MiniLM-L6-v2 (~90 MB) from HuggingFace on first use.

Step 3 — Build the FAISS index

The processed JSON is committed. Index files are gitignored and must be built once locally.

source .venv/bin/activate
python inference.py --build

Encodes 1,269 chunks, writes embeddings.npy + faiss.index to data/processed/. Takes ~2 min on CPU. Subsequent starts load from cache — no rebuild needed unless chunks change.

Step 4 — Node.js dependencies

cd web/server && npm install
cd ../client && npm install

Step 5 — Environment variables

cp web/server/.env.example web/server/.env

Edit web/server/.env:

# Required for LLM explanations, query rewriting, and /api/chat
GROQ_API_KEY=your_groq_api_key_here

# Optional — defaults to 5000
PORT=5000

# Required if "python" is not Python 3 — point to your venv
PYTHON_BIN=/path/to/SpecForge/.venv/bin/python3

PYTHON_BIN accepts only "python", "python3", or an absolute path. The server validates and rejects arbitrary values on startup.

Step 6 — Start the application

Terminal 1 — API server (port 5000):

cd web/server
npm start

Wait for the log line Python retriever ready (~20 s first boot). The server is accepting queries after that.

Terminal 2 — Frontend dev server (port 5173):

cd web/client
npm run dev

Open http://localhost:5173. The Vite dev server proxies all /api/* requests to :5000.

Using `inference.py` (Judge Entry Point)

inference.py is the mandatory entry point. It runs independently of the web server.

Always activate the virtual environment first: source .venv/bin/activate

Build / force-rebuild the index

python inference.py --build

Single query (interactive testing)

python inference.py --query "Which standard covers 33 grade OPC cement?"

Output:

============================================================
Query : Which standard covers 33 grade OPC cement?
Latency: 0.019s

Top results:
  1. IS 269: 1989 — Ordinary Portland Cement, 33 Grade
     Category: Cement and Concrete  |  Section: Scope  |  Score: 0.8921
  2. IS 8112: 1989 — 43 Grade Ordinary Portland Cement
     ...

Batch evaluation (judge command)

python inference.py \
  --input  data/processed/public_test_set.json \
  --output data/processed/retrieval_results.json

Input format:

[
  {
    "id": "PUB-01",
    "query": "We are a small enterprise manufacturing 33 Grade OPC...",
    "expected_standards": ["IS 269: 1989"]
  }
]

Output format:

[
  {
    "id": "PUB-01",
    "query": "...",
    "retrieved_standards": ["IS 8112: 1989", "IS 269: 1989", "..."],
    "details": [
      {
        "standard_id": "IS 269: 1989",
        "title": "Ordinary Portland Cement, 33 Grade",
        "category": "Cement and Concrete",
        "score": 0.8921,
        "matched_section": "Scope"
      }
    ],
    "latency_seconds": 0.019,
    "expected_standards": ["IS 269: 1989"]
  }
]

Evaluation

# Step 1: generate results
python inference.py \
  --input  data/processed/public_test_set.json \
  --output data/processed/retrieval_results.json

# Step 2: score
python eval_script.py \
  --results data/processed/retrieval_results.json

Targets and our results on the public set:

Metric	Formula	Target	Achieved
Hit Rate @3	correct queries where expected std in top-3 / total	> 80%	100%
MRR @5	Σ(1/rank_i) / N	> 0.7	0.950
Avg Latency	total_time / num_queries	< 5 s	~0.018 s

API Reference

All endpoints on Express server (default http://localhost:5000).

`POST /api/recommend`

Core RAG endpoint. Retrieval + optional LLM explanations.

// Request
{ "query": "fire resistance for brick masonry", "top_n": 5, "rewrite": false }

// Response
{
  "standards": [
    {
      "standard_id": "IS 1905: 1987",
      "title": "Code of Practice for Structural Use of Unreinforced Masonry",
      "category": "Masonry",
      "score": 0.812,
      "matched_section": "Fire Resistance",
      "explanation": "This standard specifies..."
    }
  ],
  "latency": { "retrieval_ms": 19, "llm_ms": 820, "total_ms": 839 }
}

Field	Type	Default	Description
`query`	string	required	Natural-language product description or compliance question
`top_n`	integer	5	Results to return (1–10)
`rewrite`	boolean	`false`	Expand query to IS-standard vocabulary via LLM before retrieval

Rate limit: 20 req/min.

`POST /api/ask`

Chunk-grounded QA for a specific standard.

{ "standard_id": "IS 1905: 1987", "question": "What is the minimum wall thickness?" }

`POST /api/chat`

Conversational QA over the standards corpus. Requires GROQ_API_KEY; returns 503 if absent.

{ "message": "What grades of Portland cement does BIS cover?" }

`GET /api/standards`

Paginated list. Query params: q (keyword search), category, page (default 1), limit (default 20, max 100).

`GET /api/standards/:id`

Single standard. :id is URL-encoded IS ID, e.g. IS%20269%3A%201989.

`GET /api/categories`

All 25 material categories sorted alphabetically.

`GET /api/stats`

{ "standards": 586, "chunks": 1269, "categories": 25 }

Features

Feature	Description
Hybrid RAG retrieval	FAISS (dense, 60%) + BM25 (sparse, 40%) fused and re-ranked
Re-ranking	Keyword overlap, title match, exact IS-ID match, short-chunk penalty
Part-number disambiguation	Explicit Part N in query boosts matching part ±0.30, penalises siblings ±0.20; handles em-dash/PART2 variants
AI explanations	Groq `llama-3.1-8b-instant` — parallel, fallback-safe
Query rewriting	LLM expands natural language to IS-standard vocabulary (optional)
Chunk-grounded QA	Question answered from the most relevant chunk of a specific standard
Conversational chat	Open-ended QA against the full corpus
Browse & filter	Paginated standards list with keyword scoring; category gallery
Persistent daemon	Python retrieval process spawned once at boot; auto-restarts on crash
Internationalisation	UI in English and Hindi (i18next + react-i18next)
Rate limiting	60 req/min global, 20 req/min on LLM endpoints (Helmet + express-rate-limit)
Production-ready API	Input validation, sanitisation, structured JSON logging, latency breakdown

Tech Stack

Layer	Technology
Embedding model	`all-MiniLM-L6-v2` via `sentence-transformers`
Dense index	FAISS `IndexFlatIP` (cosine via inner product)
Sparse index	BM25Okapi (`rank-bm25`)
PDF parsing	PyMuPDF
LLM	Groq API (`llama-3.1-8b-instant`)
Backend	Node.js 18 + Express 5
Security middleware	Helmet, CORS, express-rate-limit
Frontend	React 19, Vite 8, React Router 7
Internationalisation	i18next, react-i18next, i18next-browser-languagedetector

Troubleshooting

Symptom	Likely cause	Fix
`PYTHON_BIN validation failed` on start	Invalid `PYTHON_BIN`	Set to `python`, `python3`, or absolute venv path
`ModuleNotFoundError: faiss`	Wrong Python binary (system Python instead of venv)	Set `PYTHON_BIN=/path/to/.venv/bin/python3` in `.env`
`Python daemon boot timeout` (90 s)	Index files missing	Run `python inference.py --build` with venv active
Results return but no `explanation` field	`GROQ_API_KEY` absent or invalid	Set key in `.env`; retrieval still works, explanations fall back silently
`fuser: command not found` on Linux	`psmisc` not installed	`sudo apt install psmisc` / `sudo dnf install psmisc`
Port 5000 still in use after crash	`fuser` not available	Manually: `kill $(lsof -t -i:5000)`

License

See LICENSE.

README.md Unescape Escape

SpecForge — BIS Standards Recommendation Engine