SpecForge/README.md

# SpecForge — BIS Standards Recommendation Engine

> **BIS × Sigma Squad AI Hackathon** | Track: AI / Retrieval Augmented Generation (RAG)
>
> An end-to-end RAG system that turns plain-language product descriptions into accurate BIS standard recommendations in milliseconds — helping Indian MSEs find compliance requirements in seconds instead of weeks.

---

## Public Test Set Results

> Evaluated on the 10 provided public queries. Judges run: `python inference.py --input <hidden_dataset>.json --output team_results.json`

| Metric | Target | **Our Score** |
|---|---|---|
| Hit Rate @3 | > 80% | **100%** (10/10) |
| MRR @5 | > 0.7 | **0.950** |
| Avg Latency | < 5 s | **~18 ms** |

All 10 public queries returned the expected standard in the top 2 results. Average query latency is 19 ms after the index warms up — 250× faster than the 5 s target.

---

## What It Does

Indian Micro and Small Enterprises (MSEs) spend weeks manually searching BIS SP-21 to identify which standards apply to their products. SpecForge eliminates that.

1. **Describe your product** in plain language — e.g. *"We manufacture 33 Grade Ordinary Portland Cement"*
2. **Get ranked BIS standards** with matched sections and relevance scores in milliseconds
3. **Read AI explanations** of why each standard applies, generated by Groq LLM

The system covers all **586 unique standards** across **25 building material categories** from BIS SP-21 (Summaries of Indian Standards for Building Materials).

---

## System Architecture

### Data Flow

```
data/raw/dataset.pdf  (BIS SP-21, 929 pages)
  → src/parse_bis_pdf.py
  → data/processed/standards.json          586 structured records  [committed]
  → data/processed/standards_chunks.json   1,269 RAG-ready chunks  [committed]
  → inference.py --build
  → data/processed/embeddings.npy          dense vectors           [gitignored — rebuild locally]
  → data/processed/faiss.index             FAISS index             [gitignored — rebuild locally]
```

### Request Pipeline

```
Browser / API Client
  → POST /api/recommend  { query, top_n, rewrite }
  → Express server (web/server/index.js)
      ├─ [optional] llmService.rewriteQuery()        Groq — expands to IS-standard vocabulary
      ├─ retrieverService.retrieve()
      │     └─ PythonRetriever singleton              EventEmitter, queues concurrent requests
      │           └─ bridge/retrieve.py daemon        stdin/stdout newline-delimited JSON
      │                 └─ inference.py               FAISS 0.6 + BM25 0.4 → re-rank → top-N
      └─ llmService.generateExplanation() × N        Promise.allSettled — parallel, non-blocking
  → JSON { standards[], latency: { retrieval_ms, llm_ms, total_ms } }
```

### Chunking & Retrieval Strategy

**Chunking** (`src/parse_bis_pdf.py`):
- 4-pass boundary detection splits the 929-page PDF into per-standard records
  - Pass 1–2: primary block splitting and secondary boundary recovery
  - Pass 3: recovers scope text stolen by the preceding block (SP-21 PDF layout quirk)
  - Pass 4: truncates next-standard content bleed at a second `1. Scope` marker
- Each standard is further split by section with **50-word overlap** to prevent context loss at boundaries
- Weak chunks (<30 words) are merged with their neighbour
- Result: 1,269 chunks from 586 standards (avg 2.2 chunks/standard)

**Hybrid Retrieval** (`inference.py`):
- **Dense**: FAISS `IndexFlatIP` with `all-MiniLM-L6-v2` embeddings (384-dim cosine similarity)
- **Sparse**: BM25Okapi with weighted document construction — title ×4, keywords ×3, section ×2, body ×1
- **Fusion**: `score = 0.6 × dense_norm + 0.4 × sparse_norm`

**Re-ranking** bonuses applied per candidate:
- +0.05 per overlapping keyword (max 4) between query and standard's keyword list
- +0.05 per overlapping title word (max 5)
- +0.25 if ≥60% of significant title words appear in the query (strong title match)
- +0.20 if an exact IS ID from the query matches this standard
- +0.35 / -0.40 grade discriminator: boosts/penalises OPC-grade standards (33/43/53) when query names a specific grade
- +0.30 / -0.20 part-number discriminator: boosts matching Part N and penalises non-matching parts when query explicitly names a part number (handles "Part – 1", "PART2" etc.)
- -0.15 penalty for very short chunks (<40 body words)

**Post-grouping Part disambiguation**: when multiple parts of the same IS base number survive into the candidate set with identical titles, IDF-weighted discriminating keyword scores break the tie — rarer corpus terms (e.g. "lightweight") carry proportionally more weight.

**Deduplication**: candidates grouped by `standard_id`; only the best-scoring chunk per standard survives. Final output is top-N unique IS standards.

### Key Design Decisions

| Decision | Rationale |
|---|---|
| Persistent Python daemon | FAISS index load takes ~18 s cold. Spawn once at boot, queue all requests through a single process — zero cold start per query. |
| `inference.py` never modified | Bridge pattern: `bridge/retrieve.py` imports `inference.py` as a module. Judges run `inference.py` directly; the web server uses the bridge. Both paths are identical. |
| In-memory data | 586 standards + 1,269 chunks fit comfortably in RAM. No database dependency, no I/O per request. |
| LLM fallbacks everywhere | Every Groq call is wrapped with a timeout (8 s) and a safe default return. `Promise.allSettled` for parallel calls. Server starts and retrieval works without a `GROQ_API_KEY`. |
| Weighted BM25 document | Repeating title tokens ×4 makes exact IS-standard name queries dominant over body-text noise — critical for the BIS domain where standard names are precise. |

---

## Project Structure

```
SpecForge/
├── inference.py                         # Entry point for judges
├── requirements.txt                     # All Python dependencies
├── eval_script.py                       # Provided evaluation script (Hit@3, MRR@5, latency)
├── data/
│   └── processed/
│       ├── standards.json               # 586 parsed standards (committed)
│       ├── standards_chunks.json        # 1,269 RAG chunks (committed)
│       ├── public_test_set.json         # 10 public evaluation queries
│       └── retrieval_results.json       # Our results on public test set
├── src/
│   └── parse_bis_pdf.py                 # PDF → JSON parsing pipeline
└── web/
    ├── server/
    │   ├── index.js                     # Express API — all routes
    │   ├── start.js                     # Safe launcher (kills stale port process)
    │   ├── .env.example                 # Environment template
    │   ├── bridge/
    │   │   └── retrieve.py              # Daemon wrapping inference.py for the web server
    │   └── services/
    │       ├── llmService.js            # Groq wrappers with fallbacks
    │       └── retrieverService.js      # PythonRetriever — daemon lifecycle manager
    └── client/
        └── src/
            ├── App.jsx                  # React router (5 pages)
            ├── api/standards.js         # Typed fetch wrappers
            ├── pages/                   # Home, Standards, Categories, Recommend, About
            ├── components/              # Navbar, Footer, StandardCard, StandardModal
            └── locales/                 # en/ and hi/ (English + Hindi i18n)
```

---

## External APIs & Data Sources

All sources disclosed per hackathon transparency requirements.

| Source | Purpose | Key required? | Notes |
|---|---|---|---|
| **BIS SP-21** (Bureau of Indian Standards, Special Publication 21) | Source dataset — 929-page PDF of building material standard summaries | No | Provided by organisers; processed JSON committed to repo |
| **HuggingFace `all-MiniLM-L6-v2`** | 384-dimension sentence embedding model for FAISS dense retrieval | No | Downloaded automatically by `sentence-transformers` on first `--build` (~90 MB) |
| **Groq API** (`llama-3.1-8b-instant`) | Query rewriting, per-result explanation, conversational QA | Yes — `GROQ_API_KEY` | Free tier sufficient. Groq chosen for sub-second inference latency. Retrieval works without this key. |

No other external APIs, databases, or paid services are used.

---

## Environment Dependencies

### System Requirements

| Dependency | Minimum | Notes |
|---|---|---|
| Python | 3.10 | For retrieval pipeline and `inference.py` |
| Node.js | 18 | For Express server and React client |
| npm | 9 | Ships with Node 18 |
| `fuser` | any | Linux — used by `start.js` to clear stale port; install via `psmisc` if missing |

### Hardware

- **CPU**: Any x86-64 or ARM64 — no GPU required
- **RAM**: 2 GB minimum; index + embeddings use ~500 MB
- **GPU**: Optional — a CUDA GPU reduces index build time but `faiss-cpu` and `sentence-transformers` run fully on CPU
- **Disk**: ~1 GB free for venv and generated index files

---

## Setup & Running

### Step 1 — Clone

```bash
git clone https://github.com/kshitij-ka/SpecForge.git
cd SpecForge
```

### Step 2 — Python virtual environment

```bash
python3 -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate

pip install --upgrade pip
pip install -r requirements.txt
```

`requirements.txt`:
```
pymupdf>=1.24.0
faiss-cpu>=1.7.4
rank-bm25>=0.2.2
sentence-transformers>=3.0.0
numpy>=1.26.0
```

> `sentence-transformers` downloads `all-MiniLM-L6-v2` (~90 MB) from HuggingFace on first use.

### Step 3 — Build the FAISS index

The processed JSON is committed. Index files are gitignored and must be built once locally.

```bash
source .venv/bin/activate
python inference.py --build
```

Encodes 1,269 chunks, writes `embeddings.npy` + `faiss.index` to `data/processed/`. Takes **~2 min on CPU**. Subsequent starts load from cache — no rebuild needed unless chunks change.

### Step 4 — Node.js dependencies

```bash
cd web/server && npm install
cd ../client && npm install
```

### Step 5 — Environment variables

```bash
cp web/server/.env.example web/server/.env
```

Edit `web/server/.env`:

```env
# Required for LLM explanations, query rewriting, and /api/chat
GROQ_API_KEY=your_groq_api_key_here

# Optional — defaults to 5000
PORT=5000

# Required if "python" is not Python 3 — point to your venv
PYTHON_BIN=/path/to/SpecForge/.venv/bin/python3
```

> `PYTHON_BIN` accepts only `"python"`, `"python3"`, or an absolute path. The server validates and rejects arbitrary values on startup.

### Step 6 — Start the application

**Terminal 1 — API server (port 5000):**
```bash
cd web/server
npm start
```
Wait for the log line `Python retriever ready` (~20 s first boot). The server is accepting queries after that.

**Terminal 2 — Frontend dev server (port 5173):**
```bash
cd web/client
npm run dev
```

Open **http://localhost:5173**. The Vite dev server proxies all `/api/*` requests to `:5000`.

---

## Using `inference.py` (Judge Entry Point)

`inference.py` is the mandatory entry point. It runs independently of the web server.

> Always activate the virtual environment first: `source .venv/bin/activate`

### Build / force-rebuild the index

```bash
python inference.py --build
```

### Single query (interactive testing)

```bash
python inference.py --query "Which standard covers 33 grade OPC cement?"
```

Output:
```
============================================================
Query : Which standard covers 33 grade OPC cement?
Latency: 0.019s

Top results:
  1. IS 269: 1989 — Ordinary Portland Cement, 33 Grade
     Category: Cement and Concrete  |  Section: Scope  |  Score: 0.8921
  2. IS 8112: 1989 — 43 Grade Ordinary Portland Cement
     ...
```

### Batch evaluation (judge command)

```bash
python inference.py \
  --input  data/processed/public_test_set.json \
  --output data/processed/retrieval_results.json
```

Input format:
```json
[
  {
    "id": "PUB-01",
    "query": "We are a small enterprise manufacturing 33 Grade OPC...",
    "expected_standards": ["IS 269: 1989"]
  }
]
```

Output format:
```json
[
  {
    "id": "PUB-01",
    "query": "...",
    "retrieved_standards": ["IS 8112: 1989", "IS 269: 1989", "..."],
    "details": [
      {
        "standard_id": "IS 269: 1989",
        "title": "Ordinary Portland Cement, 33 Grade",
        "category": "Cement and Concrete",
        "score": 0.8921,
        "matched_section": "Scope"
      }
    ],
    "latency_seconds": 0.019,
    "expected_standards": ["IS 269: 1989"]
  }
]
```

## Evaluation

```bash
# Step 1: generate results
python inference.py \
  --input  data/processed/public_test_set.json \
  --output data/processed/retrieval_results.json

# Step 2: score
python eval_script.py \
  --results data/processed/retrieval_results.json
```

Targets and our results on the public set:

| Metric | Formula | Target | Achieved |
|---|---|---|---|
| Hit Rate @3 | correct queries where expected std in top-3 / total | > 80% | **100%** |
| MRR @5 | Σ(1/rank_i) / N | > 0.7 | **0.950** |
| Avg Latency | total_time / num_queries | < 5 s | **~0.018 s** |

---

## API Reference

All endpoints on Express server (default `http://localhost:5000`).

### `POST /api/recommend`

Core RAG endpoint. Retrieval + optional LLM explanations.

```json
// Request
{ "query": "fire resistance for brick masonry", "top_n": 5, "rewrite": false }

// Response
{
  "standards": [
    {
      "standard_id": "IS 1905: 1987",
      "title": "Code of Practice for Structural Use of Unreinforced Masonry",
      "category": "Masonry",
      "score": 0.812,
      "matched_section": "Fire Resistance",
      "explanation": "This standard specifies..."
    }
  ],
  "latency": { "retrieval_ms": 19, "llm_ms": 820, "total_ms": 839 }
}
```

| Field | Type | Default | Description |
|---|---|---|---|
| `query` | string | required | Natural-language product description or compliance question |
| `top_n` | integer | 5 | Results to return (1–10) |
| `rewrite` | boolean | `false` | Expand query to IS-standard vocabulary via LLM before retrieval |

Rate limit: 20 req/min.

### `POST /api/ask`

Chunk-grounded QA for a specific standard.

```json
{ "standard_id": "IS 1905: 1987", "question": "What is the minimum wall thickness?" }
```

### `POST /api/chat`

Conversational QA over the standards corpus. Requires `GROQ_API_KEY`; returns `503` if absent.

```json
{ "message": "What grades of Portland cement does BIS cover?" }
```

### `GET /api/standards`

Paginated list. Query params: `q` (keyword search), `category`, `page` (default 1), `limit` (default 20, max 100).

### `GET /api/standards/:id`

Single standard. `:id` is URL-encoded IS ID, e.g. `IS%20269%3A%201989`.

### `GET /api/categories`

All 25 material categories sorted alphabetically.

### `GET /api/stats`

```json
{ "standards": 586, "chunks": 1269, "categories": 25 }
```

---

## Features

| Feature | Description |
|---|---|
| **Hybrid RAG retrieval** | FAISS (dense, 60%) + BM25 (sparse, 40%) fused and re-ranked |
| **Re-ranking** | Keyword overlap, title match, exact IS-ID match, short-chunk penalty |
| **Part-number disambiguation** | Explicit Part N in query boosts matching part ±0.30, penalises siblings ±0.20; handles em-dash/PART2 variants |
| **AI explanations** | Groq `llama-3.1-8b-instant` — parallel, fallback-safe |
| **Query rewriting** | LLM expands natural language to IS-standard vocabulary (optional) |
| **Chunk-grounded QA** | Question answered from the most relevant chunk of a specific standard |
| **Conversational chat** | Open-ended QA against the full corpus |
| **Browse & filter** | Paginated standards list with keyword scoring; category gallery |
| **Persistent daemon** | Python retrieval process spawned once at boot; auto-restarts on crash |
| **Internationalisation** | UI in English and Hindi (i18next + react-i18next) |
| **Rate limiting** | 60 req/min global, 20 req/min on LLM endpoints (Helmet + express-rate-limit) |
| **Production-ready API** | Input validation, sanitisation, structured JSON logging, latency breakdown |

---

## Tech Stack

| Layer | Technology |
|---|---|
| Embedding model | `all-MiniLM-L6-v2` via `sentence-transformers` |
| Dense index | FAISS `IndexFlatIP` (cosine via inner product) |
| Sparse index | BM25Okapi (`rank-bm25`) |
| PDF parsing | PyMuPDF |
| LLM | Groq API (`llama-3.1-8b-instant`) |
| Backend | Node.js 18 + Express 5 |
| Security middleware | Helmet, CORS, express-rate-limit |
| Frontend | React 19, Vite 8, React Router 7 |
| Internationalisation | i18next, react-i18next, i18next-browser-languagedetector |

---

## Troubleshooting

| Symptom | Likely cause | Fix |
|---|---|---|
| `PYTHON_BIN validation failed` on start | Invalid `PYTHON_BIN` | Set to `python`, `python3`, or absolute venv path |
| `ModuleNotFoundError: faiss` | Wrong Python binary (system Python instead of venv) | Set `PYTHON_BIN=/path/to/.venv/bin/python3` in `.env` |
| `Python daemon boot timeout` (90 s) | Index files missing | Run `python inference.py --build` with venv active |
| Results return but no `explanation` field | `GROQ_API_KEY` absent or invalid | Set key in `.env`; retrieval still works, explanations fall back silently |
| `fuser: command not found` on Linux | `psmisc` not installed | `sudo apt install psmisc` / `sudo dnf install psmisc` |
| Port 5000 still in use after crash | `fuser` not available | Manually: `kill $(lsof -t -i:5000)` |

---

## License

See [LICENSE](LICENSE).