docs: update README.
This commit is contained in:
@@ -13,10 +13,10 @@
|
|||||||
| Metric | Target | **Our Score** |
|
| Metric | Target | **Our Score** |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| Hit Rate @3 | > 80% | **100%** (10/10) |
|
| Hit Rate @3 | > 80% | **100%** (10/10) |
|
||||||
| MRR @5 | > 0.7 | **1.000** |
|
| MRR @5 | > 0.7 | **0.950** |
|
||||||
| Avg Latency | < 5 s | **~19 ms** |
|
| Avg Latency | < 5 s | **~18 ms** |
|
||||||
|
|
||||||
All 10 public queries returned the expected standard at rank 1. Average query latency is 19 ms after the index warms up — 250× faster than the 5 s target.
|
All 10 public queries returned the expected standard in the top 2 results. Average query latency is 19 ms after the index warms up — 250× faster than the 5 s target.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -28,7 +28,7 @@ Indian Micro and Small Enterprises (MSEs) spend weeks manually searching BIS SP-
|
|||||||
2. **Get ranked BIS standards** with matched sections and relevance scores in milliseconds
|
2. **Get ranked BIS standards** with matched sections and relevance scores in milliseconds
|
||||||
3. **Read AI explanations** of why each standard applies, generated by Groq LLM
|
3. **Read AI explanations** of why each standard applies, generated by Groq LLM
|
||||||
|
|
||||||
The system covers all **573 unique standards** across **25 building material categories** from BIS SP-21 (Summaries of Indian Standards for Building Materials).
|
The system covers all **586 unique standards** across **25 building material categories** from BIS SP-21 (Summaries of Indian Standards for Building Materials).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -39,8 +39,8 @@ The system covers all **573 unique standards** across **25 building material cat
|
|||||||
```
|
```
|
||||||
data/raw/dataset.pdf (BIS SP-21, 929 pages)
|
data/raw/dataset.pdf (BIS SP-21, 929 pages)
|
||||||
→ src/parse_bis_pdf.py
|
→ src/parse_bis_pdf.py
|
||||||
→ data/processed/standards.json 573 structured records [committed]
|
→ data/processed/standards.json 586 structured records [committed]
|
||||||
→ data/processed/standards_chunks.json 1,236 RAG-ready chunks [committed]
|
→ data/processed/standards_chunks.json 1,269 RAG-ready chunks [committed]
|
||||||
→ inference.py --build
|
→ inference.py --build
|
||||||
→ data/processed/embeddings.npy dense vectors [gitignored — rebuild locally]
|
→ data/processed/embeddings.npy dense vectors [gitignored — rebuild locally]
|
||||||
→ data/processed/faiss.index FAISS index [gitignored — rebuild locally]
|
→ data/processed/faiss.index FAISS index [gitignored — rebuild locally]
|
||||||
@@ -70,7 +70,7 @@ Browser / API Client
|
|||||||
- Pass 4: truncates next-standard content bleed at a second `1. Scope` marker
|
- Pass 4: truncates next-standard content bleed at a second `1. Scope` marker
|
||||||
- Each standard is further split by section with **50-word overlap** to prevent context loss at boundaries
|
- Each standard is further split by section with **50-word overlap** to prevent context loss at boundaries
|
||||||
- Weak chunks (<30 words) are merged with their neighbour
|
- Weak chunks (<30 words) are merged with their neighbour
|
||||||
- Result: 1,236 chunks from 573 standards (avg 2.2 chunks/standard)
|
- Result: 1,269 chunks from 586 standards (avg 2.2 chunks/standard)
|
||||||
|
|
||||||
**Hybrid Retrieval** (`inference.py`):
|
**Hybrid Retrieval** (`inference.py`):
|
||||||
- **Dense**: FAISS `IndexFlatIP` with `all-MiniLM-L6-v2` embeddings (384-dim cosine similarity)
|
- **Dense**: FAISS `IndexFlatIP` with `all-MiniLM-L6-v2` embeddings (384-dim cosine similarity)
|
||||||
@@ -83,6 +83,7 @@ Browser / API Client
|
|||||||
- +0.25 if ≥60% of significant title words appear in the query (strong title match)
|
- +0.25 if ≥60% of significant title words appear in the query (strong title match)
|
||||||
- +0.20 if an exact IS ID from the query matches this standard
|
- +0.20 if an exact IS ID from the query matches this standard
|
||||||
- +0.35 / -0.40 grade discriminator: boosts/penalises OPC-grade standards (33/43/53) when query names a specific grade
|
- +0.35 / -0.40 grade discriminator: boosts/penalises OPC-grade standards (33/43/53) when query names a specific grade
|
||||||
|
- +0.30 / -0.20 part-number discriminator: boosts matching Part N and penalises non-matching parts when query explicitly names a part number (handles "Part – 1", "PART2" etc.)
|
||||||
- -0.15 penalty for very short chunks (<40 body words)
|
- -0.15 penalty for very short chunks (<40 body words)
|
||||||
|
|
||||||
**Post-grouping Part disambiguation**: when multiple parts of the same IS base number survive into the candidate set with identical titles, IDF-weighted discriminating keyword scores break the tie — rarer corpus terms (e.g. "lightweight") carry proportionally more weight.
|
**Post-grouping Part disambiguation**: when multiple parts of the same IS base number survive into the candidate set with identical titles, IDF-weighted discriminating keyword scores break the tie — rarer corpus terms (e.g. "lightweight") carry proportionally more weight.
|
||||||
@@ -95,7 +96,7 @@ Browser / API Client
|
|||||||
|---|---|
|
|---|---|
|
||||||
| Persistent Python daemon | FAISS index load takes ~18 s cold. Spawn once at boot, queue all requests through a single process — zero cold start per query. |
|
| Persistent Python daemon | FAISS index load takes ~18 s cold. Spawn once at boot, queue all requests through a single process — zero cold start per query. |
|
||||||
| `inference.py` never modified | Bridge pattern: `bridge/retrieve.py` imports `inference.py` as a module. Judges run `inference.py` directly; the web server uses the bridge. Both paths are identical. |
|
| `inference.py` never modified | Bridge pattern: `bridge/retrieve.py` imports `inference.py` as a module. Judges run `inference.py` directly; the web server uses the bridge. Both paths are identical. |
|
||||||
| In-memory data | 573 standards + 1,236 chunks fit comfortably in RAM. No database dependency, no I/O per request. |
|
| In-memory data | 586 standards + 1,269 chunks fit comfortably in RAM. No database dependency, no I/O per request. |
|
||||||
| LLM fallbacks everywhere | Every Groq call is wrapped with a timeout (8 s) and a safe default return. `Promise.allSettled` for parallel calls. Server starts and retrieval works without a `GROQ_API_KEY`. |
|
| LLM fallbacks everywhere | Every Groq call is wrapped with a timeout (8 s) and a safe default return. `Promise.allSettled` for parallel calls. Server starts and retrieval works without a `GROQ_API_KEY`. |
|
||||||
| Weighted BM25 document | Repeating title tokens ×4 makes exact IS-standard name queries dominant over body-text noise — critical for the BIS domain where standard names are precise. |
|
| Weighted BM25 document | Repeating title tokens ×4 makes exact IS-standard name queries dominant over body-text noise — critical for the BIS domain where standard names are precise. |
|
||||||
|
|
||||||
@@ -105,14 +106,13 @@ Browser / API Client
|
|||||||
|
|
||||||
```
|
```
|
||||||
SpecForge/
|
SpecForge/
|
||||||
├── inference.py # Entry point for judges — do not modify
|
├── inference.py # Entry point for judges
|
||||||
├── requirements.txt # All Python dependencies
|
├── requirements.txt # All Python dependencies
|
||||||
├── scripts/
|
├── eval_script.py # Provided evaluation script (Hit@3, MRR@5, latency)
|
||||||
│ └── eval_script.py # Provided evaluation script (Hit@3, MRR@5, latency)
|
|
||||||
├── data/
|
├── data/
|
||||||
│ └── processed/
|
│ └── processed/
|
||||||
│ ├── standards.json # 573 parsed standards (committed)
|
│ ├── standards.json # 586 parsed standards (committed)
|
||||||
│ ├── standards_chunks.json # 1,236 RAG chunks (committed)
|
│ ├── standards_chunks.json # 1,269 RAG chunks (committed)
|
||||||
│ ├── public_test_set.json # 10 public evaluation queries
|
│ ├── public_test_set.json # 10 public evaluation queries
|
||||||
│ └── retrieval_results.json # Our results on public test set
|
│ └── retrieval_results.json # Our results on public test set
|
||||||
├── src/
|
├── src/
|
||||||
@@ -211,7 +211,7 @@ source .venv/bin/activate
|
|||||||
python inference.py --build
|
python inference.py --build
|
||||||
```
|
```
|
||||||
|
|
||||||
Encodes 1,236 chunks, writes `embeddings.npy` + `faiss.index` to `data/processed/`. Takes **~2 min on CPU**. Subsequent starts load from cache — no rebuild needed unless chunks change.
|
Encodes 1,269 chunks, writes `embeddings.npy` + `faiss.index` to `data/processed/`. Takes **~2 min on CPU**. Subsequent starts load from cache — no rebuild needed unless chunks change.
|
||||||
|
|
||||||
### Step 4 — Node.js dependencies
|
### Step 4 — Node.js dependencies
|
||||||
|
|
||||||
@@ -341,7 +341,7 @@ python inference.py \
|
|||||||
--output data/processed/retrieval_results.json
|
--output data/processed/retrieval_results.json
|
||||||
|
|
||||||
# Step 2: score
|
# Step 2: score
|
||||||
python scripts/eval_script.py \
|
python eval_script.py \
|
||||||
--results data/processed/retrieval_results.json
|
--results data/processed/retrieval_results.json
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -350,8 +350,8 @@ Targets and our results on the public set:
|
|||||||
| Metric | Formula | Target | Achieved |
|
| Metric | Formula | Target | Achieved |
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
| Hit Rate @3 | correct queries where expected std in top-3 / total | > 80% | **100%** |
|
| Hit Rate @3 | correct queries where expected std in top-3 / total | > 80% | **100%** |
|
||||||
| MRR @5 | Σ(1/rank_i) / N | > 0.7 | **1.000** |
|
| MRR @5 | Σ(1/rank_i) / N | > 0.7 | **0.950** |
|
||||||
| Avg Latency | total_time / num_queries | < 5 s | **~0.019 s** |
|
| Avg Latency | total_time / num_queries | < 5 s | **~0.018 s** |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -422,7 +422,7 @@ All 25 material categories sorted alphabetically.
|
|||||||
### `GET /api/stats`
|
### `GET /api/stats`
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{ "standards": 573, "chunks": 1236, "categories": 25 }
|
{ "standards": 586, "chunks": 1269, "categories": 25 }
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -433,6 +433,7 @@ All 25 material categories sorted alphabetically.
|
|||||||
|---|---|
|
|---|---|
|
||||||
| **Hybrid RAG retrieval** | FAISS (dense, 60%) + BM25 (sparse, 40%) fused and re-ranked |
|
| **Hybrid RAG retrieval** | FAISS (dense, 60%) + BM25 (sparse, 40%) fused and re-ranked |
|
||||||
| **Re-ranking** | Keyword overlap, title match, exact IS-ID match, short-chunk penalty |
|
| **Re-ranking** | Keyword overlap, title match, exact IS-ID match, short-chunk penalty |
|
||||||
|
| **Part-number disambiguation** | Explicit Part N in query boosts matching part ±0.30, penalises siblings ±0.20; handles em-dash/PART2 variants |
|
||||||
| **AI explanations** | Groq `llama-3.1-8b-instant` — parallel, fallback-safe |
|
| **AI explanations** | Groq `llama-3.1-8b-instant` — parallel, fallback-safe |
|
||||||
| **Query rewriting** | LLM expands natural language to IS-standard vocabulary (optional) |
|
| **Query rewriting** | LLM expands natural language to IS-standard vocabulary (optional) |
|
||||||
| **Chunk-grounded QA** | Question answered from the most relevant chunk of a specific standard |
|
| **Chunk-grounded QA** | Question answered from the most relevant chunk of a specific standard |
|
||||||
|
|||||||
Reference in New Issue
Block a user