# GraphRAG Pipeline Architecture
This document details the 7-phase architecture of the GraphRAG pipeline, covering both Indexing (Graph Construction) and Query (Retrieval & Generation).
## Phase 1: Chunking (Indexing)
**Goal:** Split documents into manageable segments for processing.
> [!NOTE]
> This phase **only** handles text splitting. Vectorization (Embedding) happens in Phase 5, either immediately for indexing or later during query processing.
### Configuration
| `chunk_size` | Maximum number of tokens per chunk | 300 |
| `chunk_overlap` | Number of overlapping tokens between chunks | 30 |
### Guidelines
- **Small Chunks (100-300 tokens):** Best for granular retrieval and precise fact extraction.
- **Large Chunks (500-1000 tokens):** Better for capturing broader context and thematic relationships.
- **Overlap:** Ensure critical information isn't split across boundaries. 10-20% overlap is recommended.
## Phase 2: Entity Extraction (Indexing)
**Goal:** Identify key entities (People, Places, Organizations, Concepts) from text chunks.
### Configuration
| `approach` | Extraction method: `algorithmic` (NLP), `semantic` (LLM), or `hybrid` | `hybrid` |
| `entity_types` | List of entity types to extract | `["organization", "person", "geo", "event"]` |
| `use_gleaning` | Enable multi-pass extraction for missed entities | `true` |
### Methods
1. **Algorithmic:** Uses NLP libraries (e.g., SpaCy) for fast, rule-based extraction. Good for standard entities.
2. **Semantic:** Uses LLMs to identify entities based on context and meaning. Best for domain-specific or abstract concepts.
3. **Hybrid:** Combines both for speed and accuracy.
## Phase 3: Relationship Extraction (Indexing)
**Goal:** Identify relationships between extracted entities.
### Configuration
| `extract_relationships` | Enable/disable relationship extraction | `true` |
| `use_gleaning` | Enable multi-pass extraction for missed relationships | `true` |
| `max_gleanings` | Maximum number of gleaning passes | `1` |
### Methods
1. **Co-occurrence:** rapid identification based on proximity in text.
2. **LLM:** Detailed relationship extraction with semantic understanding.
3. **Gleaning:** Iterative refinement to catch subtle connections.
## Phase 4: Graph Construction (Indexing)
**Goal:** Build a knowledge graph from entities and relationships.
### Configuration
| `enable_pagerank` | Calculate node importance using PageRank | `true` |
| `max_connections` | limit max degree for nodes to reduce noise | `50` |
| `graph.traversal` | Traversal depth for sub-graph queries | `2` |
| `clustering_algorithm` | Algorithm for community detection (e.g., `leiden`) | `leiden` |
### Features
- **PageRank:** Identifies central nodes in the graph.
- **Leiden:** Detects communities and hierarchical structures.
- **Clustering:** Groups related entities for better context retrieval.
## Phase 5: Embedding (Indexing & Query)
**Goal:** Generate vector embeddings for text chunks and graph nodes.
> [!IMPORTANT]
> This phase runs during **Indexing** (to vectorize chunks/nodes) AND during **Query** (to vectorize the user's question).
### Configuration
| `backend` | Embedding provider (OpenAI, Azure, Ollama, etc.) | `openai` |
| `dimension` | Vector dimension size | `1536` |
| `model` | Specific embedding model name | `text-embedding-3-small` |
| `batch_size` | Number of items to embed at once | `16` |
### Providers & Costs
| OpenAI | text-embedding-3-small | $0.02 | 1536 |
| OpenAI | text-embedding-3-large | $0.13 | 3072 |
| Azure | (varies) | (varies) | (varies) |
| Ollama | (local) | $0.00 | (varies) |
## Phase 6: Retrieval (Query)
**Goal:** Find relevant information based on user query.
### Configuration
| `strategy` | Retrieval strategy (`vector`, `graph`, `hybrid`, `local`, `global`) | `hybrid` |
| `top_k` | Number of results to return | `10` |
| `retrieval.score_threshold` | Minimum similarity score | `0.7` |
### Strategies
1. **Vector:** Standard semantic search using embeddings.
2. **Graph:** Traverses the knowledge graph to find related entities.
3. **Hybrid:** Combines Vector and Graph search for best coverage.
4. **Local:** Focuses on immediate neighborhood of query entities.
5. **Global:** Scans entire graph/communities for broad queries.
6. **Text:** Keyword-based search (BM25 or similar).
## Phase 7: Answer Generation (Query)
**Goal:** Synthesize a final answer from retrieved context.
### Configuration
| `chat_model` | LLM for generation | `gpt-4o` |
| `temperature` | Creativity control (0.0 - 1.0) | `0.0` |
| `max_tokens` | Maximum length of generated answer | `2000` |
| `ollama.model` | Local model name if using Ollama | `llama3` |
### Methods
1. **LLM:** Direct generation using retrieved context.
2. **Concatenation:** Simple joining of retrieved text segments.
3. **Explained:** Generates answer with citations and reasoning steps.
---
## Quick Reference: All Configurable Parameters
| **Chunking** | `chunk_size` | int | 300 | Max tokens per chunk |
| | `chunk_overlap` | int | 30 | Overlap tokens |
| **Extraction** | `entity_extraction.approach` | enum | hybrid | Extraction method |
| | `entity_extraction.entity_types` | list | [...] | Target entities |
| | `entity_extraction.use_gleaning` | bool | true | Multi-pass extraction |
| | `relationship_extraction.enabled` | bool | true | Extract relationships |
| **Graph** | `graph.enable_pagerank` | bool | true | Calculate PageRank |
| | `graph.max_connections` | int | 50 | Max node degree |
| | `graph.clustering_algorithm` | enum | leiden | Community detection |
| **Embedding** | `embeddings.backend` | string | openai | Provider name |
| | `embeddings.model` | string | ... | Model name |
| | `embeddings.dimension` | int | 1536 | Vector size |
| **Retrieval** | `retrieval.strategy` | enum | hybrid | Search strategy |
| | `retrieval.top_k` | int | 10 | Result count |
| **Generation** | `llm.model` | string | gpt-4o | Chat model |
| | `llm.temperature` | float | 0.0 | Creativity |
## Performance Table
| **Light** (Algo Extract, Vector Only) | ~30s | ~500ms | Low |
| **Balanced** (Hybrid Extract, Hybrid Search) | ~2m | ~1.5s | Medium |
| **Deep** (All Semantic, Global Search) | ~10m | ~5s+ | High |
---
## Scientific Basis & Inspirations
GraphRAG-rs implements techniques from several cutting-edge research papers. Here is how they map to our 7-phase pipeline:
| **1. Chunking** | **Semantic Chunking** | *LangChain / Gregory Kamradt (2024)* | Respects semantic boundaries better than fixed-size splitting. |
| **2. Extraction** | **Gleaning** | *Microsoft GraphRAG (2024)* | Iterative extraction to catch entities missed in the first pass. |
| **4. Graph** | **Leiden Algorithm** | *Traag et al. (Nature, 2019)* | Superior community detection (modularity) vs. Louvain. |
| **4. Graph** | **Hierarchical Communities** | *Microsoft GraphRAG (2024)* | Enables "Global Search" by summarizing communities at different levels. |
| **6. Retrieval** | **Fast-GraphRAG** | *Pang et al. (2024)* | Uses PageRank/PPR to traverse the graph efficiently. |
| **6. Retrieval** | **LightRAG** | *Zhang et al. (2024)* | Dual-level retrieval (Low-level keywords + High-level topics) for 6000x efficiency. |
| **6. Retrieval** | **HippORAG** | *He et al. (NeurIPS 2024)* | "In-context" learning on the graph via Personalized PageRank (PPR). |
| **6. Retrieval** | **Cross-Encoder** | *Reimers et al. (2019)* | Reranking step to drastically improve precision of retrieved results. |
## Time Estimates (per 100 chunks)
- **Chunking:** < 1s
- **Entity/Rel Extraction (LLM):** ~2-5 mins
- **Graph Construction:** ~10-30s
- **Embedding:** ~10-30s
- **Total Indexing:** ~3-6 mins (highly dependent on LLM speed)