Piyush Choudhari

KnowFlow

KnowFlow is a powerful hybrid Retrieval-Augmented Generation (RAG) system that combines semantic search with knowledge graph capabilities for intelligent document processing and querying.

Jul 2025 - Aug 2025
impact:
  • 3x query types supported
  • Session-based context preservation
  • Multi-format document processing

Deployment (HLD)

diagram-export-7-20-2025-6_03_33-PM

AI Infra

diagram-export-7-20-2025-6_57_03-PM

KnowFlow is a powerful hybrid Retrieval-Augmented Generation (RAG) system that combines semantic search with knowledge graph capabilities for intelligent document processing and querying.

Features

  • Advanced Document Processing

    • Multi-format support (PDF, DOCX, CSV, TXT)
    • Intelligent chunking with configurable size and overlap
    • Parallel batch processing with S3 storage
    • Document status tracking (PENDING, PROCESSING, INDEXED, FAILED)
    • Secure per-user document isolation
  • Hybrid RAG + Knowledge Graph Architecture

    • Dense semantic embeddings via Google Gemini + pgvector
    • Structured knowledge extraction to Neo4j
    • Multi-hop reasoning through graph relationships
    • Automatic entity and relationship mapping
    • Query decomposition for complex questions
  • Smart Query Processing

    • Automatic query decomposition for complex questions
    • Hybrid vector + graph-based retrieval
    • Retrieval quality evaluation and improvement
    • Context-aware response synthesis
    • Conversation memory with graph context
  • Chat & Session Management

    • Persistent chat sessions with history
    • Context-aware follow-up questions
    • Session renaming and management
    • Message tracking with context preservation
    • Multi-user support with isolation
  • Security & Authentication

    • JWT-based authentication
    • Secure password hashing with bcrypt
    • Role-based access control
    • Per-user data isolation
    • Document access verification
  • Storage & Infrastructure

    • S3-compatible object storage
    • PostgreSQL for structured data
    • Neo4j for graph relationships
    • Concurrent file operations
    • Efficient batch processing

Quick Start

Prerequisites

  • Python 3.8+
  • PostgreSQL 14+ with pgvector extension
  • Neo4j 5.0+
  • S3-compatible storage
  • Google Cloud API key for Gemini

Environment Variables

# Database
DATABASE_URL=postgresql://user:pass@localhost:5432/knowflow
VECTOR_COLLECTION_NAME=document_embeddings

# Neo4j
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=password

# Google API
GOOGLE_API_KEY=your_gemini_api_key
GEMINI_MODEL_NAME=gemini-pro
GEMINI_EMBEDDING_MODEL=embedding-001

# AWS S3
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1
S3_BUCKET_NAME=knowflow-documents

# App Settings
SECRET_KEY=your_jwt_secret_key
ACCESS_TOKEN_EXPIRE_MINUTES=60
CHUNK_SIZE=1000
CHUNK_OVERLAP=100
TOP_K_RESULTS=3

Development Setup

  1. Clone the repository:
git clone https://github.com/yourusername/knowflow.git
cd knowflow
  1. Install dependencies:
pip install -r requirements.txt
  1. Run migrations:
alembic upgrade head
  1. Start the development server:
uvicorn src.main:app --reload

API Documentation

Authentication

  • POST /auth/register - Register new user
  • POST /auth/login - Login and get JWT token
  • GET /auth/me - Get current user info

Documents

  • POST /documents/upload - Upload multiple documents
  • POST /documents/{doc_id}/index - Index document content
  • GET /documents - List user documents
  • GET /documents/{doc_id} - Get document details

Chat

  • POST /chat/query - Process a new query
  • POST /chat/sessions/{session_id}/messages - Send follow-up message
  • GET /chat/sessions - List chat sessions
  • PUT /chat/sessions/{session_id}/rename - Rename session
  • DELETE /chat/sessions/{session_id} - Delete session

Security Features

  • JWT-based authentication with expiration
  • Bcrypt password hashing
  • Per-user document isolation
  • Access control verification
  • Secure file storage paths
  • Input validation and sanitization

Monitoring & Logging

  • Structured logging with levels
  • Request/response tracking
  • Error handling and reporting
  • Performance metrics
  • Document processing status
  • Chat session analytics

License

This project is licensed under the terms of the LICENSE file included in the repository.