Skip to main content

Essential Kubeflow

Engineering ML Workflows on Kubernetes

  • 1st Edition - July 1, 2026
  • Latest edition
  • Authors: Prashanth Josyula, Sonika Arora, Anant Kumar
  • Language: English

Essential Kubeflow: Engineering ML Workflows on Kubernetes provides the tools needed to transform ML workflows from experimental notebooks to production-ready platforms. Throug… Read more

Early spring sale

Nurture your knowledge

Grow your expertise with up to 25% off trusted resources.

Description

Essential Kubeflow: Engineering ML Workflows on Kubernetes provides the tools needed to transform ML workflows from experimental notebooks to production-ready platforms. Through hands-on examples and production-tested patterns, readers will master essential skills for building enterprise-grade Machine Learning platforms, including architecting production systems on Kubernetes, designing end-to-end ML pipelines, implementing robust model serving, efficiently scaling workloads, managing multi-user environments, deploying automated MLOps workflows, and integrating with existing ML tools. Whether you're a Machine Learning engineer looking to operationalize models, a platform engineer diving into ML infrastructure, or a technical leader architecting ML systems, this book provides solutions for real-world challenges.

With this comprehensive guide to Kubeflow, a widely adopted open source MLOps platforms for automating ML workloads, readers will have the expertise to build and maintain scalable ML platforms that can handle the demands of modern enterprise AI initiatives.

Key features

  • Provides readers with a comprehensive, step-by-step guide to building reliable ML pipelines with automated workflows, testing, and deployment using Kubeflow's pipeline components
  • Includes clear strategies for monitoring ML workloads, managing resources, handling multi-user environments, and maintaining production platforms at scale
  • Presents proven solutions and architectural patterns drawn from actual production deployments, showing readers how to avoid common pitfalls and accelerate ML initiatives

Readership

Computer Scientists and researchers in Artificial Intelligence and Machine Learning, as well as academics, researchers, and professionals such as ML Engineers, DevOps Engineers, Platform Engineers, MLOps Engineers, Infrastructure Engineers, Site Reliability Engineers, Data Scientists, Software Engineers focused on ML, Cloud Engineers, and Technical Architects.

Table of contents

Part I: Foundation

1. Kubernetes Essentials for ML Engineers

1.2. Container Fundamentals and Docker Basics

1.3. Kubernetes Architecture Overview

1.4. Key Concepts: Pods, Deployments, Services

1.5. Resource Management and Scheduling

1.6. StatefulSets and Persistent Storage

1.7. Networking and Service Discovery

2. Getting Started with Kubeflow

2.1. Understanding ML Platforms and MLOps

2.2. Kubeflow Architecture and Components

2.3. Installation and Environment Setup

2.4. Multi-user Management Basics

2.5. Platform Security Fundamentals

Part II: Building ML Workflows

3. Understanding Kubeflow Pipelines

3.1. Pipeline Architecture Fundamentals

3.2. The Pipeline SDK and DSL

3.3. Building Your First Pipeline

3.4. Pipeline Components and Artifacts

3.5. Pipeline Execution and Debugging

4. Advanced Pipeline Development

4.1. Designing Reusable Components

4.2. Managing Pipeline Parameters

4.3. Error Handling Strategies

4.4. Pipeline Versioning and Storage

4.5. CI/CD Integration Patterns

5. Experimentation with Notebooks

5.1. JupyterHub in Kubeflow

5.2. Managing Notebook Servers

5.3. Resource Allocation and Quotas

5.4. Persistent Storage Configuration

5.5. From Notebooks to Pipelines

Part III: Model Development and Training

6. Training at Scale

6.1. Understanding Training Operators

6.2. Distributed Training Basics

6.3. TensorFlow Training on Kubeflow

6.4. PyTorch Training on Kubeflow

6.5. Resource Management for Training

7. Hyperparameter Tuning with Katib

7.1. Experiment Configuration

7.2. Defining Search Spaces

7.3. Understanding Search Algorithms

7.4. Managing Training Trials

7.5. Analyzing Experiment Results

Part IV: Model Deployment

8. Serving Models with KServe

8.1. KServe Architecture Overview

8.2. Model Server Deployment

8.3. Inference Service Configuration

8.4. Model Updates and Versioning

8.5. Performance Monitoring

9. Production Operations

9.1. Monitoring ML Workloads

9.2. Resource Management

9.3. Security Best Practices

9.4. Platform Maintenance

9.5. Troubleshooting Guide

Part V: Enterprise Implementation

10. Production Best Practices

10.1. Building Enterprise ML Platforms

10.2. Multi-tenant Architecture Design

10.3. Scaling Strategies and Patterns

10.4. Cost Optimization Techniques

10.5. Team Collaboration Models

11. Platform Integration and Ecosystem

11.1. Integrating with Data Lakes

11.2. CI/CD Pipeline Integration

11.3. Monitoring Stack Integration

11.4. External Model Registry Systems

11.5. Cloud Provider Integrations

Product details

  • Edition: 1
  • Latest edition
  • Published: July 1, 2026
  • Language: English

About the authors

PJ

Prashanth Josyula

Prashanth Josyula is a seasoned IT professional based in San Francisco, USA, with over 16 years of industry experience spanning enterprise software engineering, artificial intelligence, and cloud-native infrastructure. He specializes in AI/ML systems, Kubernetes, MLOps, and service mesh technologies, and has consistently contributed to building intelligent, scalable, and resilient platforms that power next-generation applications.

In his current role as a Principal Member of Technical Staff (PMTS) at Salesforce, Prashanth is at the forefront of architecting cloud-native solutions that seamlessly integrate AI-driven automation, real-time data processing, and large-scale distributed systems. His work spans across platform services, ML infrastructure, and enterprise-grade deployments, enabling cross-functional teams to build, deploy, and manage intelligent applications with speed and reliability. Prashanth is also an active thought leader and speaker, regularly participating in and presenting at industry-leading conferences. His talks focus on advanced topics such as ML/AI Ops, Retrieval-Augmented Generation (RAG), AI Agents, Responsible AI, and Time-Series Forecasting, where he shares practical insights derived from real-world enterprise experience. With a strong passion for both innovation and knowledge-sharing, Prashanth combines deep technical expertise with a commitment to advancing the field through mentorship, public speaking, authorship, and contributions to research and open-source communities.

Affiliations and expertise
Salesforce AI Cloud, Freemont, CA, USA

SA

Sonika Arora

Sonika Arora is a seasoned software engineer with over a decade of experience building scalable, resilient, and intelligent distributed systems. She currently serves as a Lead Member of Technical Staff at Salesforce, where she architects and delivers complex microservice-based platforms that power machine learning workflows at scale. At Salesforce, Sonika has played a pivotal role in designing orchestration platforms that seamlessly integrate compute services such as training, prediction, and modeling of ML jobs. By leveraging technologies like AWS Lambda, DynamoDB Streams, Kubernetes, and Terraform, she has led initiatives that ensure concurrency, reliability, and observability across distributed architectures. Prior to Salesforce, she made significant contributions at PayPal, where she helped build real-time monitoring systems and QR code payment infrastructure—delivering solutions optimized for scale, fault tolerance, and performance. Sonika’s strength lies in fusing backend engineering with system-level thinking to create cloud-native systems enriched with automation, monitoring, and intelligent orchestration. She remains passionate about advancing AI-powered platforms, stream processing, and high-throughput infrastructure.

Affiliations and expertise
Salesforce AI Cloud, Sunnyvale, CA, USA

AK

Anant Kumar

Anant Kumar is a seasoned technology leader at Salesforce, where he leads the Data Lake team within the Einstein AI Platform. With over 20 years of experience in distributed systems, AI/ML infrastructure, and cloud-native architectures, he architects enterprise-scale Apache Spark services and data lake solutions that power Salesforce’s predictive and generative AI.

His technical expertise includes building scalable Spark services on Kubernetes, developing cloud-native data pipelines processing billions of events, and designing secure, high-performance infrastructure for AI/ML workloads. He holds multiple U.S. patents in network visibility and security, and his innovations have earned him industry recognition.

Anant is a passionate advocate for responsible AI, contributing to IEEE conferences, peer-reviewed journals, and academic reviews. He mentors emerging researchers and students through non profit organizations and serves as a technical reviewer for leading publishers like O'Reilly, Packt, Manning and Plos ONE.

He is an alumnus of the Stanford Graduate School of Business Ignite Program and actively supports interdisciplinary collaboration across AI, cloud infrastructure, and data science. Recognized for his leadership, mentorship, and commitment to ethical innovation, Anant continues to shape the future of enterprise AI platforms.

Affiliations and expertise
Salesforce AI Cloud, San Jose, CA, USA