Table of Contents
Open Table of Contents
Overview of the MLOps Ecosystem
Machine Learning Operations (MLOps) bridges the gap between experimental ML notebooks and production systems that serve predictions at scale. As enterprises move from proof-of-concept models to revenue-critical inference workloads, the need for a repeatable, auditable, and secure ML lifecycle becomes essential.
The core MLOps lifecycle covers:
- Data ingestion and validation — Ensuring data quality before training begins
- Experiment tracking — Recording hyperparameters, metrics, and artifacts across runs
- Model training and tuning — Scaling training across GPUs and distributed clusters
- Model registry and governance — Versioning, approving, and auditing models
- Deployment and serving — Rolling models into production with canary or blue-green strategies
- Monitoring and observability — Detecting data drift, latency regressions, and accuracy degradation
Two platforms dominate the open-source MLOps landscape:
- MLflow — Focused on experiment tracking, model registry, and lightweight deployment
- Kubeflow — A full Kubernetes-native ML platform covering pipelines, hyperparameter tuning, and serving
Other tools complement or overlap with these:
- Apache Airflow — General-purpose workflow orchestration, often used for data pipelines that feed ML training jobs
- Metaflow — Originally developed at Netflix, focused on ergonomic Python-based ML workflows with built-in versioning
This guide focuses on MLflow and Kubeflow as the primary MLOps building blocks for enterprise teams.
Architecture Comparison
MLflow Architecture
MLflow is organized around four core components:
┌─────────────────────────────────────────────────────┐
│ MLflow Platform │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ Tracking │ │ Model │ │ Projects │ │
│ │ Server │ │ Registry │ │ │ │
│ │ │ │ │ │ │ │
│ │ • Experiments│ │ • Versioning │ │ • Repro- │ │
│ │ • Runs │ │ • Staging │ │ ducible │ │
│ │ • Metrics │ │ • Production │ │ runs │ │
│ │ • Params │ │ • Archived │ │ │ │
│ └──────┬──────┘ └──────┬───────┘ └────────────┘ │
│ │ │ │
│ ┌──────▼────────────────▼───────┐ ┌────────────┐ │
│ │ Backend Store │ │ Model │ │
│ │ (PostgreSQL / MySQL / SQLite)│ │ Serving │ │
│ └──────────────┬────────────────┘ │ │ │
│ │ │ • REST API │ │
│ ┌──────────────▼────────────────┐ │ • Batch │ │
│ │ Artifact Store │ │ • Streaming│ │
│ │ (S3 / MinIO / GCS / ADLS) │ └────────────┘ │
│ └───────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
- Tracking Server — Logs experiments, parameters, metrics, and artifacts via a REST API
- Model Registry — Manages model lifecycle stages (Staging → Production → Archived)
- Projects — Packages ML code for reproducible runs across environments
- Model Serving — Deploys registered models as REST endpoints
Kubeflow Architecture
Kubeflow is a Kubernetes-native platform with multiple integrated components:
┌──────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Kubeflow Platform │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │ │
│ │ │ Kubeflow │ │ Katib │ │ Notebooks │ │ │
│ │ │ Pipelines│ │ (AutoML) │ │ (Jupyter) │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ • DAGs │ │ • HPO │ │ • Workspace │ │ │
│ │ │ • Steps │ │ • NAS │ │ • GPU-enabled │ │ │
│ │ │ • Caching│ │ • Trials │ │ │ │ │
│ │ └──────────┘ └──────────┘ └────────────────┘ │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │ │
│ │ │ Training │ │ KServe │ │ Central │ │ │
│ │ │ Operator │ │ (Serving)│ │ Dashboard │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ • TFJob │ │ • Canary │ │ • UI Portal │ │ │
│ │ │ • PyTorch│ │ • A/B │ │ • Namespace │ │ │
│ │ │ Job │ │ • Scale │ │ mgmt │ │ │
│ │ └──────────┘ └──────────┘ └────────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────────┐ │
│ │ Istio │ │ Knative │ │ Cert-Manager │ │
│ │ (Service │ │ (Serverless│ │ (TLS) │ │
│ │ Mesh) │ │ Serving) │ │ │ │
│ └────────────┘ └────────────┘ └────────────────┘ │
└──────────────────────────────────────────────────────────┘
Key Kubeflow components:
- Pipelines — Define and execute multi-step ML workflows as DAGs
- Katib — Hyperparameter tuning and neural architecture search
- Training Operators — Kubernetes-native distributed training for TensorFlow, PyTorch, MPI, and XGBoost
- KServe — High-performance model serving with autoscaling, canary deployments, and explainability
- Notebooks — Managed Jupyter environments with GPU access
Head-to-Head Comparison
| Aspect | MLflow | Kubeflow |
|---|---|---|
| Primary focus | Experiment tracking + model registry | Full ML platform on Kubernetes |
| Infrastructure | Runs anywhere (local, VM, K8s) | Kubernetes-native (requires cluster) |
| Learning curve | Low — pip install, immediate use | High — requires K8s expertise |
| Pipelines | Basic (MLflow Recipes) | Advanced (Argo-based DAGs) |
| Model serving | Built-in REST endpoint | KServe with autoscaling |
| Hyperparameter tuning | External (Optuna, Ray Tune) | Built-in (Katib) |
| Multi-tenancy | Limited (manual RBAC) | Native (namespace isolation) |
| Best for | Small-to-mid teams, rapid iteration | Large teams, regulated industries |
Installation on Ubuntu
Prerequisites
# Update system packages
sudo apt update && sudo apt upgrade -y
# Install essential build tools
sudo apt install -y build-essential curl wget git unzip \
python3 python3-pip python3-venv
Installing MLflow
Step 1: Create a Virtual Environment
python3 -m venv ~/mlflow-env
source ~/mlflow-env/bin/activate
Step 2: Install MLflow
pip install mlflow
Verify the installation:
mlflow --version
Step 3: Configure a PostgreSQL Backend Store
For production use, MLflow should persist experiment data in a relational database rather than local files.
# Install PostgreSQL
sudo apt install -y postgresql postgresql-contrib
# Create database and user
sudo -u postgres psql -c "CREATE DATABASE mlflow_db;"
sudo -u postgres psql -c "CREATE USER mlflow_user WITH ENCRYPTED PASSWORD 'your-secure-password';"
sudo -u postgres psql -c "GRANT ALL PRIVILEGES ON DATABASE mlflow_db TO mlflow_user;"
# Install the PostgreSQL driver
pip install psycopg2-binary
⚠️ Warning: Never use the default PostgreSQL credentials in production. Use a secrets manager (HashiCorp Vault, AWS Secrets Manager) to inject credentials at runtime.
Step 4: Configure an Artifact Store (MinIO / S3)
For storing model artifacts, use S3-compatible storage:
# Install MinIO (for local/on-prem S3-compatible storage)
wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
sudo mv minio /usr/local/bin/
# Start MinIO (development — use systemd in production)
mkdir -p ~/minio-data
MINIO_ROOT_USER=minioadmin MINIO_ROOT_PASSWORD=minioadmin \
minio server ~/minio-data --console-address ":9001"
# Install boto3 for S3 integration
pip install boto3
Step 5: Start the MLflow Tracking Server
mlflow server \
--backend-store-uri postgresql://mlflow_user:your-secure-password@localhost:5432/mlflow_db \
--default-artifact-root s3://mlflow-artifacts/ \
--host 0.0.0.0 \
--port 5000
⚠️ Warning: In production, always place the MLflow server behind a reverse proxy (NGINX, Traefik) with TLS termination. The MLflow UI has no built-in authentication.
Installing Kubeflow
Step 1: Set Up a Kubernetes Cluster
For local development, use kind (Kubernetes IN Docker):
# Install Docker
sudo apt install -y docker.io
sudo usermod -aG docker $USER
newgrp docker
# Install kind
[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.24.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind
# Create a cluster with sufficient resources
cat <<EOF | kind create cluster --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "ingress-ready=true"
extraPortMappings:
- containerPort: 80
hostPort: 80
protocol: TCP
- containerPort: 443
hostPort: 443
protocol: TCP
- role: worker
- role: worker
EOF
Step 2: Install kubectl and kustomize
# Install kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/
# Install kustomize (version-pinned with checksum verification)
KUSTOMIZE_VERSION="v5.4.1"
curl -LO "https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2F${KUSTOMIZE_VERSION}/kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz"
curl -LO "https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2F${KUSTOMIZE_VERSION}/kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz.sha256"
# Verify checksum before installing
echo "$(cat kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz.sha256) kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz" | sha256sum --check
tar -xzf "kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz"
sudo mv kustomize /usr/local/bin/
Step 3: Deploy Kubeflow Using Manifests
The official Kubeflow manifests repository provides a kustomize-based installation:
# Clone the Kubeflow manifests repository
git clone https://github.com/kubeflow/manifests.git
cd manifests
# Deploy the full Kubeflow platform
while ! kustomize build example | kubectl apply -f -; do
echo "Retrying to apply resources..."
sleep 20
done
⚠️ Warning: The full Kubeflow installation requires significant cluster resources (16 GB+ RAM recommended). For resource-constrained environments, consider installing individual components (e.g., Pipelines only).
Step 4: Verify Kubeflow Installation
# Check that all pods are running
kubectl get pods -n kubeflow --watch
# Port-forward the Kubeflow dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
Access the dashboard at http://localhost:8080. Default credentials are user@example.com / 12341234.
⚠️ Warning: Change the default credentials immediately. In production, integrate with your organization’s identity provider (OIDC, LDAP) via Istio and Dex.
Enterprise Deployment Patterns
MLflow on Kubernetes
Deploy MLflow as a Kubernetes workload for high availability and scalability:
# mlflow-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlflow-tracking
namespace: mlops
spec:
replicas: 2
selector:
matchLabels:
app: mlflow-tracking
template:
metadata:
labels:
app: mlflow-tracking
spec:
containers:
- name: mlflow
image: ghcr.io/mlflow/mlflow:latest
ports:
- containerPort: 5000
env:
- name: MLFLOW_BACKEND_STORE_URI
valueFrom:
secretKeyRef:
name: mlflow-secrets
key: backend-store-uri
- name: MLFLOW_DEFAULT_ARTIFACT_ROOT
value: "s3://mlflow-artifacts/"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: mlflow-secrets
key: aws-access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: mlflow-secrets
key: aws-secret-key
command: ["mlflow", "server",
"--backend-store-uri", "$(MLFLOW_BACKEND_STORE_URI)",
"--default-artifact-root", "$(MLFLOW_DEFAULT_ARTIFACT_ROOT)",
"--host", "0.0.0.0",
"--port", "5000"]
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "1"
memory: "2Gi"
---
apiVersion: v1
kind: Service
metadata:
name: mlflow-tracking
namespace: mlops
spec:
selector:
app: mlflow-tracking
ports:
- port: 5000
targetPort: 5000
type: ClusterIP
Kubeflow Full Platform Deployment
For production Kubeflow, use a managed Kubernetes service (EKS, GKE, AKS) with:
- Istio for service mesh, mTLS, and traffic management
- Cert-Manager for automatic TLS certificate provisioning
- External DNS for automatic DNS record management
- PersistentVolumeClaims for durable pipeline storage
Integrating MLflow with Kubeflow
A common enterprise pattern uses both platforms together:
┌──────────────────────────────────────────────────┐
│ Enterprise ML Platform │
│ │
│ ┌──────────┐ ┌───────────┐ ┌──────────┐ │
│ │ Kubeflow │───▶│ MLflow │───▶│ KServe │ │
│ │ Pipelines│ │ Tracking │ │ (Serving) │ │
│ │ │ │ + Registry│ │ │ │
│ │ Orchestrate │ Track + │ │ Deploy + │ │
│ │ training │ │ version │ │ scale │ │
│ └──────────┘ └───────────┘ └──────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │ GPU │ │ Monitoring│ │
│ │ Cluster │ │ (Prometheus│ │
│ │ │ │ + Grafana)│ │
│ └──────────┘ └──────────┘ │
└──────────────────────────────────────────────────┘
In this pattern:
- Kubeflow Pipelines orchestrate the training workflow (data preprocessing → training → evaluation)
- MLflow tracks experiments and registers approved models
- KServe deploys registered models with autoscaling and canary rollouts
- Prometheus + Grafana monitor inference latency, throughput, and model drift
CI/CD Pipeline Integration
Use GitHub Actions or Argo Workflows to automate the ML lifecycle:
# .github/workflows/ml-pipeline.yaml
name: ML Training Pipeline
on:
push:
paths:
- 'models/**'
- 'pipelines/**'
jobs:
train-and-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install mlflow tensorflow
- name: Run training
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
run: python pipelines/train.py
- name: Register model
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
run: python pipelines/register_model.py
- name: Deploy to KServe
uses: azure/k8s-set-context@v4
with:
kubeconfig: ${{ secrets.KUBECONFIG }}
- run: kubectl apply -f deployments/inference-service.yaml
Security Best Practices
Secrets Management
- Never store database credentials, API keys, or cloud access keys in code or environment variables directly
- Use Kubernetes Secrets with encryption at rest enabled (
EncryptionConfiguration) - Integrate with HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault via the CSI Secrets Store Driver
# Enable encryption at rest for Kubernetes secrets
# Add to kube-apiserver configuration:
--encryption-provider-config=/etc/kubernetes/encryption-config.yaml
RBAC in Kubernetes
Apply the principle of least privilege to MLOps namespaces:
# mlops-rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: mlops
name: ml-engineer
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps"]
verbs: ["get", "list", "watch"]
- apiGroups: ["kubeflow.org"]
resources: ["tfjobs", "pytorchjobs", "experiments"]
verbs: ["get", "list", "create", "delete"]
- apiGroups: ["serving.kserve.io"]
resources: ["inferenceservices"]
verbs: ["get", "list", "create", "update"]
Network Policies
Restrict traffic between MLOps components:
# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: mlflow-isolation
namespace: mlops
spec:
podSelector:
matchLabels:
app: mlflow-tracking
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
role: ml-training
ports:
- protocol: TCP
port: 5000
egress:
- to:
- namespaceSelector:
matchLabels:
role: storage
ports:
- protocol: TCP
port: 5432
- protocol: TCP
port: 9000
Model Governance
- Sign model artifacts using Sigstore/Cosign to verify provenance
- Enforce approval gates in the MLflow Model Registry before promoting to Production
- Log all model transitions with audit trails (who promoted, when, from which experiment)
- Run adversarial robustness checks before deployment
Real-World Architecture: End-to-End Pipeline
A production enterprise ML pipeline typically follows this flow:
┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Data │───▶│ Feature │───▶│ Training │───▶│ Model │───▶│ Serving │
│ Lake │ │ Store │ │ Pipeline │ │ Registry │ │ (KServe) │
│ (S3/GCS)│ │ (Feast) │ │(Kubeflow)│ │ (MLflow) │ │ │
└─────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │
│ ┌──────────┐ │
└────────▶│ Experiment│◀─────────┘
│ Tracking │
│ (MLflow) │
└──────────┘
│
┌──────────┐
│Monitoring│
│(Prometheus│
│+ Grafana)│
└──────────┘
Stage 1 — Data: Raw data lands in the data lake. Validation runs (Great Expectations, TFX Data Validation) ensure schema compliance.
Stage 2 — Features: A feature store (Feast) provides consistent features for training and serving, preventing training-serving skew.
Stage 3 — Training: Kubeflow Pipelines orchestrate distributed training jobs across GPU nodes. Katib runs hyperparameter optimization.
Stage 4 — Tracking: MLflow logs all experiments — parameters, metrics, artifacts, and code versions. The Model Registry gates promotions.
Stage 5 — Serving: KServe deploys the approved model with autoscaling, canary rollouts, and A/B testing capabilities.
Stage 6 — Monitoring: Prometheus collects inference metrics. Grafana dashboards visualize latency, throughput, and model accuracy. Alertmanager triggers on drift detection.
Top GitHub Projects
1. kubeflow/kubeflow
- Description: Machine Learning Toolkit for Kubernetes — the full platform for running ML workflows on K8s
- Stars: 14,500+
- Use case: End-to-end ML platform for teams that already run workloads on Kubernetes
- Link: github.com/kubeflow/kubeflow
2. mlflow/mlflow
- Description: Open source platform for the complete machine learning lifecycle — tracking, registry, serving
- Stars: 19,000+
- Use case: Experiment tracking and model versioning across any infrastructure (local, cloud, K8s)
- Link: github.com/mlflow/mlflow
3. kubeflow/pipelines
- Description: ML pipeline SDK and execution engine built on Argo Workflows for Kubernetes
- Stars: 3,600+
- Use case: Defining, deploying, and managing reproducible ML workflows as code
- Link: github.com/kubeflow/pipelines
4. SeldonIO/seldon-core
- Description: ML deployment platform for Kubernetes with advanced inference graphs, A/B testing, and explainability
- Stars: 4,300+
- Use case: Complex inference pipelines with multi-model routing and real-time monitoring
- Link: github.com/SeldonIO/seldon-core
5. kserve/kserve
- Description: Standardized serverless ML inference on Kubernetes — supports TensorFlow, PyTorch, ONNX, and more
- Stars: 3,500+
- Use case: Production model serving with autoscaling, canary rollouts, and multi-framework support
- Link: github.com/kserve/kserve