MLflow vs Kubeflow (and Modern MLOps Tools): Enterprise Installation and Architecture Guide

Open Table of Contents

Overview of the MLOps Ecosystem
Architecture Comparison
Installation on Ubuntu
Enterprise Deployment Patterns
Security Best Practices
Real-World Architecture: End-to-End Pipeline
Top GitHub Projects
References

Overview of the MLOps Ecosystem

Machine Learning Operations (MLOps) bridges the gap between experimental ML notebooks and production systems that serve predictions at scale. As enterprises move from proof-of-concept models to revenue-critical inference workloads, the need for a repeatable, auditable, and secure ML lifecycle becomes essential.

The core MLOps lifecycle covers:

Data ingestion and validation — Ensuring data quality before training begins
Experiment tracking — Recording hyperparameters, metrics, and artifacts across runs
Model training and tuning — Scaling training across GPUs and distributed clusters
Model registry and governance — Versioning, approving, and auditing models
Deployment and serving — Rolling models into production with canary or blue-green strategies
Monitoring and observability — Detecting data drift, latency regressions, and accuracy degradation

Two platforms dominate the open-source MLOps landscape:

MLflow — Focused on experiment tracking, model registry, and lightweight deployment
Kubeflow — A full Kubernetes-native ML platform covering pipelines, hyperparameter tuning, and serving

Other tools complement or overlap with these:

Apache Airflow — General-purpose workflow orchestration, often used for data pipelines that feed ML training jobs
Metaflow — Originally developed at Netflix, focused on ergonomic Python-based ML workflows with built-in versioning

This guide focuses on MLflow and Kubeflow as the primary MLOps building blocks for enterprise teams.

Architecture Comparison

MLflow Architecture

MLflow is organized around four core components:

┌─────────────────────────────────────────────────────┐
│                    MLflow Platform                   │
│                                                     │
│  ┌─────────────┐  ┌──────────────┐  ┌────────────┐ │
│  │  Tracking    │  │   Model      │  │  Projects  │ │
│  │  Server      │  │   Registry   │  │            │ │
│  │             │  │              │  │            │ │
│  │ • Experiments│  │ • Versioning │  │ • Repro-   │ │
│  │ • Runs      │  │ • Staging    │  │   ducible  │ │
│  │ • Metrics   │  │ • Production │  │   runs     │ │
│  │ • Params    │  │ • Archived   │  │            │ │
│  └──────┬──────┘  └──────┬───────┘  └────────────┘ │
│         │                │                          │
│  ┌──────▼────────────────▼───────┐  ┌────────────┐ │
│  │       Backend Store           │  │  Model     │ │
│  │  (PostgreSQL / MySQL / SQLite)│  │  Serving   │ │
│  └──────────────┬────────────────┘  │            │ │
│                 │                    │ • REST API │ │
│  ┌──────────────▼────────────────┐  │ • Batch    │ │
│  │       Artifact Store          │  │ • Streaming│ │
│  │   (S3 / MinIO / GCS / ADLS)  │  └────────────┘ │
│  └───────────────────────────────┘                  │
└─────────────────────────────────────────────────────┘

Tracking Server — Logs experiments, parameters, metrics, and artifacts via a REST API
Model Registry — Manages model lifecycle stages (Staging → Production → Archived)
Projects — Packages ML code for reproducible runs across environments
Model Serving — Deploys registered models as REST endpoints

Kubeflow Architecture

Kubeflow is a Kubernetes-native platform with multiple integrated components:

┌──────────────────────────────────────────────────────────┐
│                   Kubernetes Cluster                     │
│                                                          │
│  ┌──────────────────────────────────────────────────┐    │
│  │              Kubeflow Platform                    │    │
│  │                                                   │    │
│  │  ┌──────────┐ ┌──────────┐ ┌────────────────┐   │    │
│  │  │ Kubeflow │ │  Katib   │ │   Notebooks    │   │    │
│  │  │ Pipelines│ │ (AutoML) │ │   (Jupyter)    │   │    │
│  │  │          │ │          │ │                │   │    │
│  │  │ • DAGs   │ │ • HPO    │ │ • Workspace   │   │    │
│  │  │ • Steps  │ │ • NAS    │ │ • GPU-enabled  │   │    │
│  │  │ • Caching│ │ • Trials │ │                │   │    │
│  │  └──────────┘ └──────────┘ └────────────────┘   │    │
│  │                                                   │    │
│  │  ┌──────────┐ ┌──────────┐ ┌────────────────┐   │    │
│  │  │ Training │ │  KServe  │ │   Central      │   │    │
│  │  │ Operator │ │ (Serving)│ │   Dashboard    │   │    │
│  │  │          │ │          │ │                │   │    │
│  │  │ • TFJob  │ │ • Canary │ │ • UI Portal   │   │    │
│  │  │ • PyTorch│ │ • A/B    │ │ • Namespace    │   │    │
│  │  │   Job    │ │ • Scale  │ │   mgmt         │   │    │
│  │  └──────────┘ └──────────┘ └────────────────┘   │    │
│  └──────────────────────────────────────────────────┘    │
│                                                          │
│  ┌────────────┐  ┌────────────┐  ┌────────────────┐     │
│  │   Istio    │  │   Knative  │  │   Cert-Manager │     │
│  │  (Service  │  │  (Serverless│  │   (TLS)       │     │
│  │   Mesh)    │  │   Serving) │  │                │     │
│  └────────────┘  └────────────┘  └────────────────┘     │
└──────────────────────────────────────────────────────────┘

Key Kubeflow components:

Pipelines — Define and execute multi-step ML workflows as DAGs
Katib — Hyperparameter tuning and neural architecture search
Training Operators — Kubernetes-native distributed training for TensorFlow, PyTorch, MPI, and XGBoost
KServe — High-performance model serving with autoscaling, canary deployments, and explainability
Notebooks — Managed Jupyter environments with GPU access

Head-to-Head Comparison

Aspect	MLflow	Kubeflow
Primary focus	Experiment tracking + model registry	Full ML platform on Kubernetes
Infrastructure	Runs anywhere (local, VM, K8s)	Kubernetes-native (requires cluster)
Learning curve	Low — pip install, immediate use	High — requires K8s expertise
Pipelines	Basic (MLflow Recipes)	Advanced (Argo-based DAGs)
Model serving	Built-in REST endpoint	KServe with autoscaling
Hyperparameter tuning	External (Optuna, Ray Tune)	Built-in (Katib)
Multi-tenancy	Limited (manual RBAC)	Native (namespace isolation)
Best for	Small-to-mid teams, rapid iteration	Large teams, regulated industries

Installation on Ubuntu

Prerequisites

# Update system packages
sudo apt update && sudo apt upgrade -y

# Install essential build tools
sudo apt install -y build-essential curl wget git unzip \
  python3 python3-pip python3-venv

Installing MLflow

Step 1: Create a Virtual Environment

python3 -m venv ~/mlflow-env
source ~/mlflow-env/bin/activate

Step 2: Install MLflow

pip install mlflow

Verify the installation:

mlflow --version

Step 3: Configure a PostgreSQL Backend Store

For production use, MLflow should persist experiment data in a relational database rather than local files.

# Install PostgreSQL
sudo apt install -y postgresql postgresql-contrib

# Create database and user
sudo -u postgres psql -c "CREATE DATABASE mlflow_db;"
sudo -u postgres psql -c "CREATE USER mlflow_user WITH ENCRYPTED PASSWORD 'your-secure-password';"
sudo -u postgres psql -c "GRANT ALL PRIVILEGES ON DATABASE mlflow_db TO mlflow_user;"

# Install the PostgreSQL driver
pip install psycopg2-binary

⚠️ Warning: Never use the default PostgreSQL credentials in production. Use a secrets manager (HashiCorp Vault, AWS Secrets Manager) to inject credentials at runtime.

Step 4: Configure an Artifact Store (MinIO / S3)

For storing model artifacts, use S3-compatible storage:

# Install MinIO (for local/on-prem S3-compatible storage)
wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
sudo mv minio /usr/local/bin/

# Start MinIO (development — use systemd in production)
mkdir -p ~/minio-data
MINIO_ROOT_USER=minioadmin MINIO_ROOT_PASSWORD=minioadmin \
  minio server ~/minio-data --console-address ":9001"

# Install boto3 for S3 integration
pip install boto3

Step 5: Start the MLflow Tracking Server

mlflow server \
  --backend-store-uri postgresql://mlflow_user:your-secure-password@localhost:5432/mlflow_db \
  --default-artifact-root s3://mlflow-artifacts/ \
  --host 0.0.0.0 \
  --port 5000

⚠️ Warning: In production, always place the MLflow server behind a reverse proxy (NGINX, Traefik) with TLS termination. The MLflow UI has no built-in authentication.

Installing Kubeflow

Step 1: Set Up a Kubernetes Cluster

For local development, use kind (Kubernetes IN Docker):

# Install Docker
sudo apt install -y docker.io
sudo usermod -aG docker $USER
newgrp docker

# Install kind
[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.24.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

# Create a cluster with sufficient resources
cat <<EOF | kind create cluster --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
  extraPortMappings:
  - containerPort: 80
    hostPort: 80
    protocol: TCP
  - containerPort: 443
    hostPort: 443
    protocol: TCP
- role: worker
- role: worker
EOF

Step 2: Install kubectl and kustomize

# Install kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/

# Install kustomize (version-pinned with checksum verification)
KUSTOMIZE_VERSION="v5.4.1"
curl -LO "https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2F${KUSTOMIZE_VERSION}/kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz"
curl -LO "https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2F${KUSTOMIZE_VERSION}/kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz.sha256"

# Verify checksum before installing
echo "$(cat kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz.sha256)  kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz" | sha256sum --check

tar -xzf "kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz"
sudo mv kustomize /usr/local/bin/

Step 3: Deploy Kubeflow Using Manifests

The official Kubeflow manifests repository provides a kustomize-based installation:

# Clone the Kubeflow manifests repository
git clone https://github.com/kubeflow/manifests.git
cd manifests

# Deploy the full Kubeflow platform
while ! kustomize build example | kubectl apply -f -; do
  echo "Retrying to apply resources..."
  sleep 20
done

⚠️ Warning: The full Kubeflow installation requires significant cluster resources (16 GB+ RAM recommended). For resource-constrained environments, consider installing individual components (e.g., Pipelines only).

Step 4: Verify Kubeflow Installation

# Check that all pods are running
kubectl get pods -n kubeflow --watch

# Port-forward the Kubeflow dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

Access the dashboard at http://localhost:8080. Default credentials are user@example.com / 12341234.

⚠️ Warning: Change the default credentials immediately. In production, integrate with your organization’s identity provider (OIDC, LDAP) via Istio and Dex.

Enterprise Deployment Patterns

MLflow on Kubernetes

Deploy MLflow as a Kubernetes workload for high availability and scalability:

# mlflow-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-tracking
  namespace: mlops
spec:
  replicas: 2
  selector:
    matchLabels:
      app: mlflow-tracking
  template:
    metadata:
      labels:
        app: mlflow-tracking
    spec:
      containers:
      - name: mlflow
        image: ghcr.io/mlflow/mlflow:latest
        ports:
        - containerPort: 5000
        env:
        - name: MLFLOW_BACKEND_STORE_URI
          valueFrom:
            secretKeyRef:
              name: mlflow-secrets
              key: backend-store-uri
        - name: MLFLOW_DEFAULT_ARTIFACT_ROOT
          value: "s3://mlflow-artifacts/"
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: mlflow-secrets
              key: aws-access-key
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: mlflow-secrets
              key: aws-secret-key
        command: ["mlflow", "server",
          "--backend-store-uri", "$(MLFLOW_BACKEND_STORE_URI)",
          "--default-artifact-root", "$(MLFLOW_DEFAULT_ARTIFACT_ROOT)",
          "--host", "0.0.0.0",
          "--port", "5000"]
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1"
            memory: "2Gi"
---
apiVersion: v1
kind: Service
metadata:
  name: mlflow-tracking
  namespace: mlops
spec:
  selector:
    app: mlflow-tracking
  ports:
  - port: 5000
    targetPort: 5000
  type: ClusterIP

Kubeflow Full Platform Deployment

For production Kubeflow, use a managed Kubernetes service (EKS, GKE, AKS) with:

Istio for service mesh, mTLS, and traffic management
Cert-Manager for automatic TLS certificate provisioning
External DNS for automatic DNS record management
PersistentVolumeClaims for durable pipeline storage

Integrating MLflow with Kubeflow

A common enterprise pattern uses both platforms together:

┌──────────────────────────────────────────────────┐
│              Enterprise ML Platform              │
│                                                  │
│  ┌──────────┐    ┌───────────┐    ┌──────────┐  │
│  │ Kubeflow │───▶│  MLflow   │───▶│  KServe  │  │
│  │ Pipelines│    │  Tracking │    │ (Serving) │  │
│  │          │    │  + Registry│    │          │  │
│  │ Orchestrate   │  Track +   │    │ Deploy + │  │
│  │ training │    │  version   │    │ scale    │  │
│  └──────────┘    └───────────┘    └──────────┘  │
│       │                                │         │
│       ▼                                ▼         │
│  ┌──────────┐                    ┌──────────┐   │
│  │   GPU    │                    │ Monitoring│   │
│  │ Cluster  │                    │ (Prometheus│  │
│  │          │                    │  + Grafana)│  │
│  └──────────┘                    └──────────┘   │
└──────────────────────────────────────────────────┘

In this pattern:

Kubeflow Pipelines orchestrate the training workflow (data preprocessing → training → evaluation)
MLflow tracks experiments and registers approved models
KServe deploys registered models with autoscaling and canary rollouts
Prometheus + Grafana monitor inference latency, throughput, and model drift

CI/CD Pipeline Integration

Use GitHub Actions or Argo Workflows to automate the ML lifecycle:

# .github/workflows/ml-pipeline.yaml
name: ML Training Pipeline

on:
  push:
    paths:
      - 'models/**'
      - 'pipelines/**'

jobs:
  train-and-deploy:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4

    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: '3.11'

    - name: Install dependencies
      run: pip install mlflow tensorflow

    - name: Run training
      env:
        MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
      run: python pipelines/train.py

    - name: Register model
      env:
        MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
      run: python pipelines/register_model.py

    - name: Deploy to KServe
      uses: azure/k8s-set-context@v4
      with:
        kubeconfig: ${{ secrets.KUBECONFIG }}
    - run: kubectl apply -f deployments/inference-service.yaml

Security Best Practices

Secrets Management

Never store database credentials, API keys, or cloud access keys in code or environment variables directly
Use Kubernetes Secrets with encryption at rest enabled (EncryptionConfiguration)
Integrate with HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault via the CSI Secrets Store Driver

# Enable encryption at rest for Kubernetes secrets
# Add to kube-apiserver configuration:
--encryption-provider-config=/etc/kubernetes/encryption-config.yaml

RBAC in Kubernetes

Apply the principle of least privilege to MLOps namespaces:

# mlops-rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: mlops
  name: ml-engineer
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["kubeflow.org"]
  resources: ["tfjobs", "pytorchjobs", "experiments"]
  verbs: ["get", "list", "create", "delete"]
- apiGroups: ["serving.kserve.io"]
  resources: ["inferenceservices"]
  verbs: ["get", "list", "create", "update"]

Network Policies

Restrict traffic between MLOps components:

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: mlflow-isolation
  namespace: mlops
spec:
  podSelector:
    matchLabels:
      app: mlflow-tracking
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          role: ml-training
    ports:
    - protocol: TCP
      port: 5000
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          role: storage
    ports:
    - protocol: TCP
      port: 5432
    - protocol: TCP
      port: 9000

Model Governance

Sign model artifacts using Sigstore/Cosign to verify provenance
Enforce approval gates in the MLflow Model Registry before promoting to Production
Log all model transitions with audit trails (who promoted, when, from which experiment)
Run adversarial robustness checks before deployment

Real-World Architecture: End-to-End Pipeline

A production enterprise ML pipeline typically follows this flow:

┌─────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│  Data    │───▶│  Feature │───▶│ Training │───▶│  Model   │───▶│ Serving  │
│  Lake    │    │  Store   │    │ Pipeline │    │ Registry │    │ (KServe) │
│ (S3/GCS)│    │ (Feast)  │    │(Kubeflow)│    │ (MLflow) │    │          │
└─────────┘    └──────────┘    └──────────┘    └──────────┘    └──────────┘
                                    │                               │
                                    │         ┌──────────┐          │
                                    └────────▶│ Experiment│◀─────────┘
                                              │ Tracking │
                                              │ (MLflow) │
                                              └──────────┘
                                                   │
                                              ┌──────────┐
                                              │Monitoring│
                                              │(Prometheus│
                                              │+ Grafana)│
                                              └──────────┘

Stage 1 — Data: Raw data lands in the data lake. Validation runs (Great Expectations, TFX Data Validation) ensure schema compliance.

Stage 2 — Features: A feature store (Feast) provides consistent features for training and serving, preventing training-serving skew.

Stage 3 — Training: Kubeflow Pipelines orchestrate distributed training jobs across GPU nodes. Katib runs hyperparameter optimization.

Stage 4 — Tracking: MLflow logs all experiments — parameters, metrics, artifacts, and code versions. The Model Registry gates promotions.

Stage 5 — Serving: KServe deploys the approved model with autoscaling, canary rollouts, and A/B testing capabilities.

Stage 6 — Monitoring: Prometheus collects inference metrics. Grafana dashboards visualize latency, throughput, and model accuracy. Alertmanager triggers on drift detection.

Top GitHub Projects

1. kubeflow/kubeflow

Description: Machine Learning Toolkit for Kubernetes — the full platform for running ML workflows on K8s
Stars: 14,500+
Use case: End-to-end ML platform for teams that already run workloads on Kubernetes
Link: github.com/kubeflow/kubeflow

2. mlflow/mlflow

Description: Open source platform for the complete machine learning lifecycle — tracking, registry, serving
Stars: 19,000+
Use case: Experiment tracking and model versioning across any infrastructure (local, cloud, K8s)
Link: github.com/mlflow/mlflow

3. kubeflow/pipelines

Description: ML pipeline SDK and execution engine built on Argo Workflows for Kubernetes
Stars: 3,600+
Use case: Defining, deploying, and managing reproducible ML workflows as code
Link: github.com/kubeflow/pipelines

4. SeldonIO/seldon-core

Description: ML deployment platform for Kubernetes with advanced inference graphs, A/B testing, and explainability
Stars: 4,300+
Use case: Complex inference pipelines with multi-model routing and real-time monitoring
Link: github.com/SeldonIO/seldon-core

5. kserve/kserve

Description: Standardized serverless ML inference on Kubernetes — supports TensorFlow, PyTorch, ONNX, and more
Stars: 3,500+
Use case: Production model serving with autoscaling, canary rollouts, and multi-framework support
Link: github.com/kserve/kserve