Skip to content
yisusvii Blog
Go back

MLflow vs Kubeflow (and Modern MLOps Tools): Enterprise Installation and Architecture Guide

Suggest Changes

Table of Contents

Open Table of Contents

Overview of the MLOps Ecosystem

Machine Learning Operations (MLOps) bridges the gap between experimental ML notebooks and production systems that serve predictions at scale. As enterprises move from proof-of-concept models to revenue-critical inference workloads, the need for a repeatable, auditable, and secure ML lifecycle becomes essential.

The core MLOps lifecycle covers:

  1. Data ingestion and validation — Ensuring data quality before training begins
  2. Experiment tracking — Recording hyperparameters, metrics, and artifacts across runs
  3. Model training and tuning — Scaling training across GPUs and distributed clusters
  4. Model registry and governance — Versioning, approving, and auditing models
  5. Deployment and serving — Rolling models into production with canary or blue-green strategies
  6. Monitoring and observability — Detecting data drift, latency regressions, and accuracy degradation

Two platforms dominate the open-source MLOps landscape:

Other tools complement or overlap with these:

This guide focuses on MLflow and Kubeflow as the primary MLOps building blocks for enterprise teams.


Architecture Comparison

MLflow Architecture

MLflow is organized around four core components:

┌─────────────────────────────────────────────────────┐
│                    MLflow Platform                   │
│                                                     │
│  ┌─────────────┐  ┌──────────────┐  ┌────────────┐ │
│  │  Tracking    │  │   Model      │  │  Projects  │ │
│  │  Server      │  │   Registry   │  │            │ │
│  │             │  │              │  │            │ │
│  │ • Experiments│  │ • Versioning │  │ • Repro-   │ │
│  │ • Runs      │  │ • Staging    │  │   ducible  │ │
│  │ • Metrics   │  │ • Production │  │   runs     │ │
│  │ • Params    │  │ • Archived   │  │            │ │
│  └──────┬──────┘  └──────┬───────┘  └────────────┘ │
│         │                │                          │
│  ┌──────▼────────────────▼───────┐  ┌────────────┐ │
│  │       Backend Store           │  │  Model     │ │
│  │  (PostgreSQL / MySQL / SQLite)│  │  Serving   │ │
│  └──────────────┬────────────────┘  │            │ │
│                 │                    │ • REST API │ │
│  ┌──────────────▼────────────────┐  │ • Batch    │ │
│  │       Artifact Store          │  │ • Streaming│ │
│  │   (S3 / MinIO / GCS / ADLS)  │  └────────────┘ │
│  └───────────────────────────────┘                  │
└─────────────────────────────────────────────────────┘

Kubeflow Architecture

Kubeflow is a Kubernetes-native platform with multiple integrated components:

┌──────────────────────────────────────────────────────────┐
│                   Kubernetes Cluster                     │
│                                                          │
│  ┌──────────────────────────────────────────────────┐    │
│  │              Kubeflow Platform                    │    │
│  │                                                   │    │
│  │  ┌──────────┐ ┌──────────┐ ┌────────────────┐   │    │
│  │  │ Kubeflow │ │  Katib   │ │   Notebooks    │   │    │
│  │  │ Pipelines│ │ (AutoML) │ │   (Jupyter)    │   │    │
│  │  │          │ │          │ │                │   │    │
│  │  │ • DAGs   │ │ • HPO    │ │ • Workspace   │   │    │
│  │  │ • Steps  │ │ • NAS    │ │ • GPU-enabled  │   │    │
│  │  │ • Caching│ │ • Trials │ │                │   │    │
│  │  └──────────┘ └──────────┘ └────────────────┘   │    │
│  │                                                   │    │
│  │  ┌──────────┐ ┌──────────┐ ┌────────────────┐   │    │
│  │  │ Training │ │  KServe  │ │   Central      │   │    │
│  │  │ Operator │ │ (Serving)│ │   Dashboard    │   │    │
│  │  │          │ │          │ │                │   │    │
│  │  │ • TFJob  │ │ • Canary │ │ • UI Portal   │   │    │
│  │  │ • PyTorch│ │ • A/B    │ │ • Namespace    │   │    │
│  │  │   Job    │ │ • Scale  │ │   mgmt         │   │    │
│  │  └──────────┘ └──────────┘ └────────────────┘   │    │
│  └──────────────────────────────────────────────────┘    │
│                                                          │
│  ┌────────────┐  ┌────────────┐  ┌────────────────┐     │
│  │   Istio    │  │   Knative  │  │   Cert-Manager │     │
│  │  (Service  │  │  (Serverless│  │   (TLS)       │     │
│  │   Mesh)    │  │   Serving) │  │                │     │
│  └────────────┘  └────────────┘  └────────────────┘     │
└──────────────────────────────────────────────────────────┘

Key Kubeflow components:

Head-to-Head Comparison

AspectMLflowKubeflow
Primary focusExperiment tracking + model registryFull ML platform on Kubernetes
InfrastructureRuns anywhere (local, VM, K8s)Kubernetes-native (requires cluster)
Learning curveLow — pip install, immediate useHigh — requires K8s expertise
PipelinesBasic (MLflow Recipes)Advanced (Argo-based DAGs)
Model servingBuilt-in REST endpointKServe with autoscaling
Hyperparameter tuningExternal (Optuna, Ray Tune)Built-in (Katib)
Multi-tenancyLimited (manual RBAC)Native (namespace isolation)
Best forSmall-to-mid teams, rapid iterationLarge teams, regulated industries

Installation on Ubuntu

Prerequisites

# Update system packages
sudo apt update && sudo apt upgrade -y

# Install essential build tools
sudo apt install -y build-essential curl wget git unzip \
  python3 python3-pip python3-venv

Installing MLflow

Step 1: Create a Virtual Environment

python3 -m venv ~/mlflow-env
source ~/mlflow-env/bin/activate

Step 2: Install MLflow

pip install mlflow

Verify the installation:

mlflow --version

Step 3: Configure a PostgreSQL Backend Store

For production use, MLflow should persist experiment data in a relational database rather than local files.

# Install PostgreSQL
sudo apt install -y postgresql postgresql-contrib

# Create database and user
sudo -u postgres psql -c "CREATE DATABASE mlflow_db;"
sudo -u postgres psql -c "CREATE USER mlflow_user WITH ENCRYPTED PASSWORD 'your-secure-password';"
sudo -u postgres psql -c "GRANT ALL PRIVILEGES ON DATABASE mlflow_db TO mlflow_user;"

# Install the PostgreSQL driver
pip install psycopg2-binary

⚠️ Warning: Never use the default PostgreSQL credentials in production. Use a secrets manager (HashiCorp Vault, AWS Secrets Manager) to inject credentials at runtime.

Step 4: Configure an Artifact Store (MinIO / S3)

For storing model artifacts, use S3-compatible storage:

# Install MinIO (for local/on-prem S3-compatible storage)
wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
sudo mv minio /usr/local/bin/

# Start MinIO (development — use systemd in production)
mkdir -p ~/minio-data
MINIO_ROOT_USER=minioadmin MINIO_ROOT_PASSWORD=minioadmin \
  minio server ~/minio-data --console-address ":9001"

# Install boto3 for S3 integration
pip install boto3

Step 5: Start the MLflow Tracking Server

mlflow server \
  --backend-store-uri postgresql://mlflow_user:your-secure-password@localhost:5432/mlflow_db \
  --default-artifact-root s3://mlflow-artifacts/ \
  --host 0.0.0.0 \
  --port 5000

⚠️ Warning: In production, always place the MLflow server behind a reverse proxy (NGINX, Traefik) with TLS termination. The MLflow UI has no built-in authentication.

Installing Kubeflow

Step 1: Set Up a Kubernetes Cluster

For local development, use kind (Kubernetes IN Docker):

# Install Docker
sudo apt install -y docker.io
sudo usermod -aG docker $USER
newgrp docker

# Install kind
[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.24.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

# Create a cluster with sufficient resources
cat <<EOF | kind create cluster --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
  extraPortMappings:
  - containerPort: 80
    hostPort: 80
    protocol: TCP
  - containerPort: 443
    hostPort: 443
    protocol: TCP
- role: worker
- role: worker
EOF

Step 2: Install kubectl and kustomize

# Install kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/

# Install kustomize (version-pinned with checksum verification)
KUSTOMIZE_VERSION="v5.4.1"
curl -LO "https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2F${KUSTOMIZE_VERSION}/kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz"
curl -LO "https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2F${KUSTOMIZE_VERSION}/kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz.sha256"

# Verify checksum before installing
echo "$(cat kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz.sha256)  kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz" | sha256sum --check

tar -xzf "kustomize_${KUSTOMIZE_VERSION}_linux_amd64.tar.gz"
sudo mv kustomize /usr/local/bin/

Step 3: Deploy Kubeflow Using Manifests

The official Kubeflow manifests repository provides a kustomize-based installation:

# Clone the Kubeflow manifests repository
git clone https://github.com/kubeflow/manifests.git
cd manifests

# Deploy the full Kubeflow platform
while ! kustomize build example | kubectl apply -f -; do
  echo "Retrying to apply resources..."
  sleep 20
done

⚠️ Warning: The full Kubeflow installation requires significant cluster resources (16 GB+ RAM recommended). For resource-constrained environments, consider installing individual components (e.g., Pipelines only).

Step 4: Verify Kubeflow Installation

# Check that all pods are running
kubectl get pods -n kubeflow --watch

# Port-forward the Kubeflow dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

Access the dashboard at http://localhost:8080. Default credentials are user@example.com / 12341234.

⚠️ Warning: Change the default credentials immediately. In production, integrate with your organization’s identity provider (OIDC, LDAP) via Istio and Dex.


Enterprise Deployment Patterns

MLflow on Kubernetes

Deploy MLflow as a Kubernetes workload for high availability and scalability:

# mlflow-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-tracking
  namespace: mlops
spec:
  replicas: 2
  selector:
    matchLabels:
      app: mlflow-tracking
  template:
    metadata:
      labels:
        app: mlflow-tracking
    spec:
      containers:
      - name: mlflow
        image: ghcr.io/mlflow/mlflow:latest
        ports:
        - containerPort: 5000
        env:
        - name: MLFLOW_BACKEND_STORE_URI
          valueFrom:
            secretKeyRef:
              name: mlflow-secrets
              key: backend-store-uri
        - name: MLFLOW_DEFAULT_ARTIFACT_ROOT
          value: "s3://mlflow-artifacts/"
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: mlflow-secrets
              key: aws-access-key
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: mlflow-secrets
              key: aws-secret-key
        command: ["mlflow", "server",
          "--backend-store-uri", "$(MLFLOW_BACKEND_STORE_URI)",
          "--default-artifact-root", "$(MLFLOW_DEFAULT_ARTIFACT_ROOT)",
          "--host", "0.0.0.0",
          "--port", "5000"]
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1"
            memory: "2Gi"
---
apiVersion: v1
kind: Service
metadata:
  name: mlflow-tracking
  namespace: mlops
spec:
  selector:
    app: mlflow-tracking
  ports:
  - port: 5000
    targetPort: 5000
  type: ClusterIP

Kubeflow Full Platform Deployment

For production Kubeflow, use a managed Kubernetes service (EKS, GKE, AKS) with:

Integrating MLflow with Kubeflow

A common enterprise pattern uses both platforms together:

┌──────────────────────────────────────────────────┐
│              Enterprise ML Platform              │
│                                                  │
│  ┌──────────┐    ┌───────────┐    ┌──────────┐  │
│  │ Kubeflow │───▶│  MLflow   │───▶│  KServe  │  │
│  │ Pipelines│    │  Tracking │    │ (Serving) │  │
│  │          │    │  + Registry│    │          │  │
│  │ Orchestrate   │  Track +   │    │ Deploy + │  │
│  │ training │    │  version   │    │ scale    │  │
│  └──────────┘    └───────────┘    └──────────┘  │
│       │                                │         │
│       ▼                                ▼         │
│  ┌──────────┐                    ┌──────────┐   │
│  │   GPU    │                    │ Monitoring│   │
│  │ Cluster  │                    │ (Prometheus│  │
│  │          │                    │  + Grafana)│  │
│  └──────────┘                    └──────────┘   │
└──────────────────────────────────────────────────┘

In this pattern:

  1. Kubeflow Pipelines orchestrate the training workflow (data preprocessing → training → evaluation)
  2. MLflow tracks experiments and registers approved models
  3. KServe deploys registered models with autoscaling and canary rollouts
  4. Prometheus + Grafana monitor inference latency, throughput, and model drift

CI/CD Pipeline Integration

Use GitHub Actions or Argo Workflows to automate the ML lifecycle:

# .github/workflows/ml-pipeline.yaml
name: ML Training Pipeline

on:
  push:
    paths:
      - 'models/**'
      - 'pipelines/**'

jobs:
  train-and-deploy:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4

    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: '3.11'

    - name: Install dependencies
      run: pip install mlflow tensorflow

    - name: Run training
      env:
        MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
      run: python pipelines/train.py

    - name: Register model
      env:
        MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
      run: python pipelines/register_model.py

    - name: Deploy to KServe
      uses: azure/k8s-set-context@v4
      with:
        kubeconfig: ${{ secrets.KUBECONFIG }}
    - run: kubectl apply -f deployments/inference-service.yaml

Security Best Practices

Secrets Management

# Enable encryption at rest for Kubernetes secrets
# Add to kube-apiserver configuration:
--encryption-provider-config=/etc/kubernetes/encryption-config.yaml

RBAC in Kubernetes

Apply the principle of least privilege to MLOps namespaces:

# mlops-rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: mlops
  name: ml-engineer
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["kubeflow.org"]
  resources: ["tfjobs", "pytorchjobs", "experiments"]
  verbs: ["get", "list", "create", "delete"]
- apiGroups: ["serving.kserve.io"]
  resources: ["inferenceservices"]
  verbs: ["get", "list", "create", "update"]

Network Policies

Restrict traffic between MLOps components:

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: mlflow-isolation
  namespace: mlops
spec:
  podSelector:
    matchLabels:
      app: mlflow-tracking
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          role: ml-training
    ports:
    - protocol: TCP
      port: 5000
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          role: storage
    ports:
    - protocol: TCP
      port: 5432
    - protocol: TCP
      port: 9000

Model Governance


Real-World Architecture: End-to-End Pipeline

A production enterprise ML pipeline typically follows this flow:

┌─────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│  Data    │───▶│  Feature │───▶│ Training │───▶│  Model   │───▶│ Serving  │
│  Lake    │    │  Store   │    │ Pipeline │    │ Registry │    │ (KServe) │
│ (S3/GCS)│    │ (Feast)  │    │(Kubeflow)│    │ (MLflow) │    │          │
└─────────┘    └──────────┘    └──────────┘    └──────────┘    └──────────┘
                                    │                               │
                                    │         ┌──────────┐          │
                                    └────────▶│ Experiment│◀─────────┘
                                              │ Tracking │
                                              │ (MLflow) │
                                              └──────────┘

                                              ┌──────────┐
                                              │Monitoring│
                                              │(Prometheus│
                                              │+ Grafana)│
                                              └──────────┘

Stage 1 — Data: Raw data lands in the data lake. Validation runs (Great Expectations, TFX Data Validation) ensure schema compliance.

Stage 2 — Features: A feature store (Feast) provides consistent features for training and serving, preventing training-serving skew.

Stage 3 — Training: Kubeflow Pipelines orchestrate distributed training jobs across GPU nodes. Katib runs hyperparameter optimization.

Stage 4 — Tracking: MLflow logs all experiments — parameters, metrics, artifacts, and code versions. The Model Registry gates promotions.

Stage 5 — Serving: KServe deploys the approved model with autoscaling, canary rollouts, and A/B testing capabilities.

Stage 6 — Monitoring: Prometheus collects inference metrics. Grafana dashboards visualize latency, throughput, and model accuracy. Alertmanager triggers on drift detection.


Top GitHub Projects

1. kubeflow/kubeflow

2. mlflow/mlflow

3. kubeflow/pipelines

4. SeldonIO/seldon-core

5. kserve/kserve


References


Suggest Changes
Share this post on:

Previous Post
Python, TensorFlow, and PyTorch: Enterprise AI Stack Setup and Best Practices
Next Post
Installing NVIDIA NemoClaw Securely (Official Guide + Best Practices)