Skip to content
yisusvii Blog
Go back

Python, TensorFlow, and PyTorch: Enterprise AI Stack Setup and Best Practices

Suggest Changes

Table of Contents

Open Table of Contents

Overview

Python is the undisputed backbone of AI and machine learning development. Its ecosystem of scientific computing libraries (NumPy, pandas, scikit-learn), visualization tools (Matplotlib, Plotly), and deep learning frameworks (TensorFlow, PyTorch) makes it the default language for everything from research prototypes to production inference systems.

Two deep learning frameworks dominate the enterprise landscape:

Enterprise adoption patterns:


Installation on Ubuntu

Prerequisites

# Update system packages
sudo apt update && sudo apt upgrade -y

# Install essential build tools and dependencies
sudo apt install -y build-essential curl wget git \
  libssl-dev zlib1g-dev libbz2-dev libreadline-dev \
  libsqlite3-dev libncursesw5-dev xz-utils tk-dev \
  libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev

Python Setup with pyenv

Using pyenv ensures consistent Python versions across development and production environments:

# Install pyenv (download, inspect, then run)
curl -fsSL https://pyenv.run -o /tmp/pyenv-installer.sh
# (Optional) Inspect the installer before running
less /tmp/pyenv-installer.sh
bash /tmp/pyenv-installer.sh

# Add pyenv to shell configuration
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo '[[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
source ~/.bashrc

# Install Python 3.11 (recommended for TensorFlow/PyTorch compatibility)
pyenv install 3.11.9
pyenv global 3.11.9

# Verify installation
python --version

⚠️ Warning: Avoid using the system Python for ML workloads. System Python is managed by apt and modifying it can break OS utilities. Always use pyenv or virtual environments.

Virtual Environments

Always isolate project dependencies:

# Create a project-specific virtual environment
python -m venv ~/ai-stack-env
source ~/ai-stack-env/bin/activate

# Upgrade pip
pip install --upgrade pip setuptools wheel

GPU Setup (CUDA)

For GPU-accelerated training, install NVIDIA drivers and CUDA:

# Check for NVIDIA GPU
lspci | grep -i nvidia

# Install NVIDIA drivers (Ubuntu)
sudo apt install -y nvidia-driver-550

# Verify driver installation
nvidia-smi

# Install CUDA Toolkit
# IMPORTANT: Check the official compatibility matrices BEFORE choosing a version.
# TensorFlow and PyTorch each support specific CUDA/cuDNN combinations.
# See the links below for the exact versions tested with your target framework.
#
# Example for CUDA 12.5 (tested with TF 2.18+ and PyTorch 2.5+):
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-5

# Install cuDNN (match your CUDA version)
sudo apt install -y libcudnn9-cuda-12

⚠️ Warning: Always verify the CUDA/cuDNN/TensorFlow/PyTorch compatibility matrix before installing. Mismatched versions are the #1 cause of GPU initialization failures. The commands above are examples — always cross-reference the official matrices for your specific framework version.

Check the official compatibility pages before installing:

As of the latest stable releases:

FrameworkTested CUDATested cuDNN
TensorFlow 2.18+12.59.3
PyTorch 2.5+12.4 / 12.69.x

⚠️ Warning: If you plan to run both frameworks on the same machine, choose a CUDA version that is compatible with both. Check each framework’s install page for the overlap.

Installing TensorFlow

# CPU-only installation
pip install tensorflow

# GPU installation (CUDA 12 — pip automatically resolves GPU packages)
pip install tensorflow[and-cuda]

# Verify installation
python -c "import tensorflow as tf; print(tf.__version__); print('GPU:', tf.config.list_physical_devices('GPU'))"

Installing PyTorch

Use the official selector at pytorch.org/get-started/locally/ to get the correct command for your setup:

# CPU-only installation
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# GPU installation (CUDA 12.4)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Verify installation
python -c "import torch; print(torch.__version__); print('CUDA:', torch.cuda.is_available())"

Architecture in Enterprise

Training Pipeline Architecture

┌──────────────────────────────────────────────────────────┐
│              Enterprise AI Training Platform              │
│                                                          │
│  ┌──────────┐    ┌──────────────┐    ┌───────────────┐  │
│  │  Data     │───▶│  Feature     │───▶│  Training     │  │
│  │  Pipeline │    │  Engineering │    │  Orchestrator │  │
│  │           │    │              │    │  (Kubeflow)   │  │
│  │ • Spark   │    │ • Transform  │    │               │  │
│  │ • Airflow │    │ • Validate   │    │ • TFJob       │  │
│  │ • dbt     │    │ • Feature    │    │ • PyTorchJob  │  │
│  │           │    │   Store      │    │ • Distributed │  │
│  └──────────┘    └──────────────┘    └───────┬───────┘  │
│                                              │           │
│                                              ▼           │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────┐   │
│  │  Monitoring   │◀───│  Model       │◀───│ Experiment│  │
│  │               │    │  Serving     │    │ Tracking  │  │
│  │ • Prometheus  │    │              │    │ (MLflow)  │  │
│  │ • Grafana     │    │ • TF Serving │    │           │  │
│  │ • Evidently   │    │ • TorchServe │    │ • Params  │  │
│  │   (Drift)     │    │ • KServe     │    │ • Metrics │  │
│  └──────────────┘    └──────────────┘    └──────────┘   │
└──────────────────────────────────────────────────────────┘

Distributed Training

Both frameworks support distributed training across multiple GPUs and nodes:

TensorFlow — MultiWorkerMirroredStrategy:

import tensorflow as tf

strategy = tf.distribute.MultiWorkerMirroredStrategy()

with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

model.fit(train_dataset, epochs=10)

PyTorch — DistributedDataParallel (DDP):

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def train(rank, world_size):
    setup(rank, world_size)
    model = MyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    optimizer = torch.optim.Adam(ddp_model.parameters())
    # Training loop...
    dist.destroy_process_group()

On Kubernetes with Kubeflow Training Operators:

# pytorch-training-job.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-distributed-training
  namespace: ml-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: your-registry/training:latest
            resources:
              limits:
                nvidia.com/gpu: 1
                memory: "16Gi"
                cpu: "4"
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: your-registry/training:latest
            resources:
              limits:
                nvidia.com/gpu: 1
                memory: "16Gi"
                cpu: "4"

Model Serving

Each framework has its own production serving solution:

TensorFlow Serving:

# Pull and run TF Serving container
docker run -p 8501:8501 \
  --mount type=bind,source=/models/my_model,target=/models/my_model \
  -e MODEL_NAME=my_model \
  -t tensorflow/serving

TorchServe:

# Install TorchServe
pip install torchserve torch-model-archiver

# Archive a model
torch-model-archiver --model-name my_model \
  --version 1.0 \
  --serialized-file model.pt \
  --handler image_classifier

# Start TorchServe
torchserve --start --model-store model_store --models my_model=my_model.mar

KServe (Framework-Agnostic on Kubernetes):

# inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
  namespace: ml-serving
spec:
  predictor:
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://models/my-model"
      resources:
        limits:
          nvidia.com/gpu: 1
          memory: "8Gi"
        requests:
          cpu: "2"
          memory: "4Gi"

TensorFlow vs PyTorch Comparison

AspectTensorFlowPyTorch
DeveloperGoogleMeta
Computation graphStatic (tf.function) + EagerDynamic by default
API styleKeras (high-level), tf.* (low-level)Pythonic, module-based
Production servingTF Serving (mature, battle-tested)TorchServe (growing ecosystem)
Mobile/EdgeTF Lite (strong)PyTorch Mobile, ExecuTorch
Research adoptionDeclining in new papersDominant (80%+ of new research)
Enterprise adoptionStrong in legacy + GCP environmentsGrowing rapidly
Distributed trainingMultiWorkerMirrored, ParameterServerDDP, FSDP (Fully Sharded)
VisualizationTensorBoard (built-in)TensorBoard (via integration)
ONNX exporttf2onnx (community)Native torch.onnx.export
Compiler optimizationXLAtorch.compile (TorchDynamo)

When to Choose TensorFlow

When to Choose PyTorch

When to Use Both

Many enterprises adopt a dual-framework strategy:


Enterprise Implementation

Training at Scale (GPU Clusters)

Enterprise GPU clusters typically use:

# Install NVIDIA GPU Operator on Kubernetes
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator --create-namespace

Model Versioning

Integrate with MLflow for consistent model versioning across both frameworks:

import mlflow

# Log a TensorFlow model
with mlflow.start_run():
    mlflow.tensorflow.log_model(model, "tf-model")
    mlflow.log_params({"learning_rate": 0.001, "epochs": 50})
    mlflow.log_metrics({"accuracy": 0.94, "loss": 0.18})

# Log a PyTorch model
with mlflow.start_run():
    mlflow.pytorch.log_model(model, "pytorch-model")
    mlflow.log_params({"learning_rate": 0.001, "epochs": 50})
    mlflow.log_metrics({"accuracy": 0.95, "loss": 0.15})

Deployment Architecture

┌────────────────────────────────────────────────────┐
│             Production Serving Layer               │
│                                                    │
│  ┌────────────────────────────────────────────┐    │
│  │           Load Balancer / Ingress          │    │
│  └─────────────────┬──────────────────────────┘    │
│                    │                               │
│    ┌───────────────┼───────────────┐               │
│    │               │               │               │
│    ▼               ▼               ▼               │
│  ┌──────┐    ┌──────────┐    ┌──────────┐         │
│  │  TF  │    │  Torch   │    │  KServe  │         │
│  │Serving│    │  Serve   │    │ (ONNX/   │         │
│  │      │    │          │    │  Triton) │         │
│  │v1:90%│    │ v2:100%  │    │ v3:canary│         │
│  │v2:10%│    │          │    │   (10%)  │         │
│  └──────┘    └──────────┘    └──────────┘         │
│                                                    │
│  ┌────────────────────────────────────────────┐    │
│  │     GPU Node Pool (NVIDIA A100 / H100)     │    │
│  └────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────┘

Security Best Practices

Dependency Isolation

# Pin dependencies
pip install pip-tools
pip-compile requirements.in --generate-hashes

# Audit for vulnerabilities
pip install pip-audit
pip-audit

Supply Chain Security

ML dependencies have a large attack surface. Protect against supply chain attacks:

# Scan a training container image
trivy image your-registry/training:latest

# Install only from verified sources with hash verification
pip install --require-hashes -r requirements.txt

⚠️ Warning: Never install packages from untrusted sources. Typosquatting attacks on PyPI are common — always verify package names and publishers before installing.

Model Validation

Before deploying a model to production:

Reproducibility

Reproducibility is critical for auditing and debugging production models:

# Lock the exact environment
pip freeze > requirements-lock.txt

# Set random seeds for reproducibility
python -c "
import random, numpy as np, torch, tensorflow as tf
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
tf.random.set_seed(42)
"

# Use Docker for environment reproducibility
# Dockerfile
# FROM nvidia/cuda:12.4.1-runtime-ubuntu24.04
# COPY requirements-lock.txt .
# RUN pip install -r requirements-lock.txt
# COPY . /app
# WORKDIR /app
# CMD ["python", "train.py"]

Network Security for Model Serving


Top GitHub Projects

1. tensorflow/tensorflow

2. pytorch/pytorch

3. huggingface/transformers

4. keras-team/keras

5. Lightning-AI/pytorch-lightning


References


Suggest Changes
Share this post on:

Previous Post
What is the New DevOps Agent in AWS?
Next Post
MLflow vs Kubeflow (and Modern MLOps Tools): Enterprise Installation and Architecture Guide