Table of Contents
Open Table of Contents
Overview
Python is the undisputed backbone of AI and machine learning development. Its ecosystem of scientific computing libraries (NumPy, pandas, scikit-learn), visualization tools (Matplotlib, Plotly), and deep learning frameworks (TensorFlow, PyTorch) makes it the default language for everything from research prototypes to production inference systems.
Two deep learning frameworks dominate the enterprise landscape:
-
TensorFlow — Developed by Google, designed for production deployment at scale. Offers TensorFlow Serving, TensorFlow Lite, and TensorFlow.js for multi-platform inference. Widely adopted in large enterprises and cloud-native environments.
-
PyTorch — Developed by Meta, favored for its Pythonic API and dynamic computation graphs. Dominant in research and increasingly adopted in production through TorchServe and the PyTorch ecosystem. PyTorch now accounts for the majority of new ML research papers.
Enterprise adoption patterns:
- TensorFlow tends to dominate in organizations with existing Google Cloud Platform (GCP) infrastructure, edge/mobile deployment requirements, and teams that prioritize static graph optimization
- PyTorch leads in research-heavy organizations, NLP/LLM workloads (most Hugging Face models are PyTorch-native), and teams that value rapid prototyping
- Many enterprises run both frameworks, using PyTorch for experimentation and TensorFlow for optimized production serving
Installation on Ubuntu
Prerequisites
# Update system packages
sudo apt update && sudo apt upgrade -y
# Install essential build tools and dependencies
sudo apt install -y build-essential curl wget git \
libssl-dev zlib1g-dev libbz2-dev libreadline-dev \
libsqlite3-dev libncursesw5-dev xz-utils tk-dev \
libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
Python Setup with pyenv
Using pyenv ensures consistent Python versions across development and production environments:
# Install pyenv (download, inspect, then run)
curl -fsSL https://pyenv.run -o /tmp/pyenv-installer.sh
# (Optional) Inspect the installer before running
less /tmp/pyenv-installer.sh
bash /tmp/pyenv-installer.sh
# Add pyenv to shell configuration
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo '[[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
source ~/.bashrc
# Install Python 3.11 (recommended for TensorFlow/PyTorch compatibility)
pyenv install 3.11.9
pyenv global 3.11.9
# Verify installation
python --version
⚠️ Warning: Avoid using the system Python for ML workloads. System Python is managed by apt and modifying it can break OS utilities. Always use pyenv or virtual environments.
Virtual Environments
Always isolate project dependencies:
# Create a project-specific virtual environment
python -m venv ~/ai-stack-env
source ~/ai-stack-env/bin/activate
# Upgrade pip
pip install --upgrade pip setuptools wheel
GPU Setup (CUDA)
For GPU-accelerated training, install NVIDIA drivers and CUDA:
# Check for NVIDIA GPU
lspci | grep -i nvidia
# Install NVIDIA drivers (Ubuntu)
sudo apt install -y nvidia-driver-550
# Verify driver installation
nvidia-smi
# Install CUDA Toolkit
# IMPORTANT: Check the official compatibility matrices BEFORE choosing a version.
# TensorFlow and PyTorch each support specific CUDA/cuDNN combinations.
# See the links below for the exact versions tested with your target framework.
#
# Example for CUDA 12.5 (tested with TF 2.18+ and PyTorch 2.5+):
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-5
# Install cuDNN (match your CUDA version)
sudo apt install -y libcudnn9-cuda-12
⚠️ Warning: Always verify the CUDA/cuDNN/TensorFlow/PyTorch compatibility matrix before installing. Mismatched versions are the #1 cause of GPU initialization failures. The commands above are examples — always cross-reference the official matrices for your specific framework version.
Check the official compatibility pages before installing:
As of the latest stable releases:
| Framework | Tested CUDA | Tested cuDNN |
|---|---|---|
| TensorFlow 2.18+ | 12.5 | 9.3 |
| PyTorch 2.5+ | 12.4 / 12.6 | 9.x |
⚠️ Warning: If you plan to run both frameworks on the same machine, choose a CUDA version that is compatible with both. Check each framework’s install page for the overlap.
Installing TensorFlow
# CPU-only installation
pip install tensorflow
# GPU installation (CUDA 12 — pip automatically resolves GPU packages)
pip install tensorflow[and-cuda]
# Verify installation
python -c "import tensorflow as tf; print(tf.__version__); print('GPU:', tf.config.list_physical_devices('GPU'))"
Installing PyTorch
Use the official selector at pytorch.org/get-started/locally/ to get the correct command for your setup:
# CPU-only installation
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# GPU installation (CUDA 12.4)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# Verify installation
python -c "import torch; print(torch.__version__); print('CUDA:', torch.cuda.is_available())"
Architecture in Enterprise
Training Pipeline Architecture
┌──────────────────────────────────────────────────────────┐
│ Enterprise AI Training Platform │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Data │───▶│ Feature │───▶│ Training │ │
│ │ Pipeline │ │ Engineering │ │ Orchestrator │ │
│ │ │ │ │ │ (Kubeflow) │ │
│ │ • Spark │ │ • Transform │ │ │ │
│ │ • Airflow │ │ • Validate │ │ • TFJob │ │
│ │ • dbt │ │ • Feature │ │ • PyTorchJob │ │
│ │ │ │ Store │ │ • Distributed │ │
│ └──────────┘ └──────────────┘ └───────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ Monitoring │◀───│ Model │◀───│ Experiment│ │
│ │ │ │ Serving │ │ Tracking │ │
│ │ • Prometheus │ │ │ │ (MLflow) │ │
│ │ • Grafana │ │ • TF Serving │ │ │ │
│ │ • Evidently │ │ • TorchServe │ │ • Params │ │
│ │ (Drift) │ │ • KServe │ │ • Metrics │ │
│ └──────────────┘ └──────────────┘ └──────────┘ │
└──────────────────────────────────────────────────────────┘
Distributed Training
Both frameworks support distributed training across multiple GPUs and nodes:
TensorFlow — MultiWorkerMirroredStrategy:
import tensorflow as tf
strategy = tf.distribute.MultiWorkerMirroredStrategy()
with strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(train_dataset, epochs=10)
PyTorch — DistributedDataParallel (DDP):
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup(rank, world_size):
dist.init_process_group("nccl", rank=rank, world_size=world_size)
def train(rank, world_size):
setup(rank, world_size)
model = MyModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
optimizer = torch.optim.Adam(ddp_model.parameters())
# Training loop...
dist.destroy_process_group()
On Kubernetes with Kubeflow Training Operators:
# pytorch-training-job.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-distributed-training
namespace: ml-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: your-registry/training:latest
resources:
limits:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "4"
Worker:
replicas: 3
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: your-registry/training:latest
resources:
limits:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "4"
Model Serving
Each framework has its own production serving solution:
TensorFlow Serving:
# Pull and run TF Serving container
docker run -p 8501:8501 \
--mount type=bind,source=/models/my_model,target=/models/my_model \
-e MODEL_NAME=my_model \
-t tensorflow/serving
TorchServe:
# Install TorchServe
pip install torchserve torch-model-archiver
# Archive a model
torch-model-archiver --model-name my_model \
--version 1.0 \
--serialized-file model.pt \
--handler image_classifier
# Start TorchServe
torchserve --start --model-store model_store --models my_model=my_model.mar
KServe (Framework-Agnostic on Kubernetes):
# inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: my-model
namespace: ml-serving
spec:
predictor:
model:
modelFormat:
name: pytorch
storageUri: "s3://models/my-model"
resources:
limits:
nvidia.com/gpu: 1
memory: "8Gi"
requests:
cpu: "2"
memory: "4Gi"
TensorFlow vs PyTorch Comparison
| Aspect | TensorFlow | PyTorch |
|---|---|---|
| Developer | Meta | |
| Computation graph | Static (tf.function) + Eager | Dynamic by default |
| API style | Keras (high-level), tf.* (low-level) | Pythonic, module-based |
| Production serving | TF Serving (mature, battle-tested) | TorchServe (growing ecosystem) |
| Mobile/Edge | TF Lite (strong) | PyTorch Mobile, ExecuTorch |
| Research adoption | Declining in new papers | Dominant (80%+ of new research) |
| Enterprise adoption | Strong in legacy + GCP environments | Growing rapidly |
| Distributed training | MultiWorkerMirrored, ParameterServer | DDP, FSDP (Fully Sharded) |
| Visualization | TensorBoard (built-in) | TensorBoard (via integration) |
| ONNX export | tf2onnx (community) | Native torch.onnx.export |
| Compiler optimization | XLA | torch.compile (TorchDynamo) |
When to Choose TensorFlow
- You need TF Lite for mobile or edge deployment
- Your team has existing TensorFlow infrastructure and models
- You need TF.js for browser-based inference
- Your organization is heavily invested in GCP (Vertex AI)
When to Choose PyTorch
- Your team does research and experimentation that becomes production
- You work heavily with Hugging Face Transformers (most models are PyTorch-first)
- You need dynamic computation graphs (e.g., for variable-length sequences or graph neural networks)
- You want to leverage torch.compile for automatic performance optimization
When to Use Both
Many enterprises adopt a dual-framework strategy:
- Research teams use PyTorch for rapid experimentation
- Production teams convert critical models to ONNX or TensorFlow SavedModel format
- Model serving uses KServe or Triton Inference Server, which supports both frameworks
Enterprise Implementation
Training at Scale (GPU Clusters)
Enterprise GPU clusters typically use:
- NVIDIA DGX systems or cloud GPU instances (A100, H100)
- Kubernetes with the NVIDIA GPU Operator for device plugin management
- NCCL (NVIDIA Collective Communications Library) for multi-GPU/multi-node communication
# Install NVIDIA GPU Operator on Kubernetes
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace
Model Versioning
Integrate with MLflow for consistent model versioning across both frameworks:
import mlflow
# Log a TensorFlow model
with mlflow.start_run():
mlflow.tensorflow.log_model(model, "tf-model")
mlflow.log_params({"learning_rate": 0.001, "epochs": 50})
mlflow.log_metrics({"accuracy": 0.94, "loss": 0.18})
# Log a PyTorch model
with mlflow.start_run():
mlflow.pytorch.log_model(model, "pytorch-model")
mlflow.log_params({"learning_rate": 0.001, "epochs": 50})
mlflow.log_metrics({"accuracy": 0.95, "loss": 0.15})
Deployment Architecture
┌────────────────────────────────────────────────────┐
│ Production Serving Layer │
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ Load Balancer / Ingress │ │
│ └─────────────────┬──────────────────────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────┐ ┌──────────┐ ┌──────────┐ │
│ │ TF │ │ Torch │ │ KServe │ │
│ │Serving│ │ Serve │ │ (ONNX/ │ │
│ │ │ │ │ │ Triton) │ │
│ │v1:90%│ │ v2:100% │ │ v3:canary│ │
│ │v2:10%│ │ │ │ (10%) │ │
│ └──────┘ └──────────┘ └──────────┘ │
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ GPU Node Pool (NVIDIA A100 / H100) │ │
│ └────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────┘
Security Best Practices
Dependency Isolation
- Always use virtual environments or containers to isolate dependencies
- Pin exact dependency versions in
requirements.txtor usepip-compile(pip-tools) - Run
pip auditregularly to check for known vulnerabilities
# Pin dependencies
pip install pip-tools
pip-compile requirements.in --generate-hashes
# Audit for vulnerabilities
pip install pip-audit
pip-audit
Supply Chain Security
ML dependencies have a large attack surface. Protect against supply chain attacks:
- Verify package hashes when installing (
--require-hasheswith pip) - Use a private PyPI mirror (Artifactory, Nexus) for production environments
- Scan container images with Trivy or Grype before deployment
# Scan a training container image
trivy image your-registry/training:latest
# Install only from verified sources with hash verification
pip install --require-hashes -r requirements.txt
⚠️ Warning: Never install packages from untrusted sources. Typosquatting attacks on PyPI are common — always verify package names and publishers before installing.
Model Validation
Before deploying a model to production:
- Run validation datasets to ensure the model meets accuracy thresholds
- Check for adversarial robustness using frameworks like Adversarial Robustness Toolbox (ART)
- Validate fairness metrics to detect bias (AI Fairness 360, Fairlearn)
- Ensure model explainability using SHAP or LIME
Reproducibility
Reproducibility is critical for auditing and debugging production models:
# Lock the exact environment
pip freeze > requirements-lock.txt
# Set random seeds for reproducibility
python -c "
import random, numpy as np, torch, tensorflow as tf
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
tf.random.set_seed(42)
"
# Use Docker for environment reproducibility
# Dockerfile
# FROM nvidia/cuda:12.4.1-runtime-ubuntu24.04
# COPY requirements-lock.txt .
# RUN pip install -r requirements-lock.txt
# COPY . /app
# WORKDIR /app
# CMD ["python", "train.py"]
Network Security for Model Serving
- Place inference endpoints behind an API gateway with rate limiting
- Use mTLS between services in a service mesh (Istio, Linkerd)
- Implement input validation to prevent adversarial inputs and injection attacks
- Log all inference requests for audit and anomaly detection
Top GitHub Projects
1. tensorflow/tensorflow
- Description: An end-to-end open source platform for machine learning with comprehensive tools, libraries, and community resources
- Stars: 187,000+
- Enterprise usage: Google, Intel, Airbus, PayPal — widely used for production ML at scale, edge deployment (TF Lite), and browser inference (TF.js)
- Link: github.com/tensorflow/tensorflow
2. pytorch/pytorch
- Description: Tensors and dynamic neural networks in Python with strong GPU acceleration
- Stars: 85,000+
- Enterprise usage: Meta, Microsoft, Tesla, OpenAI — the dominant framework for research and increasingly for production LLM workloads
- Link: github.com/pytorch/pytorch
3. huggingface/transformers
- Description: State-of-the-art pretrained models for NLP, computer vision, and audio — supports PyTorch, TensorFlow, and JAX
- Stars: 140,000+
- Enterprise usage: Used by thousands of companies for NLP pipelines, text generation, sentiment analysis, and as the foundation for fine-tuning LLMs
- Link: github.com/huggingface/transformers
4. keras-team/keras
- Description: Deep learning for humans — a multi-backend high-level neural network API supporting TensorFlow, JAX, and PyTorch
- Stars: 62,000+
- Enterprise usage: The default high-level API for TensorFlow development. Keras 3 supports multiple backends, making it a universal deep learning interface
- Link: github.com/keras-team/keras
5. Lightning-AI/pytorch-lightning
- Description: The deep learning framework to pretrain, finetune, and deploy AI models — reduces boilerplate and scales PyTorch training
- Stars: 28,000+
- Enterprise usage: Simplifies distributed training, mixed precision, and multi-GPU workflows. Used by teams migrating from research to production PyTorch
- Link: github.com/Lightning-AI/pytorch-lightning
References
- TensorFlow Official Documentation
- PyTorch Official Documentation
- TensorFlow Installation Guide
- PyTorch Get Started
- KServe Documentation
- TorchServe Documentation
- TensorFlow Serving
- NVIDIA GPU Operator
- MLflow TensorFlow Integration
- MLflow PyTorch Integration
- pip-audit — Security Auditing for Python
- Adversarial Robustness Toolbox