Python, TensorFlow, and PyTorch: Enterprise AI Stack Setup and Best Practices

Open Table of Contents

Overview
Installation on Ubuntu
Architecture in Enterprise
TensorFlow vs PyTorch Comparison
Enterprise Implementation
Security Best Practices
Top GitHub Projects
References

Overview

Python is the undisputed backbone of AI and machine learning development. Its ecosystem of scientific computing libraries (NumPy, pandas, scikit-learn), visualization tools (Matplotlib, Plotly), and deep learning frameworks (TensorFlow, PyTorch) makes it the default language for everything from research prototypes to production inference systems.

Two deep learning frameworks dominate the enterprise landscape:

TensorFlow — Developed by Google, designed for production deployment at scale. Offers TensorFlow Serving, TensorFlow Lite, and TensorFlow.js for multi-platform inference. Widely adopted in large enterprises and cloud-native environments.
PyTorch — Developed by Meta, favored for its Pythonic API and dynamic computation graphs. Dominant in research and increasingly adopted in production through TorchServe and the PyTorch ecosystem. PyTorch now accounts for the majority of new ML research papers.

Enterprise adoption patterns:

TensorFlow tends to dominate in organizations with existing Google Cloud Platform (GCP) infrastructure, edge/mobile deployment requirements, and teams that prioritize static graph optimization
PyTorch leads in research-heavy organizations, NLP/LLM workloads (most Hugging Face models are PyTorch-native), and teams that value rapid prototyping
Many enterprises run both frameworks, using PyTorch for experimentation and TensorFlow for optimized production serving

Installation on Ubuntu

Prerequisites

# Update system packages
sudo apt update && sudo apt upgrade -y

# Install essential build tools and dependencies
sudo apt install -y build-essential curl wget git \
  libssl-dev zlib1g-dev libbz2-dev libreadline-dev \
  libsqlite3-dev libncursesw5-dev xz-utils tk-dev \
  libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev

Python Setup with pyenv

Using pyenv ensures consistent Python versions across development and production environments:

# Install pyenv (download, inspect, then run)
curl -fsSL https://pyenv.run -o /tmp/pyenv-installer.sh
# (Optional) Inspect the installer before running
less /tmp/pyenv-installer.sh
bash /tmp/pyenv-installer.sh

# Add pyenv to shell configuration
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo '[[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
source ~/.bashrc

# Install Python 3.11 (recommended for TensorFlow/PyTorch compatibility)
pyenv install 3.11.9
pyenv global 3.11.9

# Verify installation
python --version

⚠️ Warning: Avoid using the system Python for ML workloads. System Python is managed by apt and modifying it can break OS utilities. Always use pyenv or virtual environments.

Virtual Environments

Always isolate project dependencies:

# Create a project-specific virtual environment
python -m venv ~/ai-stack-env
source ~/ai-stack-env/bin/activate

# Upgrade pip
pip install --upgrade pip setuptools wheel

GPU Setup (CUDA)

For GPU-accelerated training, install NVIDIA drivers and CUDA:

# Check for NVIDIA GPU
lspci | grep -i nvidia

# Install NVIDIA drivers (Ubuntu)
sudo apt install -y nvidia-driver-550

# Verify driver installation
nvidia-smi

# Install CUDA Toolkit
# IMPORTANT: Check the official compatibility matrices BEFORE choosing a version.
# TensorFlow and PyTorch each support specific CUDA/cuDNN combinations.
# See the links below for the exact versions tested with your target framework.
#
# Example for CUDA 12.5 (tested with TF 2.18+ and PyTorch 2.5+):
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-5

# Install cuDNN (match your CUDA version)
sudo apt install -y libcudnn9-cuda-12

⚠️ Warning: Always verify the CUDA/cuDNN/TensorFlow/PyTorch compatibility matrix before installing. Mismatched versions are the #1 cause of GPU initialization failures. The commands above are examples — always cross-reference the official matrices for your specific framework version.

Check the official compatibility pages before installing:

As of the latest stable releases:

Framework	Tested CUDA	Tested cuDNN
TensorFlow 2.18+	12.5	9.3
PyTorch 2.5+	12.4 / 12.6	9.x

⚠️ Warning: If you plan to run both frameworks on the same machine, choose a CUDA version that is compatible with both. Check each framework’s install page for the overlap.

Installing TensorFlow

# CPU-only installation
pip install tensorflow

# GPU installation (CUDA 12 — pip automatically resolves GPU packages)
pip install tensorflow[and-cuda]

# Verify installation
python -c "import tensorflow as tf; print(tf.__version__); print('GPU:', tf.config.list_physical_devices('GPU'))"

Installing PyTorch

Use the official selector at pytorch.org/get-started/locally/ to get the correct command for your setup:

# CPU-only installation
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# GPU installation (CUDA 12.4)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Verify installation
python -c "import torch; print(torch.__version__); print('CUDA:', torch.cuda.is_available())"

Architecture in Enterprise

Training Pipeline Architecture

┌──────────────────────────────────────────────────────────┐
│              Enterprise AI Training Platform              │
│                                                          │
│  ┌──────────┐    ┌──────────────┐    ┌───────────────┐  │
│  │  Data     │───▶│  Feature     │───▶│  Training     │  │
│  │  Pipeline │    │  Engineering │    │  Orchestrator │  │
│  │           │    │              │    │  (Kubeflow)   │  │
│  │ • Spark   │    │ • Transform  │    │               │  │
│  │ • Airflow │    │ • Validate   │    │ • TFJob       │  │
│  │ • dbt     │    │ • Feature    │    │ • PyTorchJob  │  │
│  │           │    │   Store      │    │ • Distributed │  │
│  └──────────┘    └──────────────┘    └───────┬───────┘  │
│                                              │           │
│                                              ▼           │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────┐   │
│  │  Monitoring   │◀───│  Model       │◀───│ Experiment│  │
│  │               │    │  Serving     │    │ Tracking  │  │
│  │ • Prometheus  │    │              │    │ (MLflow)  │  │
│  │ • Grafana     │    │ • TF Serving │    │           │  │
│  │ • Evidently   │    │ • TorchServe │    │ • Params  │  │
│  │   (Drift)     │    │ • KServe     │    │ • Metrics │  │
│  └──────────────┘    └──────────────┘    └──────────┘   │
└──────────────────────────────────────────────────────────┘

Distributed Training

Both frameworks support distributed training across multiple GPUs and nodes:

TensorFlow — MultiWorkerMirroredStrategy:

import tensorflow as tf

strategy = tf.distribute.MultiWorkerMirroredStrategy()

with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

model.fit(train_dataset, epochs=10)

PyTorch — DistributedDataParallel (DDP):

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def train(rank, world_size):
    setup(rank, world_size)
    model = MyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    optimizer = torch.optim.Adam(ddp_model.parameters())
    # Training loop...
    dist.destroy_process_group()

On Kubernetes with Kubeflow Training Operators:

# pytorch-training-job.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-distributed-training
  namespace: ml-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: your-registry/training:latest
            resources:
              limits:
                nvidia.com/gpu: 1
                memory: "16Gi"
                cpu: "4"
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: your-registry/training:latest
            resources:
              limits:
                nvidia.com/gpu: 1
                memory: "16Gi"
                cpu: "4"

Model Serving

Each framework has its own production serving solution:

TensorFlow Serving:

# Pull and run TF Serving container
docker run -p 8501:8501 \
  --mount type=bind,source=/models/my_model,target=/models/my_model \
  -e MODEL_NAME=my_model \
  -t tensorflow/serving

TorchServe:

# Install TorchServe
pip install torchserve torch-model-archiver

# Archive a model
torch-model-archiver --model-name my_model \
  --version 1.0 \
  --serialized-file model.pt \
  --handler image_classifier

# Start TorchServe
torchserve --start --model-store model_store --models my_model=my_model.mar

KServe (Framework-Agnostic on Kubernetes):

# inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
  namespace: ml-serving
spec:
  predictor:
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://models/my-model"
      resources:
        limits:
          nvidia.com/gpu: 1
          memory: "8Gi"
        requests:
          cpu: "2"
          memory: "4Gi"

TensorFlow vs PyTorch Comparison

Aspect	TensorFlow	PyTorch
Developer	Google	Meta
Computation graph	Static (tf.function) + Eager	Dynamic by default
API style	Keras (high-level), tf.* (low-level)	Pythonic, module-based
Production serving	TF Serving (mature, battle-tested)	TorchServe (growing ecosystem)
Mobile/Edge	TF Lite (strong)	PyTorch Mobile, ExecuTorch
Research adoption	Declining in new papers	Dominant (80%+ of new research)
Enterprise adoption	Strong in legacy + GCP environments	Growing rapidly
Distributed training	MultiWorkerMirrored, ParameterServer	DDP, FSDP (Fully Sharded)
Visualization	TensorBoard (built-in)	TensorBoard (via integration)
ONNX export	tf2onnx (community)	Native torch.onnx.export
Compiler optimization	XLA	torch.compile (TorchDynamo)

When to Choose TensorFlow

You need TF Lite for mobile or edge deployment
Your team has existing TensorFlow infrastructure and models
You need TF.js for browser-based inference
Your organization is heavily invested in GCP (Vertex AI)

When to Choose PyTorch

Your team does research and experimentation that becomes production
You work heavily with Hugging Face Transformers (most models are PyTorch-first)
You need dynamic computation graphs (e.g., for variable-length sequences or graph neural networks)
You want to leverage torch.compile for automatic performance optimization

When to Use Both

Many enterprises adopt a dual-framework strategy:

Research teams use PyTorch for rapid experimentation
Production teams convert critical models to ONNX or TensorFlow SavedModel format
Model serving uses KServe or Triton Inference Server, which supports both frameworks

Enterprise Implementation

Training at Scale (GPU Clusters)

Enterprise GPU clusters typically use:

NVIDIA DGX systems or cloud GPU instances (A100, H100)
Kubernetes with the NVIDIA GPU Operator for device plugin management
NCCL (NVIDIA Collective Communications Library) for multi-GPU/multi-node communication

# Install NVIDIA GPU Operator on Kubernetes
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator --create-namespace

Model Versioning

Integrate with MLflow for consistent model versioning across both frameworks:

import mlflow

# Log a TensorFlow model
with mlflow.start_run():
    mlflow.tensorflow.log_model(model, "tf-model")
    mlflow.log_params({"learning_rate": 0.001, "epochs": 50})
    mlflow.log_metrics({"accuracy": 0.94, "loss": 0.18})

# Log a PyTorch model
with mlflow.start_run():
    mlflow.pytorch.log_model(model, "pytorch-model")
    mlflow.log_params({"learning_rate": 0.001, "epochs": 50})
    mlflow.log_metrics({"accuracy": 0.95, "loss": 0.15})

Deployment Architecture

┌────────────────────────────────────────────────────┐
│             Production Serving Layer               │
│                                                    │
│  ┌────────────────────────────────────────────┐    │
│  │           Load Balancer / Ingress          │    │
│  └─────────────────┬──────────────────────────┘    │
│                    │                               │
│    ┌───────────────┼───────────────┐               │
│    │               │               │               │
│    ▼               ▼               ▼               │
│  ┌──────┐    ┌──────────┐    ┌──────────┐         │
│  │  TF  │    │  Torch   │    │  KServe  │         │
│  │Serving│    │  Serve   │    │ (ONNX/   │         │
│  │      │    │          │    │  Triton) │         │
│  │v1:90%│    │ v2:100%  │    │ v3:canary│         │
│  │v2:10%│    │          │    │   (10%)  │         │
│  └──────┘    └──────────┘    └──────────┘         │
│                                                    │
│  ┌────────────────────────────────────────────┐    │
│  │     GPU Node Pool (NVIDIA A100 / H100)     │    │
│  └────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────┘

Security Best Practices

Dependency Isolation

Always use virtual environments or containers to isolate dependencies
Pin exact dependency versions in requirements.txt or use pip-compile (pip-tools)
Run pip audit regularly to check for known vulnerabilities

# Pin dependencies
pip install pip-tools
pip-compile requirements.in --generate-hashes

# Audit for vulnerabilities
pip install pip-audit
pip-audit

Supply Chain Security

ML dependencies have a large attack surface. Protect against supply chain attacks:

Verify package hashes when installing (--require-hashes with pip)
Use a private PyPI mirror (Artifactory, Nexus) for production environments
Scan container images with Trivy or Grype before deployment

# Scan a training container image
trivy image your-registry/training:latest

# Install only from verified sources with hash verification
pip install --require-hashes -r requirements.txt

⚠️ Warning: Never install packages from untrusted sources. Typosquatting attacks on PyPI are common — always verify package names and publishers before installing.

Model Validation

Before deploying a model to production:

Run validation datasets to ensure the model meets accuracy thresholds
Check for adversarial robustness using frameworks like Adversarial Robustness Toolbox (ART)
Validate fairness metrics to detect bias (AI Fairness 360, Fairlearn)
Ensure model explainability using SHAP or LIME

Reproducibility

Reproducibility is critical for auditing and debugging production models:

# Lock the exact environment
pip freeze > requirements-lock.txt

# Set random seeds for reproducibility
python -c "
import random, numpy as np, torch, tensorflow as tf
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
tf.random.set_seed(42)
"

# Use Docker for environment reproducibility
# Dockerfile
# FROM nvidia/cuda:12.4.1-runtime-ubuntu24.04
# COPY requirements-lock.txt .
# RUN pip install -r requirements-lock.txt
# COPY . /app
# WORKDIR /app
# CMD ["python", "train.py"]

Network Security for Model Serving

Place inference endpoints behind an API gateway with rate limiting
Use mTLS between services in a service mesh (Istio, Linkerd)
Implement input validation to prevent adversarial inputs and injection attacks
Log all inference requests for audit and anomaly detection

Top GitHub Projects

1. tensorflow/tensorflow

Description: An end-to-end open source platform for machine learning with comprehensive tools, libraries, and community resources
Stars: 187,000+
Enterprise usage: Google, Intel, Airbus, PayPal — widely used for production ML at scale, edge deployment (TF Lite), and browser inference (TF.js)
Link: github.com/tensorflow/tensorflow

2. pytorch/pytorch

Description: Tensors and dynamic neural networks in Python with strong GPU acceleration
Stars: 85,000+
Enterprise usage: Meta, Microsoft, Tesla, OpenAI — the dominant framework for research and increasingly for production LLM workloads
Link: github.com/pytorch/pytorch

3. huggingface/transformers

Description: State-of-the-art pretrained models for NLP, computer vision, and audio — supports PyTorch, TensorFlow, and JAX
Stars: 140,000+
Enterprise usage: Used by thousands of companies for NLP pipelines, text generation, sentiment analysis, and as the foundation for fine-tuning LLMs
Link: github.com/huggingface/transformers

4. keras-team/keras

Description: Deep learning for humans — a multi-backend high-level neural network API supporting TensorFlow, JAX, and PyTorch
Stars: 62,000+
Enterprise usage: The default high-level API for TensorFlow development. Keras 3 supports multiple backends, making it a universal deep learning interface
Link: github.com/keras-team/keras

5. Lightning-AI/pytorch-lightning

Description: The deep learning framework to pretrain, finetune, and deploy AI models — reduces boilerplate and scales PyTorch training
Stars: 28,000+
Enterprise usage: Simplifies distributed training, mixed precision, and multi-GPU workflows. Used by teams migrating from research to production PyTorch
Link: github.com/Lightning-AI/pytorch-lightning