-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Comparing changes
Open a pull request
base repository: deepspeedai/DeepSpeedExamples
base: master
head repository: TamerSoliman/DeepSpeedExamples
compare: master
- 7 commits
- 49 files changed
- 2 contributors
Commits on Nov 18, 2025
-
Add comprehensive DeepSpeed ZeRO tutorials and documentation
This commit adds detailed educational materials mapping DeepSpeed ZeRO optimization stages to code and configuration: **Annotated Scripts (4 files):** - 01_hello_deepspeed_annotated.py - Basic ZeRO-1 with CPU offload - 02_cifar10_annotated.py - Configurable ZeRO stages (0-3) - 03_superoffload_zero3_annotated.py - ZeRO-3 with detailed parameter lifecycle - 04_zenflow_zero2_annotated.py - ZeRO-2 with sparse optimizer updates **Annotated Configurations (3 files):** - zero3_nvme_offload_annotated.json - NVMe offloading with AIO - zero3_cpu_offload_annotated.json - CPU offloading configuration - zero2_zenflow_annotated.json - ZenFlow sparse optimization **Comprehensive Guides (2 files):** - ZeRO3_Concept_to_Code.md - Maps ZeRO-3 theory to DeepSpeed source code - Distributed_Training_Guide.md - Complete data flow for gradient step **Key Features:** - Line-by-line annotations explaining distributed training mechanics - Explicit mapping to DeepSpeed source code (stage3.py, partition_parameters.py) - Memory breakdown examples and performance comparisons - Communication pattern diagrams and optimization strategies - Detailed explanation of All-Gather, Reduce-Scatter operations - Parameter lifecycle through forward/backward/optimizer steps All materials placed in claude_tutorials/ directory for easy access.
Configuration menu - View commit details
-
Copy full SHA for 5599f42 - Browse repository at this point
Copy the full SHA 5599f42View commit details
Commits on Nov 20, 2025
-
Add remaining 4 annotated DeepSpeed training scripts
Complete the full set of 8 annotated training examples: **Script 5 - DeepSpeed-Chat SFT:** - Production RLHF pipeline (Step 1: Supervised Fine-Tuning) - Dynamic DeepSpeed config generation pattern - LoRA integration for parameter-efficient training - Conditional optimizer selection (CPU vs GPU) - ZeRO-3 model saving utilities - Distributed evaluation with metric aggregation **Script 6 - Domino + Megatron:** - 3D Parallelism (Tensor + Pipeline + Data) - Megatron-LM integration with DeepSpeed - Tensor parallelism within nodes (NVLink) - Pipeline parallelism across nodes (InfiniBand) - Interleaved pipeline scheduling - Communication group explanations - Record GPT-3 training implementation **Script 7 - Tensor Parallelism:** - Tensor parallelism with transformers library - ZeRO-1 + Tensor Parallel combination - Layer-wise model splitting - All-Reduce communication patterns - Comparison with data parallelism - Optimal configuration guidelines **Script 8 - Bing BERT:** - Production-scale BERT pre-training (44-min record) - Gradient accumulation boundaries - Custom dataset provider with prefetching - Multi-phase training strategy (128→512 tokens) - LAMB optimizer for large batches - Production monitoring and checkpointing - 1024-GPU scaling patterns All scripts include: - Line-by-line annotations of distributed mechanisms - Communication pattern diagrams - Memory breakdown examples - Production best practices - Usage examples and configurations Total: 8 comprehensive annotated scripts covering all major DeepSpeed features and production patterns.
Configuration menu - View commit details
-
Copy full SHA for 4d2975b - Browse repository at this point
Copy the full SHA 4d2975bView commit details -
Add Tier 1: Benchmarking Suite, Troubleshooting, and Migration Guides
This commit adds comprehensive tooling and documentation: **Benchmarking Suite** (claude_tutorials/benchmarks/): - zero_stage_comparison.py: Benchmark ZeRO stages 0-3 with detailed metrics - offload_comparison.py: Compare CPU and NVMe offloading strategies - README.md: Complete guide to interpreting benchmarks and choosing optimal config **Troubleshooting Guide** (claude_tutorials/guides/): - Troubleshooting_Guide.md: 20 common issues with detailed solutions - OOM errors, NCCL timeouts, NaN losses, checkpoint issues - Configuration errors, multi-node problems, offloading issues - Debugging tools and quick reference table **Migration Guides** (claude_tutorials/migrations/): - Migration_from_PyTorch_DDP.md: Migrate from PyTorch DDP to DeepSpeed - Migration_from_HF_Trainer.md: Enable DeepSpeed in HuggingFace Trainer - Migration_from_FSDP.md: Migrate from PyTorch FSDP to DeepSpeed Each migration guide includes: - Side-by-side code comparisons - Feature mapping tables - Step-by-step migration checklist - Common issues and solutions - Performance benchmarks Total additions: ~8,000 lines of documentation and tools
Configuration menu - View commit details
-
Copy full SHA for 9f05803 - Browse repository at this point
Copy the full SHA 9f05803View commit details
Commits on Nov 21, 2025
-
Add Tier 2: Advanced Features, Tools, and Visual Guide
This commit adds comprehensive advanced tutorials and automation tools: **Advanced Feature Guides** (claude_tutorials/guides/): - MoE_Tutorial.md: Complete Mixture of Experts training guide (1,500 lines) - Expert Parallelism (EP) implementation and optimization - Load balancing strategies and capacity tuning - Switch Transformer and GPT-MoE examples - Compression_Tutorial.md: Gradient compression for multi-node (1,200 lines) - 1-bit Adam and 1-bit LAMB optimizers - 8-bit quantization techniques - Communication reduction (32× compression) - Inference_Optimization.md: DeepSpeed-Inference guide (1,500 lines) - Kernel injection and fusion - INT8/FP16 quantization - Tensor parallelism for inference - Production deployment patterns - Custom_Kernels.md: Writing CUDA kernels for DeepSpeed (1,000 lines) - OpBuilder system for JIT compilation - Kernel fusion and optimization techniques - Memory coalescing and shared memory - Tensor Core utilization - Visual_Guide.md: Architecture diagrams and visualizations (800 lines) - ASCII diagrams of ZeRO stages 0-3 - Memory layout comparisons - Communication pattern visualizations - Pipeline and tensor parallelism diagrams **Configuration Tools** (claude_tutorials/tools/): - config_generator.py: Interactive config generator (600 lines) - CLI tool for generating optimized DeepSpeed configs - Automatic ZeRO stage selection based on model size - Memory requirement estimation - Command-line and interactive modes - config_optimizer.py: Auto-tuning via benchmarks (500 lines) - Automated configuration optimization - Grid search over ZeRO stages, batch sizes, comm settings - Performance tracking and best config selection - Goal-based optimization (speed/memory/balanced) Total additions: 7 files, ~6,100 lines These tutorials cover advanced DeepSpeed features for production deployment, research experimentation, and performance optimization.
Configuration menu - View commit details
-
Copy full SHA for de4cf08 - Browse repository at this point
Copy the full SHA de4cf08View commit details
Commits on Nov 22, 2025
-
Add Tier 3: Multi-node setup, cost optimization, and model configs
Tier 3 Progress (Part 1/2): - Multi-node training guide with SSH, NCCL, SLURM setup - Cost optimization strategies across cloud providers - Cost calculator tool for training estimates - 13 production-ready model configurations: * LLaMA: 7B, 13B, 70B, LoRA fine-tuning * GPT: GPT-2, GPT-J 6B, GPT-NeoX 20B * BERT: Base fine-tuning, Large pre-training * T5: Small, Base, Large, XL configurations Files: 16, Lines: 2,847
Configuration menu - View commit details
-
Copy full SHA for f2fc312 - Browse repository at this point
Copy the full SHA f2fc312View commit details -
Complete Tier 3: Framework comparisons and model-specific guide
Tier 3 Progress (Part 2/2 - FINAL): - Model-Specific Configuration Guide (1,301 lines) * Comprehensive guide for using 13 production-ready configs * Covers LLaMA, GPT, BERT, T5 models * Includes customization tips and troubleshooting - Framework Comparison Guides (3,176 lines total): * DeepSpeed vs PyTorch FSDP (1,288 lines) * DeepSpeed vs Megatron-LM (984 lines) * DeepSpeed vs HF Accelerate (904 lines) * Performance benchmarks, code examples, use cases - Framework Comparison Tool (720 lines): * Benchmark DeepSpeed, FSDP, Accelerate * Measure throughput, memory, scaling efficiency * Generate comparison tables and reports Files: 5, Lines: 5,197 Total Tier 3: 22 files, 8,044 lines COMPLETE PROJECT SUMMARY: - Tier 0: 14 files, 6,002 lines ✅ - Tier 1: 7 files, 5,492 lines ✅ - Tier 2: 7 files, 5,467 lines ✅ - Tier 3: 22 files, 8,044 lines ✅ GRAND TOTAL: 50 files, 25,005 lines
Configuration menu - View commit details
-
Copy full SHA for 9dddfb6 - Browse repository at this point
Copy the full SHA 9dddfb6View commit details -
Merge pull request #1 from TamerSoliman/claude/deepspeed-zero-mapping…
…-01DbSwtx6qb4Nd7MLASQo99o Claude/deepspeed zero mapping 01 db swtx6qb4 nd7 mlas qo99o
Configuration menu - View commit details
-
Copy full SHA for 9a0083b - Browse repository at this point
Copy the full SHA 9a0083bView commit details
This comparison is taking too long to generate.
Unfortunately it looks like we can’t render this comparison for you right now. It might be too big, or there might be something weird with your repository.
You can try running this command locally to see the comparison on your machine:
git diff master...master