Kinematics, Learning & Control

Humanoid Walking with Deep Reinforcement Learning

Bipedal Locomotion Control Using Soft Actor-Critic and DDPG Algorithms for 17-DOF Humanoid in MuJoCo Physics Simulation

20 December 2024

Project

Introduction

This project implements advanced deep reinforcement learning algorithms to achieve stable bipedal locomotion for a complex 17-DOF humanoid robot in the MuJoCo physics simulator. The system employs both Soft Actor-Critic (SAC) with automatic temperature tuning and Deep Deterministic Policy Gradient (DDPG) with Ornstein-Uhlenbeck noise exploration to learn natural walking gaits from scratch. The humanoid learns to balance, walk forward, and maintain stability for over 8 hours of continuous locomotion without falls, operating in a challenging 376-dimensional observation space that includes joint positions, velocities, and contact forces. Through careful reward shaping including standing bonuses and forward velocity incentives, the agent develops robust walking behaviors that emerge purely from trial-and-error learning without any motion capture data or hand-crafted controllers.

Objectives

To implement and compare SAC and DDPG algorithms for high-dimensional continuous control in bipedal locomotion
To develop effective reward shaping strategies that encourage stable walking without falls or unnatural gaits
To achieve long-duration stable walking (8+ hours) demonstrating learned balance and coordination
To create a robust training pipeline handling the complexity of 17-DOF control with 376-dimensional observations
To optimize hyperparameters including learning rates, buffer sizes, and exploration noise for efficient learning
To demonstrate emergent locomotion behaviors learned purely from reinforcement signals without demonstrations

Tools and Technologies

Programming Languages: Python
Deep Learning Framework: PyTorch with CUDA acceleration
RL Algorithms: Soft Actor-Critic (SAC), Deep Deterministic Policy Gradient (DDPG)
Physics Simulation: MuJoCo (Multi-Joint dynamics with Contact)
Environment: Gymnasium Humanoid-v5
Exploration Strategies: Ornstein-Uhlenbeck noise (DDPG), Stochastic policy (SAC)
Experience Replay: Prioritized replay buffer with 1M capacity
Network Architecture: Actor-Critic with separate Q-networks
Optimization: Adam optimizer with different learning rates for actor/critic
Monitoring: TensorBoard for training curves, Matplotlib for live plotting
Hardware: NVIDIA GPU for accelerated training
Version Control: Git

Source Code

GitHub Repository: DDPG-SAC-HumanoidWalking
Documentation: README with implementation details

Video Result

Humanoid Walking Demo: Trained agent demonstration showing stable bipedal locomotion

WhatsApp Image 2025-09-03 at 4.35.40 PM.jpeg

Training Architecture: Dual algorithm implementation with SAC achieving 2x faster convergence than DDPG baseline
Performance Metrics: 8+ hours continuous walking, 6000+ average reward after convergence

Process and Development

The project is structured into five critical components: environment setup and reward engineering, DDPG implementation with deterministic policy, SAC implementation with entropy regularization, hyperparameter optimization and ablation studies, and comparative performance analysis between algorithms.

Task 1: Environment Configuration and Reward Shaping

MuJoCo Integration: Configured Humanoid-v5 environment with 376-dimensional observation space including joint angles, velocities, quaternion orientation, and contact forces from feet sensors.

Reward Engineering: Developed multi-component reward function combining survival bonus (+1 per timestep), forward velocity reward (proportional to x-velocity), standing bonus (+2.0 when torso height > 1.0m), and velocity scaling (2x multiplier) to encourage upright forward locomotion.

Action Space Design: Implemented 17-DOF continuous control with torque limits, mapping neural network outputs through tanh activation to ensure valid joint torque ranges preventing motor damage

Task 2: DDPG Algorithm Implementation

Deterministic Policy Network: Created 3-layer Actor network (400-300-17 neurons) with ReLU activations outputting continuous torque values scaled by action bounds for direct motor control.

Critic Q-Network: Implemented state-action value function approximator concatenating 376-dim state and 17-dim action vectors, outputting single Q-value for policy gradient computation.

Ornstein-Uhlenbeck Noise: Developed temporally correlated exploration noise with theta=0.15, initial sigma=0.3 decaying to 0.05, providing smooth action perturbations suitable for continuous control.

Task 3: SAC Algorithm Enhancement

Stochastic Policy: Implemented Gaussian policy outputting mean and log-std for each action dimension, using reparameterization trick for differentiable sampling through policy network.

Automatic Temperature Tuning: Created learnable entropy coefficient (alpha) with target entropy=-17 (negative action dimension), automatically balancing exploration-exploitation without manual tuning.

Twin Q-Networks: Developed dual critic networks taking minimum Q-value to address overestimation bias, improving learning stability in high-dimensional continuous spaces.

Task 4: Training Pipeline and Optimization

Experience Replay Buffer: Implemented circular buffer with 1M capacity storing (state, action, reward, next_state, done) tuples, enabling off-policy learning and sample efficiency.

Soft Target Updates: Applied exponential moving average with tau=0.005 (SAC) and tau=0.001 (DDPG) for target network updates, stabilizing temporal difference learning.

Batch Training Schedule: Configured 256 batch size with learning starting after 1000/5000 steps for SAC/DDPG respectively, performing gradient updates every environment step after warmup.

Task 5: Performance Analysis and Validation

Convergence Metrics: Tracked episode rewards, moving averages, Q-values, and actor losses using TensorBoard, observing SAC convergence at ~20k steps versus DDPG at ~40k steps.

Stability Testing: Evaluated trained policies for continuous walking duration, achieving 8+ hours without falls for SAC while DDPG showed occasional instabilities after 2-3 hours.

Ablation Studies: Tested reward component contributions finding forward velocity bonus critical for gait development, standing bonus essential for stability, and entropy regularization improving exploration efficiency.

Results

The SAC implementation achieves stable bipedal locomotion after approximately 20,000 environment interactions, learning natural walking gaits without any human demonstrations. The agent maintains balance for over 8 hours of continuous walking with average episode rewards exceeding 6000 after convergence. Walking speed reaches 2.5 m/s forward velocity while maintaining upright posture with torso height consistently above 1.0 meters. The learned policy demonstrates robustness to initial conditions, successfully recovering from various starting poses. Comparison with DDPG shows 2x faster learning for SAC with more stable long-term performance. The entropy-regularized SAC policy exhibits more natural motion with smoother joint trajectories compared to the deterministic DDPG policy.

Key Insights

Entropy Regularization Advantage: SAC's automatic temperature adjustment eliminates manual tuning while maintaining optimal exploration-exploitation balance throughout training.
Reward Shaping Criticality: Standing bonus prevents early convergence to crawling behaviors while forward velocity scaling encourages efficient gaits over simple survival.
Exploration Noise Impact: Ornstein-Uhlenbeck noise in DDPG provides smoother exploration than Gaussian noise but still underperforms SAC's principled stochastic policy.
Sample Efficiency: Both algorithms require approximately 1M total environment steps for robust policies, with SAC achieving stable gaits 2x faster.
Emergent Behaviors: Natural arm swinging and weight shifting emerge without explicit programming, demonstrating the power of end-to-end reinforcement learning

Future Work

Curriculum Learning: Implement progressive difficulty increase starting with balance, then stepping, finally continuous walking to accelerate training
Domain Randomization: Add variations in mass distribution, joint friction, and ground properties for sim-to-real transfer
Hierarchical Control: Develop high-level navigation policy commanding low-level walking controller for goal-directed locomotion
Multi-Task Learning: Train single policy for walking, running, turning, and climbing stairs using task-conditioned rewards
Model-Based Enhancement: Integrate learned dynamics models for planning and reduced sample complexity
Real Robot Deployment: Transfer learned policies to physical humanoid robots using domain adaptation techniques

Ritwik Rohan