Glossary

GELU (Gaussian Error Linear Unit)

Discover how the GELU activation function enhances transformer models like GPT-4, boosting gradient flow, stability, and efficiency.

The Gaussian Error Linear Unit, or GELU, is a high-performing activation function widely used in modern neural networks (NN), particularly in transformer models. Proposed in the paper "Gaussian Error Linear Units (GELUs)" by Dan Hendrycks and Kevin Gimpel, GELU introduces a probabilistic approach to neuron activation, departing from the deterministic nature of functions like ReLU. It weights inputs based on their magnitude rather than just gating them by sign, effectively combining properties from dropout, zoneout, and ReLU.

How GELU Works

GELU determines a neuron's output by multiplying the input value by the value of the standard Gaussian cumulative distribution function (CDF) applied to that input. This means the activation is stochastic, depending on the input value itself. Unlike ReLU, which sharply cuts off negative values, GELU provides a smoother curve. Inputs with larger magnitudes are more likely to be preserved, while inputs closer to zero are more likely to be zeroed out. This smooth, probabilistic weighting allows for richer representations and potentially better gradient flow during backpropagation, which is crucial for training deep networks.

Comparison with Other Activation Functions

GELU offers distinct characteristics compared to other common activation functions:

ReLU (Rectified Linear Unit): ReLU is computationally simple (output is the input if positive, zero otherwise). GELU is smoother and non-monotonic (it can decrease as the input increases for negative values), which can sometimes help capture more complex patterns. However, GELU is more computationally intensive than ReLU.
Sigmoid and Tanh: These functions squash inputs into a fixed range (0 to 1 for Sigmoid, -1 to 1 for Tanh). While useful in certain contexts (like output layers for probabilities), they can suffer from the vanishing gradient problem in deep networks. GELU, like ReLU, does not have an upper bound, mitigating this issue for positive values.
SiLU (Sigmoid Linear Unit) / Swish: SiLU is another smooth, non-monotonic activation function that multiplies the input by its sigmoid. It shares similarities with GELU in terms of shape and performance, often considered a close alternative. Both have shown strong empirical results.

Advantages of GELU

Smoothness: Its smooth curve allows for better gradient descent dynamics compared to the sharp point in ReLU.
Non-Monotonicity: Allows for more complex function approximation.
Probabilistic Interpretation: Incorporates input magnitude into the activation decision in a stochastic manner.
State-of-the-Art Performance: Frequently used in top-performing models, especially transformers.

Disadvantages and Considerations

Computational Cost: Calculating the Gaussian CDF is more expensive than the simple operations in ReLU. Efficient approximations are often used in practice.
Complexity: Slightly more complex to understand and implement from scratch compared to simpler functions like ReLU.

Applications and Significance

GELU has become a popular choice in many advanced deep learning models due to its strong empirical performance:

Transformer Models: GELU is a standard activation function in the feed-forward layers of transformer architectures, powering models like:
- BERT (Bidirectional Encoder Representations from Transformers): Used for tasks like natural language understanding (NLU) and question answering.
- GPT models (Generative Pre-trained Transformer): Employed in large language models (LLMs) for text generation, summarization, and more.
Vision Transformers (ViT): Used in ViTs and related architectures for computer vision (CV) tasks like image classification and object detection.
Ultralytics YOLOv9: The GELAN (Generalized Efficient Layer Aggregation Network) architecture used in YOLOv9 incorporates activation functions like GELU or SiLU, contributing to its high accuracy and efficiency in object detection tasks, as detailed in the YOLOv9 paper. See a comparison between YOLOv9 and YOLOv8.

The function's ability to provide smooth non-linearity and incorporate input magnitude into activation decisions makes it effective for training deep networks. While slightly more computationally intensive than ReLU, its performance benefits often justify its use in large-scale models available through frameworks like PyTorch and TensorFlow. You can explore various models and train them using tools like Ultralytics HUB.

GELU (Gaussian Error Linear Unit)

Train YOLO models simply
with Ultralytics HUB

Flexible enterprise licensing solution to power your innovation

Train AI models in seconds with Ultralytics YOLO

Train YOLO models simply with Ultralytics HUB

How GELU Works

Comparison with Other Activation Functions

Advantages of GELU

Disadvantages and Considerations

Applications and Significance

Read more blogs

Join the Ultralytics community

GELU (Gaussian Error Linear Unit)

Train YOLO models simplywith Ultralytics HUB

Flexible enterprise licensing solution to power your innovation

Train AI models in seconds with Ultralytics YOLO

Train YOLO models simply with Ultralytics HUB

How GELU Works

Comparison with Other Activation Functions

Advantages of GELU

Disadvantages and Considerations

Applications and Significance

Read more blogs

Join the Ultralytics community

Train YOLO models simply
with Ultralytics HUB