Glossary

GELU (Gaussian Error Linear Unit)

Discover how the GELU activation function enhances transformer models like GPT-4, boosting gradient flow, stability, and efficiency.

Train YOLO models simply
with Ultralytics HUB

Learn more

GELU (Gaussian Error Linear Unit) is a type of activation function commonly used in modern neural networks, particularly in Transformer architectures. Proposed by Dan Hendrycks and Kevin Gimpel in the paper "Gaussian Error Linear Units (GELUs)", it aims to combine properties from dropout, zoneout, and ReLU (Rectified Linear Unit) to improve model performance. Unlike ReLU, which sharply cuts off negative values, GELU provides a smoother curve, weighting inputs based on their magnitude rather than just their sign.

How GELU Works

The GELU function modulates the input based on its value, effectively deciding whether to "activate" a neuron. It multiplies the input by the value of the standard Gaussian cumulative distribution function (CDF) applied to that input. Intuitively, this means inputs further away from zero (both positive and negative) are more likely to be preserved, while inputs closer to zero have a higher chance of being zeroed out. This probabilistic approach introduces a form of stochastic regularization similar to dropout but determined by the input value itself, leading to a non-linear function that can capture more complex patterns in data.

GELU vs. Other Activation Functions

GELU offers advantages over simpler activation functions, contributing to its adoption in state-of-the-art models:

  • ReLU: ReLU is computationally simple but can suffer from the "dying ReLU" problem where neurons become inactive for negative inputs, potentially hindering learning. GELU's smooth curve allows gradients to flow more easily, especially for negative values, potentially mitigating this issue.
  • Leaky ReLU: While Leaky ReLU addresses the dying ReLU problem by allowing a small, non-zero gradient for negative inputs, it maintains a simple linear relationship in the negative domain. GELU offers a more complex, non-linear transformation.
  • SiLU (Swish): SiLU (Sigmoid Linear Unit) is another smooth activation function that often performs similarly to GELU. The choice between GELU and SiLU can depend on the specific architecture and dataset, often determined through empirical testing or hyperparameter tuning.

Applications and Significance

GELU has become a popular choice in many advanced deep learning models due to its strong empirical performance:

  1. Natural Language Processing (NLP): It is widely used in Transformer-based models like BERT and GPT models, contributing to their success in tasks such as text generation and natural language understanding.
  2. Computer Vision: GELU is also found in Vision Transformers (ViT) and subsequent vision models. For instance, components like the Generalized Efficient Layer Aggregation Network (GELAN) used in Ultralytics YOLOv9 employ GELU to enhance feature extraction and improve accuracy in object detection tasks, as detailed in the YOLOv9 paper.

The function's ability to provide smooth non-linearity and incorporate input magnitude into activation decisions makes it effective for training deep networks. While slightly more computationally intensive than ReLU, its performance benefits often justify its use in large-scale models available through frameworks like PyTorch and TensorFlow.

Read all