Glossary

GPT-4

Explore GPT-4, OpenAI's advanced multimodal AI, excelling in text-visual tasks, complex reasoning, and real-world applications like healthcare and education.

Train YOLO models simply
with Ultralytics HUB

Learn more

GPT-4 (Generative Pre-trained Transformer 4) is a large multimodal model created by OpenAI, representing a significant advancement in the field of Artificial Intelligence (AI). As the successor to GPT-3, GPT-4 demonstrates enhanced capabilities in understanding and generating human-like text, solving complex problems with improved reasoning, and exhibiting greater creativity. A key distinction from its predecessors is that GPT-4 is a Multi-modal Model, meaning it can accept both text and image inputs, allowing for richer interactions and a broader range of applications in Machine Learning (ML).

Core Concepts and Architecture

GPT-4, like other models in the GPT series, is built upon the Transformer architecture. This architecture, introduced in the influential paper "Attention Is All You Need", heavily relies on self-attention mechanisms. These mechanisms allow the model to weigh the importance of different words (or tokens) within an input sequence, enabling it to effectively capture long-range dependencies and context in text. GPT-4 was trained using vast amounts of data scraped from the internet and licensed data sources, encompassing both text and images. While specific details about its architecture size (number of parameters) and the exact training dataset remain proprietary, the GPT-4 Technical Report documents its significantly improved performance on various professional and academic benchmarks compared to earlier models. It operates as a powerful Large Language Model (LLM), capable of performing diverse language and vision-related tasks.

Key Features and Improvements

GPT-4 introduces several notable improvements over models like GPT-3:

Real-World Applications

GPT-4 powers a diverse set of applications across various industries, often accessed via an API:

GPT-4 in Context

While GPT-4 is a versatile foundation model excelling at language understanding, text generation, and basic image interpretation, it differs significantly from specialized models in fields like Computer Vision (CV). For instance, Ultralytics YOLO models, such as YOLOv8 or YOLO11, are specifically designed using Deep Learning (DL) for high-speed, accurate Object Detection, Image Segmentation, and Instance Segmentation within images or videos. GPT-4 can describe what is in an image (e.g., "There is a cat on a mat"), but YOLO models pinpoint where objects are located with precise bounding boxes or pixel-level masks, making them suitable for different computer vision tasks.

These different types of models can be highly complementary within complex AI systems. For example, a YOLO model could detect objects in a video stream, and GPT-4 could then generate descriptions or answer questions about the interactions between those detected objects. Managing the development, training, and model deployment of such combined systems can be streamlined using platforms like Ultralytics HUB or tools from communities like Hugging Face. Read more about AI advancements on the Ultralytics Blog.

Read all