용어집

자신감

AI 신뢰도 점수를 정의하세요. 모델이 예측 확실성을 측정하고, 신뢰도에 대한 임계값을 설정하고, 신뢰도와 정확도를 구분하는 방법을 알아보세요.

YOLO 모델을 Ultralytics HUB로 간단히
훈련

자세히 알아보기

Confidence, in the context of Artificial Intelligence (AI) and Machine Learning (ML), represents a score assigned by a model to its prediction, indicating how certain the model is about that specific output. For tasks like object detection or image classification, each detected object or assigned class label comes with a confidence score, typically ranging from 0 to 1 (or 0% to 100%). This score helps users gauge the reliability of individual predictions made by models such as Ultralytics YOLO. A higher score suggests the model is more certain about its prediction based on the patterns learned during training. Understanding confidence is crucial for interpreting model outputs and making informed decisions based on AI predictions, especially in safety-critical applications like AI in automotive solutions.

신뢰가 결정되는 방법

Confidence scores are usually derived from the output layer of a neural network (NN). For classification tasks, this often involves applying an activation function like Softmax or Sigmoid to the raw outputs (logits) to produce probability-like values for each class. In object detection models like YOLO, the confidence score might combine the probability of an object being present in a proposed bounding box (often called an "objectness score") and the probability of that object belonging to a specific class, conditioned on an object being present. It's a key output used during the inference process to assess the validity of detections. This score is calculated based on the model weights learned from datasets like COCO.

신뢰도 임계값

In practice, not all predictions from a model are equally useful or reliable. Predictions with very low confidence scores often represent background noise or uncertain classifications. To filter these out, a "confidence threshold" is typically applied. This is a user-defined value (e.g., 0.5 or 50%); only predictions with a confidence score above this threshold are considered valid outputs. Setting an appropriate threshold is vital and often depends on the specific application:

  • High-Recall Scenarios: In applications like medical image analysis for screening, a lower threshold might be used initially to minimize the chance of missing potential findings (high recall), even if it means more false positives that require human review. AI in healthcare often involves careful threshold tuning.
  • High-Precision Scenarios: In applications like autonomous driving or quality control in AI in manufacturing, a higher threshold is preferred to ensure that actions are taken only based on highly certain predictions (high precision), reducing the risk of errors. AI safety research emphasizes robust decision-making.

The confidence threshold often works in conjunction with techniques like Non-Maximum Suppression (NMS) to refine the final set of detections by removing overlapping bounding boxes for the same object. You can easily configure this threshold when using Ultralytics models via the command-line interface (CLI) or Python API. Finding the optimal threshold may involve hyperparameter tuning.

실제 애플리케이션

Confidence scores are fundamental in deploying AI models responsibly and effectively:

  1. Medical Diagnosis Support: In systems analyzing medical scans (like X-rays or MRIs) for potential anomalies (like tumor detection), the confidence score helps prioritize cases. A prediction with low confidence might indicate an ambiguous finding requiring closer examination by a radiologist, while high-confidence predictions can streamline the review process. Research in Radiology AI often discusses confidence levels.
  2. Autonomous Systems: For self-driving cars or robotics, confidence scores are critical for safety. A detection of a pedestrian or another vehicle (learn about Waymo's approach) must meet a high confidence threshold before the system initiates an action like braking or swerving. Low-confidence detections might be ignored or trigger less critical alerts. This ensures the system acts decisively only when certain.

신뢰도 대 다른 지표

It's important not to confuse the confidence score of an individual prediction with overall model evaluation metrics. While related, they measure different aspects of performance:

  • Accuracy: Measures the overall percentage of correct predictions across the entire dataset. It provides a general sense of model performance but doesn't reflect the certainty of individual predictions. A model can have high accuracy but still make some predictions with low confidence.
  • Precision: Indicates the proportion of positive predictions that were actually correct (True Positives / (True Positives + False Positives)). High precision means fewer false alarms. Confidence reflects the model's belief in its prediction, which might or might not align with correctness.
  • Recall (Sensitivity): Measures the proportion of actual positive instances that the model correctly identified (True Positives / (True Positives + False Negatives)). High recall means fewer missed detections. Confidence doesn't directly relate to how many actual positives were found.
  • F1-Score: The harmonic mean of Precision and Recall, providing a single metric that balances both. Confidence remains a prediction-level score.
  • Mean Average Precision (mAP): A common metric in object detection that summarizes the precision-recall curve across different confidence thresholds and classes. While mAP calculation involves confidence thresholds, the confidence score itself applies to each individual detection.
  • Calibration: Refers to how well the confidence scores align with the actual probability of correctness. A well-calibrated model's predictions with 80% confidence should be correct about 80% of the time. Confidence scores from models are not always inherently well-calibrated (see research on calibration).

In summary, confidence is a valuable output for assessing the certainty of individual AI predictions, enabling better filtering, prioritization, and decision-making in real-world applications. It complements, but is distinct from, metrics that evaluate the overall performance of a model like those tracked in Ultralytics HUB.

모두 보기