用語集

自信

AI信頼度スコアの定義モデルが予測の確実性を測定する方法、信頼性のしきい値を設定する方法、信頼性と精度を区別する方法を学ぶ。

Ultralytics HUB で
を使ってYOLO モデルをシンプルにトレーニングする。

さらに詳しく

Confidence, in the context of Artificial Intelligence (AI) and Machine Learning (ML), represents a score assigned by a model to its prediction, indicating how certain the model is about that specific output. For tasks like object detection or image classification, each detected object or assigned class label comes with a confidence score, typically ranging from 0 to 1 (or 0% to 100%). This score helps users gauge the reliability of individual predictions made by models such as Ultralytics YOLO. A higher score suggests the model is more certain about its prediction based on the patterns learned during training. Understanding confidence is crucial for interpreting model outputs and making informed decisions based on AI predictions, especially in safety-critical applications like AI in automotive solutions.

自信の決め方

Confidence scores are usually derived from the output layer of a neural network (NN). For classification tasks, this often involves applying an activation function like Softmax or Sigmoid to the raw outputs (logits) to produce probability-like values for each class. In object detection models like YOLO, the confidence score might combine the probability of an object being present in a proposed bounding box (often called an "objectness score") and the probability of that object belonging to a specific class, conditioned on an object being present. It's a key output used during the inference process to assess the validity of detections. This score is calculated based on the model weights learned from datasets like COCO.

自信のしきい値

In practice, not all predictions from a model are equally useful or reliable. Predictions with very low confidence scores often represent background noise or uncertain classifications. To filter these out, a "confidence threshold" is typically applied. This is a user-defined value (e.g., 0.5 or 50%); only predictions with a confidence score above this threshold are considered valid outputs. Setting an appropriate threshold is vital and often depends on the specific application:

  • High-Recall Scenarios: In applications like medical image analysis for screening, a lower threshold might be used initially to minimize the chance of missing potential findings (high recall), even if it means more false positives that require human review. AI in healthcare often involves careful threshold tuning.
  • High-Precision Scenarios: In applications like autonomous driving or quality control in AI in manufacturing, a higher threshold is preferred to ensure that actions are taken only based on highly certain predictions (high precision), reducing the risk of errors. AI safety research emphasizes robust decision-making.

The confidence threshold often works in conjunction with techniques like Non-Maximum Suppression (NMS) to refine the final set of detections by removing overlapping bounding boxes for the same object. You can easily configure this threshold when using Ultralytics models via the command-line interface (CLI) or Python API. Finding the optimal threshold may involve hyperparameter tuning.

実世界での応用

Confidence scores are fundamental in deploying AI models responsibly and effectively:

  1. Medical Diagnosis Support: In systems analyzing medical scans (like X-rays or MRIs) for potential anomalies (like tumor detection), the confidence score helps prioritize cases. A prediction with low confidence might indicate an ambiguous finding requiring closer examination by a radiologist, while high-confidence predictions can streamline the review process. Research in Radiology AI often discusses confidence levels.
  2. Autonomous Systems: For self-driving cars or robotics, confidence scores are critical for safety. A detection of a pedestrian or another vehicle (learn about Waymo's approach) must meet a high confidence threshold before the system initiates an action like braking or swerving. Low-confidence detections might be ignored or trigger less critical alerts. This ensures the system acts decisively only when certain.

信頼度とその他の指標

It's important not to confuse the confidence score of an individual prediction with overall model evaluation metrics. While related, they measure different aspects of performance:

  • Accuracy: Measures the overall percentage of correct predictions across the entire dataset. It provides a general sense of model performance but doesn't reflect the certainty of individual predictions. A model can have high accuracy but still make some predictions with low confidence.
  • Precision: Indicates the proportion of positive predictions that were actually correct (True Positives / (True Positives + False Positives)). High precision means fewer false alarms. Confidence reflects the model's belief in its prediction, which might or might not align with correctness.
  • Recall (Sensitivity): Measures the proportion of actual positive instances that the model correctly identified (True Positives / (True Positives + False Negatives)). High recall means fewer missed detections. Confidence doesn't directly relate to how many actual positives were found.
  • F1-Score: The harmonic mean of Precision and Recall, providing a single metric that balances both. Confidence remains a prediction-level score.
  • Mean Average Precision (mAP): A common metric in object detection that summarizes the precision-recall curve across different confidence thresholds and classes. While mAP calculation involves confidence thresholds, the confidence score itself applies to each individual detection.
  • Calibration: Refers to how well the confidence scores align with the actual probability of correctness. A well-calibrated model's predictions with 80% confidence should be correct about 80% of the time. Confidence scores from models are not always inherently well-calibrated (see research on calibration).

In summary, confidence is a valuable output for assessing the certainty of individual AI predictions, enabling better filtering, prioritization, and decision-making in real-world applications. It complements, but is distinct from, metrics that evaluate the overall performance of a model like those tracked in Ultralytics HUB.

すべて読む