How to Read a Metric Quad Map

Introduction

The purpose of this postal service was to summarize some mutual metrics for object detection adopted by diverse popular competetions. This postal service mainly focuses on the definitions of the metrics; I'll write another mail to discuss the interpretaions and intuitions.

Popular competetions and metrics

The following competetions and metrics are included by this postⁱ:

The PASCAL VOC Claiming (Everingham et al. 2010)
The COCO Object Detection Challenge (Lin et al. 2014)
The Open Images Challenge (Kuznetsova 2018).

The links above points to the websites that describe the evaluation metrics. In brief:

All three challenges use mean boilerplate precision as a primary metric to evaluate object detectors; even so, at that place are some variations in definitions and implementations.
The COCO Object Detection claiming² too includes mean average call up as a detection metric.

Some concepts

Earlier diving into the competetion metrics, let'south first review some foundational concepts.

Confidence score is the probability that an anchor box contains an object. It is unremarkably predicted by a classifier.

Intersection over Marriage (IoU) is defined every bit the expanse of the intersection divided past the surface area of the union of a predicted bounding box (\(B_p\)) and a ground-truth box (\(B_{gt}\)):

\[ IoU = \frac{area(B_p \cap B_{gt})}{area(B_p \cup B_{gt})} \quad (1) \]

Both confidence score and IoU are used every bit the criteria that determine whether a detection is a true positive or a false positive. The pseudocode below shows how:

            for each detection that has a conviction score > threshold:    among the ground-truths, cull one that belongs to the same grade and has the highest IoU with the detection      if no ground-truth can be chosen or IoU < threshold (e.g., 0.v):     the detection is a fake positive   else:     the detection is a true positive

As the pseudocode indicates, a detection is considered a truthful positive (TP) only if it satisties three conditions: confidence score > threshold; the predicted class matches the class of a basis truth; the predicted bounding box has an IoU greater than a threshold (e.thousand., 0.v) with the ground-truth. Violation of either of the latter two conditions makes a fake positive (FP). It is worth mentioning that the PASCAL VOC Claiming includes some additional rules to define truthful/false positives. In example multiple predictions correspond to the same ground-truth, simply the one with the highest confidence score counts as a true positive, while the remainings are considered fake positives.

When the conviction score of a detection that is supposed to detect a ground-truth is lower than the threshold, the detection counts equally a false negative (FN). You may wonder how the number of false positives are counted so every bit to calculate the following metrics. However, as will be shown, we don't actually need to count it to get the consequence.

When the confidence score of a detection that is not supposed to detect annihilation is lower than the threshold, the detection counts as a true negative (TN). Even so, in object detection we unremarkably don't care nearly these kind of detections.

Precision is defined as the number of truthful positives divided past the sum of truthful positives and false positives:

\[ precision = \frac{TP}{TP + FP} \quad (2) \]

Recall is defined as the number of true positives divided by the sum of true positives and false negatives (note that the sum is simply the number of ground-truths, so at that place'southward no need to count the number of false negatives):

\[ call back = \frac{TP}{TP + FN} \quad (3) \]

By setting the threshold for confidence score at unlike levels, we get different pairs of precision and recall. With recall on the x-axis and precison on the y-axis, we tin can depict a precision-think curve, which indicates the association between the ii metrics. Fig. one shows a simulated plot.

Figure 1

Note that as the threshold for conviction score decreases, retrieve increases monotonically; precision can go upwardly and down, but the general tendency is to decrease.

In addition to precision-think curve, there is some other kind of curve called call back-IoU curve. Traditionally, this curve is used to evaluate the effectiveness of detection proposals (Hosang et al. 2016), but it is also the foundation of a metric called average retrieve, which will be introduced in the adjacent section.

By setting the threshold for IoU at different levels, the detector would achieve unlike recall levels appropriately. With these values, we can draw the recall-IoU bend by mapping \(IoU \in [0.5, one.0]\) on the x-centrality and recollect on the y-axis (Fig. ii shows a simulated plot).

Figure 2

The bend shows that recall decreases equally IoU increases.

Definitions of various metrics

This department introduces the post-obit metrics: average precision (AP), mean boilerplate precision (mAP), average call up (AR) and mean boilerplate recall (mAR).

Boilerplate precision

Although the precision-recall curve can exist used to evaluate the performance of a detector, it is not easy to compare among different detectors when the curves intersect with each other. Information technology would exist ameliorate if we take a numerical metric that tin be used direct for the comparing. This is where average precision (AP), which is based on the precision-call up curve, comes into play. In essence, AP is the precision averaged across all unique recall levels.

Note that in order to reduce the impact of the wiggles in the bend, we first interpolate the precision at multiple recall levels before really calculating AP. The interpolated precision \(p_{interp}\) at a certain think level \(r\) is defined every bit the highest precision institute for any recall level \(r' \geq r\):

\[ p_{interp}(r) = \max_{r' \geq r} p(r') \quad (4) \]

Annotation that there are two means to choose the levels of recall (denoted as \(r\) above) at which the precision should be interpolated. The traditional mode is to choose xi equally spaced recollect levels (i.e., 0.0, 0.one, 0.two, … 1.0); while a new standard adopted by the PASCAL VOC challenge chooses all unique remember levels presented past the data. The new standard is said to exist more capable of improving precision and measuring differences between methods with depression AP. Fig. 3 shows how the interpolated precision-recall curve is obtained over the original curve, using the new standard.

Figure 3

AP can then exist divers as the area nether the interpolated precision-recall bend, which tin exist calculated using the following formula:

\[ AP = \sum_{i = 1}^{n - 1} (r_{i + i} - r_i)p_{interp}(r_{i + 1}) \quad (v) \]

where \(r_1, r_2, ..., r_n\) is the recall levels (in an ascending order) at which the precision is first interpolated.

Mean average precision

The calculation of AP only involves i form. However, in object detection, there are usually \(Thousand > one\) classes. Hateful average precision (mAP) is divers as the mean of AP across all \(K\) classes:

\[ mAP = \frac{\sum_{i = 1}^{K}{AP_i}}{Thousand} \quad (6) \]

Average recall

Similar AP, average recollect (AR) is as well a numerical metric that can exist used to compare detector functioning. In essence, AR is the retrieve averaged over all \(IoU \in [0.v, one.0]\) and can be computed as ii times the area nether the recall-IoU curve:

\[ AR = 2 \int_{0.5}^{1}recall(o)practice \quad (7) \]

where \(o\) is IoU and \(recall(o)\) is the corresponding think.

It should be noted that for its original purpose (Hosang et al. 2016), the recall-IoU bend does not distinguish amid dissimilar classes^three. However, the COCO challenge makes such distinctions and its AR metric is calculated on a per-grade basis, just like AP.

Hateful boilerplate recall

Mean average recall is divers as the mean of AR beyond all \(K\) classes:

\[ mAR = \frac{\sum_{i = i}^{1000}{AR_i}}{Yard} \quad (eight) \]

Variations amid the metrics

The Pascal VOC challenge's mAP metric can be seen as a standard metric to evaluate the performance of object detectors; the major metrics adopted by the other two competetions tin exist seen equally variants of the same metric.

The COCO challenge's variants

Recollect that the Pascal VOC challenge defines the mAP metric using a single IoU threshold of 0.5. However, the COCO challenge defines several mAP metrics using dissimilar thresholds, including:

\(mAP^{IoU=.50:.05:.95}\) which is mAP averaged over 10 IoU thresholds (i.e., 0.50, 0.55, 0.sixty, …, 0.95) and is the primary challenge metric;
\(mAP^{IoU=.50}\), which is identical to the Pascal VOC metric;
\(mAP^{IoU=.75}\), which is a strict metric.

In improver to dissimilar IoU thresholds, in that location are also mAP calculated across different object scales; these variants of mAP are all averaged over 10 IoU thresholds (i.due east., 0.50, 0.55, 0.60, …, 0.95):

\(mAP^{pocket-size}\), which is mAP for small objects that covers area less than \(32^2\);
\(mAP^{medium}\), which is mAP for medium objects that covers area greater than \(32^2\) only less than \(96^2\);
\(mAP^{big}\), which is mAP for large objects that covers area greater than \(96^2\).

Like mAP, the mAR metric likewise has many variations. One set up of mAR variants vary across unlike numbers of detections per epitome:

\(mAR^{max = 1}\), which is mAR given 1 detection per paradigm;
\(mAR^{max = 10}\), which is mAR given 10 detections per paradigm;
\(mAR^{max = 100}\), which is mAR given 100 detections per image.

The other set of mAR variants vary across the size of detected objects:

\(mAR^{pocket-sized}\), which is mAR for small objects that covers area less than \(32^2\);
\(mAR^{medium}\), which is mAR for medium objects that covers area greater than \(32^2\) but less than \(96^ii\);
\(mAR^{large}\), which is mAR for large objects that covers expanse greater than \(96^2\).

The Open up Images claiming'southward variants

The Open Images challenge's object detection metric is a variant of the PASCAL VOC challenge'south mAP metric, which accomodates to three fundamental features of the dataset that affect how true positives and false positives are deemed:

non-exhaustive paradigm-level labeling;
semantic hierarchy of classes;
some ground-truth boxes may contain groups of objects and the exact location of a single object inside the group is unknown.

The official site provides more detailed description on how to deal with these cases.

Implementations

The Tensorflow Object Detection API provides implementations of various metrics.

There is also another opensource projection that implements diverse metrics that respect the competition's specifications, with an advantage in unifying the input format.

References

Everingham, Marking, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. "The Pascal Visual Object Classes (VOC) Claiming." International Periodical of Computer Vision 88 (2): 303–38. doi:10.1007/s11263-009-0275-four.

Hosang, Jan, Rodrigo Benenson, Piotr Dollár, and Bernt Schiele. 2016. "What Makes for Effective Detection Proposals?" IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (4): 814–30. doi:10.1109/TPAMI.2015.2465908.

Kuznetsova, Alina. 2018. "The Open up Images Dataset V4: Unified Paradigm Nomenclature, Object Detection, and Visual Relationship Detection at Scale."

Lin, Tsung-Yi, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2014. "Microsoft COCO: Common Objects in Context," May. http://arxiv.org/abs/1405.0312.

Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. 2015. "ImageNet Large Scale Visual Recognition Challenge." International Periodical of Computer Vision 115 (3): 211–52. doi:10.1007/s11263-015-0816-y.

lofgrensquing.blogspot.com

Source: https://blog.zenggyu.com/en/post/2018-12-16/an-introduction-to-evaluation-metrics-for-object-detection/