Its use is different in the field of Information Retrieval (Reference [1][2] )and Multi-Class classification (Object Detection) settings.
To calculate it for Object Detection, you calculate the average precision for each class in your data based on your model predictions. Average precision is related to the area under the precision-recall curve for a class. Then Taking the mean of these average individual-class-precision gives you the Mean Average Precision.
First an "overlap criterion" is defined as an intersection-over-union greater than 0.5. (e.g. if a predicted box satisfies this criterion with respect to a ground-truth box, it is considered a detection). Then a matching is made between the GT boxes and the predicted boxes using this "greedy" approach:
Detections output by a method were assigned to ground truth objects
satisfying the overlap criterion in order ranked by the (decreasing)
confidence output. Multiple detections of the same object in an image
were considered false detections e.g. 5 detections of a single object
counted as 1 correct detection and 4 false detections
Hence each predicted box is either True-Positive or False-Positive.
Each ground-truth box is True-Positive.
There are no True-Negatives.
Then the average precision is computed by averaging the precision values on the precision-recall curve where the recall is in the range [0, 0.1, ..., 1] (e.g. average of 11 precision values). To be more precise, we consider a slightly corrected PR curve, where for each curve point (p, r), if there is a different curve point (p', r') such that p' > p and r' >= r, we replace p with maximum p' of those points.
What is still unclear to me is what is done with those GT boxes that are never detected (even if the confidence is 0). This means that there are certain recall values that the precision-recall curve will never reach, and this makes the average precision computation above undefined.
Edit:
Short answer: in the region where the recall is unreachable, the precision drops to 0.
One way to explain this is to assume that when the threshold for the confidence approaches 0, an infinite number of predicted bounding boxes light up all over the image. The precision then immediately goes to 0 (since there is only a finite number of GT boxes) and the recall keeps growing on this flat curve until we reach 100%.
I think the important part here is linking how object detection can be considered the same as the standard information retrieval problems for which there exists at least one excellent description of average precision.
The output of some object detection algorithm is a set of proposed bounding boxes, and for each one, a confidence and classification scores (one score per class). Let's ignore the classification scores for now, and use the confidence as input to a threshold binary classification. Intuitively, the average precision is an aggregation over all choices for the threshold/cut-off value. But wait; in order to calculate precision, we need to know if a box is correct!
This is where it gets confusing/difficult; as opposed to typical information retrieval problems, we actually have an extra level of classification here. That is, we can't do exact matching between boxes, so we need to classify if a bounding box is correct or not. The solution is to essentially do a hard-coded classification on the box dimensions; we check if it sufficiently overlaps with any ground truth to be considered 'correct'. The threshold for this part is chosen by common sense. The dataset you are working on will likely define what this threshold for a 'correct' bounding box is. Most datasets just set it at 0.5 IoU and leave it at that (I recommend doing a few manual IoU calculations [they're not hard] to get a feel for how strict IoU of 0.5 actually is).
Now that we have actually defined what it means to be 'correct', we can just use the same process as information retrieval.
To find mean average precision (mAP), you just stratify your proposed boxes based on the maximum of the classification scores associated with those boxes, then average (take the mean) of the average precision (AP) over the classes.
TLDR; make the distinction between determining if a bounding box prediction is 'correct' (extra level of classification) and evaluating how well the box confidence informs you of a 'correct' bounding box prediction (completely analogous to information retrieval case) and the typical descriptions of mAP will make sense.
For detection, a common way to determine if one object proposal was
right is Intersection over Union (IoU, IU). This takes the set A
of proposed object pixels and the set of true object pixels B and
calculates:
Commonly, IoU > 0.5 means that it was a hit, otherwise it was a fail. For each class, one can calculate the
True Positive TP(c): a proposal was made for class c and there actually was an object of class c
False Positive FP(c): a proposal was made for class c, but there is no object of class c
Average Precision for class c:
The mAP (mean average precision) is then:
Note:
If one wants better proposals, one does increase the IoU from 0.5 to a higher value (up to 1.0 which would be perfect). One can denote this with mAP@p, where p \in (0, 1) is the IoU.
mAP@[.5:.95] means that the mAP is calculated over multiple thresholds and then again being averaged
In most of the object detection contests, there are many categories to detect, and the evaluation of the model is performed on one specific category each time, the eval result is the AP of that category.
When every category is evaluated, the mean of all APs is calculated as the final result of the model, which is mAP.
Intersection Over Union (IOU) is measure based on Jaccard Index that evaluates the overlap between two bounding boxes. It requires a ground truth bounding box and a predicted bounding box By applying the IOU we can tell if a detection is valid (True Positive) or not (False Positive).IOU is given by the overlapping area between the predicted bounding box and the ground truth bounding box divided by the area of union between them.