There are potentially multiple ways of representing the same number,
using decimal as an example, the number 0.1 could be represented as
1*10-1 or 0.1*100 or even 0.01 * 10. The standard dictates that the
numbers are always stored with the first bit as a one. In decimal that
corresponds to the 1*10-1 example.
Now suppose that the lowest exponent that can be represented is -100.
So the smallest number that can be represented in normal form is
1*10-100. However, if we relax the constraint that the leading bit be
a one, then we can actually represent smaller numbers in the same
space. Taking a decimal example we could represent 0.1*10-100. This
is called a subnormal number. The purpose of having subnormal numbers
is to smooth the gap between the smallest normal number and zero.
It is very important to realise that subnormal numbers are represented
with less precision than normal numbers. In fact, they are trading
reduced precision for their smaller size. Hence calculations that use
subnormal numbers are not going to have the same precision as
calculations on normal numbers. So an application which does
significant computation on subnormal numbers is probably worth
investigating to see if rescaling (i.e. multiplying the numbers by
some scaling factor) would yield fewer subnormals, and more accurate
results.
In the IEEE754 standard, floating point numbers are represented as binary scientific notation, x = M × 2e. Here M is the mantissa and e is the exponent. Mathematically, you can always choose the exponent so that 1 ≤ M < 2.* However, since in the computer representation the exponent can only have a finite range, there are some numbers which are bigger than zero, but smaller than 1.0 × 2emin. Those numbers are the subnormals or M0.
Practically, the mantissa is stored without the leading 1, since there is always a leading 1, except for subnormal numbers (and zero). Thus the interpretation is that if the exponent is non-minimal, there is an implicit leading 1, and if the exponent is minimal, there isn't, and the number is subnormal.
*) More generally, 1 ≤ M < B for any base-B scientific notation.
Therefore, it would be wasteful to let that digit take up one precision bit almost every single number.
For this reason, they created the "leading bit convention":
always assume that the number starts with one
But then how to deal with 0.0? Well, they decided to create an exception:
if the exponent is 0
and the fraction is 0
then the number represents plus or minus 0.0
so that the bytes 00 00 00 00 also represent 0.0, which looks good.
If we only considered these rules, then the smallest non-zero number that can be represented would be:
exponent: 0
fraction: 1
which looks something like this in a hex fraction due to the leading bit convention:
1.000002 * 2 ^ (-127)
where .000002 is 22 zeroes with a 1 at the end.
We cannot take fraction = 0, otherwise that number would be 0.0.
But then the engineers, who also had a keen aesthetic sense, thought: isn't that ugly? That we jump from straight 0.0 to something that is not even a proper power of 2? Couldn't we represent even smaller numbers somehow? (OK, it was a bit more concerning than "ugly": it was actually people getting bad results for their computations, see "How subnormals improve computations" below).
Subnormal numbers
The engineers scratched their heads for a while, and came back, as usual, with another good idea. What if we create a new rule:
If the exponent is 0, then:
the leading bit becomes 0
the exponent is fixed to -126 (not -127 as if we didn't have this exception)
Such numbers are called subnormal numbers (or denormal numbers which is synonym).
This rule immediately implies that the number such that:
exponent: 0
fraction: 0
is still 0.0, which is kind of elegant as it means one less rule to keep track of.
So 0.0 is actually a subnormal number according to our definition!
With this new rule then, the smallest non-subnormal number is:
exponent: 1 (0 would be subnormal)
fraction: 0
which represents:
1.0 * 2 ^ (-126)
Then, the largest subnormal number is:
exponent: 0
fraction: 0x7FFFFF (23 bits 1)
which equals:
0.FFFFFE * 2 ^ (-126)
where .FFFFFE is once again 23 bits one to the right of the dot.
This is pretty close to the smallest non-subnormal number, which sounds sane.
And the smallest non-zero subnormal number is:
exponent: 0
fraction: 1
which equals:
0.000002 * 2 ^ (-126)
which also looks pretty close to 0.0!
Unable to find any sensible way to represent numbers smaller than that, the engineers were happy, and went back to viewing cat pictures online, or whatever it is that they did in the 70s instead.
As you can see, subnormal numbers do a trade-off between precision and representation length.
As the most extreme example, the smallest non-zero subnormal:
0.000002 * 2 ^ (-126)
has essentially a precision of a single bit instead of 32-bits. For example, if we divide it by two:
0.000002 * 2 ^ (-126) / 2
we actually reach 0.0 exactly!
Visualization
It is always a good idea to have a geometric intuition about what we learn, so here goes.
If we plot IEEE 754 floating point numbers on a line for each given exponent, it looks something like this:
In addition to exposing all of C's APIs, C++ also exposes some extra subnormal related functionality that is not as readily available in C in <limits>, e.g.:
denorm_min: Returns the minimum positive subnormal value of the type T
In C++ the whole API is templated for each floating point type, and is much nicer.
Implementations
x86_64 and ARMv8 implemens IEEE 754 directly on hardware, which the C code translates to.
The performance of floating-point processing can be reduced when doing calculations involving denormalized numbers and Underflow exceptions. In many algorithms, this performance can be recovered, without significantly affecting the accuracy of the final result, by replacing the denormalized operands and intermediate results with zeros. To permit this optimization, ARM floating-point implementations allow a Flush-to-zero mode to be used for different floating-point formats as follows:
For AArch64:
If FPCR.FZ==1, then Flush-to-Zero mode is used for all Single-Precision and Double-Precision inputs and outputs of all instructions.
If FPCR.FZ16==1, then Flush-to-Zero mode is used for all Half-Precision inputs and outputs of floating-point instructions, other than:—Conversions between Half-Precision and Single-Precision numbers.—Conversions between Half-Precision and Double-Precision numbers.
A1.5.2 "Floating-point standards, and terminology" Table A1-3 "Floating-point terminology" confirms that subnormals and denormals are synonyms:
This manual IEEE 754-2008
------------------------- -------------
[...]
Denormal, or denormalized Subnormal
C5.2.7 "FPCR, Floating-point Control Register" describes how ARMv8 can optionally raise exceptions or set a flag bits whenever the input of a floating point operation is subnormal:
FPCR.IDE, bit [15] Input Denormal floating-point exception trap enable. Possible values are:
0b0 Untrapped exception handling selected. If the floating-point exception occurs then the FPSR.IDC bit is set to 1.
0b1 Trapped exception handling selected. If the floating-point exception occurs, the PE does not update the FPSR.IDC bit. The trap handling software can decide whether to set the FPSR.IDC bit to 1.
D12.2.88 "MVFR1_EL1, AArch32 Media and VFP Feature Register 1" shows that denormal support is completely optional in fact, and offers a bit to detect if there is support:
FPFtZ, bits [3:0]
Flush to Zero mode. Indicates whether the floating-point implementation provides support only for the Flush-to-Zero mode of operation. Defined values are:
0b0000 Not implemented, or hardware supports only the Flush-to-Zero mode of operation.
0b0001 Hardware supports full denormalized number arithmetic.
All other values are reserved.
In ARMv8-A, the permitted values are 0b0000 and 0b0001.
This suggests that when subnormals are not implemented, implementations just revert to flush-to-zero.
[S]ubnormal numbers eliminate underflow as a cause for concern for a variety of computations (typically, multiply followed by add). ... The class of problems that succeed in the presence of gradual underflow, but fail with Store 0, is larger than the fans of Store 0 may realize. ... In the absence of gradual underflow, user programs need to be sensitive to the implicit inaccuracy threshold. For example, in single precision, if underflow occurs in some parts of a calculation, and Store 0 is used to replace underflowed results with 0, then accuracy can be guaranteed only to around 10-31, not 10-38, the usual lower range for single-precision exponents.
The Numerical Computation Guide refers the reader to two other papers: