Bfloat16

bfloat16
术语名称	bfloat16
英语名称	bloat16 floating-point type
别名	bfloat16, BF16, brain 16-bit floating-point

bfloat16(brain floating-point)是浮点型中长 16 二进制位的数据类型，是 IEEE 754 中 binary32 类型的截断格式，一般用于深度学习领域。根据长度及用途，称为 bfloat16 。

由于技术原因，标题首字母会被大写。这一术语通常应当以小写字母开头。

bfloat16 虽然是一个较新的格式，随着近年来机器学习的发展，大量的硬件已经对这一格式进行了支持。

定义

bfloat16 ，指长度为 16 位、与 IEEE 754 中 binary32 类型浮点格式在符号部分和阶码部分相当的浮点类型。

其中浮点数的 16 位包括符号位 1 位、阶码 8 位、尾数 7 位。

范围

阶码 8 位，因此阶码偏移量为 [math]\displaystyle{ b=2^{8-1}-1=127 }[/math] 。指数范围 [math]\displaystyle{ e_\min = 1-b=-126, e_\max=b=127 }[/math] 。尾数 7 位，精度 [math]\displaystyle{ p=8 }[/math] ，因此尾数精度为 [math]\displaystyle{ 2^{-7} }[/math] ，尾数取值在 0 ～ [math]\displaystyle{ 1-2^{-7} }[/math] 之间。

其有效数精度为 8 位二进制数（含隐藏位）相当于约 2～3 位十进制数的有效数字（[math]\displaystyle{ \lg 2^{7} \approx 2.17 }[/math]）。

对 bfloat16 ，除了 0 和无穷、 NaN 外：

规格化数 [math]\displaystyle{ (-1)^s 2^{E-127} (1+M) }[/math] 正数/绝对值范围：[math]\displaystyle{ 2^{-126} \approx 1.175 \times 10^{-38} }[/math] ～ [math]\displaystyle{ 2^{127} \times (2 - 2^{-7}) \approx 3.390 \times 10^{38} }[/math]
非规格化数 [math]\displaystyle{ (-1)^s 2^{e_\min} M }[/math] 正数/绝对值范围：[math]\displaystyle{ 2^{-126} \times 2^{-7} \approx 9.184 \times 10^{-41} }[/math] ～ [math]\displaystyle{ 2^{-126} \times (1 - 2^{-7}) \approx 1.166 \times 10^{-38} }[/math]