x86/bytecode.txt: improve byte code documentation

H. Peter Anvin (Intel) · H. Peter Anvin (Intel) · commit 587ed5e36d07 · 2025-10-12T11:23:28.000-07:00
Improve the byte code reference documentation to make a few opcodes
more clear and add some general properties about the byte codes,
including the files that need to be changed when the byte code
changes.

Signed-off-by: H. Peter Anvin (Intel) &lt;hpa@zytor.com&gt;
diff --git a/x86/bytecode.txt b/x86/bytecode.txt
@@ -1,3 +1,5 @@
+-*- text -*-
+
 Bytecode specification
 ----------------------
 
@@ -9,31 +11,72 @@ hexadecimal.
 
 The mnemonics are the ones used in x86/insns.dat, where applicable.
 
+The byte code is not stable. Byte codes can be moved around and
+recycled at any time. x86/insnsb.c contains a generated table of
+byte code use frequencies as a comment near the end that can be
+used to identify candidates for recycling, if necessary.
+
+Several byte codes are equivalent to sequences of other byte codes; if
+those have low usage counts they can be good candidates for
+recycling.
+
+Operands are numbered starting with 0.
+
+Operand numbers encoded in byte codes only encode two bits of the
+operand number, with the opcodes \5, \6 and \7 used as a prefixes to
+escape to operands 4+. This saves a lot of byte coding space, as these
+operands are extremely rare.
+
+When byte codes are changed, the following files MUST be updated
+accordingly:
+
+	this file
+	x86/insns.pl	- many locations
+	disasm/disasm.c	- matches()
+	asm/assemble.c	- calcsize(), gencode(), find_match(), jmp_match()
+
 In x86/insns.dat, the encoding slot of each operand is encoded as:
 
 	-	implicit operand (no encoding)
 	x+y	multiple encoding slots for one operand
-	r	"r" position in modr/m, or base register with "+r"
+	r	"r" position in modr/m[1], or base register with "+r"[2]
 	m	"m" position in modr/m
-	n	immediate encoded in the "m" position in modr/m
-	b	register encoded in the "m" position in modr/m
+	n	immediate encoded in the "m" position in modr/m[3]
+	b	register encoded in the "m" position in modr/m[4]
 	x	register encoded in the "x" position in modr/m + sib (MIB)
 	v	"v" register position in vex/evex
-	s	"s" registe rposition in /is4
-	w	immediate encoded in the "v" position in vex/evex
-	i	first immediate or mem_offs
-	j	second immediate or mem_offs
-
-Codes            Mnemonic        Explanation
-
-\0                                       terminates the code. (Unless it's a literal of course.)
-\1..\4                                   that many literal bytes follow in the code stream
-\5                                       add 4 to the primary operand number (b, low octdigit)
-\6                                       add 4 to the secondary operand number (a, middle octdigit)
-\7                                       add 4 to both the primary and the secondary operand number
-\10..\13                                 a literal byte follows in the code stream, to be added
+	s	"s" register position in /is4
+	w	immediate encoded in the "v" position in vex/evex[3]
+	i	first immediate or mem_offs[5]
+	j	second immediate or mem_offs[6]
+
+[1] currently used even for register operands, even though "b" is an
+    alias in that case.
+[2] this is technically incorrect and should be "b", but that is the
+    way it is currently encoded.
+[3] separate letter code for the benefit of the insns.pl sanity checker.
+[4] currently used mainly when "x" is also used.
+[5] when the modr/m displacement is used as an immediate, it is byte
+    coded as an *address-sized* immediate and uses "i". A seg:offs
+    pair uses "i" for the offset (thus "ji").
+[6] when the modr/m displacement is used as an immediate and
+    another ("true") immediate is present, the "true" immediate uses "j".
+    A seg:offs pair uses "j" for the segment (thus "ji").
+
+
+XX below indicates a hexadecimal byte; NN a decimal number.
+
+Codes            Mnemonic		 Definition
+
+\0               (auto-generated)        end of code sequence (but 0 can be part of a multi-byte
+                                         sequence, so byte codes are NOT null-terminated strings.)
+\1..\4           XX XX...                that many literal bytes follow in the code stream
+\5               (auto-generated)        add 4 to the primary operand number (b, low octdigit)
+\6               (auto-generated)        add 4 to the secondary operand number (a, middle octdigit)
+\7               (auto-generated)        add 4 to both the primary and the secondary operand number
+\10..\13         +r                      a literal byte follows in the code stream, to be added
                                          to the register value of operand 0..3
-\14..\17                                 the position of index register operand in MIB (BND insns)
+\14..\17         (auto-generated)        the position of index register operand in MIB (BND insns)
 \20..\23         ib                      a byte immediate operand, from operand 0..3
 \24..\27         ib,u                    a zero-extended byte immediate operand, from operand 0..3
 \30..\33         iw                      a word immediate operand, from operand 0..3
@@ -54,17 +97,20 @@ Codes            Mnemonic        Explanation
 \171\mab         /mrb (e.g /3r0)         a ModRM, with the reg field taken from operand a, and the m
                                          and b fields set to the specified values.
 \172\ab          /is4                    the register number from operand a in bits 7..4, with
-                                         the 4-bit immediate from operand b in bits 3..0.
-\173\xab                                 the register number from operand a in bits 7..4, with
+                                         the 4-bit immediate from operand b in bits 2..0.
+					 For EVEX- or REX2-encodable instructions, the operand is encoded in
+                                         bits [3:7..4] and the immediate is restricted to 3 bits
+					 unless the register operand is given the rn_l16 operand flag.
+\173\xab         /is4=NN                 the register number from operand a in bits 7..4, with
                                          the value b in bits 3..0.
-\174..\177                               the register number from operand 0..3 in bits 7..4, and
+\174..\177       /is4                    the register number from operand 0..3 in bits 7..4, and
                                          an arbitrary value in bits 3..0 (assembled as zero.)
 \2ab             /b                      a ModRM, calculated on EA in operand a, with the reg
                                          field equal to digit b.
-\240..\243                               this instruction uses EVEX rather than REX or VEX/XOP, with the
+\240..\243       evex.*                  this instruction uses EVEX rather than REX or VEX/XOP, with the
                                          V register number taken from operand "b" (0..3) (which may
 					 be an immediate, as is used for DFV.)
-\250                                     this instruction uses EVEX rather than REX or VEX/XOP, with the
+\250             evex.*                  this instruction uses EVEX rather than REX or VEX/XOP, with the
                                          V register number set to 0 (subject to the XOR as defined
 					 below)
 
@@ -88,10 +134,10 @@ EVEX prefixes are followed by the sequence:
                    (compressed displacement encoding)
 
 \254..\257      id,s                     a signed 32-bit operand to be extended to 64 bits.
-\260..\263                               this instruction uses VEX/XOP rather than REX, with the
+\260..\263      vex.*                    this instruction uses VEX/XOP rather than REX, with the
                                          V register taken from operand "b" 0..3.
 \264..\267	id,u			 an unsigned 32-bit operand to be extended to 64 bits.
-\270                                     this instruction uses VEX/XOP rather than REX, with the
+\270            vex.*                    this instruction uses VEX/XOP rather than REX, with the
                                          V register set to 0.
 VEX/XOP prefixes are followed by the sequence:
 \tmm\wlp	tmm format:	tt 0mm mmm
@@ -112,16 +158,20 @@ VEX/XOP prefixes are followed by the sequence:
 
 t = 0 for VEX (C4/C5), t = 1 for XOP (8F).
 
-\271             hlex                       instruction takes XRELEASE (F3) with or without lock
+		 vex+.*			     instruction is encodable either with VEX or EVEX,
+                                             depending on the operands. Generates multiple
+                                             instruction patterns with different operand encoding
+                                             and byte codes.
+\271             hlex                        instruction takes XRELEASE (F3) with or without lock
 \272             hlenl                       instruction takes XACQUIRE/XRELEASE with or without lock
 \273             hle                         instruction takes XACQUIRE/XRELEASE with lock only
 \274..\277       ib,s                        a byte immediate operand, from operand 0..3, sign-extended
                                              to the operand size (if o16/o32/o64 present) or the bit size
 \300..\303	 ibn			     a valid 0F NOP opcode.
-\304..\307
-	\0\xNN	 ib^NN			     intermediate byte XOR 0xNN
-	\1\xNN	 ib,s^NN		     signed intermediate byte XOR 0xNN
-	\2\xNN	 ib,u^NN		     unsigned intermediate byte XOR 0xNN
+\304..\307				     a byte immediate from operand 0..3, XOR a specific constant.
+	\0\xXX	 ib^XX			     intermediate byte XOR 0xXX
+	\1\xXX	 ib,s^XX		     signed intermediate byte XOR 0xXX
+	\2\xXX	 ib,u^XX		     unsigned intermediate byte XOR 0xXX
 \310             a16                         indicates fixed 16-bit address size, i.e. optional 0x67.
 \311             a32                         indicates fixed 32-bit address size, i.e. optional 0x67.
 \312             adf, asz                    (disassembler only) invalid with non-default address size.
@@ -185,3 +235,7 @@ t = 0 for VEX (C4/C5), t = 1 for XOP (8F).
 \376             vsibz|vm32z|vm64z           this instruction takes an ZMM VSIB memory EA
 
 * No 66 prefix is emitted if combined with VEX/EVEX, np, 66, osp or !osp.
+
+## Local variables:
+## fill-column: 99
+## End