1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
//! 32-way packed bitsliced SM4 S-box (v0.6 W6).
//!
//! Public entry point: [`sbox_x32`]. Operates on 32 independent
//! S-box inputs packed as `[u8; 32]`, returning `[u8; 32]`. The
//! intended consumer is `gmcrypto_core::sm4::cbc_streaming::
//! Sm4CbcDecryptor::process_chunk`'s 8-block batched CBC-decrypt
//! fanout: 8 SM4 blocks × 4 `tau` bytes per round = 32 bytes per
//! call, zero wasted lanes (vs phase 2's [`super::sbox_x8`] which
//! uses only 8 of the 32 lanes per round).
//!
//! # Dispatch
//!
//! - On `x86_64` with AVX2 available at runtime: [`sbox_x32_avx2`]
//! — the full 32-byte AVX2 path. Same shared gate sequence as
//! [`super::sbox_x8`] ([`super::avx2::sbox_round`]); the only
//! difference is no staging-buffer overhead.
//! - Elsewhere (non-x86_64, or x86_64 without AVX2): falls back to
//! [`sbox_x32_scalar`] — a 32-iteration loop calling the local
//! single-block [`super::scalar::sbox_byte`]. Designed so the
//! non-AVX2 fallback is not slower than calling
//! [`super::sbox_x8_scalar`] four times (codex flag #1 from the
//! v0.6 W6 phase 3 scope consultation).
//!
//! # Constant-time discipline
//!
//! Same as [`super::sbox_x8`]: shared AVX2 gate sequence (no table
//! lookups, no secret-derived branches); scalar path is the same
//! gate-only `sbox_byte` from [`super::scalar`].
use sbox_byte;
use cratehas_avx2;
/// Scalar fallback: 32 sequential calls into
/// [`super::scalar::sbox_byte`]. Always available.
/// 32-way packed bitsliced SM4 S-box dispatch.
///
/// On `x86_64` with AVX2: calls [`sbox_x32_avx2`]. Otherwise
/// [`sbox_x32_scalar`].
///
/// Byte-identical output to applying [`super::scalar::sbox_byte`]
/// to each input byte (verified exhaustively in
/// `tests/lane_position_x32.rs` with lane-position-shifted sweeps
/// per Q6.8 / codex's phase 3 flag #4).
// ============================================================
// x86_64 AVX2 path
// ============================================================
use ;
/// AVX2 byte-parallel SM4 S-box on 32 independent inputs (full
/// `__m256i` register width).
///
/// Loads the 32 input bytes into one AVX2 register, runs the shared
/// AVX2 gate sequence from [`super::avx2`], stores back to a 32-byte
/// output buffer. No staging-buffer overhead vs [`super::sbox_x8`].
///
/// # Safety
///
/// Caller must guarantee the host CPU supports AVX2. The public
/// entry point [`sbox_x32`] verifies this via [`has_avx2`] (cached
/// `cpufeatures` check) before calling.
pub unsafe