1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
//! 8-way packed bitsliced SM4 S-box (v0.5 W4 phase 2).
//!
//! Public entry point: [`sbox_x8`]. Operates on 8 independent S-box
//! inputs packed as `[u8; 8]`, returning `[u8; 8]`. Internally
//! dispatches to one of two paths:
//!
//! - `sbox_x8_avx2` (x86_64 only, guarded by runtime AVX2 detection
//! via [`crate::has_avx2`]) — translates the v0.4 W3 single-block
//! Itoh-Tsujii gate sequence to byte-parallel AVX2 intrinsics on
//! `__m256i`. Only the low 8 bytes of the 256-bit register carry
//! real data; the upper 24 bytes are unused. v0.6 W6 added
//! [`super::sbox_x32`] which uses the full 32-byte width for the
//! `Sm4CbcDecryptor` batch fanout path.
//! - [`sbox_x8_scalar`] (always available) — calls the local
//! single-block `super::scalar::sbox_byte` 8 times.
//!
//! # Algorithm — re-implementation note
//!
//! The Boyar-Peralta GF(2^8) Itoh-Tsujii gate sequence is duplicated
//! between this crate and `gmcrypto_core::sm4::sbox_bitsliced` rather
//! than shared via a widened `pub(crate)` visibility. CLAUDE.md
//! pins "Don't expose the bitsliced helpers publicly" — so the
//! sibling crate carries its own copy (in [`super::scalar`]) and
//! `tests/lane_equivalence.rs` cross-checks both paths against the
//! public GB/T 32907-2016 §6.2 S-box table.
//!
//! # Constant-time discipline
//!
//! Both paths are constant-time by construction. The AVX2 path uses
//! `_mm256_*` intrinsics with publicly-fixed loop counts; no table
//! lookups, no secret-dependent branches, no `_mm256_shuffle_*`
//! against secret-derived indices. The scalar path's gate sequence
//! mirrors the v0.4 W3 single-block bitslice already gated by the
//! existing `ct_sm4_encrypt_block_bitsliced` dudect target.
use sbox_byte;
use cratehas_avx2;
/// Scalar fallback: 8 sequential calls into
/// `super::scalar::sbox_byte`. Always available; selected at
/// runtime when AVX2 is not present.
/// 8-way packed bitsliced SM4 S-box dispatch.
///
/// On x86_64 with AVX2 available at runtime, calls
/// `sbox_x8_avx2`. Otherwise delegates to [`sbox_x8_scalar`].
///
/// Byte-identical output to the v0.4 W3 single-block bitslice for
/// every input byte across every lane (verified exhaustively in
/// `tests/lane_equivalence.rs`).
// ============================================================
// x86_64 AVX2 path
// ============================================================
use ;
/// AVX2 byte-parallel SM4 S-box on 8 independent inputs.
///
/// Stages 8 input bytes into the low 8 lanes of a 256-bit register
/// (upper 24 lanes carry junk but are never read out) and runs the
/// shared AVX2 gate sequence from [`super::avx2`].
///
/// # Safety
///
/// Caller must guarantee the host CPU supports AVX2. The public
/// entry point [`sbox_x8`] verifies this via [`has_avx2`] (cached
/// `cpufeatures` check) before calling.
pub unsafe