Per-instance allocation for max_n, max_batch (B):
WORKING STORAGE:
A_work : [B, max_n, max_n] # working copy (destroyed)
V_accum : [B, max_n, max_n] # eigenvector accumulator
householder : [max_n-2, B, max_n] # stored reflectors (padded)
d : [B, max_n] # tridiagonal diagonal
e : [B, max_n-1] # tridiagonal off-diagonal
Subtotal: ~3 × max_n² × B floats
D&C TREE (depth = ⌈log₂(max_n)⌉ levels):
FOR each level l (0 to depth-1):
num_sub = 2^l
sub_size = max_n // 2^l (padded up to power of 2)
delta : [B, num_sub, sub_size] # merged eigenvalues
z_vec : [B, num_sub, sub_size] # merge vectors
rho : [B, num_sub] # coupling strengths
mask : [B, num_sub, sub_size] # valid element mask
# Newton state (per root):
lam : [B, num_sub, sub_size] # current root estimates
lo : [B, num_sub, sub_size] # bracket lower
hi : [B, num_sub, sub_size] # bracket upper
f_val : [B, num_sub, sub_size] # secular function value
converge: [B, num_sub, sub_size] # convergence mask
# Eigenvector fragments:
V_frag : [B, num_sub, sub_size, sub_size]
Subtotal per level: ~(9 × sub_size + sub_size²) × num_sub × B
Total across levels: since num_sub × sub_size = max_n at every level,
≈ (9 × max_n + max_n²) × depth × B
≈ max_n² × depth × B (the V_frags dominate)
CONCRETE NUMBERS (fp32, 4 bytes each):
max_n=8, B=4096: ~8² × 8 × 3 × 4096 × 4 ≈ 24 MB
max_n=32, B=1024: ~32² × 5 × 3 × 1024 × 4 ≈ 60 MB
max_n=64, B=512: ~64² × 6 × 3 × 512 × 4 ≈ 144 MB
max_n=128, B=256: ~128² × 7 × 3 × 256 × 4 ≈ 352 MB
max_n=256, B=128: ~256² × 8 × 3 × 128 × 4 ≈ 768 MB
max_n=6, B=8192: ~6² × 3 × 3 × 8192 × 4 ≈ 6 MB ← your CM case