Skip to content

8348868: AArch64: Add backend support for SelectFromTwoVector #23570

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions src/hotspot/cpu/aarch64/aarch64.ad
Original file line number Diff line number Diff line change
Expand Up @@ -881,6 +881,16 @@ reg_class vectorx_reg(
V31, V31_H, V31_J, V31_K
);

// Class for vector register V17
reg_class v17_veca_reg(
V17, V17_H, V17_J, V17_K
);

// Class for vector register v18
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use upper case

Suggested change
// Class for vector register v18
// Class for vector register V18

reg_class v18_veca_reg(
V18, V18_H, V18_J, V18_K
);

// Class for 128 bit register v0
reg_class v0_reg(
V0, V0_H
Expand Down Expand Up @@ -4974,6 +4984,26 @@ operand vReg()
interface(REG_INTER);
%}

operand vReg_V17()
%{
constraint(ALLOC_IN_RC(v17_veca_reg));
match(vReg);

op_cost(0);
format %{ %}
interface(REG_INTER);
%}

operand vReg_V18()
%{
constraint(ALLOC_IN_RC(v18_veca_reg));
match(vReg);

op_cost(0);
format %{ %}
interface(REG_INTER);
%}

operand vecA()
%{
constraint(ALLOC_IN_RC(vectora_reg));
Expand Down
47 changes: 47 additions & 0 deletions src/hotspot/cpu/aarch64/aarch64_vector.ad
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,18 @@ source %{
return false;
}
break;
// The "tbl" instruction for two vector table is supported only in Neon and SVE2. Return
// false if vector length > 16B but supported SVE version < 2.
// For vector length of 16B, generate SVE2 "tbl" instruction if SVE2 is supported, else
// generate Neon "tbl" instruction to select from two vectors.
// Currently, as we support only vector sizes of 8B and 16B, we disable this operation for
// T_LONG and T_DOUBLE on Neon as "mul" does not support 2D arrangement. However, these
// types are supported on machines with UseSVE == 2.
case Op_SelectFromTwoVector:
if (UseSVE < 2 && (type2aelembytes(bt) == 8 || length_in_bytes > 16)) {
return false;
}
break;
default:
break;
}
Expand Down Expand Up @@ -7150,3 +7162,38 @@ instruct vexpandBits(vReg dst, vReg src1, vReg src2) %{
%}
ins_pipe(pipe_slow);
%}

// --------------------------------SelectFromTwoVector -----------------------------

instruct vselect_from_two_vectors_SIFNeon(vReg dst, vReg_V17 src1, vReg_V18 src2,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a similar rule for VectorRearrange such as rearrange_HS_neon. To unify, can we use the similar name style for this rule?

Suggested change
instruct vselect_from_two_vectors_SIFNeon(vReg dst, vReg_V17 src1, vReg_V18 src2,
instruct vselect_from_two_vectors_HS_neon(vReg dst, vReg_V17 src1, vReg_V18 src2,

vReg index, vReg tmp1, vReg tmp2) %{
predicate((Matcher::vector_element_basic_type(n) == T_SHORT ||
type2aelembytes(Matcher::vector_element_basic_type(n)) == 4) &&
(UseSVE < 2 || Matcher::vector_length_in_bytes(n) < 16));
match(Set dst (SelectFromTwoVector (Binary index src1) src2));
effect(TEMP_DEF dst, TEMP tmp1, TEMP tmp2);
format %{ "vselect_from_two_vectors_SIF $dst, $src1, $src2, $index\t# vector (4S/8S/2I/4I/2F/4F). KILL $tmp1, $tmp2" %}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the same match rule name in the format. Thanks!

ins_encode %{
BasicType bt = Matcher::vector_element_basic_type(this);
uint length_in_bytes = Matcher::vector_length_in_bytes(this);
__ select_from_two_vectors_SIFNeon($dst$$FloatRegister, $src1$$FloatRegister,
$src2$$FloatRegister,$index$$FloatRegister,
$tmp1$$FloatRegister, $tmp2$$FloatRegister,
bt, length_in_bytes);
%}
ins_pipe(pipe_slow);
%}

instruct vselect_from_two_vectors(vReg dst, vReg_V17 src1, vReg_V18 src2, vReg index) %{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add comment before the rule why v17 and v18 are used explicitly here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @theRealAph , apologies for the late response. The tbl instruction needs both the source registers to be consecutive and I could not find a way to make the register allocator choose two consecutive registers for this operation and decided to hard code them. As v0-v7 are used for function arguments, v8-v15 are non-volatile which are not needed for this purpose (as we dont want to be preserving these values across function calls), I chose two of the volatile registers from v16-v31 for the source registers. Please let me know if this is the right way to approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it is, yes. Thanks.

predicate(Matcher::vector_element_basic_type(n) == T_BYTE ||
(UseSVE == 2 && Matcher::vector_length_in_bytes(n) >= 16));
match(Set dst (SelectFromTwoVector (Binary index src1) src2));
format %{ "vselect_from_two_vectors $dst, $src1, $src2, $index" %}
ins_encode %{
BasicType bt = Matcher::vector_element_basic_type(this);
uint length_in_bytes = Matcher::vector_length_in_bytes(this);
__ select_from_two_vectors($dst$$FloatRegister, $src1$$FloatRegister, $src2$$FloatRegister,
$index$$FloatRegister, bt, length_in_bytes);
%}
ins_pipe(pipe_slow);
%}
47 changes: 47 additions & 0 deletions src/hotspot/cpu/aarch64/aarch64_vector_ad.m4
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,18 @@ source %{
return false;
}
break;
// The "tbl" instruction for two vector table is supported only in Neon and SVE2. Return
// false if vector length > 16B but supported SVE version < 2.
// For vector length of 16B, generate SVE2 "tbl" instruction if SVE2 is supported, else
// generate Neon "tbl" instruction to select from two vectors.
// Currently, as we support only vector sizes of 8B and 16B, we disable this operation for
// T_LONG and T_DOUBLE on Neon as "mul" does not support 2D arrangement. However, these
// types are supported on machines with UseSVE == 2.
case Op_SelectFromTwoVector:
if (UseSVE < 2 && (type2aelembytes(bt) == 8 || length_in_bytes > 16)) {
return false;
}
break;
default:
break;
}
Expand Down Expand Up @@ -5132,3 +5144,38 @@ BITPERM(vcompressBits, CompressBitsV, sve_bext)

// ----------------------------------- ExpandBitsV ---------------------------------
BITPERM(vexpandBits, ExpandBitsV, sve_bdep)

// --------------------------------SelectFromTwoVector -----------------------------

instruct vselect_from_two_vectors_SIFNeon(vReg dst, vReg_V17 src1, vReg_V18 src2,
vReg index, vReg tmp1, vReg tmp2) %{
predicate((Matcher::vector_element_basic_type(n) == T_SHORT ||
type2aelembytes(Matcher::vector_element_basic_type(n)) == 4) &&
(UseSVE < 2 || Matcher::vector_length_in_bytes(n) < 16));
match(Set dst (SelectFromTwoVector (Binary index src1) src2));
effect(TEMP_DEF dst, TEMP tmp1, TEMP tmp2);
format %{ "vselect_from_two_vectors_SIF $dst, $src1, $src2, $index\t# vector (4S/8S/2I/4I/2F/4F). KILL $tmp1, $tmp2" %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful here. select_from_two_vectors_SIFNeon seems to alter src1, so you need a USE_KILL effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theRealAph Thanks for the suggestion! makes sense to add USE_KILL for the src1 usage here. I am getting into some errors when I do that. I'll resolve them and get back soon. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theRealAph Thanks for the suggestion! makes sense to add USE_KILL for the src1 usage here. I am getting into some errors when I do that. I'll resolve them and get back soon. Thanks!

Maybe that should be USE_DEF or TEMP_DEF.

ins_encode %{
BasicType bt = Matcher::vector_element_basic_type(this);
uint length_in_bytes = Matcher::vector_length_in_bytes(this);
__ select_from_two_vectors_SIFNeon($dst$$FloatRegister, $src1$$FloatRegister,
$src2$$FloatRegister,$index$$FloatRegister,
$tmp1$$FloatRegister, $tmp2$$FloatRegister,
bt, length_in_bytes);
%}
ins_pipe(pipe_slow);
%}

instruct vselect_from_two_vectors(vReg dst, vReg_V17 src1, vReg_V18 src2, vReg index) %{
predicate(Matcher::vector_element_basic_type(n) == T_BYTE ||
(UseSVE == 2 && Matcher::vector_length_in_bytes(n) >= 16));
match(Set dst (SelectFromTwoVector (Binary index src1) src2));
format %{ "vselect_from_two_vectors $dst, $src1, $src2, $index" %}
ins_encode %{
BasicType bt = Matcher::vector_element_basic_type(this);
uint length_in_bytes = Matcher::vector_length_in_bytes(this);
__ select_from_two_vectors($dst$$FloatRegister, $src1$$FloatRegister, $src2$$FloatRegister,
$index$$FloatRegister, bt, length_in_bytes);
%}
ins_pipe(pipe_slow);
%}
10 changes: 10 additions & 0 deletions src/hotspot/cpu/aarch64/assembler_aarch64.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -4294,6 +4294,16 @@ template<typename R, typename... Rx>
Assembler(CodeBuffer* code) : AbstractAssembler(code) {
}

// SVE2 programmable table lookup in two vector table
void sve2_tbl(FloatRegister Zd, SIMD_RegVariant T, FloatRegister Zn1,
FloatRegister Zn2, FloatRegister Zm) {
starti;
assert(T != Q, "invalid size");
assert(Zn1->successor() == Zn2, "invalid order of registers");
f(0b00000101, 31, 24), f(T, 23, 22), f(0b1, 21), rf(Zm, 16);
f(0b001010, 15, 10), rf(Zn1, 5), rf(Zd, 0);
}

// Stack overflow checking
virtual void bang_stack_with_offset(int offset);

Expand Down
74 changes: 74 additions & 0 deletions src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2853,3 +2853,77 @@ void C2_MacroAssembler::reconstruct_frame_pointer(Register rtmp) {
add(rfp, sp, framesize - 2 * wordSize);
}
}

void C2_MacroAssembler::select_from_two_vectors_SIFNeon(FloatRegister dst, FloatRegister src1,
FloatRegister src2, FloatRegister index,
FloatRegister tmp1, FloatRegister tmp2,
BasicType bt, unsigned vector_length_in_bytes) {
assert_different_registers(src1, src2, tmp1, tmp2);
assert(bt == T_SHORT || bt == T_INT || bt == T_FLOAT, "unsupported basic type");
assert(vector_length_in_bytes == 8 || vector_length_in_bytes == 16, "unsupported vector length");

// Neon "tbl" instruction only supports byte tables, so we need to look at chunks of
// 2B for selecting shorts or chunks of 4B for selecting ints/floats from the table.
// The index values in "index" register are in the range of [0, 2 * NUM_ELEM) where NUM_ELEM
// is the number of elements that can fit in a vector. For ex. for T_SHORT with 64-bit vector length,
// the indices can range from [0, 7].
// As an example with 64-bit vector length and T_SHORT type - let index = [2, 5, 1, 0]
// Move a constant 0x02 in every byte of tmp1 - tmp1 = [0x0202, 0x0202, 0x0202, 0x0202]
// Move a constant 0x0100 in every 2B of tmp2 - tmp2 = [0x0100, 0x0100, 0x0100, 0x0100]
// Multiply index vector with tmp1 to yield - dst = [0x0404, 0x0b0b, 0x0202, 0x0000]
// Add the multiplied result to the vector in tmp2 to obtain the byte level
// offsets - dst = [0x0504, 0x0c0b, 0x0302, 0x0100]
// Use these offsets in the "tbl" instruction to select chunks of 2B.

SIMD_Arrangement size1 = vector_length_in_bytes == 16 ? T16B : T8B;
SIMD_Arrangement size2 = vector_length_in_bytes == 16 ? T8H : T4H;
if (bt == T_INT || bt == T_FLOAT) {
size2 = vector_length_in_bytes == 16 ? T4S : T2S;
}

switch (bt) {
case T_SHORT:
mov(tmp1, size1, 0x02);
mov(tmp2, size2, 0x0100);
break;
case T_INT:
case T_FLOAT:
// Similarly, for int/float the index values for the "tbl" instruction are computed to
// select chunks of 4B for every int/float element
mov(tmp1, size1, 0x04);
mov(tmp2, size2, 0x03020100);
break;
default:
ShouldNotReachHere();
}
mulv(dst, size2, index, tmp1);
Comment on lines +2898 to +2899

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use vector lsl instead of mul here, so that we can also support D types for NEON/SVE1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XiaohongGong , thanks I'll give it a try and get back.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bhavana-Kilambi , left shift can not get right indexes here as values 0x2, 0x4 is landed in each B lane. Maybe we can just try with bsl for D size types, as it has only two lanes for long/double types with 128-bit vector length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @XiaohongGong , thanks but bsl instruction only has 8B/16B types. not D type. I'll see how I can do this with bsl.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, bsl only accepts 8B/16B, but it can also work for other types. We need to keep all bits of the lane to 1/0 (e.g. [0xffffffffffffffff, 0x0000000000000000] for T2D type). You can take the implementation of VectorBlend as a reference.

BTW, I'm currently working on adding the vector rearrange support for 2D (i.e. 128-bit long/double vector) types, and I met the same issues. I have tested that using a pattern with bsl can implement the op. The main idea is 1) compare the shuffle input with an iota index vector, and 2) choose src input or swap two elements in src based on the comparing result with bsl. Hope this could help you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XiaohongGong thank you! I will check it out. Apologies for being so slow in responding (got pulled into something else). I will update this PR with my latest patch soon. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @XiaohongGong , I just got back to working on this PR again!
I have been trying to implement this operation for Doubles/Longs but the performance is 0.8x that of the default implementation (with two vector rearranges and a vector blend). The implementation using bsl that I used is given below -

    dup(tmp1, T2D, src1, 0);
    dup(tmp2, T2D, src1, 1);

    mov(tmp3, T2D, 0x01);
    andr(tmp4, T16B, index, tmp3);
    negr(tmp4, T2D, tmp4);
    orr(tmp5, T16B, tmp4, tmp4);

    bsl(tmp4, T16B, tmp2, tmp1);

    dup(tmp1, T2D, src2, 0);
    dup(tmp2, T2D, src2, 1);

    bsl(tmp5, T16B, tmp2, tmp1);

    sshr(dst, T2D, index, 1);
    andr(dst, T16B, dst, tmp3);
    negr(dst, T2D, dst);

    bsl(dst, T16B, tmp5, tmp4);

This is based on the fact that the index vector can only contain values = 0 to 3. If the first bit is 0/1 it refers to the first or second double/long and if the second bit is 0/1 it selects the source (either src1/src2).
index = 00 -> choose first double/long of src1
01 -> choose second double/long of src1
10 -> choose first double/long of src2
11 -> choose second double/long of src2

I am not able to avoid duplicating the source elements.
Would it be ok if I do not support SelectFromTwoVector for doubles/longs or do you have any suggestion on how I can improve my implementation?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I forgot that we have the blend + rearrange pattern if this op is not supported directly. Since VectorRearrange for 2D have been implemented now, did you check the final codegen of the default pattern? I think we can revisit the codegen first with the default pattern (i.e. VectorBlend + VectorRearrange + VectorRearrange), and find whether there is further improvement opportunity for that. If so, we can implement the SelectFromTwoVectors op directly based on the improvement point. Otherwise, just keep using the default pattern will be fine to me.

Copy link
Contributor Author

@Bhavana-Kilambi Bhavana-Kilambi Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @XiaohongGong , thanks for the idea. I did check the codegen and I saw that the iota vectors were being loaded twice for both the source vectors which I felt could be eliminated. So I created a separate implementation for SelectFromTwoVector with the code for both the VectorRearrange and VectorBlend as show below -


    lea(rscratch1,
        ExternalAddress(StubRoutines::aarch64::vector_iota_indices() + 48));
    ldrq(tmp1, rscratch1);
    mov(tmp2, T2D, 0x01);
    andr(tmp3, size1, index, tmp2);
    cm(EQ, tmp3, size2, tmp1, tmp3);
    orr(tmp1, T16B, tmp3, tmp3);
    ext(tmp4, size1, src1, src1, 8);
    ext(tmp5, size1, src2, src2, 8);

    cm(GE, dst, size2, tmp2, index);
    bsl(tmp3, size1, src1, tmp4);
 
    bsl(tmp1, size1, src2, tmp5);

    bsl(dst, size1, tmp3, tmp1);

I have rearranged the instructions and used tmp5 (I could have reused tmp4 in the second ext) to allow for more ILP.

This implementation is certainly better than my previous implementation by ~23% for double and 31% for long but the performance is not much different from the default implementation (VectorRearrange + VectorBlend). For double, the performance is exactly the same and for long it is 0.97x. I collected some perf numbers for the cases with and without this patch. My implementation certainly executes fewer instructions compared to the default implementation but there is more ILP in the default implementation due to which it's performance is either better or the same as my implementation. I feel we can use the default implementation for doubles and longs? WDYT?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to me. Thanks for your testing! Using the mid-end IR pattern looks better that it may have other mid-end optimization opportunities in some case.

addv(dst, size1, dst, tmp2); // "dst" now contains the processed index elements
// to select a set of bytes (2B/4B) depending on the datatype

if (vector_length_in_bytes == 8) {
// We need to fit both the source vectors (src1, src2) in a 128-bit register as the
// Neon "tbl" instruction supports only looking up 16B vectors and use the Neon "tbl"
// instruction with one vector lookup
ins(src1, D, src2, 1, 0);
tbl(dst, size1, src1, 1, dst);
} else {
// If the vector length is 16B, then use the Neon "tbl" instruction with two vector table
assert(vector_length_in_bytes == 16, "must be");
tbl(dst, size1, src1, 2, dst);
}
}

void C2_MacroAssembler::select_from_two_vectors(FloatRegister dst, FloatRegister src1,
FloatRegister src2, FloatRegister index,
BasicType bt, unsigned vector_length_in_bytes) {
if (bt == T_BYTE && vector_length_in_bytes == 8) {
ins(src1, D, src2, 1, 0);
tbl(dst, T8B, src1, 1, index);
} else if (bt == T_BYTE && vector_length_in_bytes == 16 && UseSVE < 2){
tbl(dst, T16B, src1, 2, index);
} else {
assert(UseSVE == 2, "must be sve2");
SIMD_RegVariant size = elemType_to_regVariant(bt);
sve2_tbl(dst, size, src1, src2, index);
}
}
9 changes: 8 additions & 1 deletion src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -188,9 +188,16 @@
void vector_signum_sve(FloatRegister dst, FloatRegister src, FloatRegister zero,
FloatRegister one, FloatRegister vtmp, PRegister pgtmp, SIMD_RegVariant T);

void verify_int_in_range(uint idx, const TypeInt* t, Register val, Register tmp);
void verify_int_in_range(uint idx, const TypeInt* t, Register val, Register tmp);
void verify_long_in_range(uint idx, const TypeLong* t, Register val, Register tmp);

void reconstruct_frame_pointer(Register rtmp);

// Select from a table of two vectors
void select_from_two_vectors_SIFNeon(FloatRegister dst, FloatRegister src1, FloatRegister src2,
FloatRegister index, FloatRegister tmp1, FloatRegister tmp2,
BasicType bt, unsigned length_in_bytes);

void select_from_two_vectors(FloatRegister dst, FloatRegister src1, FloatRegister src2,
FloatRegister index, BasicType bt, unsigned length_in_bytes);
#endif // CPU_AARCH64_C2_MACROASSEMBLER_AARCH64_HPP
5 changes: 4 additions & 1 deletion src/hotspot/share/opto/vectorIntrinsics.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2739,6 +2739,9 @@ bool LibraryCallKit::inline_vector_select_from_two_vectors() {
index_elem_bt = T_LONG;
}

// Check if the platform requires a VectorLoadShuffle node to be generated
bool need_load_shuffle = Matcher::vector_rearrange_requires_load_shuffle(index_elem_bt, num_elem);

bool lowerSelectFromOp = false;
if (!arch_supports_vector(Op_SelectFromTwoVector, num_elem, elem_bt, VecMaskNotUsed)) {
int cast_vopc = VectorCastNode::opcode(-1, elem_bt, true);
Expand All @@ -2748,7 +2751,7 @@ bool LibraryCallKit::inline_vector_select_from_two_vectors() {
!arch_supports_vector(Op_VectorMaskCast, num_elem, elem_bt, VecMaskNotUsed) ||
!arch_supports_vector(Op_VectorBlend, num_elem, elem_bt, VecMaskUseLoad) ||
!arch_supports_vector(Op_VectorRearrange, num_elem, elem_bt, VecMaskNotUsed) ||
!arch_supports_vector(Op_VectorLoadShuffle, num_elem, index_elem_bt, VecMaskNotUsed) ||
(need_load_shuffle && !arch_supports_vector(Op_VectorLoadShuffle, num_elem, index_elem_bt, VecMaskNotUsed)) ||
!arch_supports_vector(Op_Replicate, num_elem, index_elem_bt, VecMaskNotUsed)) {
log_if_needed(" ** not supported: opc=%d vlen=%d etype=%s ismask=useload",
Op_SelectFromTwoVector, num_elem, type2name(elem_bt));
Expand Down
1 change: 1 addition & 0 deletions test/hotspot/gtest/aarch64/aarch64-asmtest.py
Original file line number Diff line number Diff line change
Expand Up @@ -2087,6 +2087,7 @@ def generate(kind, names):
["index", "__ sve_index(z7, __ D, r5, 5);", "index\tz7.d, x5, #5"],
["cpy", "__ sve_cpy(z7, __ H, p3, r5);", "cpy\tz7.h, p3/m, w5"],
["tbl", "__ sve_tbl(z16, __ S, z17, z18);", "tbl\tz16.s, {z17.s}, z18.s"],
["tbl", "__ sve2_tbl(z16, __ S, z17, z18, z16);", "tbl\tz16.s, {z17.s, z18.s}, z16.s"],
["ld1w", "__ sve_ld1w_gather(z15, p0, r5, z16);", "ld1w\t{z15.s}, p0/z, [x5, z16.s, uxtw #2]"],
["ld1d", "__ sve_ld1d_gather(z15, p0, r5, z16);", "ld1d\t{z15.d}, p0/z, [x5, z16.d, uxtw #3]"],
["st1w", "__ sve_st1w_scatter(z15, p0, r5, z16);", "st1w\t{z15.s}, p0, [x5, z16.s, uxtw #2]"],
Expand Down
Loading