[MLIR][RFC]: introduce linalg.contract

Motivation

Based on prior discussions, e.g. here and here, a linalg.contract op would be a worthwhile addition to the arsenal of Linalg named ops. The main benefit is to be able to directly represent contractions, especially ones not covered by current ops - e.g. matmul-like ops which employ arbitrary transposes/permuted dims and/or have high-dimensional operands - without needing to go to linalg.generics. Our main motivation is that contractions represented as linalg.contract (rather than linalg.generic) allow for optimizations to be more easily expressed, e.g. applying/undoing arbitrary transposes, and for straightforward lowerings, in particular to vector.contract.

A contraction op in Linalg is also a (standalone) piece of the Linalg operation tree puzzle.

Limitations of status quo

  • We have a wild growth of matmul-like ops that we wish to halt (& eventually prune): linalg.matmul(_transpose_a|_transpose_b)?, linalg.batch_matmul(_transpose_a|_transpose_b)?, linalg.batch_reduce_matmul, linalg.matvec, linalg.vecmat, linalg.batch_matvec, linalg.batch_vecmat, linalg.mmt4d, linalg.batch_mmt4d, linalg.dot

    • Though there’s progress, see e.g. here, these ops do not allow all the possible dim permutations people care about (e.g. selecting contracting dim(s) in a batch_(reduce_)_matmul).
    • Ever more specific ops do not give a path to a general contraction, one supporting arbitrary broadcasts and transposes on operands with arbitrary number of dims.
    • The availability of an op of appropriate generality would mean the introduction of further named ops similar to the ones above would need much stronger justification.
  • linalg.generic is too general for this class of einsum-like ops:

    • The guarantee that an op’s indexing_maps are restricted to projected permutations means we know that a number of transforms can be applied withour reservation.
      • For example, transposes and broadcasts can be applied/folded into to such a contraction op itself, without going to linalg.generic.
      • The more gradual lowering to a contraction op (vs. linalg.generic) allows for, e.g., permutation decisions to be easily amended after the packing transform.
    • In practice, matching linalg.generics to lower to vector.contract can be a pain – a suitable contraction op would have a straightforward and efficient ā€œcanonicalā€ lowering to vector.contract that always applies.
    • Named ops, at the right abstraction level, allow for encoding invariants like op-is-a-contraction into the IR, which can, e.g.,
      • be a convenient anchor for hero op matching;
      • be used by tools (like the Transform dialect) to prove transforms, and compositions thereof, are valid/well-defined for every well-typed input.

Proposal: introduce linalg.contract

In essence: vector.contract but at the Linalg level.

Syntax

contract-op ::= `linalg.contract`
                  `indexing_maps` `=` `[` affine-map+ `]`
                  (`iterator_types` `=` `[` ( `parallel` | `reduction` )+ `]`)?
                  (`kind` `=` reduction-op)?
                  `ins(` $A `,` $B `:` tensor-or-memref-type `,` tensor-or-memref-type `)`
                  `outs(` $C `:` tensor-or-memref-type `)`

reduction-op ::= `#linalg.kind` `<` reduction-op-kind `>`
reduction-op-kind ::= `add` | `mul` | `minui` | … | `minimumf`

The verifier checks

  1. the indexing_maps attribute consists of affine_maps which are
    1. projected permutations;
    2. encode a valid contraction - reducing at least one dimension - w.r.t. the ins operands’ types;
    3. the outs operand’s dims are a subset of the dims of the ins operands.
  2. if provided, iterator_types matches the implied iterator types of the projected permutation maps.

As with vector.contract, see docs,

  1. The optional kind attribute controls which operator is used for reducing/combining–defaults to standard addition;
  2. a dim that only occurs in A or B, but not in the output, is a ā€œfreeā€ dimension, one over which to reduce.

Semantics

Einsum semantics per

Screenshot 2024-12-10 224932

where I^A, I^B, and J are multi-indices, i.e. sequences/ordered sets of dimension identifiers (meant to range over index ranges) corresponding to the co-domains of the respective affine_maps, āŠ• is the selected kind of reduction operation, and āŠ•_{dims} means reduce over all valid indices for the dimensions in the set dims (NB: per the verifier, dims cannot be empty).

Example: for matmul we have I^A = ⟨ m, k ⟩, I^B = ⟨ k, n ⟩, J = ⟨ m, n ⟩ and āŠ• is normal addition/summation.

Like all recent linalg named ops and vector.contract’s docs: ā€œNumeric casting is performed on the operands to the inner multiply, promoting them to the same data type as the accumulator/output.ā€

Design choices (+ pros & cons of alternatives)

1. Require at least one contraction/reduction dimension

Rationale: is what vector.contract does; restrict generality of op

Alternative: allow elementwise products like outerproduct

Pros:

  • Gain ability to represent all binary einsums
  • Some strategies for rewriting trees of contractions/einsums convert contractions into elementwise ā€œcontractionsā€ and back, hence it might be desirable to have one op to represent all intermediate states – if one op is desired, a separate einsum op would probably be a more appropriate solution.

Cons:

  • Would be yet another way to represent elementwise products (v.s. linalg.mul and linalg.elementwise_binary and linalg.generics) which certainly complicates matching.
  • Lose property that each linalg.contract can be lowered to a vector.contract.
  • Would mean linalg.contract cannot implement ContractionOpInterface, as it requires that a contraction op ā€œhas at least one reduction dimensionā€, as the name suggests.

2. affine_maps to encode projected dim permutations

Rationale: is what vector.contract and recent linalg ops use

Alternative: specify reduction/contraction dims and permutation of dims as separate attributes (e.g. two arrays)

Pros:

  • On the face of it more separation of concerns, though changing the permutation array could necessitate changing the reduction/contraction dims array.

Cons:

  • After long discussions, transposes on linalg.matmul was merged with an indexing_maps attribute instead of an attribute encoding projected permutations some other way
  • Per the ContractionOpInterface: ā€œHas only projected permutation indexing mapsā€, which, to be fair, could still be derived from the array attributes.
  • For lowering to vector.contract we would need to infer the corresponding affine_maps anyway.

3. iterator_types attribute is optional

Rationale: middle ground of needing to do inference in most cases though not all and the ability to opt-in to verification.

Alternative: require iterator_types to be provided

Pros:

  • As currently implemented, vector.contract requires the attribute, so for lowering to vector.contract you need it anyway.
  • Attribute is there as a cache, and only need to do the (linear-scan) inference in case the verifier runs.

Cons:

  • Unlike for linalg.generic, iterator_types can always be inferred.
  • Consensus was against inclusion of iterator_types when transposes on linalg.matmul (which uses indexing_maps to permute dims) got merged recently.

Alternative: no IR representation; only available through inference

Pros:

  • Less verbose IR.
  • Can still be cached internally, e.g. across calls to LinalgStructuredInterface::getIteratorTypesArray – potentially already by the verifier.

Cons:

  • Opting-in to validation of supposed iterator_types by the verifier is not possible – which, to be fair, is the case for a number of linalg ops.

4. linalg.contract is a binary op

Rationale: Keep to convention of vector.contract and most linalg named ops (making for easier time matching up semantics); could always be generalized later

Alternative: allow single input and/or more than two input operands

Pros:

  • Multi-operand contractions are valid operations and could be part of a valid lowering strategy (e.g. they can be represented by linalg.generic).
  • Can represent more versions of einsum, closer to their original form.

Cons:

  • Binary contractions suffice to implement multi-operand contractions [e.g., per a under-review paper].
  • Lose existence of a ā€œcanonicalā€ lowering to vector.contract, which is a binary op.
  • Single operand version is already served by linalg.reduce – as above: two ways of writing the same thing (at the same abstraction level) complicates matching.
  • ContractionOpInterface mandates a binary op, though ā€œIn the future, we may wish to allow more input argumentsā€.

Actions

Primary / 1st PR:

  • Implement the proposed abstraction, with it implementing the ContractionOpInterface.
  • Implement generalization to linalg.generic and lowering to vector.contract.
  • Change inferContractionDims to additionally return ā€œfreeā€ dimensions, i.e. reduction dims that occur in the LHS or RHS but not both.
    • To maintain current API expectations, add allowFreeDims=false as argument to inferContractionDims.

Secondary / follow-up PRs:

  • Implement folding in transposes before and after the linalg.contract.
  • Rewrite packMatmulGreedily transform to lower matmuls to linalg.contract instead of linalg.generic.
    • Enables easier cleanup, e.g. fiddling with transposes after this transform has run
  • Implement raising/specialization transform from linalg.generic to linalg.contract
  • In line with [RFC][MLIR] Linalg operation tree, implement generalization/coercion transforms to linalg.contract for
    • linalg.dot
    • linalg.matmul(_transpose_a|_transpose_b)?
    • linalg.batch_matmul(_transpose_a|_transpose_b)?
    • linalg.batch_reduce_matmul
    • linalg.matvec, linalg.vecmat
    • linalg.batch_matvec, linalg.batch_vecmat
    • linalg.mmt4d, linalg.batch_mmt4d

Alternatives

  • Generalize current collection of matmul-like ops to support higher-dimensions and more permutations.
    • This would just give us linalg.contract though in sheep’s clothing, probably spread out over a number of ops. Unlikely to yield the same generality as a proper contraction op.
  • Status quo: no linalg op that sits between current matmul variants and linalg.generics; linalg.generic remains only representation for non-named op contractions.
    • The matching story would need to be improved – e.g. by adopting ā€œmatch linalg.generic and perform isContractionOpInterface func callā€ as the preferred approach.
      • Each such scheme entails running checks with non-trivial cost on each and every linalg.generic, e.g. incur scans over indexing_maps and region matching costs.
      • Hard to justify in face of efficient named-op matching infrastructure being available.
      • Arguably, in general we do need such matching for contractions that happened to have been prematurely lowered to linalg.generic - though these ops could also be raised to linalg.contract, in which case you still incur the matching complexity.
    • Could introduce projected_perm_map attribute to be used in indexing_maps on linalg.generic (& other ops) to easily identify projected permutations on dims. Would reduce the cost of matching the attributes.
      • An advantage would be that all multi-argument einsums expressed as a single linalg.generic are easier to recognize through their indexing_maps.
      • We would have overlapping representations with the affine_map attribute representation still being valid.
      • Does not yield a scheme to simplify region matching. Two such schemes that are always used together in order to constrain attributes and the op’s region ought to be enough motivation for a new op – this is what this proposal is about.

Unresolved questions:

  • Should repeated indices for a single operand be allowed?
    • The proposed semantics extends to this use case but vector.contract for example explicitly disallows it (as the corresponding affine_map is not a projected permutation). E.g. trace as an einsum is supported at least in some frameworks (as it is unambiguous in Einstein notation, also in the multi-operand case).

This RFC benefitted from comments from and discussion with @rengolin, @asiemien, Alex Heinecke & Alex Breuer.

6 Likes

Thanks @rolfmorel for the RFC. Couple of points for discussion as I read through:

  • addition to the arsenal of Linalg named ops. I am not picking on this line, but more like using it to highlight a source of future confusions. I am wondering if it is time to introduce a new terminology to differentiate between a proper base level named op (e.g. linalg.matmul) and something like linalg.contract. Abstractions like contract sits between generic and matmul. Specially following @rengolin 's proposed [classification], it would be good to have a terminology that identifies and separates the mid-level meta-ops like ā€˜contract’ from ā€˜generic’ on one hand, and very specific named ops like linalg.matmul. My proposal: category ops. linalg.generic is explicitly an op, named op is just something we all agreed to mean something, and so category ops are same (to talk about linalg.contract, linalg.elemwise, …).

  • (iterator_types = [(parallel|reduction)+])? I thnk we should avoid having iterator type and just infer. Optional is where one can add more information other than default/inferred. Here, as you mention, its for verification. But that is essentially duplication of information in the IR. Just my opinion.

  • reduction-op ::= linalg.kind <reduction-op-kind>`` – Contraction can be two ops in the linalg.generic body (e.g. mul and add for case of matmul). Could you elaborate how that is captured here?

  • +1 for linalg.contract is a binary op . Otherwise it gets too generic (puun intended). Also, crops up checks for unused args?

  • Should repeated indices for a single operand be allowed? Allowing would break projected-permutation contact.

While it’s true that contract is in between matmul and generic, it is a named operation in its own right. For example, my comment on the linalg tree RFC about reducing the number of named ops (which is a thing we’re both working together):

The idea of adding a contract is not just to create an intermediary, it’s to actually use it on corner cases, when the benefit of adding yet another named op is marginal.

This is what we have today and the ā€œweā€ that agreed 5 years ago doesn’t remember anymore what they agreed upon with the same clarity.

My objective in creating the tree, and Rolf’s objective in being deeply technical in this proposal is precisely to improve on that situation and have an actual encoded semantics that composes with the rest of the dialect and other dialects.

Thanks for the reply @rengolin .

Agree. That’s fine. From time to time when we want to specifically refer to the set {contract, elemwise, conv, pool, ...} we are categorizing and creating, we can refer to the collection as category.

agree. And to do transformation/opts on them.
BTW, I added some more thoughts/points in my first reply (it doesn’t contract my initial but adds some new points.

Thanks for the replies, @javedabsar and @rengolin.

Regarding the classification of linalg.contract, I think we are in agreement: it is a named op - in the sense of having restricted syntax and clear semantics - while at the same time covering a ā€œcategoryā€ that encompasses other named ops. As @rengolin says, there are cases where we want it as the most specific (named) op that Linalg provides and it can serve as an intermediary.


Indeed, the RFC’s current proposal does not allow for the ā€œMulOpā€ to be anything but standard multiplication. The advantage of this is that the one-to-one mapping to vector.contract is always available (on non-dynamic shapes) - this is due to vector.contract only allowing control over the ā€œAddOpā€ kind.

Considering the docs of LinalgContractionOpInterface say

A Linalg contraction is defined in general terms:
    [...]
    4. its body computes `u5(u1(c) + u2(u3(a) * u4(b)))` on some field
    (AddOpType, MulOpType), where u1, u2, u3, u4 and u5 represent scalar unary
    operations that may change the type (e.g. for mixed-precision).
As a consequence, when vectorization of such an op occurs, the only special 
behavior is that the (unique) MulOpType is vectorized into a
`vector.contract`. All other ops are handled in a generic fashion.

it seems that @nicolasvasilache did already have in mind that the ā€œcanonicalā€ lowering to vector.contract is conditional on the contraction’s ā€œMulOpā€ being multiplication. As such, I will modify the RFC and add an attribute to control the ā€œMulOpā€. Will change the original post to reflect this. (EDIT: I am not allowed to edit the post anymore :slightly_frowning_face:) Thanks @javedabsar.


On this topic, I will just note that the most recent iteration of linalg.matmul has affine_maps and no iterator_types as they can be inferred. I also note that vector.contract does the opposite and requires iterator_types while they can be inferred. The third route is optional verification.

At the moment, I am inclined to go with no iterator_types attribute, primarily to be consistent within Linalg and as it is the conservative option (i.e. we could always add (an optional) iterator_types later, when a (new) consensus emerges).

I’m sceptical if this is really required in the first iteration. Nico’s definition is in general terms and not something that we need to follow to the dot.

I’d start with the standard mul + add contraction assumption and add a FIXME in the code saying we could support other types. After we implemented all the currently useful patterns, if someone feels inclined to add support for other types, we do it.

Agreed.

1 Like

I’d say go for consistency. Either do mul/add or allow both to be changed, but allowing only the reducer part to be changed (original post) looks odd.

2 Likes

Thanks for the proposal @rolfmorel . Most of it looks good to me, but we need a solution for mixed data types here. Something that can help specify

  1. What is the compute bit width
  2. How to convert the input data types to the compute data type
  3. How to convert from the compute data type to the result data type.

Any thoughts?

1 Like

I’d recommend working the f16 and bf16 cases specifically as part of the design vs leaving that as a future exercise. There are more corners than that, but ime, if you can represent those and ensure lowering to various intrinsics, then that is most of the way there. Note also that frontends like torch do have to specify what these types are (it isn’t just left to the backend to infer).

Hi @MaheshRavishankar and @stellaraccident - thanks for prompting a look at the mixed types scenario. (And thanks to @rengolin for a quick offline sanity check regarding this.)


The current proposal is that ā€œnumeric casting is performed on the operands to the inner multiply, promoting them to the same data type as the accumulator/outputā€ (same as for recent linalg named ops and vector.contract). Implicit in this is that the compute type and the output type do not differ: that is, semantically speaking, accumulation happens on an element of the output.

Contrary to that, in principle, we could allow a compute_type attribute. However, how would the generalization of a linalg.contract with a compute_type != output type look like as a linalg.generic? The issue I see is that the linalg.generic’s body would run multiple times to achieve the accumulation, hence the cast from the compute type to output type needs to be captured outside of the linalg.generic (or at least outside its body).

So the main question is how we represent accumulation on a ā€œcompute typeā€ that differs from the out type in generics. Would you maybe be able to give an example of what you had in mind here, i.e. a linalg.generic-representation?

Meanwhile, in order to represent the intent in IR, we can treat the out type of a contraction op as the compute type and then apply a cast to the whole output of the linalg.contract or linalg.generic to capture the intended out type. Might this approach be sufficient for the use cases you had in mind? (Through, e.g., a tiling transform we could rewrite to a more efficient form than really casting an entire tensor.)


In all, the distinct compute and out data types issue does not seem solely related to linalg.contract. In my understanding, it affects linalg.generic and the other named (contraction) ops as well. As the solution does not appear straightforward either - and IMO seems largely orthogonal to the introduction of the contract op - my preference would be for a separate RFC where people can share their use cases and at least the linalg.generic representation they had in mind.

It comes down to a judgment call: do you want the most common op forms in modern ML to only be representable by fusion (ie. of a contract+cast)? We do that today and it is one of the things that just pushes us to generalize everything because the named forms approximately never represent useful forms.

Consider:

  • An f16f16f16 contract is almost also always formulated with an f32 accumulator. This decision is made on the frontend and linalg must encode it. A lot of hardware has intrinsics for either f16 or f32 accumulation, but usually the latter is needed for most cases.
  • Bf16 is similar
  • Libraries are almost always explicit about accumulator type because it is such an important aspect.

XLA came to similar conclusions as I’m indicating here: accumulation precision is a non optional part of the operation (the XLA architects used to say [there is no such thing as a matmul without a precision specifier]).

I think my main point is that this is not niche. If the contract op doesn’t have a representation that can work for f16 and bf16 arithmetic, then I’m not sure it is the right design for a modern op. If everyone agrees that representing this by fusion is what we are ok living with, then I’ll drop the objection: mainly, we need to decide on that point and not just let it float.

(Edit: unless someone beats me to it, I can dig up some examples on Monday. These are the most common ops in modern ML so are pretty easy to get concrete about)

vector.contract semantics for mixed precision are not a good model to follow. Implicit numeric casting is not the best way to capture different extension semantics. There has been effort to either remove mixed precision semantics for vector.contract or to add a per operand attribute to capture extension semantics. Please see the discussion on: [mlir][Vector] Support mixed mode vector.contract lowering by Groverkss Ā· Pull Request #117753 Ā· llvm/llvm-project Ā· GitHub and [mlir][Vector] Update docs for vector.contract extension semantics by Groverkss Ā· Pull Request #118598 Ā· llvm/llvm-project Ā· GitHub . We should probably decide how we are going to do this and be consistent with it. Note that linalg.generic form of contractions allows this specification because it explicitly has a arith.extf in it’s body which can capture different extension semantics.

CC: @dcaballe

I think we all agree with this. I have been adding operations without such magical casts into linalg to avoid these pitfals, but the end result was that fusion was needed at the named op level. That’s why I proposed at the time (almost 2 years ago) to add the compute_type attribute into all linalg operations.

My point to Rolf was that this is not an exclusive problem of the contract op, but of all linalg and vector ops. Here, we’re just discussing the mechanics of contraction and we only mapped to vector.contract because we want a 1:1 lowering for as many cases as possible.

We do not want this to be the default behaviour. But we want to handle one problem at a time. If we try to fix mixed type, cast semantics and contraction at the same time, we’ll go round in circles and not get anywhere.

If we force all linalg operations to have to wait until we fix the type system, we’ll be waiting a long time and still have a lot of work to do. All I’m asking now is to work in parallel as much as possible.

We add linalg.contract like the others, while we’re already discussing mixed types and cast semantics (other threads and PRs), then when we agree on a model, we match-and-replace on all core operations (linalg AND vector) at once.

Meanwhile, we can already use more basic examples (ex. f32) to get the core semantics well understood and correct. Linalg is far from having complete and safe semantics, even for the basic cases. As we try to represent more complex type support, permutations and fusion opportunities, it’s bound to get a lot harder. The work on contract is to simplify the zoo of matmul operations and is orthogonal to the type system.

Moreover, whatever we design, has to represent the solutions that help us on CPUs, GPUs and other accelerators, which will need more time to consider all the cases. Let’s not get ahead of ourselves and work in parallel to make most of our time and efforts in converging linalg/vector lowering to a reasonable state.

Sorry, I should’ve been more clear what I’m asking for. I’m not blocking anything here, only mentioning that we were recently discussing vector.contract extension semantics and we should be consistent whatever we eventually decide (We don’t have to decide it before this patch lands).

I understand, and agree. My point is the same: let’s be consistent now, and make the changes (we all want) consistently, across linalg and vector.

1 Like

I’m fine with sequencing. But I do think we need to have a strong statement that these ops need to grow a compute_type attribute as our plan of record. Or we need to have an agreed way that we expect a canonical f16 matmul to be represented.

The reason this isn’t just a ā€œwell all of them do it so we keep doing it thingā€ is that these new ops are claiming to be designed to fit specific purposes that their predecessors and the current norm of ā€œjust make everything genericā€ doesn’t. And my claim is that if you can’t represent f16 arithmetic on them, they don’t satisfy that purpose.

Sorry for being a bit picky on this, but we’ve suffered in silence for many years because mixed precision was not strongly considered in the original linalg design. Back then, it was possibly ok to treat it as a niche thing, but it is a complete non starter for a next version today. I’m not exaggerating when I say that I cannot find any modern workload that does not need at least f16 arithmetic to be more than a kludge. I could go further and ask for more mixed precision consideration, but I think a good representation for f16 is table stakes.

So to recap: totally fine sequencing the work. But I think we at least need a plausible direction defined for at least the level of mixed precision support needed for f16. In practice, nothing can start using these ops until that exists anyway (ie. Torch-mlir will not generate them and they will not be safe to introduce in any transformation pipeline if they can’t have a uniform representation for the most common datatypes).

(edit: I don’t feel like I’m asking for much here – just a bit more of a written record than ā€œwe’ll think about that laterā€. But this is also one of the major reasons we ended up with the current op explosion and a lot of switchy code that has to treat less than 32bit precisions in a completely divergent way at each level. I need to know that we aren’t considering these ops done and ready to use without having addressed those original issues that led the prior version to the state we are trying to fix)

1 Like

Absolutely agree with this statement. This is a big (but necessary) deviation from the current status quo, and I want to de-couple the progress. Simplify the matmul forest independently but concurrently as fixing the compute and cast semantics.

Me and Rolf discussed how this cannot be done in generics: currently, accumulator and output types are one and the same, the outs argument. The alternative is to decouple the accumulator (iter_args) type from the return type, but generalizing becomes impossible in a single op, unless generic gains a second region that only executes after the last iteration. That’s why he mentioned two ops (generic + cast).

The main problem here is that such semantics in memory means ā€œcast the data after the last accumulationā€, which in a large tensor can only be after all the compute is finishes. Such semantics in register is ā€œcast the data on the last storeā€, which is the same thing as above, but in a tile. Optimizations aim to keep the C tile in registers, but this is not always possible, so any intermediate store (hopefully L1/shared) needs to be in the accumulation type, not the output type.

If you tile your large contraction using outer product, the last store is a loop construct, not a linalg construct, and such tiling needs to lower to two linalg forms (intermediate, with acc = out type, and final, with acc != out type). This needs not only the addition of accumulation type, and the separation in the generics, but also special logic in the tiling interface implementation, etc.

Those are not trivial changes to do all at once.

Didn’t get to read through the most of the text but +1 to @rengolin’s snippet above and quite strong -1 to just:

have a strong statement that these ops need to grow a `compute_type` attribute

To keep it simple: the mixed precision case is the simplest form of AggregateOpInterface.

In this particular case, I recommend leaving linalg.contract like it is proposed by @rolfmorel with acc_type == store_type by construction. On top, we can have an aggregate linalg.aggregate.mixed_contract (or any other syntactic sugar people prefer), that is a contract + cast, if people want a single op.

Overall this is not fundamentally different from softmax or attention: a mixed_contract (like a softmax or attention) tiles in a fused perfectly-nested fashion on the parallel dimensions and requires imperfect loop nesting for tiling on the reduction dimension.

FWIW, I routinely use ext/ext/contract/trunc patterns to map to various mixed-precision instructions on a bunch of different HW and it is actually very low surprise and pain from my perspective (but I am happy to discuss use cases I may not have considered yet).

Let’s please schedule a call to get to the bottom of this and avoid conflations between looping structure and payload representation.

Thanks!

I agree that the compute type and the output type cannot differ (more on that below since this seems to have been brought up in the thread later). But the input type and compute type can differ and we need a way to capture how to convert the input type to the accumulator type. I realize that I did ask for how the compute type gets converted to the result type and that was a mistake. For example do we need to use unsigned extension or signed extensions for converting from i16 to i32. This is effectively input from the layer that lowers into Linalg.That needs to be added to op semantics. I captured more details in this post. (Its a HackMD post, you just need to sign into HackMD to see it).

I think Nico says this later as well. I dont know if we need a compute type. I think the output type serves as a compute type. The reason they cant be different is that the downcasting makes it an imperfectly nested loop computation that is outside of LinalgStructuredOp domain. Essentially an (i16, i16) → (i16) with i32 accumulate is this computation

for (i = 0; i < M; i++) {
  for (j = 0; j < N; j++) {
    int32_t acc = 0;;
    for (k = 0; k < K; k++) {
      acc += sign_ext(A[i][k]) * sign_ext(B[k][j])
    }
    C[i][j] = trunc(acc);
  }
}

As Nico mentioned we can have separate ops that implement the AggregateOpInterface to represent cases where the result type is different from the compute type, but for the linalg.compute op we can do that.

I do want to re-iterate that the linalg.compute op needs a way for the user to specify extension semantics for the input type → output/compute type. Without that we will have explosion of ops to deal with the different extension semantics. A simple enum should do and that could have any number of enteries. For most part you only need it when generalizing or converting to loops.

Even after all of these years, I still admittedly always struggle with the way linalg does mixed precision because it is so far removed from how the algebra and library/kernel world typically define things. Basically for everything except f32/f64 (which are arguably the least useful datatypes for ML), the only way to express the most used operations on the most used datatypes is by fusion of an output cast.

I get the arguments about the way things compose if sticking to a purely perfectly nested generic, but I thought that a goal of this op refresh was to be able to more closely model these ops, with some of those constraints lifted, so that they better encapsulated the most common cases driving pattern matching and algorithm selection.

In my original message, I did say I just wanted this decided and documented as an explicit design choice. If a contract+cast is our canonical way of modeling everything, then so be it. I know this has been a source of a lot of subtle bugs and questions over the years because, for those of us who come from different levels/backgrounds, this feels utterly wrong (and because it relies on reasonably advanced fusion for even the most basic of computations, it has a lot of extra moving parts). If that’s the plan, you’ll have to forgive me for probably getting confused a few more times over the years because my mind refuses to accept this as right :slight_smile: