[RFC][MLIR] Linalg operation tree

Background

I keep repeating this to multiple people, so I thought I should just get it out in the open so we can discuss this together. @rolfmorel @asiemien @mshahid @javedabsar @MaheshRavishankar @banach-space @ftynse

There is a disconnect between Linalg operations being declared and upstreamed from various groups, and we need to get the house in order. Some are very generic (pun intended), others are very specific. Named ops have been added with a restrictive semantics, that we’re trying to fix with the new scheme.

We have discussed between ourselves and I think we mostly agree on the overall idea, but we need to get to a specific level of detail to help @mshahid propose batch_matmul and match_reduce_matmul, @rolfmorel propose contraction, @javedabsar propose element-wise and @MaheshRavishankar propose convolution. We also don’t need to create this tree in one go, and can have multiple steps for different branches.

Tree Proposal

An operation tree, where generic is at the root, and the operations get more and more specialized. For example, genericcontractmatmul. When doing tiling, we want to avoid having to coerce a matmul all the way down to a generic if we can stop half-way through as a contract.

The reason to have the intermediate operations has been widely discussed (and agreed): to help pattern matchers (avoid checking generics every time for every possible combination in ever transform/pass in a schedule/pipeline), by persisting semantics (memoization) across multiple transforms. It also helps matching intermediate semantics to vectorization or micro-kernel lowering when we support more than just the standard matmuls, but don’t want to create a gazillion variations.

Here’s the idea for a tree:

generic
 |- contract
 |   |- matmul
 |   |- vecmat
 |   |- matvec
 |   |- batch_...
 |   |- batch_reduce_...
 |   |- quantized_... (we need to solve this problem later)
 |   |- mmt4d
 |   |- dot
 |- element_wise
 |   |- add
 |   |- max
 |   |- select
 |   |- (all compute unary, binary, ternary)
 |- convolution
 |   |- conv_3d_ncdhw_fcdhw
 |   |- conv_2d_...
 |   |- conv_...
 |   |- depthwise_conv_...
 |- pooling
 |   |- pooling_nchw_max
 |   |- pooling_...
 |- winograd
 |   |- winograd_...
...

A class of operations that don’t have a common ancestor are the memory ops (copy, transpose, broadcast, reduce, pack/unpack (soon)). They can all map to generics, so if they branch straight off generic, they’d be directly coerced into that on transforms that change the shapes.

There are some left-overs that cannot be converted to generics:

  • softmax, which can be lowered to multiple generics, and can be thought of as a meta operation. This would not be in the tree at all.
  • index and yield, which are used inside the payload region and do not have a lowering per-se. Also a meta operation, also outside of the tree.

Current Implementation

Right now, with OpDSL, named ops are essentially aliases to linalg.generic, so tiling and fusion can happen on it without additional code. But it also restricts what we can do with the ops, so we started moving them to ODS (table-gen).

We have discussed creating the operations as aliases, similar to OpDSL, but that means different things to different people. Some may think this is a table-gen alias definition, others may think they just implement the same interfaces, etc. So we need to find a common language, which in this case is reasonably simple because they all mean the same thing in the end.

Also note that not all current operations need to be named in the new tree. For example, having a matmul operation is really useful, but perhaps a batch_reduce_matmul can be matched from a contract instead. Same thing element wise operations (do we really need linalg.add if we already have linalg.element_wise add?) and the near-infinite variations of convolutions.

This is another strong motivation for doing this tree approach, replacing the flat grassland we have today.

Implementation Proposal

The criteria we need for the aliases are:

  • A transform on an operation should try to still employ the same operation. Ex: tiling a linalg.add should create a loop of adds.
  • A transform on an operation that cannot use the same op (ex. some batch_matmul tiling) should be coerced to its immediate parent on the tree. Failing that, its grand-parent, until reaching the root, or linalg.generic.
  • An optional coercion can be made in cases where another sibling operation provides the appropriate semantics. For example, batch-tiling a batch_matmul into a loop over the batch dimension using matmul.
  • Transforms should operate on interfaces and operations should implement those interfaces. But you still can have a transform that is specific to a particular operation where an interface doesn’t make sense.
  • There is a trivial path to generalize operations (towards generic).
  • There can be a path to specialize operations (towards the leaves) with enough pattern-matching. Those are expensive, this is why we want to persist as much as possible in IR for future transforms. Note, not every generic can be specialized.
  • There can be helper functions that match more generic operations’ features that mean “the same thing” as the more specific operation. This is helpful if you know you won’t be handling it after this transform (ex. lowering). These should use the same matcher infrastructure as de-generalization transforms.

A naive approach is to encode these relationships in table-gen as def-aliases. But we still need to implement the functionality in C++ that cover the bridges between these representations. Technically, because we’ll implement those anyway, having def-aliases is not required and may get in the way.

Having a back-end for converting def-alises to C++ code that de/generalize and guarantee interface implementation is outside of the scope for now and should not be attempted before we know the actual semantics we’ll go for.

Next Steps

None of these changes will happen now. But this is the design we’re going for the linalg dialect in the short-term, so the new operations will follow this path towards a more flexible transform infrastructure without losing the power to represent what we need without op explosion.

RFCs on the aforementioned operations will start popping soon, and this is the context that we’re operating on.

6 Likes

Thank you for working on this, Renato!

Strong +1 from me, this aligns with what we have been discussing (and implementing) as a community for a good part of this year.

softmax is a bit of an outlier - we should make sure that we avoid “outlier explosion” in the future. And if that happens, it would be good for the “tree” that you are proposing to somehow capture that :thinking:

No specific suggestion here. I think that it’s fine if we wait for an appropriate “bucket” to emerge with time.

I’m not sure whether we will be able to map tensor.pack/tensor.unpack to linalg.generic - it’s a bit like “composite” Op akin softmax. This is a minor technical detail though.

Indeed!

At some point we should discuss how this affects Vectorization - does Vectorization fall under “transforms” or “lowering” and are we making this distinction?

Anyway, these are very minor points! Thank you for your continued effort in this area :pray:

-Andrzej

That’s a good point. I think because pack is basically reshapes and transposes, we could possibly lower it to a very convoluted generic, but like softmax, we don’t want to.

The tree is used for generalization and degeneralization patterns, with the root as a generic, and these patterns don’t fit that model.

We can expand the idea (not the tree), to accomodate generalization and degeneralization patterns not based on generic, but multiple operations. I think that’s what you mean, just wanted to clarify.

I want to do that later when mapping DAG-to-DAG patterns, or when controlling how I lower my softmax (all named ops) and pack (block copy instead of transpose).

1 Like

The overall idea of a deeper structure for linalg ops makes sense to me.

I don’t understand what the proposed aliases are going to look like in practice. Are these just regular operations in ODS + lowerings/raisings implemented in C++ + some new ODS mechanism to describe the relation between some ops? Do we get a proper ContractionOpInterface?

Some smaller comments on classification:

  • transpose, copy and broadcast can be part of a special nullary/non-arithmetic op class;
  • copy is actually a special case of transpose with identity maps;
  • pack/unpack can also be in this class, especially if we eventually move towards relaxing the requirement on indexing schemes to be expressible as affine maps;
  • reduce is a generalization of a contraction over operations (but not over dimensions): a contraction is implicitly inner multiplication and add-reduction; one can imagine other contractions like inner addition and max-reductions.

Also noting that there is a deeper structure in contractions and, to some extent, convolutions. Specifically, a contraction conceptually has batch, lhs, rhs and reduction dimensions. If some of these dimensions have unit size, they can be omitted giving rise to particular kinds of named ops. For example, a batch matmul with batch=1 is a plain matmul, and a batch matmul with batch=1 and rhs=1 (or a matmul with rhs=1) is a matvec, etc. If such unit dimensions are introduced progressively, this becomes a DAG, where one can get from batch matmul to matmul to matvec or from batch matmul to batch matvec to matvec. Similarly, one can lower the rank of convolutions. Treating all these as siblings may be a pragmatic choice, but it’s worth giving a brief thought to the idea of encoding this DAG programmatically somehow.

What’s more interesting is that the “sibling jump” is also possible between mid-level classes. In particular, a contraction with only batch dimension(s) is an elementwise multiplication. An elementwise operation with the neutral element of the operation is a copy. If we introduce “window” dimensions for convolutions, a convolution with all “window” dimensions having unit size is a contraction. Again, unclear if we want to encode all these algebraic relations. This should depend on what kind of information we want to preserve in the IR directly.

I don’t think anyone does at the moment, and this is why I left it as an exercise to the reader. For now, we keep them as regular ops and create the tree structure in plain-old C++.

Indeed. In libxsmm, those operations are all part of an identity class and the difference is the equivalent of affine maps. In my mind, transpose is a special type of copy with an inverse affine map. For me, the parent op for this group would be copy, and others would be “copies” with non-trivial affine maps.

pack has the issue that we don’t want to generalize quickly and it could be coerced into a loop over copy (block-transpose) or transpose. Having copy coerced into a transpose makes less sense, to me.

Yes, but I do not want to add this in the contract branch. We don’t want to encode this special case in the contract op, for similar arguments as not supporting element-wise einsums. This family will be heavily used in hard transformations, and we don’t want to keep adding exceptions everywhere.

That’s the idea, and this is encoded in @rolfmorel’s RFC.

Yup, @MaheshRavishankar is working on that one. We’re all working on this design.

Exactly! We did that in tpp-mlir last year. This is one of the reasons why I introduced the idea.

My main objective with this design is to create something specific enough that we can fix the main problems we have right now in linalg, but generic enough that we can continue to expand it to more and more complex cases.

The first iteration would not need more representations, since we’re already using the ones we have right now. It’s just about organizing what we have in a clear way.

The second iteration is to refine the current representations to more accurately reproduce the intention of the front-ends and the ability of the back-ends to execute it. The key here is to carry enough information through the various transforms so that we can make that link. I think this is what you mean.

The third iteration is to look at DAG-to-DAG (you mention this too), but this needs strong semantics for the ops we already have and a robust matching infrastructure (I have the design for one, but haven’t had time to implement it).

Lol, probably the third most important op in all the world :slight_smile:

It and a few others like it are also very transformation implicated. In a prior time, I would have lumped these all under a generic “algorithm” or “ml” category because they were kind of terminals (and often library backed, etc).

But as with many things, it has been true for some years that if the compiler can’t transform, fuse across, and perform custom optimizations on softmax, the compiler doesn’t get very far. There’s actually a small universe of transformation related variants, not dissimilar to winograd in principle, in that they represent a way to recompose softmax so that it can be more efficiently fused.

Rather than outliers, these are the things you want to make sure something like linalg captures so that it can have a full suite of transformations for optimizing it in the various forms. And there will be more…

In my world, convolutions are still the most important - though sometimes it feels like I’m living in ancient history! :smile: Jokes aside, “outlier” was a poor choice of wording on my part.

I’m still working on developing a good mental model to connect these non-generics to the rest. If we’re expecting more Ops like this, it might be worth capturing that distinction somehow. Perhaps … non-generics?

Naming is hard, and as I mentioned earlier, I don’t have a great suggestion yet.

Agree. I’d call them composite, because they compose multiple other named ops or generics. softmax is a clear candidate. pack may be too. convolution, depending on their complexity, could be too?

I think the assumption is that a composite op would always decompose into a sequence of other named ops or ultimately, generic. This way we have a closed system.

2 Likes