Blame - mlir/docs/Dialects/Vector.md - external/github.com/llvm/llvm-project.git

blob: c010c3cda53c07acefbe32fc48ce1504fc9c59c0 [file] [log] [blame] [view]

River Riddle	01c857b	2020-03-30 19:25:00	[diff] [blame]	1	# 'vector' Dialect
Nicolas Vasilache	c9d5f34	2019-03-29 18:48:20	[diff] [blame]	2
River Riddle	1a083f0	2020-03-24 18:57:13	[diff] [blame]	3	[TOC]
				4
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	5	MLIR supports multi-dimensional `vector` types and custom operations on those
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	6	types. A generic, retargetable, higher-order `vector` type (`n-D` with `n > 1`)
				7	is a structured type, that carries semantic information useful for
				8	transformations. This document discusses retargetable abstractions that exist in
				9	MLIR today and operate on ssa-values of type `vector` along with pattern
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	10	rewrites and lowerings that enable targeting specific instructions on concrete
				11	targets. These abstractions serve to separate concerns between operations on
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	12	`memref` (a.k.a buffers) and operations on `vector` values. This is not a new
				13	proposal but rather a textual documentation of existing MLIR components along
				14	with a rationale.
Nicolas Vasilache	c9d5f34	2019-03-29 18:48:20	[diff] [blame]	15
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	16	## Positioning in the Codegen Infrastructure
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	17
				18	The following diagram, recently presented with the
				19	[StructuredOps abstractions](https://drive.google.com/corp/drive/u/0/folders/1sRAsgsd8Bvpm_IxREmZf2agsGU2KvrK-),
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	20	captures the current codegen paths implemented in MLIR in the various existing
				21	lowering paths.
				22	![](https://user-images.githubusercontent.com/10148468/71177417-f78e4d80-2239-11ea-92ef-700f42ea503f.png)
				23
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	24	The following diagram seeks to isolate `vector` dialects from the complexity of
				25	the codegen paths and focus on the payload-carrying ops that operate on std and
				26	`vector` types. This diagram is not to be taken as set in stone and
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	27	representative of what exists today but rather illustrates the layering of
				28	abstractions in MLIR.
				29
				30	![`vector` Abstractions in MLIR](https://user-images.githubusercontent.com/10148468/71176949-e85ad000-2238-11ea-9806-200843bc4943.png)
				31
				32	This separates concerns related to (a) defining efficient operations on
				33	`vector` types from (b) program analyses + transformations on `memref`, loops
				34	and other types of structured ops (be they `HLO`, `LHLO`, `Linalg` or other ).
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	35	Looking a bit forward in time, we can put a stake in the ground and venture that
				36	the higher level of `vector`-level primitives we build and target from codegen
				37	(or some user/language level), the simpler our task will be, the more complex
				38	patterns can be expressed and the better performance will be.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	39
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	40	## Components of a Generic Retargetable Vector-Level Dialect
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	41
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	42	The existing MLIR `vector`-level dialects are related to the following bottom-up
				43	abstractions:
				44
				45	1. Representation in `LLVMIR` via data structures, instructions and intrinsics.
				46	This is referred to as the `LLVM` level.
				47	2. Set of machine-specific operations and types that are built to translate
				48	almost 1-1 with the HW ISA. This is referred to as the Hardware Vector
				49	level; a.k.a `HWV`. For instance, we have (a) the `NVVM` dialect (for
				50	`CUDA`) with tensor core ops, (b) accelerator-specific dialects (internal),
				51	a potential (future) `CPU` dialect to capture `LLVM` intrinsics more closely
				52	and other dialects for specific hardware. Ideally this should be
				53	auto-generated as much as possible from the `LLVM` level.
				54	3. Set of virtual, machine-agnostic, operations that are informed by costs at
				55	the `HWV`-level. This is referred to as the Virtual Vector level; a.k.a
				56	`VV`. This is the level that higher-level abstractions (codegen, automatic
				57	vectorization, potential vector language, ...) targets.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	58
				59	The existing generic, retargetable, `vector`-level dialect is related to the
				60	following top-down rewrites and conversions:
				61
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	62	1. MLIR Rewrite Patterns applied by the MLIR `PatternRewrite` infrastructure to
				63	progressively lower to implementations that match closer and closer to the
				64	`HWV`. Some patterns are "in-dialect" `VV -> VV` and some are conversions
				65	`VV -> HWV`.
				66	2. `Virtual Vector -> Hardware Vector` lowering is specified as a set of MLIR
				67	lowering patterns that are specified manually for now.
				68	3. `Hardware Vector -> LLVM` lowering is a mechanical process that is written
				69	manually at the moment and that should be automated, following the `LLVM ->
				70	Hardware Vector` ops generation as closely as possible.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	71
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	72	## Short Description of the Existing Infrastructure
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	73
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	74	### LLVM level
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	75
				76	On CPU, the `n-D` `vector` type currently lowers to `!llvm<array<vector>>`. More
				77	concretely, `vector<4x8x128xf32>` lowers to `!llvm<[4 x [ 8 x [ 128 x float
				78	]]]>`. There are tradeoffs involved related to how one can access subvectors and
				79	how one uses `llvm.extractelement`, `llvm.insertelement` and
				80	`llvm.shufflevector`. A [deeper dive section](#DeeperDive) discusses the current
				81	lowering choices and tradeoffs.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	82
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	83	### Hardware Vector Ops
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	84
				85	Hardware Vector Ops are implemented as one dialect per target. For internal
				86	hardware, we are auto-generating the specific HW dialects. For `GPU`, the `NVVM`
				87	dialect adds operations such as `mma.sync`, `shfl` and tests. For `CPU` things
				88	are somewhat in-flight because the abstraction is close to `LLVMIR`. The jury is
				89	still out on whether a generic `CPU` dialect is concretely needed, but it seems
				90	reasonable to have the same levels of abstraction for all targets and perform
				91	cost-based lowering decisions in MLIR even for `LLVM`. Specialized `CPU`
				92	dialects that would capture specific features not well captured by LLVM peephole
				93	optimizations of on different types that core MLIR supports (e.g. Scalable
				94	Vectors) are welcome future extensions.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	95
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	96	### Virtual Vector Ops
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	97
Michal Terepeta	c47108c	2021-11-26 07:14:07	[diff] [blame]	98	Some existing Standard and Vector Dialect on `n-D` `vector` types comprise:
				99
				100	```mlir
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	101	%2 = arith.addf %0, %1 : vector<3x7x8xf32> // -> vector<3x7x8xf32> %2 =
River Riddle	23aa5a7	2022-02-26 22:49:54	[diff] [blame^]	102	arith.mulf %0, %1 : vector<3x7x8xf32> // -> vector<3x7x8xf32> %2 = vector.splat
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	103	%1 : vector<3x7x8xf32> // -> vector<3x7x8xf32>
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	104
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	105	%1 = vector.extract %0[1]: vector<3x7x8xf32> // -> vector<7x8xf32> %1 =
				106	vector.extract %0[1, 5]: vector<3x7x8xf32> // -> vector<8xf32> %2 =
				107	vector.outerproduct %0, %1: vector<4xf32>, vector<8xf32> // -> vector<4x8xf32>
				108	%3 = vector.outerproduct %0, %1, %2: vector<4xf32>, vector<8xf32> // fma when
				109	adding %2 %3 = vector.strided_slice %0 {offsets = [2, 2], sizes = [2, 2],
				110	strides = [1, 1]}: vector<4x8x16xf32> // Returns a slice of type
				111	vector<2x2x16xf32>
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	112
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	113	%2 = vector.transfer_read %A[%0, %1] {permutation_map = (d0, d1) -> (d0)}:
				114	memref<7x?xf32>, vector<4xf32>
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	115
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	116	vector.transfer_write %f1, %A[%i0, %i1, %i2, %i3] {permutation_map = (d0, d1,
Michal Terepeta	c47108c	2021-11-26 07:14:07	[diff] [blame]	117	d2, d3) -> (d3, d1, d0)} : vector<5x4x3xf32>, memref<?x?x?x?xf32>
				118	```
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	119
				120	The list of Vector is currently undergoing evolutions and is best kept track of
				121	by following the evolution of the
Matthias Springer	99ef9ee	2022-01-31 10:10:51	[diff] [blame]	122	[VectorOps.td](https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/Vector/IR/VectorOps.td)
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	123	ODS file (markdown documentation is automatically generated locally when
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	124	building and populates the
				125	[Vector doc](https://github.com/llvm/llvm-project/blob/main/mlir/docs/Dialects/Vector.md)).
				126	Recent extensions are driven by concrete use cases of interest. A notable such
				127	use case is the `vector.contract` op which applies principles of the
				128	StructuredOps abstraction to `vector` types.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	129
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	130	### Virtual Vector Rewrite Patterns
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	131
				132	The following rewrite patterns exist at the `VV->VV` level:
				133
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	134	1. The now retired `MaterializeVector` pass used to legalize ops on a
				135	coarse-grained virtual `vector` to a finer-grained virtual `vector` by
				136	unrolling. This has been rewritten as a retargetable unroll-and-jam pattern
				137	on `vector` ops and `vector` types.
				138	2. The lowering of `vector_transfer` ops legalizes `vector` load/store ops to
				139	permuted loops over scalar load/stores. This should evolve to loops over
				140	`vector` load/stores + `mask` operations as they become available `vector`
				141	ops at the `VV` level.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	142
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	143	The general direction is to add more Virtual Vector level ops and implement more
				144	useful `VV -> VV` rewrites as composable patterns that the PatternRewrite
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	145	infrastructure can apply iteratively.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	146
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	147	### Virtual Vector to Hardware Vector Lowering
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	148
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	149	For now, `VV -> HWV` are specified in C++ (see for instance the
				150	[SplatOpLowering for n-D vectors](https://github.com/tensorflow/mlir/commit/0a0c4867c6a6fcb0a2f17ef26a791c1d551fe33d)
				151	or the
				152	[VectorOuterProductOp lowering](https://github.com/tensorflow/mlir/commit/957b1ca9680b4aacabb3a480fbc4ebd2506334b8)).
				153
				154	Simple
				155	[conversion tests](https://github.com/llvm/llvm-project/blob/main/mlir/test/Conversion/VectorToLLVM/vector-to-llvm.mlir)
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	156	are available for the `LLVM` target starting from the Virtual Vector Level.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	157
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	158	## Rationale
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	159
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	160	### Hardware as `vector` Machines of Minimum Granularity
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	161
				162	Higher-dimensional `vector`s are ubiquitous in modern HPC hardware. One way to
				163	think about Generic Retargetable `vector`-Level Dialect is that it operates on
hasheddan	0316f3e	2021-05-19 21:18:44	[diff] [blame]	164	`vector` types that are multiples of a "good" `vector` size so the HW can
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	165	efficiently implement a set of high-level primitives (e.g.
				166	`vector<8x8x8x16xf32>` when HW `vector` size is say `vector<4x8xf32>`).
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	167
				168	Some notable `vector` sizes of interest include:
				169
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	170	1. CPU: `vector<HW_vector_size * k>`, `vector<core_count * k’ x
				171	HW_vector_size * k>` and `vector<socket_count x core_count * k’ x
				172	HW_vector_size * k>`
				173	2. GPU: `vector<warp_size * k>`, `vector<warp_size * k x float4>` and
				174	`vector<warp_size * k x 4 x 4 x 4>` for tensor_core sizes,
				175	3. Other accelerators: n-D `vector` as first-class citizens in the HW.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	176
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	177	Depending on the target, ops on sizes that are not multiples of the HW `vector`
				178	size may either produce slow code (e.g. by going through `LLVM` legalization) or
				179	may not legalize at all (e.g. some unsupported accelerator X combination of ops
				180	and types).
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	181
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	182	### Transformations Problems Avoided
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	183
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	184	A `vector<16x32x64xf32>` virtual `vector` is a coarse-grained type that can be
				185	“unrolled” to HW-specific sizes. The multi-dimensional unrolling factors are
				186	carried in the IR by the `vector` type. After unrolling, traditional
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	187	instruction-level scheduling can be run.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	188
				189	The following key transformations (along with the supporting analyses and
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	190	structural constraints) are completely avoided by operating on a `vector`
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	191	`ssa-value` abstraction:
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	192
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	193	1. Loop unroll and unroll-and-jam.
				194	2. Loop and load-store restructuring for register reuse.
				195	3. Load to store forwarding and Mem2reg.
				196	4. Coarsening (raising) from finer-grained `vector` form.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	197
				198	Note that “unrolling” in the context of `vector`s corresponds to partial loop
				199	unroll-and-jam and not full unrolling. As a consequence this is expected to
				200	compose with SW pipelining where applicable and does not result in ICache blow
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	201	up.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	202
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	203	### The Big Out-Of-Scope Piece: Automatic Vectorization
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	204
				205	One important piece not discussed here is automatic vectorization (automatically
				206	raising from scalar to n-D `vector` ops and types). The TL;DR is that when the
				207	first "super-vectorization" prototype was implemented, MLIR was nowhere near as
				208	mature as it is today. As we continue building more abstractions in `VV -> HWV`,
				209	there is an opportunity to revisit vectorization in MLIR.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	210
				211	Since this topic touches on codegen abstractions, it is technically out of the
				212	scope of this survey document but there is a lot to discuss in light of
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	213	structured op type representations and how a vectorization transformation can be
				214	reused across dialects. In particular, MLIR allows the definition of dialects at
				215	arbitrary levels of granularity and lends itself favorably to progressive
				216	lowering. The argument can be made that automatic vectorization on a loops + ops
				217	abstraction is akin to raising structural information that has been lost.
				218	Instead, it is possible to revisit vectorization as simple pattern rewrites,
				219	provided the IR is in a suitable form. For instance, vectorizing a
				220	`linalg.generic` op whose semantics match a `matmul` can be done
				221	[quite easily with a pattern](https://github.com/tensorflow/mlir/commit/bff722d6b59ab99b998f0c2b9fccd0267d9f93b5).
				222	In fact this pattern is trivial to generalize to any type of contraction when
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	223	targeting the `vector.contract` op, as well as to any field (`+/*`, `min/+`,
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	224	`max/+`, `or/and`, `logsumexp/+` ...) . In other words, by operating on a higher
				225	level of generic abstractions than affine loops, non-trivial transformations
				226	become significantly simpler and composable at a finer granularity.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	227
				228	Irrespective of the existence of an auto-vectorizer, one can build a notional
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	229	vector language based on the VectorOps dialect and build end-to-end models with
				230	expressing `vector`s in the IR directly and simple pattern-rewrites.
				231	[EDSC](https://github.com/llvm/llvm-project/blob/main/mlir/docs/EDSC.md)s
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	232	provide a simple way of driving such a notional language directly in C++.
				233
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	234	## Bikeshed Naming Discussion
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	235
				236	There are arguments against naming an n-D level of abstraction `vector` because
				237	most people associate it with 1-D `vector`s. On the other hand, `vector`s are
				238	first-class n-D values in MLIR. The alternative name Tile has been proposed,
				239	which conveys higher-D meaning. But it also is one of the most overloaded terms
				240	in compilers and hardware. For now, we generally use the `n-D` `vector` name and
				241	are open to better suggestions.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	242
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	243	## DeeperDive
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	244
				245	This section describes the tradeoffs involved in lowering the MLIR n-D vector
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	246	type and operations on it to LLVM-IR. Putting aside the
				247	[LLVM Matrix](http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html)
				248	proposal for now, this assumes LLVM only has built-in support for 1-D vector.
				249	The relationship with the LLVM Matrix proposal is discussed at the end of this
				250	document.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	251
				252	MLIR does not currently support dynamic vector sizes (i.e. SVE style) so the
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	253	discussion is limited to static rank and static vector sizes (e.g.
				254	`vector<4x8x16x32xf32>`). This section discusses operations on vectors in LLVM
				255	and MLIR.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	256
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	257	LLVM instructions are prefixed by the `llvm.` dialect prefix (e.g.
				258	`llvm.insertvalue`). Such ops operate exclusively on 1-D vectors and aggregates
				259	following the [LLVM LangRef](https://llvm.org/docs/LangRef.html). MLIR
				260	operations are prefixed by the `vector.` dialect prefix (e.g.
				261	`vector.insertelement`). Such ops operate exclusively on MLIR `n-D` `vector`
				262	types.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	263
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	264	### Alternatives For Lowering an n-D Vector Type to LLVM
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	265
				266	Consider a vector of rank n with static sizes `{s_0, ... s_{n-1}}` (i.e. an MLIR
				267	`vector<s_0x...s_{n-1}xf32>`). Lowering such an `n-D` MLIR vector type to an
				268	LLVM descriptor can be done by either:
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	269
Alex Zinenko	dd5165a	2021-01-06 15:21:08	[diff] [blame]	270	1. Flattening to a `1-D` vector: `!llvm<"(s_0...s_{n-1})xfloat">` in the MLIR
				271	LLVM dialect.
				272	2. Nested aggregate type of `1-D` vector:
				273	`!llvm."[s_0x[s_1x[...<s_{n-1}xf32>]]]">` in the MLIR LLVM dialect.
				274	3. A mix of both.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	275
				276	There are multiple tradeoffs involved in choosing one or the other that we
				277	discuss. It is important to note that “a mix of both” immediately reduces to
				278	“nested aggregate type of 1-D vector” with a `vector.cast %0:
				279	vector<4x8x16x32xf32> to vector<4x4096xf32>` operation, that flattens the most
				280	"k" minor dimensions.
				281
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	282	### Constraints Inherited from LLVM (see LangRef)
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	283
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	284	The first constraint was already mentioned: LLVM only supports `1-D` `vector`
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	285	types natively. Additional constraints are related to the difference in LLVM
				286	between vector and aggregate types: `“Aggregate Types are a subset of derived
				287	types that can contain multiple member types. Arrays and structs are aggregate
				288	types. Vectors are not considered to be aggregate types.”.`
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	289
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	290	This distinction is also reflected in some of the operations. For `1-D` vectors,
				291	the operations `llvm.extractelement`, `llvm.insertelement`, and
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	292	`llvm.shufflevector` apply, with direct support for dynamic indices. For `n-D`
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	293	vectors with `n>1`, and thus aggregate types at LLVM level, the more restrictive
				294	operations `llvm.extractvalue` and `llvm.insertvalue` apply, which only accept
				295	static indices. There is no direct shuffling support for aggregate types.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	296
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	297	The next sentence illustrates a recurrent tradeoff, also found in MLIR, between
				298	“value types” (subject to SSA use-def chains) and “memory types” (subject to
				299	aliasing and side-effects): `“Structures in memory are accessed using ‘load’ and
				300	‘store’ by getting a pointer to a field with the llvm.getelementptr instruction.
				301	Structures in registers are accessed using the llvm.extractvalue and
				302	llvm.insertvalue instructions.”`
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	303
				304	When transposing this to MLIR, `llvm.getelementptr` works on pointers to `n-D`
				305	vectors in memory. For `n-D`, vectors values that live in registers we can use
				306	`vector.extract` and `vector.insert` which do not accept dynamic indices. Note
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	307	that this is consistent with hardware considerations as discussed below.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	308
				309	An alternative is to use an LLVM `1-D` `vector` type for which one can use
				310	`llvm.extractelement`, `llvm.insertelement` and `llvm.shufflevector`. These
				311	operations accept dynamic indices. The implication is that one has to use a
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	312	flattened lowering of an MLIR n-D vector to an LLVM 1-D vector.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	313
				314	There are multiple tradeoffs involved that mix implications on the programming
				315	model, execution on actual HW and what is visible or hidden from codegen. They
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	316	are discussed in the following sections.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	317
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	318	### Nested Aggregate
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	319
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	320	Pros:
				321
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	322	1. Natural encoding n-D vector -> (n-1)-D aggregate over 1-D vector.
				323	2. No need for linearization / delinearization logic inserted everywhere.
				324	3. `llvm.insertvalue`, `llvm.extractvalue` of `(n-k)-D` aggregate is natural.
				325	4. `llvm.insertelement`, `llvm.extractelement`, `llvm.shufflevector` over `1-D`
				326	vector type is natural.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	327
				328	Cons:
				329
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	330	1. `llvm.insertvalue` / `llvm.extractvalue` does not accept dynamic indices but
				331	only static ones.
				332	2. Dynamic indexing on the non-most-minor dimension requires roundtrips to
				333	memory.
				334	3. Special intrinsics and native instructions in LLVM operate on `1-D` vectors.
				335	This is not expected to be a practical limitation thanks to a `vector.cast
				336	%0: vector<4x8x16x32xf32> to vector<4x4096xf32>` operation, that flattens
				337	the most minor dimensions (see the bigger picture in implications on
				338	codegen).
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	339
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	340	### Flattened 1-D Vector Type
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	341
				342	Pros:
				343
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	344	1. `insertelement` / `extractelement` / `shufflevector` with dynamic indexing
				345	is possible over the whole lowered `n-D` vector type.
				346	2. Supports special intrinsics and native operations.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	347
Michal Terepeta	c47108c	2021-11-26 07:14:07	[diff] [blame]	348	Cons:
				349
				350	1. Requires linearization/delinearization logic everywhere, translations are
				351	complex.
				352	2. Hides away the real HW structure behind dynamic indexing: at the end of the
				353	day, HW vector sizes are generally fixed and multiple vectors will be needed
				354	to hold a vector that is larger than the HW.
				355	3. Unlikely peephole optimizations will result in good code: arbitrary dynamic
				356	accesses, especially at HW vector boundaries unlikely to result in regular
				357	patterns.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	358
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	359	### Discussion
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	360
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	361	#### HW Vectors and Implications on the SW and the Programming Model
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	362
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	363	As of today, the LLVM model only support `1-D` vector types. This is
				364	unsurprising because historically, the vast majority of HW only supports `1-D`
				365	vector registers. We note that multiple HW vendors are in the process of
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	366	evolving to higher-dimensional physical vectors.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	367
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	368	In the following discussion, let's assume the HW vector size is `1-D` and the SW
				369	vector size is `n-D`, with `n >= 1`. The same discussion would apply with `2-D`
				370	HW `vector` size and `n >= 2`. In this context, most HW exhibit a vector
				371	register file. The number of such vectors is fixed. Depending on the rank and
				372	sizes of the SW vector abstraction and the HW vector sizes and number of
				373	registers, an `n-D` SW vector type may be materialized by a mix of multiple
				374	`1-D` HW vector registers + memory locations at a given point in time.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	375
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	376	The implication of the physical HW constraints on the programming model are that
				377	one cannot index dynamically across hardware registers: a register file can
				378	generally not be indexed dynamically. This is because the register number is
				379	fixed and one either needs to unroll explicitly to obtain fixed register numbers
				380	or go through memory. This is a constraint familiar to CUDA programmers: when
				381	declaring a `private float a[4]`; and subsequently indexing with a dynamic
				382	value results in so-called local memory usage (i.e. roundtripping to
				383	memory).
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	384
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	385	#### Implication on codegen
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	386
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	387	MLIR `n-D` vector types are currently represented as `(n-1)-D` arrays of `1-D`
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	388	vectors when lowered to LLVM. This introduces the consequences on static vs
				389	dynamic indexing discussed previously: `extractelement`, `insertelement` and
				390	`shufflevector` on `n-D` vectors in MLIR only support static indices. Dynamic
				391	indices are only supported on the most minor `1-D` vector but not the outer
				392	`(n-1)-D`. For other cases, explicit load / stores are required.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	393
				394	The implications on codegen are as follows:
				395
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	396	1. Loops around `vector` values are indirect addressing of vector values, they
				397	must operate on explicit load / store operations over `n-D` vector types.
				398	2. Once an `n-D` `vector` type is loaded into an SSA value (that may or may not
				399	live in `n` registers, with or without spilling, when eventually lowered),
				400	it may be unrolled to smaller `k-D` `vector` types and operations that
				401	correspond to the HW. This level of MLIR codegen is related to register
				402	allocation and spilling that occur much later in the LLVM pipeline.
				403	3. HW may support >1-D vectors with intrinsics for indirect addressing within
				404	these vectors. These can be targeted thanks to explicit `vector_cast`
				405	operations from MLIR `k-D` vector types and operations to LLVM `1-D`
				406	vectors + intrinsics.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	407
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	408	Alternatively, we argue that directly lowering to a linearized abstraction hides
				409	away the codegen complexities related to memory accesses by giving a false
				410	impression of magical dynamic indexing across registers. Instead we prefer to
				411	make those very explicit in MLIR and allow codegen to explore tradeoffs.
				412	Different HW will require different tradeoffs in the sizes involved in steps 1.,
				413	2. and 3.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	414
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	415	Decisions made at the MLIR level will have implications at a much later stage in
				416	LLVM (after register allocation). We do not envision to expose concerns related
				417	to modeling of register allocation and spilling to MLIR explicitly. Instead,
				418	each target will expose a set of "good" target operations and `n-D` vector
				419	types, associated with costs that `PatterRewriters` at the MLIR level will be
				420	able to target. Such costs at the MLIR level will be abstract and used for
				421	ranking, not for accurate performance modeling. In the future such costs will be
				422	learned.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	423
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	424	#### Implication on Lowering to Accelerators
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	425
				426	To target accelerators that support higher dimensional vectors natively, we can
				427	start from either `1-D` or `n-D` vectors in MLIR and use `vector.cast` to
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	428	flatten the most minor dimensions to `1-D` `vector<Kxf32>` where `K` is an
				429	appropriate constant. Then, the existing lowering to LLVM-IR immediately
				430	applies, with extensions for accelerator-specific intrinsics.
				431
Kazuaki Ishizaki	fc817b0	2020-01-20 03:14:37	[diff] [blame]	432	It is the role of an Accelerator-specific vector dialect (see codegen flow in
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	433	the figure above) to lower the `vector.cast`. Accelerator -> LLVM lowering would
				434	then consist of a bunch of `Accelerator -> Accelerator` rewrites to perform the
				435	casts composed with `Accelerator -> LLVM` conversions + intrinsics that operate
				436	on `1-D` `vector<Kxf32>`.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	437
				438	Some of those rewrites may need extra handling, especially if a reduction is
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	439	involved. For example, `vector.cast %0: vector<K1x...xKnxf32> to vector<Kxf32>`
				440	when `K != K1 * … * Kn` and some arbitrary irregular `vector.cast %0:
				441	vector<4x4x17xf32> to vector<Kxf32>` may introduce masking and intra-vector
				442	shuffling that may not be worthwhile or even feasible, i.e. infinite cost.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	443
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	444	However `vector.cast %0: vector<K1x...xKnxf32> to vector<Kxf32>` when `K = K1 *
				445	… * Kn` should be close to a noop.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	446
				447	As we start building accelerator-specific abstractions, we hope to achieve
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	448	retargetable codegen: the same infra is used for CPU, GPU and accelerators with
				449	extra MLIR patterns and costs.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	450
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	451	#### Implication on calling external functions that operate on vectors
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	452
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	453	It is possible (likely) that we additionally need to linearize when calling an
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	454	external function.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	455
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	456	### Relationship to LLVM matrix type proposal.
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	457
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	458	The LLVM matrix proposal was formulated 1 year ago but seemed to be somewhat
				459	stalled until recently. In its current form, it is limited to 2-D matrix types
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	460	and operations are implemented with LLVM intrinsics. In contrast, MLIR sits at a
				461	higher level of abstraction and allows the lowering of generic operations on
				462	generic n-D vector types from MLIR to aggregates of 1-D LLVM vectors. In the
				463	future, it could make sense to lower to the LLVM matrix abstraction also for CPU
				464	even though MLIR will continue needing higher level abstractions.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	465
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	466	On the other hand, one should note that as MLIR is moving to LLVM, this document
Michal Terepeta	c47108c	2021-11-26 07:14:07	[diff] [blame]	467	could become the unifying abstraction that people should target for 1-D vectors
				468	and the LLVM matrix proposal can be viewed as a subset of this work.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	469
Jacques Pienaar	1842fd5	2020-02-17 21:38:25	[diff] [blame]	470	### Conclusion
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	471
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	472	The flattened 1-D vector design in the LLVM matrix proposal is good in a
				473	HW-specific world with special intrinsics. This is a good abstraction for
				474	register allocation, Instruction-Level-Parallelism and
				475	SoftWare-Pipelining/Modulo Scheduling optimizations at the register level.
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	476	However MLIR codegen operates at a higher level of abstraction where we want to
				477	target operations on coarser-grained vectors than the HW size and on which
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	478	unroll-and-jam is applied and patterns across multiple HW vectors can be
				479	matched.
				480
				481	This makes “nested aggregate type of 1-D vector” an appealing abstraction for
				482	lowering from MLIR because:
				483
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	484	1. it does not hide complexity related to the buffer vs value semantics and the
				485	memory subsystem and
				486	2. it does not rely on LLVM to magically make all the things work from a too
				487	low-level abstraction.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	488
Mogball	a54f4ea	2021-10-12 23:14:57	[diff] [blame]	489	The use of special intrinsics in a `1-D` LLVM world is still available thanks to
				490	an explicit `vector.cast` op.
Nicolas Vasilache	a932f03	2020-01-03 18:05:44	[diff] [blame]	491
River Riddle	1a083f0	2020-03-24 18:57:13	[diff] [blame]	492	## Operations
Nicolas Vasilache	c9d5f34	2019-03-29 18:48:20	[diff] [blame]	493
River Riddle	1a083f0	2020-03-24 18:57:13	[diff] [blame]	494	[include "Dialects/VectorOps.md"]