You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
BoUpSLP::getSpillCost() is said to do the following:
// Walk from the bottom of the tree to the top, tracking which values are
// live. When we see a call instruction that is not part of our tree,
// query TTI to see if there is a cost to keeping values live over it
// (for example, if spills and fills are required).
Some issues (with the implementation, not the documentation ;-):
The first sentence above assumes that a traversal of the VectorizableTree[] array from its first entry (0, the root) to its last entry, corresponds to a bottom-up traversal of the associated instructions in program order, thereby supporting liveness data-flow analysis. However, the entries in this array are not guaranteed to appear in program order. In the example below involving a subtraction the minuend appears closer to the root of the VectorizableTree than the subtrahend, contrary to the order in which they appear inside the basic block. This messes up the liveness scan. Thanks to Michael Kuperstein for a lively discussion on this.
The second sentence above indicates that we accumulate a spill/fill cost per call instruction. However, if the associated calling conventions provide callee-saved vector registers, we may be able to pay this cost once. Furthermore, even if all associated vector registers are caller-saved, if a live range between two VectorizableTree entries contains several calls, as in the example below, we can spill/fill once for all.
A live range may span more than two basic blocks. If we want to keep checking only two (at both ends of the live range) in the interest of compile time, add a comment.
Finally, targets that do not override the default getCostOfKeepingLiveOverCall() which returns 0, should avoid searching for calls throughout the live ranges of all values in the SLP chain, just to accumulate zero costs.
Consider the following example, SLP vectorizing to VF=4:
void slp (int, int);
void slp_call (int *a, int * b, int n)
{
int a0 = a[1];
int a1 = a[2];
int a2 = a[3];
int a3 = a[4];
slp(n, *a);
slp(n, *a);
slp(n, *a);
int b0 = b[2];
int b1 = b[3];
int b2 = b[4];
int b3 = b[5];
a0 = a0 - b0;
a1 = a1 - b1;
a2 = a2 - b2;
a3 = a3 - b3;
b[2] = a0;
b[3] = a1;
b[4] = a2;
b[5] = a3;
}
Clearly only one vector register needs to be spilled and filled before and after the 3 calls to slp() (if at all, considering remat, but that's another story). But as we start to traverse up the tree, we first account for the single live value feeding the store, then the 2 live values feeding the subtract (erasing the former single value); next we will reach the loads from a[1:4] (not the loads from b[2:5]), and before doing so we encounter the 3 calls and accumulate a cost of spilling/filling the above 2 live values each.
This order is evident by running -slp-vectorizer -dump:
SLP: #LV: 1 sub, Looking at %sub = sub nsw i32 %0, %7
SLP: #LV: 2 , Looking at %0 = load i32, i32* %arrayidx, align 4, !tbaa !1
SLP: #LV: 1 , Looking at %7 = load i32, i32* %arrayidx4, align 4, !tbaa !1
The text was updated successfully, but these errors were encountered:
Eyal, thank you for looking at this. I think that modeling spills at the SLP-vectorizer level is a really bad idea. The SLP vectorizer usually reduces register pressure because it enables the use of vector registers in addition to scalar registers. Attempting to predict register pressure is pointless. It's very difficult to guess what ISel and the scheduler would do, especially when vectorizing across basic blocks. This whole thing feels like a hack that was inserted to handle one specific workload and I suspect that if we simply remove this code we won't see any regressions in the LLVM test suite.
Indeed, remat or e.g. scheduling calls above loads can mitigate spill costs, as hinted in the attached example. And using vector in addition to scalar registers should be an advantage, if they are available. However, consider the 3rd issue - in order to best support TSLP, if we encounter calls whose conventions have no callee saved vector registers, this should affect our throttling process, right?
As for ripping getSpillCost out, this should best be championed by its authors. This PR suggests improvements and implicitly that its tests should cover a wider range of tree(s).
Extended Description
BoUpSLP::getSpillCost() is said to do the following:
// Walk from the bottom of the tree to the top, tracking which values are
// live. When we see a call instruction that is not part of our tree,
// query TTI to see if there is a cost to keeping values live over it
// (for example, if spills and fills are required).
Some issues (with the implementation, not the documentation ;-):
The first sentence above assumes that a traversal of the VectorizableTree[] array from its first entry (0, the root) to its last entry, corresponds to a bottom-up traversal of the associated instructions in program order, thereby supporting liveness data-flow analysis. However, the entries in this array are not guaranteed to appear in program order. In the example below involving a subtraction the minuend appears closer to the root of the VectorizableTree than the subtrahend, contrary to the order in which they appear inside the basic block. This messes up the liveness scan. Thanks to Michael Kuperstein for a lively discussion on this.
The second sentence above indicates that we accumulate a spill/fill cost per call instruction. However, if the associated calling conventions provide callee-saved vector registers, we may be able to pay this cost once. Furthermore, even if all associated vector registers are caller-saved, if a live range between two VectorizableTree entries contains several calls, as in the example below, we can spill/fill once for all.
In order to support SLP Vectorizer: Partial Tree Vectorization #18119 (SLP Vectorizer: Partial Tree Vectorization), we should attribute an individual spill cost to each VectorizableTree entry.
A live range may span more than two basic blocks. If we want to keep checking only two (at both ends of the live range) in the interest of compile time, add a comment.
Finally, targets that do not override the default getCostOfKeepingLiveOverCall() which returns 0, should avoid searching for calls throughout the live ranges of all values in the SLP chain, just to accumulate zero costs.
Consider the following example, SLP vectorizing to VF=4:
void slp (int, int);
void slp_call (int *a, int * b, int n)
{
int a0 = a[1];
int a1 = a[2];
int a2 = a[3];
int a3 = a[4];
slp(n, *a);
slp(n, *a);
slp(n, *a);
int b0 = b[2];
int b1 = b[3];
int b2 = b[4];
int b3 = b[5];
a0 = a0 - b0;
a1 = a1 - b1;
a2 = a2 - b2;
a3 = a3 - b3;
b[2] = a0;
b[3] = a1;
b[4] = a2;
b[5] = a3;
}
Clearly only one vector register needs to be spilled and filled before and after the 3 calls to slp() (if at all, considering remat, but that's another story). But as we start to traverse up the tree, we first account for the single live value feeding the store, then the 2 live values feeding the subtract (erasing the former single value); next we will reach the loads from a[1:4] (not the loads from b[2:5]), and before doing so we encounter the 3 calls and accumulate a cost of spilling/filling the above 2 live values each.
This order is evident by running -slp-vectorizer -dump:
SLP: #LV: 1 sub, Looking at %sub = sub nsw i32 %0, %7
SLP: #LV: 2 , Looking at %0 = load i32, i32* %arrayidx, align 4, !tbaa !1
SLP: #LV: 1 , Looking at %7 = load i32, i32* %arrayidx4, align 4, !tbaa !1
The text was updated successfully, but these errors were encountered: