0% found this document useful (0 votes)
31 views

Elbow Metrics

The document proposes a new approach called Kneedle for detecting "knees" or optimal operating points in systems. It begins by discussing how knees represent good trade-off points in systems but current detection methods are ad hoc or system-specific. It then defines a knee mathematically as the point of maximum curvature in a continuous function, and discusses challenges in applying this to discrete data sets. The document aims to provide a general knee detection tool rather than system-specific solutions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Elbow Metrics

The document proposes a new approach called Kneedle for detecting "knees" or optimal operating points in systems. It begins by discussing how knees represent good trade-off points in systems but current detection methods are ad hoc or system-specific. It then defines a knee mathematically as the point of maximum curvature in a continuous function, and discusses challenges in applying this to discrete data sets. The document aims to provide a general knee detection tool rather than system-specific solutions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Finding a “Kneedle” in a Haystack:

Detecting Knee Points in System Behavior


Ville Satopää† , Jeannie Albrecht† , David Irwin‡ , and Barath Raghavan§
† WilliamsCollege, Williamstown, MA
‡ University of Massachusetts Amherst, Amherst, MA
§ International Computer Science Institute, Berkeley, CA

Abstract—Computer systems often reach a point at which the encounter knee detection problems similar to those we de-
relative cost to increase some tunable parameter is no longer scribe [1], [2], [3], [4], [5]. In these systems, researchers
worth the corresponding performance benefit. These “knees” typ- either use ad hoc or system-specific approaches to detect
ically represent beneficial points that system designers have long
selected to best balance inherent trade-offs. While prior work knees, or defer the problem to future work. While a finely-
largely uses ad hoc, system-specific approaches to detect knees, crafted system-specific approach will perform better than a
we present Kneedle, a general approach to online and offline general knee detection approach, a designer may not take
knee detection that is applicable to a wide range of systems. the time to design one. Thus, our aim is not to improve
We define a knee formally for continuous functions using the or optimize a specific system or protocol, but to provide
mathematical concept of curvature and compare our definition
against alternatives. We then evaluate Kneedle’s accuracy against system designers a general tool for improving the parts of
existing algorithms on both synthetic and real data sets, and their system they generally do not take the time to optimize.
evaluate its performance in two different applications. In network protocol and system design, rules-of-thumb often
serve researchers and operators well in the absence of an
I. I NTRODUCTION
optimal solution. We believe that a tool for knee detection
Selecting the “right” operating point for a given system is adds to their problem solving arsenal. Our hypothesis is that
often thought of as an art form, since the direct and indirect a knee detection algorithm that does not require tuning for a
costs and benefits of changing different system parameters specific system or operational characteristics is applicable in a
are difficult or even impossible to quantify. For example, an wide range of settings where developers do not take the time
important operating point in a large MapReduce job occurs to design, test, and optimize a system-specific algorithm.
when the job should no longer wait for “slow” tasks to finish,
but instead speculatively re-execute work on other nodes in II. D EFINING AND D ETECTING K NEES
hopes of finishing the job sooner [1]. Since MapReduce’s goal While the notion of a knee is well-known, we are not
is to finish all tasks as fast as possible, it must decide when the aware of a broadly accepted definition in prior literature.
cost, in terms of a job’s running time and cluster utilization, The confusion stems from the fact that researchers, in many
is worth the corresponding performance benefit, in terms of cases unknowingly, use knees as a substitute for a more
task completion percentage. Congestion-responsive network comprehensive cost-benefit analysis that is either difficult
protocols face a related challenge when setting a sending rate: or impossible to perform. Performing a direct cost-benefit
a protocol must decide a rate that maximizes performance analysis is often complex, since it is inherently system-,
without exceeding its fair share and causing congestion. platform-, and workload-specific. Further, many systems are
In prior work, the issue has frequently been couched as not predictable due to volatile operating conditions.
identifying one or more “knees”—operating points, based on For example, unpredictable failure rates in large clusters,
recent trends, where the perceived cost to alter a system param- which may change over time, are the root cause of stragglers in
eter is no longer worth the expected performance benefit. For MapReduce jobs [1]. Likewise, since multiple flows share net-
MapReduce, triggering speculative execution after observing work links in the Internet, network protocols cannot predict in
a knee in the task completion percentage ensures that the advance the rapidly changing level of TCP-friendly bandwidth
system re-executes tasks that are significantly slower than available, but must instead continuously adapt to the indirect
other similar tasks that have finished execution. In the case signals of packet loss and delay [6]. In lieu of a complex
of a network protocol, successive increases to the sending system-specific analysis, operators tend to select operating
rate should cease if delay signals congestion by increasing points, or knees, that are “good enough” by observing where
steeply, forming a knee. However, while the problem of performance improvements start to level off as a function of
knee detection—finding “good” operating points in system one or more tunable system parameters. Note that we focus on
behavior—seems straightforward, to the best of our knowledge knee detection for complex systems that change their behavior
there exists neither an accepted definition of a knee nor a according to volatile, and potentially unpredictable, operating
general systematic approach for detecting one. conditions, and not for simple systems that permit standard
Numerous researchers in widely disparate areas frequently closed-form models, e.g., M/M/1 queues [7].
opää Knee Detection - Spring 2010

2
Gaussian Curvature

0.8
case, we could determine curvature by fitting a continuous
function to the data and using the function’s point of maximum
Arrivals

curvature. However, fitting a continuous function to a set of


0.4

arbitrary data points is difficult, especially if the data is noisy.


Further, determining the maximum curvature of the resulting
0.0

-4 -2 0 2 4 function may not be sufficient, since the curvature at any point


Time
of a function is dependent on the entire function, including
Fig. 1: CDF of a standard Gaussiandistribution with mean=0
points not in the relevant data set. Thus, maximum curvature
and standard deviation=1. Vertical bar indicates point of maximum
curvature. The inflection point of this curve occurs at x = 0. may fall outside the data’s valid range or be one of the set’s
end-points. Since an approximation of curvature requires at
A. Knee Definition least three points—the minimum number of points that define
0.2

The difficulty with defining a knee formally is that “good a circle—end-points in a data set do not have curvature values
enough” in one system may not be “good enough” in another. by definition. Thus, using the closed-form formulation as a
0.0
K

Since knees only serve as an approximation, operators interpret direct basis for knee detection on discrete data is not possible.
them differently in different situations. Thus, knee detection is
-0.2

an inherently heuristic process. However, to design a general B. Knee Detection in Discrete Data Sets
-4 -2 0 2 4
application-independent knee detection algorithm, we require Researchers have proposed multiple previous approaches
a consistent definition applicablex to any system. In this work, to detecting knees in discrete data. Before formulating our
as in [8], we use the mathematical definition of curvature for curvature-inspired algorithm in Section III, we present two
graph represents the CDF andfunction
a continuous the bottom
as graph is thefor
the basis associated
our knee curvature. The For
definition. verticalexisting
line indicates the
approaches—Angle-based and EWMA—from prior
m curvature, i.e.
anythecontinuous
knee, This seems to match
function the intuitive
f , there exists definition
a standard of closed-form
a knee very precisely.
research for comparison, as well as another approach we
Kf (x) that defines the curvature of f at any point as a function formulate based on Menger curvature, a direct discrete equiv-
of its first and second derivative: alent of continuous curvature. Note that the Angle-based and
00
f (x) Menger algorithms are designed specifically for offline cases,
Kf (x) = 0 2 1.5 where the entire data set is known in advance, while EWMA is
(1 + f (x) )
designed to detect knees online as data points become known.
The point of maximum curvature is well-matched to the ad Angle-based. The geometric “angle-based” approach of
hoc methods operators use to select a knee, since curvature is Zhao et al. [13] is an extension of the L-method for detecting
a mathematical measure of how much a function differs from knees in clustering applications [8]. The Angle-based approach
a straight line. As a result, maximum curvature captures the first finds the local minima of the successive differences
leveling off effect operators use to identify knees. Importantly, (y1 + y3 − 2y2 ) for each consecutive triple of points. For
unlike other common definitions, curvature is application- example, consider a straight line that goes through the con-
independent and (i) does not depend on the relationship secutive points (x1 , y1 ), (x2 , y2 ), and (x3 , y3 ). Assuming x-
between system parameters and performance, or (ii) require values are evenly spaced, then y1 + y3 − 2y2 = 0 for any
setting system-specific thresholds. Note that knee detection straight segment. However, if these three points form a knee,
does depend on the selection of proper adjustable system (x2 , y2 ) must be above the the straight line that goes through
parameters and performance metrics, as we show for our (x1 , y1 ) and (x3 , y3 ). In this case y1 +y3 −2y2 < 0. “Sharper”
examples in Section V. knees have more negative difference values.
It is important to realize why a knee definition based only on Next, since successive differences are local measures and
the first derivative is not enough to identify a knee. Consider ignore the overall trend of the curve, the algorithm combines
the simple example in Figure 1, where the y-axis represents the differences with an angle value. After obtaining the local
some performance metric, the x-axis represents a tunable minima of the successive differences, the algorithm sorts the
system parameter, and the vertical bar represents the point minima, and, starting from the point with the largest difference
of maximum curvature. The maximum of the first derivative value, calculates the two angles formed by the y-axis and the
is the inflection point of the curve, which occurs at x = 0 line going through each successive pair of points associated
in Figure 1. The inflection point is not representative of the with the corresponding difference value. The sum of these
knee since performance continues to improve significantly two angles is the angle value. Knees are detected at the local
beyond it. Instead, the inflection point only captures where the maxima of these angle values.
rate of performance increase reaches a maximum. In contrast, Menger Curvature. While curvature is not well-defined
the curvature definition precisely matches the concept of a for arbitrary discrete data sets, Menger curvature defines the
knee. [8] includes a survey of a range of other knee defini- curvature for three discrete points as the curvature of the
tions from prior work, primarily in the context of clustering circle circumscribed about those points [14]. Thus, we define
algorithms [7], [9], [10], [11], [12]. We discuss alternative the Menger curvature for each point pi = (xi , yi ) in an
definitions below. n point data set as being equal to 1/r for the circle of
While curvature is well-defined for continuous functions, radius r circumscribed about p1 , pi , and pn . The curvature
it is not well-defined for discrete data sets. In the discrete of the circumscribed circle is straightforward to compute and
3
1 1 1 Difference

0.9 0.9 0.9 Threshold

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(a) (b) (c)


Fig. 2: Kneedle algorithm for online knee detection. (a) depicts the smoothed and normalized data, with dashed bars indicating the
perpendicular distance from y = x with the maximum distance indicated. (b) shows the same data, but this time the dashed bars are
rotated 45 degrees. The magnitude of these bars correspond to the difference values used in Kneedle. (c) shows the plot of these difference
values and the corresponding threshold values (with S = 1). The knee is found at x = 0.22 and is detected after receiving the point x = 0.55.
is simply a function of the lengths of the sides of the triangle of the point that differs most from the straight line connecting
with the points as vertices. However, as we show in Section IV, the set’s end-points.
while Menger closely approximates curvature for offline data Figure 2 depicts how Kneedle works for data points drawn
drawn from ideal continuous functions, it does not work well from the curve y = −1/x + 5 where x-values are between 0
for the noisy online data sets typical of computing systems. and 1. Note that we assume that the curves under consideration
EWMA. The EWMA approach uses techniques similar have negative concavity. For curves with consistently positive
to those employed by Bollinger Bands [15] and Geometric concavity (e.g., forming “elbows” rather than knees) it is trivial
Moving Average algorithms for change detection [16]. The to invert the graph by replacing each yi with ymax − yi and xi
algorithm that we use is based on the methodology described with xmax − xi .
by Albrecht et al. in their work on partial barriers [3], which We summarize Kneedle below. Put simply, knees occur
derives from previous work on MONET [17]. EWMA is an when a curve becomes more “flat,” indicating a decrease in
online algorithm that uses two exponentially weighted moving curvature. The algorithm works as follows:
averages. The first EWMA, called arr, is used to smooth
the input data, which is viewed as host arrival times. The 1. First we use a smoothing spline to preserve the shape of
second EWMA, arrvar, keeps track of the average deviation the original data set as much as possible, although other
from arr, and is an estimate of the variance in arrival times. smoothing techniques, such as an exponentially weighted
Finally, these two values are used to compute a maximum wait moving average, could also be used. Let Ds represent the
threshold of arr + 4 · arrvar, which represents the maximum finite set of x- and y-values that define a smooth curve, i.e.,
amount of time to wait for the next point to arrive. If the one that has been fit to a smoothing spline.
point arrives after this threshold, or the threshold is reached
Ds = {(xsi , ysi ) ∈ R2 | xsi , ysi ≥ 0}.
without seeing the next arrival, EWMA declares a knee. One
important attribute of this algorithm is that EWMA does not 2. We want our algorithm to function in the same way
directly report where the knee point is—it only determines if regardless of the magnitude of the values in the underlying
a knee has been passed. As a result, EWMA is only applicable data. Thus, we next normalize the points of the smooth
in an online setting. curve to the unit square, as shown in Figure 2(a). This does
not change the shape or trends of the data set:
III. K NEEDLE A LGORITHM Dsn = {(xsni , ysni )}, where
Kneedle is based on the notion that the points of maximum xsni = (xsi − min{xs })/(max{xs } − min{xs }),
curvature in a data set—the knees—are approximately the set ysni = (ysi − min{ys })/(max{ys } − min{ys })}.
of points in a curve that are local maxima if the curve is rotated
θ degrees clockwise about (xmin , ymin ) through the line formed 3. Next, we let Dd represent the set of differences between
by the points (xmin , ymin ) and (xmax , ymax ). We choose this line the x- and y-values, i.e., the set of points (x, y − x) as
because we want to preserve the overall behavior of the data illustrated in Figure 2(b). The goal is to find out when
set—using a line of best fit, for example, risks cutting off the the difference curve changes from horizontal to sharply
end points due to a higher concentration of points in the middle decreasing, since this indicates the presence of a knee in the
of the curve. After rotating about this line, the local maxima— original data set. Note that the actual values of the difference
and thus knees—are the points at which the curve differs most points are irrelevant. We are only interested in observing the
from the straight line segment connecting the first and last data trends of the difference curve, as seen in Figure 2(c).
point, thereby approximating the point of maximum curvature Dd = {(xdi , ydi )}, where
for a discrete set of points. Since maximum curvature is an
xdi = xsni ,
inherent measure of the point where a continuous function
differs most from a straight line, Kneedle uses a literal measure ydi = ysni − xsni .
4

0.5
100
Kneedle Kneedle

0.10

Definition
Kneedle Menger Menger
Menger Angle−based Angle−based
Angle−based

0.4
EWMA
80

0.08

Probability Density
0.3

0.06
60

F−Score
● ●

0.2

0.04
40

● ●

● ●

0.02
0.1
● ●
20


● ●

0.00
0.0
0

0 20 40 60 80 100 1 2 3 4 5 6 −50 0 50 100 150


Allowable error (number of points) Difference to Maximum Curvature
Fig. 3: Kneedle, Menger, Angle-based, Fig. 4: Measured offline F-Score of knee Fig. 5: Histogram showing measured off-
and EWMA for synthetic data set. Max- detection algorithms using NoisyGaus- line distances (numbers of x-values) to
imum curvature occurs at x = 60. sian data. “correct” knees.
4. To find the knee points in the normalized curve, e.g., the aim is to determine a base-line accuracy for each algorithm
places where the curve flattens out, we calculate the local using synthetic data sets drawn from continuous functions
maxima of the difference curve. These points indicate the where the true knees are well-known. After showing that
instances where the rate of increase of y begins to decrease. Kneedle closely approximates the true knees, we then compare
Each of these local maximum points are a candidate knee its online behavior against EWMA to evaluate how quickly it
point in the original data curve: is able to detect knees once they “appear” in the data.
Dlmx = {(xlmxi , ylmxi )}, where A. Detecting Knees in Synthetic Data Sets
xlmxi = xdi , To evaluate Kneedle, we developed a synthetic data source
ylmxi = ydi | ydi−1 < ydi , ydi+1 < ydi . which we call NoisyGaussian that yields data similar to many
of the real data sets of interest, but allows us to vary the overall
5. For each local maximum (xlmxi , ylmxi ) in the difference shape of the curve. To generate a NoisyGaussian, we start
curve, we define a unique threshold value, Tlmxi , that is with a Gaussian function with a randomly selected standard
based on the average difference between consecutive x- deviation and mean. Then we generate the NoisyGaussian
values and a sensitivity parameter, S. The sensitivity param- data set using the cumulative count of the randomly generated
eter allows us to adjust how aggressive we want Kneedle points whose value is less than x. The resulting curve is similar
to be when detecting knees. Smaller values for S detect to a Gaussian cumulative distribution function in overall shape.
knees quicker, while larger values are more conservative. The benefit of evaluating the knee detection algorithms
Put simply, S is a measure of how many “flat” points we using NoisyGaussian is that an approximate closed-form
expect to see in the unmodified data curve before declaring solution exists for the point of maximum curvature. We derive
a knee. We explore the choice of S in Section IV. In the point of maximum curvature by computing it for the
Figure 2(c), the threshold line is plotted with S = 1. underlying Gaussian CDF in terms of standard deviation σ
n−1
and mean µ. Although we omit the details for brevity, the
P 
xsni+1 − xsni
Tlmxi = ylmxi − S · i=1 point of maximum curvature is approximately x ≈ µ + σ with
n−1 a small bounded error. We use this closed-form expression to
6. If any difference value (xdj , ydj ), where j > i, drops represent the “correct” knee in our evaluation.
below the threshold y = Tlmxi for (xlmxi , ylmxi ) before the To illustrate the general behavior of each knee detector, we
next local maximum in the difference curve is reached, plot the knees each algorithm detects in Figure 3 for a sample
Kneedle declares a knee at the x-value of the corresponding NoisyGaussian data set with µ = 50 and σ = 10.
local maximum x = xlmxi . If the difference values reach
B. Offline Accuracy
a local minimum and starts to increase before y = Tlmxi
is reached, we reset the threshold value to 0 and wait for To evaluate offline accuracy, we use three common statisti-
another local maximum to be reached. cal metrics: precision, recall, and F-Score. Precision measures
Note that Kneedle can be run offline or online. In the online the correctness of each knee an algorithm detects. A low preci-
case, Kneedle can “correct” old knee values if necessary as sion value indicates the presence of numerous false positives,
points are received. Kneedle’s online run Ptime for any given where a false positive is any detected knee that does not
n align with maximum curvature. Recall measures completeness
n pairs of x- and y-values is bounded by i=1 i = O(n2 ).
by quantifying the percentage of correct knees an algorithm
IV. E VALUATING K NEEDLE detects out of the total number of correct knees. Note, however,
We compare the performance of Kneedle to the offline that recall does not penalize for incorrect detections. Our third
(Angle-based, Menger) and online (EWMA) algorithms sep- metric, F-Score, is the harmonic mean of precision and recall.
arately, since their goals are different. In offline settings, our Since an ideal knee detection algorithm has both high recall
5

1.0
Sensitivity

0.12
Kneedle ●

EWMA 0.001

0.35

1.0
5.0
0.10

0.8

0.30

Probability Density


0.08

0.6
F−Score

F−Score

0.25

0.06

0.4
0.20
0.04

0.2
0.02

0.15


0.00

0.0
● ● ●

0.10
● ●

−40 −20 0 20 0 2 4 6 8 10 0 20 40 60 80 100


Latency of Detection Sensitivity Parameter, S Percent of Points Received
Fig. 6: Online detection latency. Nega- Fig. 7: Measured offline F-Scores for Fig. 8: Measured online F-Scores for
tive values indicate early detections. varying sensitivity values in Kneedle. varying sensitivity values in Kneedle.
and high precision, we use F-Score to capture both measures D. Sensitivity
of accuracy in a single value. An F-Score value of 1 is best. To better understand the importance of sensitivity, S, to
To evaluate our algorithms, we generate 10,000 Noisy- Kneedle’s performance, we again use F-Score. Figures 7 and 8
Gaussian data sets. Since none of the algorithms detect knees show the results of our sensitivity analysis in offline and online
at exactly the point of maximum curvature, we vary how settings respectively. In both graphs, we compute Kneedle’s F-
many data points we allow for error. For example, suppose Score using a wide range of sensitivity values. We compare
our data set includes points at x = 1, 2, 3, 4, 5, and the point the F-Score from 10,000 data sets for each value of S. In the
of maximum curvature is x = 4. With an allowable error of offline graph, we use the points of maximum curvature as the
1, we declare the algorithm as finding a “correct” knee if it true knees, and compute the F-Score based on those values. In
detects a knee at x = 3, 4, or 5. Figure 4 shows that Kneedle’s the online graph, our goal is to determine how quickly Kneedle
F-Score is better than the Angle-based or Menger algorithm. approaches the offline case, and thus we use the knees detected
Using the closed-form approximation for the point of max- by offline Kneedle as the correct knees. Not surprisingly, in
imum curvature in our NoisyGaussian data sets, we can offline settings where Kneedle has perfect information, the
identify “true” knees in the data. This allows us to quantify highest F-Score occurs when S = 0. In online settings, the
the accuracy of each algorithm by measuring the distance, in results vary depending on the number of points received, but
terms of the number of x-values, between the true knees and overall S = 1 has the best results.
the detected knees. Figure 5 shows the results of measuring
the distance, in terms of the number of x-values, between the V. A PPLICATION R ESULTS
true knees and the detected knees. In this histogram, we see This section demonstrates Kneedle’s usefulness in real ap-
that Kneedle approximates the point of maximum curvature plications. First, we identify knees in a data set from prior
much more closely than either Menger or Angle-based, since work, and show that we find close to the same knees that
the density of the histogram is highest between 0 and 25, while the authors found with system-specific techniques. Next we
Menger and Angle-based show a wider variation. evaluate Kneedle’s performance for two sample applications: a
MapReduce-like system and a TCP-friendly network protocol.
C. Online Detection Latency A. Using Kneedle in Existing Applications
In this section, we evaluate detection latency—the number Figure 9 applies knees to object replication, where the knees
of data points beyond the knee required for detection—for represent the optimal degrees of replication for high avail-
both EWMA and Kneedle. For online Kneedle, we execute ability given various object distributions (data from Figure 5
the knee detection algorithm after receiving each new data in [5]). The application requires the detection of multiple knees
point, in order of increasing x. For both EWMA and Kneedle, in object popularity curves, each of which has considerable
we compute the detection latency as the number of data points noise. Unlike other knee detection algorithms, such as Menger,
between when the algorithm detects a knee and the actual knee Kneedle is capable of detecting multiple knees, where the
point as determined by the point of maximum curvature. For sensitivity of this detection depends on the selected value of
example, suppose the data set has points at x = 1, 2, 3, 4, and S. Note that we consider this knee detection application to
5, with a true knee at x = 3. Now suppose that after receiving be offline, since Zhong et al. observe: “[w]e expect that the
the point at x = 5, the knee detection algorithm detects a replica adjustment overhead due to object request popularity
knee. In this case, we compute the the latency as 5 − 3 = 2. changes would not be excessive in practice...our analysis of
In Figure 6 we plot a histogram of the detection latency for real system object request traces in Section 3.2 suggests that
EWMA and Kneedle with S = 1. The experiment highlights the popularities of most data objects tend to remain stable
the fact that Kneedle rarely has a significant detection latency, over multi-week periods.” The knees found by Kneedle in this
while EWMA often has high detection latencies. graph concur with those identified by the original authors.
6

1.0

200
8000

0.8
Popularity-to-size ratios

Sending rate (kpbs)


Cumulative fraction

150
6000

0.6
4000

100
0.4
2000

0.2

50
Without Kneedle UDP

0.0
With Kneedle TCP
0

0 5 10 15 0 200 400 600 800 0 100 200 300 400


Sorted objects Time (s) Time (s)
Fig. 9: Offline knee detection using Fig. 10: Distributed work allocator Fig. 11: Congestion control using Knee-
Kneedle for replication [5]. with/without online Kneedle. dle (UDP) vs. TCP.
B. Using Kneedle for Speculative Execution of curvature for continuous functions. We propose a new
To test the effectiveness of Kneedle in our own MapReduce- algorithm, Kneedle, for approximating these knee points in
like setting, we integrated our algorithm into a prototypical discrete data sets, and compare with existing knee detection
distributed batch computing system that farms out tasks to approaches using both synthetic and real application data sets.
PlanetLab nodes [18]. We ran an experiment in which we We then show that Kneedle is useful in real systems by
gave 300 PlanetLab nodes a small task that required a mix integrating it with minimal effort into two distinct systems
of CPU and I/O resources to process. We then used Kneedle that encounter the knee detection problem. We believe Kneedle
to find knee points and reallocate tasks from slow nodes to addresses a common problem that arises in systems research
fast nodes, reducing the overall task completion time. This and engineering. In addition, we believe that our methodology
task reallocation is similar to MapReduce’s strategy for coping for evaluating knee detection approaches will allow any future
with the presence of stragglers. Figure 10 demonstrates that knee detection designs to be compared in a straightforward
Kneedle can be successfully integrated into existing systems manner.
with minimal effort: the only change required to our work ACKNOWLEDGMENT
allocation system was a single function call. That is, each
This material is supported by NSF grant CNS-0845349.
time a task completed, we called Kneedle with the new data
point. When Kneedle returned a knee, we simply reallocated R EFERENCES
unfinished tasks to idle nodes, reducing the total completion [1] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on
time from 827 seconds down to 143 seconds. Large Clusters,” in OSDI, 2004.
[2] R. Jain, “Congestion Control in Computer Networks: Issues and Trends,”
IEEE Network, vol. 4, no. 3, 1990.
C. Using Kneedle for Congestion Control [3] J. Albrecht, C. Tuttle, A. C. Snoeren, and A. Vahdat, “Loose Synchro-
nization for Large-Scale Networked Systems,” in USENIX, 2006.
Next we show the applicability of Kneedle to an entirely [4] Y. Chawathe, S. Ratnasamy, L. Breslau, N. Lanham, and S. Shenker,
different domain: network congestion control. Here we use “Making Gnutella-like P2P Systems Scalable,” in SIGCOMM, 2003.
[5] M. Zhong, K. Shen, and J. Seiferas, “Replication Degree Customization
Kneedle to find the knee, as described by Jain [2], in the for High Availability,” in EuroSys, 2008.
offered load versus packet delay curve. We implemented a [6] S. Floyd, M. Handley, J. Padhye, and J. Widmer, “Equation-Based
Congestion Control for Unicast Applications,” in SIGCOMM, 2000.
TCP-friendly congestion control algorithm for UDP flows, [7] C. Millsap, “Performance Management: Myths and Facts,” Oracle Cor-
similar to the TFRC [6], but without the careful measurement poration, Tech. Rep., 1999.
[8] S. Salvador and P. Chan, “Determining the Number of Clusters/Segments
of packet transmissions and complicated equations. For each in Hierarchical Clustering/Segmentation Algorithms,” in ICTAI, 2004.
reply packet from the receiver, our algorithm simply calculates [9] T. Chiu, D. Fang, J. Chen, Y. Wang, and C. Jeris, “A Robust and Scalable
Clustering Algorithm for Mixed Type Attributes in Large Database
the round-trip delay. We increment the rate every time a packet Environments,” in KDD, 2001.
is transmitted and pace the packets evenly; for every 100 [10] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm
for Discovering Clusters in Large Spatial Databases with Noise,” in
packets sent, we compute the knee point and use it as the new KDD, 1996.
target rate. Figure 11 shows the behavior of our UDP sender [11] A. Foss and A. Zaiane, “A Parameterless Method for Efficiently Discov-
ering Clusters of Arbitrary Shape in Large Datasets,” in ICDM, 2002.
versus an ordinary TCP flow that joins and then departs. We [12] C. Millsap and J. Holt, Optimizing Oracle Performance. O’Reilly
use dummynet to limit bandwidth to 200Kbps. We find that Media, Inc., 2003.
our simple algorithm is both able to stabilize to the bottleneck [13] Q. Zhao, V. Hautamaki, and P. Fränti, “Knee Point Detection in BIC for
Detecting the Number of Clusters,” in ACIVS, 2008.
bandwidth and to share bandwidth fairly with a TCP flow. [14] X. Tolsa, “Principal Values for the Cauchy Integral and Rectifiability,”
While this experiment is simplistic, it highlights the benefits in American Mathematical Society, 2000.
[15] J. A. Bollinger, Bollinger on Bollinger Bands. McGraw-Hill, 2001.
of Kneedle’s general approach. [16] M. Basseville and I. V. Nikiforov, Detection of Abrupt Changes: Theory
and Application. Englewood Cliffs, N.J.: Prentice-Hall, 1993.
VI. C ONCLUSION [17] D. G. Andersen, H. Balakrishnan, M. F. Kaashoek, and R. N. Rao,
“Improving Web Availability for Clients with MONET,” in NSDI, 2005.
[18] L. L. Peterson, A. C. Bavier, M. E. Fiuczynski, and S. Muir, “Experi-
In this paper, we present a formal definition for a knee ences Building PlanetLab,” in OSDI, 2006.
in discrete data sets based on the mathematical definition

You might also like