SlideShare a Scribd company logo
2nd IEEE International Conference on Cloud Computing Technology and Science




    Evaluation and Analysis of GreenHDFS: A Self-Adaptive, Energy-Conserving
                   Variant of the Hadoop Distributed File System

                                  Rini T. Kaushik                                       Milind Bhandarkar
                     University of Illinois, Urbana-Champaign                               Yahoo! Inc.
                              kaushik1@illinois.edu                                   milindb@yahoo-inc.com
                                                       Klara Nahrstedt
                                          University of Illinois, Urbana-Champaign
                                                     klara@cs.uiuc.edu


  Abstract                                                               needs [13]. Hadoop’s data-intensive computing framework
                                                                         is built on a large-scale, highly resilient object-based clus-
      We present a detailed evaluation and sensitivity anal-             ter storage managed by Hadoop Distributed File System
  ysis of an energy-conserving, highly scalable variant of               (HDFS) [24].
  the Hadoop Distributed File System (HDFS) called Green-                    With the increase in the sheer volume of the data that
  HDFS. GreenHDFS logically divides the servers in a                     needs to be processed, storage and server demands of com-
  Hadoop cluster into Hot and Cold Zones and relies on in-               puting workloads are on a rapid increase. Yahoo!’s com-
  sightful data-classification driven energy-conserving data              pute infrastructure already hosts 170 petabytes of data and
  placement to realize guaranteed, substantially long periods            deploys over 38000 servers [15]. Over the lifetime of IT
  (several days) of idleness in a significant subset of servers           equipment, the operating energy cost is comparable to the
  in the Cold Zone. Detailed lifespan analysis of the files in            initial equipment acquisition cost [11] and constitutes a sig-
  a large-scale production Hadoop cluster at Yahoo! points at            nificant part of the total cost of ownership of a datacen-
  the viability of GreenHDFS. Simulation results with real-              ter [6]. Hence, energy-conservation of the extremely large-
  world Yahoo! HDFS traces show that GreenHDFS can                       scale, commodity server farms has become a priority.
  achieve 24% energy cost reduction by doing power man-                      Scale-down (i.e., transitioning servers to an inactive, low
  agement in only one top-level tenant directory in the clus-            power consuming sleep/standby state) is an attractive tech-
  ter and meets all the scale-down mandates in spite of the              nique to conserve energy as it allows energy proportional-
  unique scale-down challenges present in a Hadoop cluster.              ity with non energy-proportional components such as the
  If GreenHDFS technique is applied to all the Hadoop clus-              disks [17] and significantly reduces power consumption
  ters at Yahoo! (amounting to 38000 servers), $2.1million               (idle power draw of 132.46W vs. sleep power draw of
  can be saved in energy costs per annum. Sensitivity anal-              13.16W in a typical server as shown in Table 1). However,
  ysis shows that energy-conservation is minimally sensitive             scale-down cannot be done naively as discussed in Section
  to the thresholds in GreenHDFS. Lifespan analysis points               3.2.
  out that one-size-fits-all energy-management policies won’t                 One technique is to scale-down servers by manufactur-
  suffice in a multi-tenant Hadoop Cluster.                               ing idleness by migrating workloads and their correspond-
                                                                         ing state to fewer machines during periods of low activ-
                                                                         ity [5, 9, 10, 25, 30, 34, 36]. This can be relatively easy to ac-
  1 Introduction                                                         complish when servers are state-less (i.e., serving data that
                                                                         resides on a shared NAS or SAN storage system). However,
     Cloud computing is gaining rapid popularity. Data-                  servers in a Hadoop cluster are not state-less.
  intensive computing needs range from advertising optimiza-                 HDFS distributes data chunks and replicas across servers
  tions, user-interest predictions, mail anti-spam, and data an-         for resiliency, performance, load-balancing and data-
  alytics to deriving search rankings. An increasing number              locality reasons. With data distributed across all nodes, any
  of companies and academic institutions have started to rely            node may be participating in the reading, writing, or com-
  on Hadoop [1] which is an open-source version of Google’s              putation of a data-block at any time. Such data placement
  Map-reduce framework for their data-intensive computing                makes it hard to generate significant periods of idleness in

978-0-7695-4302-4/10 $26.00 © 2010 IEEE                            274
DOI 10.1109/CloudCom.2010.109
the Hadoop clusters and renders usage of inactive power                work and conclude.
modes infeasible [26].
    Recent research on scale-down in GFS and HDFS man-                 2     Key observations
aged clusters [3, 27] propose maintaining a primary replica
of the data on a small covering subset of nodes that are guar-            We did a detailed analysis of the evolution and lifespan
anteed to be on. However, these solutions suffer from de-              of the files in in a production Yahoo! Hadoop cluster us-
graded write-performance as they rely on write-offloading               ing one-month long HDFS traces and Namespace metadata
technique [31] to avoid server wakeups at the time of writes.          checkpoints. We analyzed each top-level directory sepa-
Write-performance is an important consideration in Hadoop              rately in the production multi-tenant Yahoo! Hadoop clus-
and even more so in a production Hadoop cluster as dis-                ter as each top-level directory in the namespace exhibited
cussed in Section 3.1.                                                 different access patterns and lifespan distributions. The key
    We took a different approach and proposed GreenHDFS,               observations from the analysis are:
an energy-conserving, self-adaptive, hybrid, logical multi-
zoned variant of HDFS in our paper [23]. Instead of an                     ∙ There is significant heterogeneity in the access pat-
energy-efficient placement of computations or using a small                   terns and the lifespan distributions across the various
covering set for primary replicas as done in earlier research,               top-level directories in the production Hadoop clus-
GreenHDFS focuses on data-classification techniques to                        ter and one-size-fits-all energy-management policies
extract energy savings by doing energy-aware placement of                    don’t suffice across all directories.
data.
                                                                           ∙ Significant amount of data amounting to 60% of used
    GreenHDFS trades cost, performance and power by sep-
                                                                             capacity is cold (i.e., is lying dormant in the system
arating cluster into logical zones of servers. Each cluster
                                                                             without getting accessed) in the production Hadoop
zone has a different temperature characteristic where tem-
                                                                             cluster. A majority of this cold data needs to exist for
perature is measured by the power consumption and the per-
                                                                             regulatory and historical trend analysis purposes.
formance requirements of the zone. GreenHDFS relies on
the inherent heterogeneity in the access patterns in the data              ∙ We found that the 95-98% files in majority of the top-
stored in HDFS to differentiate the data and to come up with                 level directories had a very short hotness lifespan of
an energy-conserving data layout and data placement onto                     less than 3 days. Only one directory had files with
the zones. Since, computations exhibit high data locality in                 longer hotness lifespan. Even in that directory 80%
the Hadoop framework, the computations then flow natu-                        of files were hot for less than 8 days.
rally to the data in the right temperature zones.
    The contribution of this paper lies in showing that the                ∙ We found that 90% of files amounting to 80.1% of the
energy-aware data-differentiation based data-placement in                    total used capacity in the most storage-heavy top-level
GreenHDFS is able to meet all the effective scale-down                       directory were dormant and hence, cold for more than
mandates (i.e., generates significant idleness, results in                    18 days. Dormancy periods were much shorter in the
few power state transitions, and doesn’t degrade write per-                  rest of the directories and only 20% files were dormant
formance) despite the significant challenges posed by a                       beyond 1 day.
Hadoop cluster to scale-down. We do a detailed evaluation
                                                                           ∙ Access pattern to majority of the data in the production
and sensitivity analysis of the policy thresholds in use in
                                                                             Hadoop cluster have a news-server-like access pattern
GreenHDFS with a trace-driven simulator with real-world
                                                                             whereby most of the computations to the data happens
HDFS traces from a production Hadoop cluster at Yahoo!.
                                                                             soon after the data’s creation.
While some aspects of GreenHDFS are sensitive to the pol-
icy thresholds, we found that energy-conservation is mini-
mally sensitive to the policy thresholds in GreenHDFS.                 3     Background
    The remainder of the paper is structured as follows. In
Section 2, we list some of the key observations from our                  Map-reduce is a programming model designed to sim-
analysis of the production Hadoop cluster at Yahoo!. In                plify data processing [13]. Google, Yahoo!, Facebook,
Section 3, we provide background on HDFS, and discuss                  Twitter etc. use Map-reduce to process massive amount of
scale-down mandates. In Section 4, we give an overview of              data on large-scale commodity clusters. Hadoop is an open-
the energy management policies of GreenHDFS. In Section                source cluster-based Map-reduce implementation written in
5, we present an analysis of the Yahoo! cluster. In Section            Java [1]. It is logically separated into two subsystems: a
6, we include experimental results demonstrating the effec-            highly resilient and scalable Hadoop Distributed File Sys-
tiveness and robustness of our design and algorithms in a              tem (HDFS), and a Map-reduce task execution framework.
simulation environment. In Section 7, we discuss related               HDFS runs on clusters of commodity hardware and is an


                                                                 275
object-based distributed file system. The namespace and                to the class of data residing in that zone. Differentiating
the metadata (modification, access times, permissions, and             the zones in terms of power is crucial towards attaining our
quotas) are stored on a dedicated server called the NameN-            energy-conservation goal.
ode and are decoupled from the actual data which is stored                Hot zone consists of files that are being accessed cur-
on servers called the DataNodes. Each file in HDFS is repli-           rently and the newly created files. This zone has strict SLA
cated for resiliency and split into blocks of typically 128MB         (Service Level Agreements) requirements and hence, per-
and individual blocks and replicas are placed on the DataN-           formance is of the greatest importance. We trade-off energy
odes for fine-grained load-balancing.                                  savings in interest of very high performance in this zone. In
                                                                      this paper, GreenHDFS employs data chunking, placement
3.1     Importance of Write-Performance in                            and replication policies similar to the policies in baseline
        Production Hadoop Cluster                                     HDFS or GFS.
                                                                          Cold zone consists of files with low to rare accesses.
   Reduce phase of a Map-reduce task writes intermediate              Files are moved by File Migration policy from the Hot
computation results back to the Hadoop cluster and relies on          zones to the Cold zone as their temperature decreases be-
high write performance for overall performance of a Map-              yond a certain threshold. Performance and SLA require-
reduce task. Furthermore, we observed that the majority of            ments are not as critical for this zone and GreenHDFS em-
the data in a production Hadoop cluster has a news-server             ploys aggressive energy-management schemes and policies
like access pattern. Predominant number of computations               in this zone to transition servers to low power inactive state.
happen on newly created data; thereby mandating good read             Hence, GreenHDFS trades-off performance with high en-
and write performance of the newly created data.                      ergy savings in the Cold zone.
                                                                          For optimal energy savings, it is important to increase
3.2     Scale-down Mandates                                           the idle times of the servers and limit the wakeups of servers
                                                                      that have transitioned to the power saving mode. Keeping
   Scale-down, in which server components such as CPU,
                                                                      this rationale in mind and recognizing the low performance
disks, and DRAM are transitioned to inactive, low power
                                                                      needs and infrequency of data accesses to the Cold zone;
consuming mode, is a popular energy-conservation tech-
                                                                      this zone will not chunk the data. This will ensure that upon
nique. However, scale-down cannot be applied naively. En-
                                                                      a future access only the server containing the data will be
ergy is expended and transition time penalty is incurred
                                                                      woken up.
when the components are transitioned back to an active
                                                                          By default, the servers in Cold zone are in a sleeping
power mode. For example, transition time of components
                                                                      mode. A server is woken up when either new data needs
such as the disks can be as high as 10secs. Hence, an effec-
                                                                      to be placed on it or when data already residing on the
tive scale-down technique mandates the following:
                                                                      server is accessed. GreenHDFS tries to avoid powering-on
    ∙ Sufficient idleness to ensure that energy savings are            a server in the Cold zone and maximizes the use of the exist-
      higher than the energy spent in the transition.                 ing powered-on servers in its server allocation decisions in
    ∙ Less number of power state transitions as some com-             interest of maximizing the energy savings. One server wo-
      ponents (e.g., disks) have limited number of start/stop         ken up and is filled completely to its capacity before next
      cycles and too frequent transitions may adversely im-           server is chosen to be transitioned to an active power state
      pact the lifetime of the disks.                                 from an ordered list of servers in the Cold zone.
                                                                          The goal of GreenHDFS is to maximize the allocation
    ∙ No performance degradation. Steps need to be taken              of the servers to the Hot zone to minimize the performance
      to amortize performance penalty of power state transi-          impact of zoning and minimize the number of servers allo-
      tions and to ensure that load concentration on the re-          cated to the Cold zone. We introduced a hybrid, storage-
      maining active state servers doesn’t adversely impact           heavy cluster model in [23] paper whereby servers in the
      overall performance of the system.                              Cold zone are storage-heavy and have 12, 1TB disks/server.
                                                                          We argue that zoning in GreenHDFS will not affect the
4     GreenHDFS Design                                                Hot zone’s performance adversely and the computational
                                                                      workload can be consolidated on the servers in the Hot zone
   GreenHDFS is a variant of the Hadoop Distributed File              without exceeding the CPU utilization above the provision-
System (HDFS) and GreenHDFS logically organizes the                   ing guidelines. A study of 5000 Google compute servers,
servers in the datacenter in multiple dynamically provi-              showed that most of the time is spent within the 10% - 50%
sioned Hot and Cold zones. Each zone has a distinct perfor-           CPU utilization range [4]. Hence, significant opportunities
mance, cost, and power characteristic. Each zone is man-              exist in workload consolidation. And, the compute capacity
aged by power and data placement policies most conducive              of the Cold zone can always be harnessed under peak load


                                                                276
scenarios.                                                                          4.1.2    Server Power Conserver Policy

4.1      Energy-management Policies                                                 The Server Power Conserver Policy runs in the Cold zone
                                                                                    and determines the servers which can be transitioned into
    Files are moved from the Hot Zones to the Cold Zone as                          a power saving standby/sleep mode in the Cold Zone as
their temperature changes over time as shown in Figure 1.                           shown in Algorithm 2. The current trend in the internet-
In this paper, we use dormancy of a file, as defined by the                           scale data warehouses and Hadoop clusters is to use com-
elapsed time since the last access to the file, as the measure                       modity servers with 4-6 directly attached disks instead of
of temperature of the file. Higher the dormancy lower is the                         using expensive RAID controllers. In such systems, disks
temperature of the file and hence, higher is the coldness of                         actually just constitute 10% of the entire power usage as il-
the files. On the other hand, lower the dormancy, higher is                          lustrated in a study performed at Google [21] and CPU and
the heat of the files. GreenHDFS uses existing mechanism                             DRAM constitute of 63% of the total power usage. Hence,
in baseline HDFS to record and update the last access time                          power management of any one component is not sufficient.
of the files upon every file read.                                                    We leverage energy cost savings at the entire server granu-
                                                                                    larity (CPU, Disks, and DRAM) in the Cold zone.
                                                                                        The GreenHDFS uses hardware techniques similar to
4.1.1    File Migration Policy
                                                                                    [28] to transition the processors, disks and the DRAM into
The File Migration Policy runs in the Hot zone, monitors                            a low power state. GreenHDFS uses the disk Sleep mode 1 ,
the dormancy of the files as shown in Algorithm 1 and                                CPU’s ACPI S3 Sleep state as it consumes minimal power
moves dormant, i.e., cold files to the Cold Zone. The advan-                         and requires only 30us to transition from sleep back to ac-
tages of this policy are two-fold: 1) leads to higher space-                        tive execution, and DRAM’s self-refresh operating mode in
efficiency as space is freed up on the hot Zone for files                             which transitions into and out of self refresh can be com-
which have higher SLA requirements by moving rarely ac-                             pleted in less than a microsecond in the Cold zone.
cessed files out of the servers in these zones, and 2) allows                            The servers are transitioned back to an active power
significant energy-conservation. Data-locality is an impor-                          mode in three conditions: 1) data residing on the server is
tant consideration in the Map-reduce framework and com-                             accessed, 2) additional data needs to be placed on the server,
putations are co-located with data. Thus, computations nat-                         or 3) block scanner needs to run on the server to ensure
urally happen on the data residing in the Hot zone. This                            the integrity of the data residing in the Cold zone servers.
results in significant idleness in all the components of the                         GreenHDFS relies on Wake-on-LAN in the NICs to send a
servers in the Cold zone (i.e., CPU, DRAM and Disks), al-                           magic packet to transition a server back to an active power
lowing effective scale-down of these servers.                                       state.
                                                                                                                 Wake-up Events:
                                                                                                                 File Access
                                                                                                                 Bit Rot Integrity Checker
                                                                                                                 File Placement
                        Coldness > ThresholdFMP                                                                  File Deletion

                  Hot                             Cold                                              Active                                Inactive
                 Zone                             Zone
                                                                                                           Server Power Conserver Policy:
                         Hotness > ThresholdFRP                                                               Coldness > Threshold  PCS




  Figure 1. State Diagram of a File’s Zone Alloca-                                     Figure 2. Triggering events leading to Power State
  tion based on Migration Policies                                                     Transitions in the Cold Zone



Algorithm 1 Description of the File Migration Policy which                          Algorithm 2 Server Power Conserver Policy
Classifies and Migrates cold data to the Cold Zone from the
                                                                                      {For every Server i in Cold Zone}
Hot Zones                                                                             for 𝑖 = 1 to n do
 {For every file i in Hot Zone}                                                           coldness 𝑖 ⇐ max0≤𝑗≤𝑚 last access time 𝑗
 for 𝑖 = 1 to n do                                                                       if coldness 𝑖 ≥ 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝑃 𝐶 then
    dormancy 𝑖 ⇐ current time − last access time 𝑖                                           S 𝑖 ⇐ INACTIVE STATE
    if dormancy 𝑖 ≥ 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 then                                                end if
        {Cold Zone} ⇐ {Cold Zone} ∪ {f 𝑖 }                                            end for
        {Hot Zone} ⇐ {Hot Zone} / {f 𝑖 }//filesystem metadata structures are
        changed to Cold Zone
    end if
 end for                                                                               1 In the Sleep mode the drive buffer is disabled, the heads are parked

                                                                                    and the spindle is at rest.


                                                                              277
4.1.3   File Reversal Policy                                                          after they have been dormant in the system for a longer pe-
                                                                                      riod of time. This would be an overkill for files with very
The File Reversal Policy runs in the Cold zone and en-
                                                                                      short 𝐿𝑖𝑓 𝑒𝑠𝑝𝑎𝑛 𝐶 𝐿𝑅 (hotness lifespan) as such files will
sures that the QoS, bandwidth and response time of files
                                                                                      unnecessarily lie dormant in the system, occupying precious
that becomes popular again after a period of dormancy is
                                                                                      Hot zone capacity for a longer period of time.
not impacted. If the number of accesses to a file that is re-
                                                                                          𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 : A high 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 increases the
siding in the Cold zone becomes higher than the threshold
                                                                                      number of the days the servers in the Cold Zone remain
 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 , the file is moved back to the Hot zone as
                                                                                      in active power state and hence, lowers the energy savings.
shown in 3. The file is chunked and placed unto the servers
                                                                                      On the other hand, it results in a reduction in the power state
in the Hot zone in congruence with the policies in the Hot
                                                                                      transitions which results in improved performance of the ac-
zone.
                                                                                      cesses to the Cold Zone. Thus, a trade-off needs to be made
Algorithm 3 Description of the File Reversal Policy Which                             between energy-conservation and data access performance
Monitors temperature of the cold files in the Cold Zones and                           in the selection of the value for 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 .
Moves Files Back to Hot Zones if their temperature changes                                𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 :      A relatively high value of
 {For every file i in Cold Zone}                                                        𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 ensures that files are accurately clas-
 for 𝑖 = 1 to n do
    if num accesses 𝑖 ≥ 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 then
                                                                                      sified as hot-again files before they are moved back to the
        {Hot Zone} ⇐ {Hot Zone} ∪ {f 𝑖 }                                              Hot zone from the Cold zone. This reduces data oscillations
        {Cold Zone} ⇐ {Cold Zone} / {f 𝑖 }//filesystem metadata are changed to
        Hot Zone
                                                                                      in the system and reduces unnecessary file reversals.
    end if
 end for
                                                                                      5    Analysis of a production Hadoop cluster at
                                                                                           Yahoo!
4.1.4   Policy Thresholds Discussion                                                     We analyzed one-month of HDFS logs 2 and namespace
A good data migration scheme should result in maximal                                 checkpoints in a multi-tenant cluster at Yahoo!. The clus-
energy savings, minimal data oscillations between Green-                              ter had 2600 servers, hosted 34 million files in the names-
HDFS zones and minimal performance degradation. Min-                                  pace and the data set size was 6 Petabytes. There were
imization of the accesses to the Cold zone files results in                            425 million entries in the HDFS logs and each names-
maximal energy savings and minimal performance impact.                                pace checkpoint contained 30-40 million files. The clus-
For this, policy thresholds should be chosen in a way that                            ter namespace was divided into six main top-level directo-
minimizes the number of accesses to the files residing in the                          ries, whereby each directory addresses different workloads
Cold zone while maximizing the movement of the dormant                                and access patterns. We only considered 4 main directories
data to the Cold zone. Results from our detailed sensitivity                          and refer to them as: d, p, u, and m in our analysis instead
analysis of the thresholds used in GreenHDFS are covered                              of referring them by their real names. The total number
in Section 6.3.5.                                                                     of unique files that was seen in the HDFS logs in the one-
    𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 : Low (i.e., aggressive) value of                                month duration were 70 million (d-1.8million, p-30million,
 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 results in an ultra-greedy selection of                             u-23million, and m-2million).
files as potential candidates for migration to the Cold                                   The logs and the metadata checkpoints were huge in size
zone. While there are several advantages of an aggressive                             and we used a large-scale research Hadoop cluster at Yahoo!
 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 such as higher space-savings in the Cold                            extensively for our analysis. We wrote the analysis scripts
zone, there are disadvantages as well. If files have inter-                            in Pig. We considered several cases in our analysis as shown
mittent periods of dormancy, the files may incorrectly get                             below:
labeled as cold and get moved to the Cold zone. There is                                  ∙ Files created before the analysis period and which
high probability that such files will get accessed in the near                               were not read or deleted subsequently at all. We clas-
future. Such accesses may suffer performance degradation                                    sify these files as long-living cold files.
as the accesses may get subject to power transition penalty
and may trigger data oscillations because of file reversals                                ∙ Files created before the analysis period and which
back to the Hot zone.                                                                       were read during the analysis period.
    A higher value of 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 results in a higher                                2 The inode data and the list of blocks belonging to each file comprise
accuracy in determining the really cold files. Hence, the                              the metadata of the name system called the image. The persistent record of
number of reversals, server wakeups and associated perfor-                            the image is called a checkpoint. HDFS has the ability to log all file system
                                                                                      access requests, which is required for auditing purposes in enterprises. The
mance degradation decreases as the threshold is increased.
                                                                                      audit logging is implemented using log4j and once enabled, logs every
On the other hand, higher value of 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 signi-                            HDFS event in the NameNode’s log [37]. We used the above-mentioned
fies that files will be chosen as candidates for migration only                         checkpoint and HDFS logs for our analysis.


                                                                                278
∙ Files created before the analysis period and which                    ∙ FileLifetime. This metric helps in determining the life-
    were both read and deleted during the analysis period.                  time of the file between its creation and its deletion.
  ∙ Files created during the analysis period and which
    were not read during the analysis period or deleted.                5.1.1   𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐹   𝑅

                                                                        The 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐹 𝑅 distribution throws light on the
  ∙ Files created during the analysis period and which
                                                                        clustering of the file reads with the file creation. As shown
    were not read during the analysis period, but were
                                                                        in Figure 3, 99% of the files have a 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐹 𝑅 of
    deleted.
                                                                        less than 2 days.
  ∙ Files created during the analysis period and which
    were read and deleted during the analysis period.                   5.1.2   𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅
    To accurately account for the file lifespan and lifetime,            Figure 4 shows the distribution of 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 in
we handled the following cases: (a) Filename reuse. We                  the cluster. In directory d, 80% of files are hot for less than
appended a timestamp to each file create to accurately track             8 days and 90% of the files amounting to 94.62% storage,
the audit log entries following the file create entry in the au-         are hot for less than 24 days. The 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 of
dit log, (b) File renames. We used an unique id per file to ac-          95% of the files amounting to 96.51% storage in the direc-
curately track its lifetime across create, rename and delete,           tory p is less than 3 days and the 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 of the
(c) Renames and deletes at higher level in the path hierarchy           100% of files in directory m and 98% of files in directory
had to be translated to leaf-level renames and deletes for our          a is as small as 2 days. In directory u, 98% of files have
analysis, (d) HDFS logs do not have file size information                 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 of less than 1 day. Thus, majority of
and hence, did a join of the dataset found in the HDFS logs             the files in the cluster have a short hotness lifespan.
and namespace checkpoint to get the file size information.
                                                                        5.1.3   𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷
5.1    File Lifespan Analysis of the Yahoo!
       Hadoop Cluster                                                    𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 indicates the time for which a file stays
                                                                        in a dormant state in the system. The longer the dormancy
   A file goes to several stages in its lifetime: 1) file cre-            period, higher is the coldness of the file and hence, higher
ation, 2) hot period during which the file is frequently ac-             the suitability of the file for migration to the cold zone. Fig-
cessed, 3) dormant period during which file is not accessed,             ure 5 shows the distribution of 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 in the
and 4) deletion. We introduced and considered various lifes-            cluster. In directory d, 90% of files are dormant beyond
pan metrics in our analysis to characterize a file’s evolution.          1 day and 80% of files, amounting to 80.1% of storage
A study of the various lifespan distributions helps in decid-           exist in dormant state past 20 days. In directory p, only
ing the energy-management policy thresholds that need to                25% files are dormant beyond 1 day and only 20% of the
be in place in GreenHDFS.                                               files remain dormant in the system beyond 10 days. In di-
                                                                        rectory m, only 0.02% files are dormant for more than 1
  ∙ 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐹 𝑅 metric is defined as the File lifes-             day and in directory u, 20% of files are dormant beyond
    pan between the file creation and first read access. This             10 days. The 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 needs to be considered
    metric is used to find the clustering of the read accesses           to find true migration suitability of a file. For example,
    around the file creation.                                            given the extremely short dormancy period of the files in
                                                                        the directory m, there is no point in exercising the File Mi-
  ∙ 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 metric is defined as the File lifes-
                                                                        gration Policy on directory m. For directories p, and u,
    pan between creation and last read access. This metric
                                                                         𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 less than 5 days will result in unneces-
    is used to determine the hotness profile of the files.
                                                                        sary movement of files to the Cold zone as these files are
  ∙ 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 metric is defined as the File lifes-              due for deletion in any case. On the other hand, given the
    pan between last read access and file deletion. This                 short 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 in these directories, high value of
    metric helps in determine the coldness profile of the                 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 won’t do justice to space-efficiency in the
    files as this is the period for which files are dormant in            Cold zone as discussed in Section 4.1.4.
    the system.
                                                                        5.1.4 File Lifetime Analysis
  ∙ 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐹 𝐿𝑅 metric is defined as the File lifes-
    pan between first read access and last read access. This             Knowledge of the FileLifetime further assists in the
    metric helps in determining another dimension of the                migration file candidate selection and needs to be ac-
    hotness profile of the files.                                         counted for in addition to the 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 and


                                                                  279
d       p       m   u
                                                                    102%                                                                                                            d       p   m       u
                                                                                                                                                                120%




                                                                                                                                      % of Tota Used Capacity
                                               % of To File Count
                                                                    100%
                                                                                                                                                                100%
                                                                    98%
                                                                                                                                                                80%
                                                                    96%
                                                                                                                                                                60%




                                                     otal
                                                                    94%




                                                                                                                                              al
                                                                                                                                                                40%
                                                                    92%                                                                                         20%
                                                                    90%                                                                                          0%
                                                                            1   3   5 7 9 11 13 15 17 19 21                                                             1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
                                                                                    FileLifeSpanCFR (Days)
                                                                                                                                                                              FileLifeSpanCFR (Days)
Figure 3. 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐹 𝑅 distribution. 99% of files in directory d and 98% of files in directory p were
accessed for the first time less than 2 days of creation.
                                                                                    d       p       m   u                                                                           d       p   m       u
                                             105%                                                                                                               120%




                                                                                                                                      % of Tota Used Capacity
                   % of Total File Count




                                             100%
                                                                                                                                                                100%
                                              95%
                                              90%                                                                                                               80%
                                              85%
                                                                                                                                                                60%
                                              80%




                                                                                                                                              al
                                              75%                                                                                                               40%
                        T




                                              70%
                                                                                                                                                                20%
                                              65%
                                              60%                                                                                                                0%
                                                                          1 3 5 7 9 11 13 15 17 19 21 23 25 27 29                                                       1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
                                                                                FileLifeSpanCLR (Days)                                                                        FileLifeSpanCLR (Days)
Figure 4. 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 Distribution in the four main top-level directories in the Yahoo! production cluster.
 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 characterizes the lifespan for which files are hot. In directory d, 80% of files were hot for less than
8 days and 90% of the files amounting to 94.62% storage, are hot for less than 24 days. The hotness lifespan of 95% of
the files amounting to 96.51% storage in the directory p is less than 3 days and the hotness lifespan of the 100% of files in
directory m and in directory u, 98% of files are hot for less than 1 day.

                                                                                        d       p   m       u                                                                   d       p       m   u
                                                         120%                                                                                       120%
                                                                                                                      % if Tota Used Capacity
                                % of Total File Count




                                                         100%                                                                                       100%
                                                                    80%                                                                                         80%
                                                                    60%                                                                                         60%
                                                                                                                              al




                                                                    40%                                                                                         40%
                                     T




                                                                    20%                                                                                         20%
                                                                    0%                                                                                          0%
                                                                           1 3 5 7 9 11 13 15 17 19 21 23 25 27 29                                                     1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
                                                                                 FileLifeSpanLRD (Days)                                                                      FileLifeSpanLRD (Days)
Figure 5. 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 distribution of the top-level directories in the Yahoo! production cluster. 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷
characterizes the coldness in the cluster and is indicative of the time a file stays in a dormant state in the system. 80% of files,
amounting to 80.1% of storage in the directory d have a dormancy period of higher than 20 days. 20% of files, amounting to
28.6% storage in directory p are dormant beyond 10 days. 0.02% of files in directory m are dormant beyond 1 day.

                                                                                        d       p   m       u                                                                   d       p       m   u
                                                         120%                                                                                      120%
                                                                                                                      % of Tota Used Capacity
                                % of Total File Count




                                                         100%                                                                                      100%
                                                                    80%                                                                                         80%
                                                                    60%                                                                                         60%
                                                                                                                              al




                                                                    40%                                                                                         40%
                                     T




                                                                    20%                                                                                         20%
                                                                    0%                                          7                                               0%
                                                                           1 3 5 7 9 11 13 15 17 19 21 23 25 27 29                                                     0 2 4 6 8 1012141618202224262830
                                                                                   FileLifetime (Days)                                                                         FileLifetime(Days)
Figure 6. FileLifetime distribution. 67% of the files in the directory p are deleted within one day of their creation. Only
23% files live beyond 20 days. On the other hand, in directory d 80% of the files have a FileLifetime of more than 30 days.


                                                                                                                280
% of Total File Count   % of Total Used Storage

                                                                           40.00%
                                                                           35.00%
                                                                           30.00%
                                                                           25.00%
                                                                           20.00%
                                                                           15.00%
                                                                           15 00%
                                                                           10.00%
                                                                            5.00%
                                                                            0.00%
                                                                                            d                         p                                            u

Figure 7. File size and file count percentage of long-living cold files. The cold files are defined as the files that were created
prior to the start of the observation period of one-month and were not accessed during the period of observation at all. In
case of directory d directory, 13% of the total file count in the cluster which amounts to 33% of total used capacity is cold.
In case of directory p, 37% of the total file count in the cluster which amounts to 16% of total used capacity is cold. Overall,
63.16% of total file count and 56.23% of total used capacity is cold in the system

                                                                  d    p   u                                                          d                                         p                 u
                                                                                                                             7
                               100%
       % of Total File Count




                                                                                                     File Count (Millions)
                                                                                                                             6
                               80%                                                                                           5
                               60%                                                                                           4
                                                                                                                             3
                               40%
                                                                                                                             2
                                                                                                          C




                               20%                                                                                           1
                                0%                                                                                           0
                                                     10       20 40 60 80 100 120 140                                            10         20                       40 60 80 100 120 140
                                                               Dormancy > than (Days)                                                                              Dormancy > than (Days)

                                                                                                                                                                           d                  p             u
                                                                               d    p   u                                                                          3500
                                                            90%
                                  % of Total Used Storage




                                                                                                                                                                   3000
                                                            80%
                                                                                                                                      Used Storage Capaicty (TB)




                                                            70%                                                                                                    2500
                                                            60%                                                                                                    2000
                                                            50%
                                         Capacity




                                                                                                                                                                   1500
                                                            40%
                                                            30%                                                                                                    1000
                                                            20%                                                                                                     500
                                                            10%
                                                             0%                                                                                                        0
                                                                                                                                                                           10       20   40   60      80 100 120 140
                                                                  10   20 40 60 80 100 120 140
                                                                                                                                                                                     Dormancy > than (Days)
                                                                       Dormancy > than (Days)


Figure 8. Dormant period analysis of the file count distribution and histogram in one namespace checkpoint. Dormancy
of the file is defined as the elapsed time between the last access time recorded in the checkpoint and the day of observation.
34% of the files in the directory p and 58% of the files in the directory d were not accessed in the last 40 days.




                                                                                                       281
𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 metrices covered earlier. As shown in                         ∙ What is the sensitivity of the various policy thresholds
Figure 6, directory p only has 23% files that live beyond 20                         used in GreenHDFS on the energy savings results?
days. On the other hand, 80% of files in directory d live
                                                                                  ∙ How many power state transitions does a server go
for more than 30 days and 80% of the files have a hot lifes-
                                                                                    through in average in the Cold Zone?
pan of less than 8 days. Thus, directory d is a very good
candidate for invoking the File Migration Policy.                                 ∙ Finally, what is the number of accesses that happen to
                                                                                    the files in the Cold Zones, the days servers are pow-
5.2     Coldness Characterization of the Files                                      ered on and the number of migrations and reversals ob-
                                                                                    served in the system?
   In this section, we show the file count and the storage
capacity used by the long-living cold files. The long-living                       ∙ How many migrations happen daily?
cold files are defined as the files that were created prior to                       ∙ How may power state transitions are occurred during
the start of the observation period and were not accessed                           the simulation-run?
during the one-month period of observation at all. As shown
in Figure 13, 63.16% of files amounting to 56.23% of the                        The following evaluation sections answer these questions,
total used capacity are cold in the system. Such long-living                   beginning with a description of our methodology, and the
cold files present significant opportunity to conserve energy                    trace workloads we use as inputs to the experiments.
in GreenHDFS.

5.3     Dormancy Characterization of the                                       6.1       Evaluation methodology
        Files
                                                                                   We evaluated GreenHDFS using a trace-driven simula-
    The HDFS trace analysis gives information only about                       tor. The simulator was driven by real-world HDFS traces
the files that were accessed in the one-month duration. To                      generated by a production Hadoop cluster at Yahoo!. The
get a better picture, we analyzed the namespace checkpoints                    cluster had 2600 servers, hosted 34 million files in the
for historical data on the file temperatures and periods of                     namespace and the data set size was 6 Petabytes.
dormancy. The namespace checkpoints contain the last ac-                           We focused our analysis on the directory d as this di-
cess time information of the files and used this information                    rectory constituted of 60% of the used storage capacity in
to calculate the dormancy of the files. The Dormancy met-                       the cluster (4PB out of the 6PB total used capacity). Just
ric defines the elapsed time between the last noted access                      focusing our analysis on the directory d cut down on our
time of the file and the day of observation. Figure 8 contains                  simulation time significantly and reduced our analysis time
the frequency histograms and distributions of the dormancy.                    4
                                                                                 . We used 60% of the total cluster nodes in our analysis to
34% of files amounting to 37% of storage in the directory p                     make the results realistic for just directory d analysis. The
present in the namespace checkpoint were not accessed in                       total number of unique files that were seen in the HDFS
the last 40 days. 58% of files amounting to 53% of storage                      traces for the directory d in the one-month duration were
in the directory d were not accessed in the last 40 days. The                  0.9 million. In our experiments, we compare GreenHDFS
extent of dormancy exhibited in the system again shows the                     to the baseline case (HDFS without energy management).
viability of the GreenHDFS solution.3                                          The baseline results give us the upper bound for energy con-
                                                                               sumption and the lower bound for average response time.
6 Evaluation                                                                       Simulation Platform: We used a trace-driven simula-
                                                                               tor for GreenHDFS to perform our experiments. We used
   In this section, we first present our experimental platform                  models for the power levels, power state transitions times
and methodology, followed by a description of the work-                        and access times of the disk, processor and the DRAM in
loads used and then we give our experimental results. Our                      the simulator. The GreenHDFS simulator was implemented
goal is to answer seven high-level sets of questions:                          in Java and MySQL distribution 5.1.41 and executed using
                                                                               Java 2 SDK, version 1.6.0-17. 5 Table 1 lists the various
  ∙ What much energy is GreenHDFS able to conserve                             power, latency, transition times etc. used in the Simulator.
    compared to a baseline HDFS with no energy manage-                         The simulator was run on 10 nodes in a development cluster
    ment?                                                                      at Yahoo!.
  ∙ What is the penalty of the energy management on av-
                                                                                  4 An important consideration given the massive scale of the traces
    erage response time?
                                                                                  5 Both,performance and energy statistics were calculated based on the
   3 The number of files present in the namespace checkpoints were less         information extracted from the datasheet of Seagate Barracuda ES.2 which
than half the number of the files seen in the one-month trace.                  is a 1TB SATA hard drive, a Quad core Intel Xeon X5400 processor


                                                                         282
6.3.2   Storage-Efficiency
   Table 1. Power and power-on penalties used in Simu-
   lator                                                                    In this section, we show the increased storage efficiency of
                                                                            the Hot Zones compared to baseline. Figure 10 shows that
   Component                    Active     Idle      Sleep   Power-         in the baseline case, the average capacity utilization of the
                                Power      Power     Power   up
                                (W)        (W)       (W)     time           1560 servers is higher than that of GreenHDFS which just
   CPU (Quad core, Intel Xeon   80-150     12.0-     3.4     30 us          has 1170 servers out of the 1560 servers provisioned to the
   X5400 [22])                             20.0
   DRAM DIMM [29]               3.5-5      1.8-      0.2     1 us           Hot second Zone. GreenHDFS has much higher amount of
                                           2.5                              free space available in the Hot zone which tremendously in-
   NIC [35]                     0.7        0.3       0.3     NA
   SATA HDD (Seagate Bar-       11.16      9.29      0.99    10 sec         creases the potential for better data placement techniques on
   racuda ES.2 1TB [16]                                                     the Hot zone. More aggressive the policy threshold, more
   PSU [2]                      50-60      25-35     0.5     300 us
   Hot server (2 CPU, 8 DRAM    445.34     132.46    13.16                  space is available in the Hot zone for truly hot data as more
   DIMM, 4 1TB HDD)                                                         data is migrated out to the Cold zone.
   Cold server (2 CPU, 8 DRAM   534.62     206.78    21.08
   DIMM, 12 1TB HDD)

                                                                            6.3.3   File Migrations and Reversals
6.2     Simulator Parameters                                                The Figure 10 (right-most) shows the number and total size
                                                                            of the files which were migrated to the Cold zone daily with
   The default simulation parameters used by in this paper                  a 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 value of 10 Days. Every day, on average
are shown in Table 2.                                                       6.38TB worth of data and 28.9 thousand files are migrated
                                                                            to the Cold zone. Since, we have assumed storage-heavy
                                                                            servers in the Cold zone where each server has 12, 1TB
                Table 2. Simulator Parameters                               disks, assuming 80MB/sec of disk bandwidth, 6.38TB data
           Parameter            Value
           NumServer            1560
                                                                            can be absorbed in less than 2hrs by one server. The mi-
           NumZones             2                                           gration policy can be run during off-peak hours to minimize
           𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝐹 𝑀 𝑃       1 Day
            𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃    5, 10, 15, 20 Days
                                                                            any performance impact.
           𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑆𝑃 𝐶        1 Day
            𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝑃 𝐶     2, 4, 6, 8 Days
           𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝐹 𝑅𝑃        1 Day                                       6.3.4   Impact of Power Management on Response Time
            𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃     1, 5, 10 Accesses
           NumServersPerZone    Hot 1170 Cold 390
                                                                            We examined the impact of server power management on
                                                                            the response time of a file which was moved to the Cold
                                                                            Zone following a period of dormancy and was accessed
6.3     Simulation results                                                  again for some reason. The files residing on the Cold Zone
                                                                            may suffer performance degradation in two ways: 1) if the
                                                                            file resides on a server that is not powered ON currently–
6.3.1   Energy-Conservation
                                                                            this will incur a server wakeup time penalty, 2) transfer time
In this section, we show the energy savings made possible                   degradation courtesy of no striping on the lower Zones. The
by GreenHDFS, compared to baseline, in one month sim-                       file is moved back to Hot zone and chunked again by the file
ply by doing power management in one of the main tenant                     reversal policy. Figure 11 shows the impact on the average
directory of the Hadoop Cluster. The cost of electricity was                response time. 97.8% of the total read requests are not im-
assumed to be $0.063/KWh. Figure 9(Left) shows a 24%                        pacted by the power management. Impact is seen only by
reduction in energy consumption of a 1560 server datacen-                   2.1% of the reads. With a less aggressive 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃
ter with 80% capacity utilization. Extrapolating, $2.1mil-                  (15, 20 days), impact on the Response time will reduce
lion can be saved in the energy costs if GreenHDFS tech-                    much further.
nique is applied to all the Hadoop clusters at Yahoo (up-
wards of 38000 servers). Energy saving from off-power                       6.3.5 Sensitivity Analysis
servers will be further compounded in the cooling system of
a real datacenter. For every Watt of power consumed by the                  We tried different values of the thresholds for the File Mi-
compute infrastructure, a modern data center expends an-                    gration policy and the Server Power Conserver policy to
other one-half to one Watt to power the cooling infrastruc-                 understand the sensitivity of these thresholds on storage-
ture [32]. Energy-saving results underscore the importance                  efficiency, energy-conservation and number of power state
of supporting access time recording in the Hadoop compute                   transitions. A discussion on the impact of the various
clusters.                                                                   thresholds is done in Section 4.1.4.


                                                                      283
$35,000                                                                       Cold Zone     Hot Zone                                                          # Migrations                       # Reversals
                                                                                                       35                                                                       8
                                            $30,000
                                                                                                       30                                                                       7
                             Energy Costs




                                            $25,000




                                                                                                                                                                Cou (x100000)
                                                                                       Day Server ON
                                                                                                       25                                                                       6
                                            $20,000
                                                                                                                                                                                5
                                            $15,000                                                    20
                                                                                                                                                                                4
                                            $10,000                                                    15




                                                                                                                                                                  unt
                                                                                                                                                                                3




                                                                                         ys
                                             $5,000                                                    10                                                                       2
                                                $0                                                       5                                                                      1
                                                                                                         0                                                                      0
                                                                                                                  1 3 5 7 9 11 13 15 17 19 21 23 25 27                                                    5          10         15         20
                                                        File Migration Policy (Days)                                      Cold Zone Servers                                                              File Migration Policy Interval (Days)
                                              Figure 9. (Left) Energy Savings with GreenHDFS and (Middle) Days Servers in Cold Zone were ON compared to the
                                              Baseline. Energy Cost Savings are Minimally Sensitive to the Policy Threshold Values. GreenHDFS achieves 24% savings in
                                              the energy costs in one month simply by doing power management in one of the main tenant directory of the Hadoop Cluster.
                                              (Right) Number of migrations and reversals in GreenHDFS with different values of the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 threshold.
                             500                                                                                  600                                                                                                   FileSize              FileCount
       orage Capacity (GB)




                                                                                          Cold Zo Used Capacity




                             450                                           Baseline                                                                                                              12                                                                              45
                                                                                                                  500
                             400                                                                                                                                                                                                                                                 40
                                                                                                                                                                                                 10




                                                                                                                                                                                                                                                                                            ount (x 1000)
                             350                                            Policy15                              400                                                                                                                                                            35
                             300                                                                                                                                                                     8                                                                           30
                                                                                                    (TB)




                                                                                                                                                                                File Size (TB)
                             250                                            Policy10                              300                                                                                                                                                            25
                                                                                                                                                                                                     6
                             200                                                                                                                                                                                                                                                 20
                                                                                                one




                             150                                                                                  200                                                                                4                                                                           15




                                                                                                                                                                                                                                                                                      File Co
                                                                            Policy5
                                                                                                                                                                                                                                                                                 10
Used Sto




                             100                                                                                                                                                                     2
                              50                                                                                  100                                                                                                                                                            5
                              -                                                                                                                                                                  -                                                                               0
                                                                                                                   -




                                                                                                                                                                                                          6/12
                                                                                                                                                                                                                 6/14
                                                                                                                                                                                                                        6/16
                                                                                                                                                                                                                               6/18
                                                                                                                                                                                                                                      6/20
                                                                                                                                                                                                                                              6/22
                                                                                                                                                                                                                                                     6/24
                                                                                                                                                                                                                                                            6/26
                                                                                                                                                                                                                                                                   6/28
                                                                                                                                                                                                                                                                          6/30
                                             1105
                                             1197
                                                1
                                               93
                                              185
                                              277
                                              369
                                              461
                                              553
                                              645
                                              737
                                              829
                                              921
                                             1013


                                             1289
                                             1381
                                             1473




                                                                                                                            5        10       15         20
                                                                                                                        File Migration Policy Interval (Days)                                                                         Days
                                                        Server Number
                                              Figure 10. Capacity Growth and Utilization in the Hot and Cold Zone compared to the Baseline and Daily Migrations.
                                              GreenHDFS substantially increases the free space in the Hot Zones by migrating cold data to the Cold Zones. In the left
                                              and middle chart, we only consider the new data that was introduced in the data directory and old data which was accessed
                                              during the 1 month period. Right chart shows the number and total size of the files migrated daily to the Cold zone with
                                               𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 value of 10 Days.



                                   𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 : We found that the energy costs are                                                experiments were done with a 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 value of 1.
                               minimally sensitive to the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 threshold value.                                             The number of file reversals are substantially reduced by in-
                               As shown in Figure 9[Left], the energy cost savings varied                                               creasing the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 value. With a 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃
                               minimally when the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 was changed to 5, 10,                                                value of 10, zero reversals happen in the system.
                               15 and 20 days.                                                                                              The storage-efficiency is sensitive to the value of the
                                   The performance impact and number of file reversals is                                                 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 threshold as shown in Figure 10[Left]. An
                               minimally sensitive to the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 value as well.                                               increase in the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 value results in less effi-
                               This behavior can be explained by the observation that ma-                                               cient capacity utilization of the Hot Zones. Higher value of
                               jority of the data in the production Hadoop cluster at Yahoo!                                             𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 threshold signifies that files will be chosen
                               has a news-server-like access pattern. This implies that once                                            as candidates for migration only after they have been dor-
                               data is deemed cold, there is low probability of data getting                                            mant in the system for a longer period of time. This would
                               accessed again.                                                                                          be an overkill for files with very short 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑠𝑝𝑎𝑛 𝐶𝐿𝑅
                                   The Figure 9 (right-most) shows the total number of mi-                                              as they will unnecessarily lie dormant in the system, oc-
                               grations of the files which were deemed cold by the file mi-                                               cupying precious Hot zone capacity for a longer period of
                               gration policy and the reversals of the moved files in case                                               time.
                               they were later accessed by a client in the one-month sim-                                                   𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 : As Figure 12(Right) illustrates, in-
                               ulation run. There were more instances (40,170, i.e., 4%                                                 creasing the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 value, minimally increases the
                               of overall file count) of file reversals with the most ag-                                                 number of the days the servers in the Cold Zone remain ON
                               gressive 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 of 5 days. With less aggressive                                                and hence, minimally lowers the energy savings. On the
                                𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 of 15 days, the number of reversals in the                                             other hand, increasing the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 value results in
                               system went down to 6,548 (i.e., 0.7% of file count). The                                                 a reduction in the power state transitions which improves


                                                                                                                                  284
120%                                                                                                                  1000000




                                                                                                                                                                    File Count in Log Scale
                                                % of Total File Reads
                                                                        100%                                                                                                                   100000
                                                                         80%                                                                                                                    10000
                                                                         60%                                                                                                                     1000
                                                                         40%                                                                                                                      100
                                                                         20%                                                                                                                       10
                                                                          0%                                                                                                                        1




                                                                                         012

                                                                                         012
                                                                                         012




                                                                                                                                                                                                                    012
                                                                                                                                                                                                                    012
                                                                                         012
                                                                                         012
                                                                                         012
                                                                                         012
                                                                                         012
                                                                                         012
                                                                                         012
                                                                                         012
                                                                                         012
                                                                                         012
                                                                                         012
                                                                                         012

                                                                                         012




                                                                                                                                                                                                                    012
                                                                                                                                                                                                                    012
                                                                                                                                                                                                                    012
                                                                                                                                                                                                                    012
                                                                                                                                                                                                                    012
                                                                                                                                                                                                                    012
                                                                                                                                                                                                                    012
                                                                                                                                                                                                                    012
                                                                                                                                                                                                                    012
                                                                                                                                                                                                                    012
                                                                                                                                                                                                                    012
                                                                                                                                                                                                                    012
                                                                               105013-1060

                                                                               127013-1280
                                                                               137013-1380




                                                                                                                                                                                                          122013-1230
                                                                                                                                                                                                          132013-1330
                                                                                     13-10
                                                                                   8013-90
                                                                                 16013-170
                                                                                 24013-250
                                                                                 32013-330
                                                                                 40013-410
                                                                                 48013-490
                                                                                 56013-570
                                                                                 67013-680
                                                                                 78013-790
                                                                                 87013-880
                                                                                 95013-960

                                                                               117013-1180




                                                                                                                                                                                                                13-10
                                                                                                                                                                                                             9013-100
                                                                                                                                                                                                            18013-190
                                                                                                                                                                                                            27013-280
                                                                                                                                                                                                            36013-370
                                                                                                                                                                                                            45013-460
                                                                                                                                                                                                            54013-550
                                                                                                                                                                                                            66013-670
                                                                                                                                                                                                            78013-790
                                                                                                                                                                                                            88013-890
                                                                                                                                                                                                            98013-990
                                                                                                                                                                                                          110013-1110
                                                                                        Read Response Time (msecs)                                                                                                Read Response Time (msecs)

                                 Figure 11. Performance Analysis: Impact on Response Time because of power management with a 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 of 10
                                 days. 97.8% of the total read requests are not impacted by the power management. Impact is seen only by 2.1% of the reads.
                                 With a less aggressive 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 (15, 20), impact on the Response time will reduce much more.

                         195                                                                                               Used Capacity Hot (TB)   Used Capacity Cold (TB)                                                  timesOn   daysOn
                                  Policy5
                         190                                                                                               600                                                                                  200
                                                                                               Used Storag Capacity (TB)
Used Cold Zone Servers




                         185                                                                                               500
                                                                                                                                                                                                                150
                         180                Policy10
                                                                                                                           400




                                                                                                                                                                                                        Count
                         175                                             Policy15                                                                                                                               100
                                                                                                                           300
                         170
        d




                                                                                                         ge




                                                                                    Policy20                               200                                                                                   50
                         165
                         160
                                                                                                                           100
                                                                                                                                                                                                                  0
                         155                                                                                                 0                                                                                             4           6             8
                         150                                                                                                        5         10         15         20                                                Server Power Conserver Policy Interval
                                                                                                                                  File Migration Policy Interval (Days)                                                              (Days)
                                 Figure 12. Sensitivity Analysis: Sensitivity of Number of Servers Used in Cold Zone, Number of Power State Transi-
                                 tions and Capacity per Zone to the Migration File Policy’s Age Threshold and the Server Power Conserver Policy’s Access
                                 Threshold.


                         the performance of the accesses to the Cold Zone. Thus,                                                                            in the Cold Zone exhibited this behavior. Most of the disks
                         a trade-off needs to be made between energy-conservation                                                                           are designed for a maximum service life time of 5 years and
                         and data access performance.                                                                                                       can tolerate 500,000 start/stop cycles. Given the very small
                             Summary on Sensitivity Analysis: From the above                                                                                number of transitions incurred by a server in the Cold Zone
                         evaluation, it is clear that a trade-off needs to be made in                                                                       in a year, GreenHDFS has no risk of exceeding the start/stop
                         choosing the right thresholds in GreenHDFS based on an                                                                             cycles during the service life time of the disks.
                         enterprise’s needs. If Hot zone space is at a premium, more
                         aggressive 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 needs to be used. This can be
                         done without impacting the energy-conservation that can be
                                                                                                                                                            7   Related Work
                         derived in GreenHDFS.
                                                                                                                                                               Management of energy, peak power, and temperature of
                         6.3.6    Number of Server Power Transitions                                                                                        data centers and warehouses are becoming the targets of
                                                                                                                                                            an increasing number of research studies. However, to the
                         The Figure 13 (Left) shows the number of power transitions                                                                         best of our knowledge, none of the existing systems exploit
                         incurred by the servers in the Cold Zones. Frequently start-                                                                       data classification-driven data placement to derive energy-
                         ing and stopping disks is suspected to affect disk longevity.                                                                      efficiency nor have a file system managed multi-zoned, hy-
                         The number of start/stop cycles a disk can tolerate during                                                                         brid data center layout. Most of the prior work focuses
                         its service life time is still limited. Making the power tran-                                                                     on workload placement to manage the thermal distribution
                         sitions infrequently reduces the risk of running into this                                                                         within a data center. [30, 34] considered the placement of
                         limit.The maximum number of power state transitions in-                                                                            computational workload for energy-efficiency. Chase et al.
                         curred by a server in a one-month simulation run is just 11                                                                        [8] do an energy-conscious provisioning which configures
                         times and only 1 server out of the 390 servers provisioned                                                                         switches to concentrate request load on a minimal active set


                                                                                                                                                      285
12




                                         Numb of Power State
                                                               10




                                             Transitions
                                                                8

                                                                6




                                            ber
                                                                4

                                                                2

                                                                0
                                                                    1 3 5 7 9 11 13 15 17 19 21 23 25 27
                                                                          Servers in Cold Zone
  Figure 13. Cold Zone Behavior: Number of Times Servers Transitioned Power State with 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 of 10 Days. We
  only show those servers in the Cold zone that either received newly cold data or had data accesses targeted to them in the
  one-month simulation run.


of servers for the current aggregate load level.                                       was proposed by Hakim et. al. [18]. However, that aims
    Le et. al. [25] focus on a multi-datacenter internet ser-                          to concentrate load on one disk at a time and hence, this
vice. They exploit the inherent heterogeneity in the data-                             design will impact availability and performance.
centers in electricity pricing, time-zone differences and col-
location to renewable energy source, to reduce energy con-                             8   Conclusion and Future Work
sumption without impacting SLA requirements of the appli-
cations. Bash et al. [5] allocate heavy computational, long
                                                                                           We presented the detailed evaluation and sensitivity anal-
running workloads onto servers that are in more thermally-
                                                                                       ysis of GreenHDFS, a policy-driven, self-adaptive, variant
efficient places. Chun et. al. [12] propose a hybrid data-
                                                                                       of Hadoop Distributed File System. GreenHDFS relies on
center comprising of low power Atom processors and high
                                                                                       data classification driven data placement to realize guar-
power, high performance Xeon processors. However, they
                                                                                       anteed, substantially long periods of idleness in a signifi-
do not specify any zoning in the system and focus more on
                                                                                       cant subset of servers in the datacenter. Detailed experi-
task migration rather than data migration. Narayanan et.
                                                                                       mental results with real-world traces from a production Ya-
al. [31] use a technique to offload write workload to one
                                                                                       hoo! Hadoop cluster show that GreenHDFS is capable of
volume to other storage elsewhere in the data center. Meis-
                                                                                       achieving 24% savings in the energy costs of a Hadoop clus-
ner et al. [28] reduce the power costs by transitioning the
                                                                                       ter by doing power management in only one of the main
servers to a ”powernap” state whenever there is a period of
                                                                                       tenant top-level directory in the cluster. These savings will
low utilization.
                                                                                       be further compounded in the savings in the cooling costs.
    In addition, there is research on hardware-level tech-                             Detailed lifespan analysis of the files in a large-scale pro-
niques such as dynamic-voltage scaling as a mechanism                                  duction Hadoop cluster at Yahoo! points at the viability
to reduce peak power consumption in the datacenters [7,                                of GreenHDFS. Evaluation results show that GreenHDFS
14] and Raghavendra et al. [33] coordinate hardware-level                              is able to meet all the scale-down mandates (i.e., generates
power capping with virtual machine dispatching mecha-                                  significant idleness in the cluster, results in very few power
nisms. Managing temperature is the subject of the systems                              state transitions, and doesn’t degrade write performance)
proposed in [20].                                                                      in spite of the unique scale-down challenges present in a
    Recent research on increasing energy-efficiency in GFS                              Hadoop cluster.
and HDFS managed clusters [3, 27] propose maintaining a
primary replica of the data on a small covering subset of
nodes that are guaranteed to be on and which represent low-
                                                                                       9   Acknowledgement
est power setting. Remaining replicas are stored in larger set
of secondary nodes. Performance is scaled up by increas-                                  This work was supported by NSF grant CNS 05-51665
                                                                                       and an internship at Yahoo!. The views and conclusions
ing number of secondary nodes. However, these solutions                                contained in this paper are those of the authors and should
suffer from degraded write-performance and increased DFS                               not be interpreted as representing the official policies, either
code complexity. These solutions also do not do any data                               expressed or implied, of NSF or the U.S. government.
differentiation and treat all the data in the system alike.
    Existing highly scalable file systems such as Google file                            References
system [19] and HDFS [37] do not do energy management.
Recently, an energy-efficient Log Structured File System                                [1] http://hadoop.apache.org/.



                                                                                 286
[2] Introduction to power supplies. National Semiconductor, 2002.                           [23] R. T. kaushik and M. Bhandarkar. Greenhdfs: Towards an energy-conserving,
                                                                                                  storage-efficient, hybrid hadoop compute cluster. HotPower, 2010.
 [3] H. Amur, J. Cipar, V. Gupta, G. R. Ganger, M. A. Kozuch, and K. Schwan.
     Robust and flexible power-proportional storage. In SoCC ’10: Proceedings of              [24] S. Konstantin, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed
     the 1st ACM symposium on Cloud computing, pages 217–228, New York, NY,                       file system. Symposium on Massive Storage Systems and Technologies, 2010.
     USA, 2010. ACM.
                                                                                             [25] K. Le, R. Bianchini, M. Martonosi, and T. Nguyen. Cost- and energy-aware
 [4] L. A. Barroso and U. H¨ lzle. The case for energy-proportional computing.
                             o                                                                    load distribution across data centers. In HotPower, 2009.
     Computer, 40(12), 2007.
                                                                                             [26] J. Leverich and C. Kozyrakis. On the energy (in)efficiency of hadoop clusters.
 [5] C. Bash and G. Forman. Cool job allocation: measuring the power savings                      HotPower, 2009.
     of placing jobs at cooling-efficient locations in the data center. In ATC’07:
     2007 USENIX Annual Technical Conference on Proceedings of the USENIX                    [27] J. Leverich and C. Kozyrakis. On the energy (in)efficiency of hadoop clusters.
     Annual Technical Conference, pages 1–6, Berkeley, CA, USA, 2007. USENIX                      SIGOPS Oper. Syst. Rev., 44(1):61–65, 2010.
     Association.
                                                                                             [28] D. Meisner, B. T. Gold, and T. F. Wenisch. Powernap: eliminating server idle
 [6] C. Belady. In the data center, power and cooling costs more than the it equip-               power. In ASPLOS ’09: Proceeding of the 14th international conference on Ar-
     ment it supports. Electronics Cooling, February, 2010.                                       chitectural support for programming languages and operating systems, pages
                                                                                                  205–216, New York, NY, USA, 2009. ACM.
 [7] D. Brooks and M. Martonosi. Dynamic thermal management for high-
     performance microprocessors. In HPCA, pages 171–, 2001.                                 [29] Micron. Ddr2 sdram sodimm. 2004.

 [8] J. S. Chase and R. P. Doyle. Balance of power: Energy management for server             [30] J. Moore, J. Chase, P. Ranganathan, and R. Sharma. Making scheduling ”cool”:
     clusters. In In Proceedings of the 8th Workshop on Hot Topics in Operating                   temperature-aware workload placement in data centers. In ATEC ’05: Proceed-
     Systems HotOS, 2001.                                                                         ings of the annual conference on USENIX Annual Technical Conference, pages
                                                                                                  5–5, Berkeley, CA, USA, 2005. USENIX Association.
 [9] G. Chen, W. He, J. Liu, S. Nath, L. Rigas, L. Xiao, and F. Zhao. Energy-aware
     server provisioning and load dispatching for connection-intensive internet ser-         [31] D. Narayanan, A. Donnelly, and A. Rowstron. Write off-loading: Practical
     vices. In NSDI’08: Proceedings of the 5th USENIX Symposium on Networked                      power management for enterprise storage. Trans. Storage, 4(3):1–23, 2008.
     Systems Design and Implementation, Berkeley, CA, USA, 2008. USENIX As-
     sociation.                                                                              [32] C. Patel, E. Bash, R. Sharma, and M. Beitelmal. Smart cooling of data centers.
                                                                                                  In In Proceedings of the Pacific RIM/ASME International Electronics Packag-
[10] Y. Chen, A. Das, W. Qin, A. Sivasubramaniam, Q. Wang, and N. Gautam.                         ing Technical Conference and Exhibition (IPACK03), 2003.
     Managing server energy and operational costs in hosting centers. SIGMETRICS
     Perform. Eval. Rev., 33(1), 2005.                                                       [33] R. Raghavendra, P. Ranganathan, V. Talwar, Z. Wang, and X. Zhu. No ”power”
                                                                                                  struggles: coordinated multi-level power management for the data center. In
[11] Y. Chen, A. Ganapathi, A. Fox, R. H. Katz, and D. A. Patterson. Statistical                  ASPLOS XIII, pages 48–59, New York, NY, USA, 2008. ACM.
     workloads for energy efficient mapreduce. Technical report, UC, Berkeley,
     2010.                                                                                   [34] R. K. Sharma, C. E. Bash, C. D. Patel, R. J. Friedrich, and J. S. Chase. Bal-
                                                                                                  ance of power: Dynamic thermal management for internet data centers. IEEE
[12] B.-G. Chun, G. Iannaccone, G. Iannaccone, R. Katz, G. Lee, and L. Niccolini.                 Internet Computing, 9:42–49, 2005.
     An energy case for hybrid datacenters. In HotPower, 2009.
                                                                                             [35] SMSC. Lan9420/lan9420i single-chip ethernet controller with hp auto-mdix
[13] J. Dean, S. Ghemawat, and G. Inc. Mapreduce: simplified data processing on                    support and pci interface. 2008.
     large clusters. In In OSDI04: Proceedings of the 6th conference on Sympo-
     sium on Opearting Systems Design and Implementation. USENIX Association,                [36] N. Tolia, Z. Wang, M. Marwah, C. Bash, P. Ranganathan, and X. Zhu. Deliver-
     2004.                                                                                        ing energy proportionality with non energy-proportional systems - optimizing
                                                                                                  the ensemble. In HotPower, 2008.
[14] M. E. Femal and V. W. Freeh. Boosting data center performance through non-
     uniform power allocation. In ICAC ’05: Proceedings of the Second Inter-                 [37] T. White. Hadoop: The Definitive Guide. O’Reilly Media, May, 2009.
     national Conference on Automatic Computing, Washington, DC, USA, 2005.
     IEEE Computer Society.

[15] Y. I. Eric Baldeschwieler. http://developer.yahoo.com/events/hadoopsummit2010.

[16] S. ES.2. http://www.seagate.com/staticfiles/support/disc/manuals/nl35 series &
     bc es series/barracuda es.2 series/100468393e.pdf. 2008.

[17] X. Fan, W.-D. Weber, and L. A. Barroso. Power provisioning for a warehouse-
     sized computer. In ISCA ’07: Proceedings of the 34th annual international sym-
     posium on Computer architecture, pages 13–23, New York, NY, USA, 2007.
     ACM.

[18] L. Ganesh, H. Weatherspoon, M. Balakrishnan, and K. Birman. Optimizing
     power consumption in large scale storage systems. In HOTOS’07: Proceedings
     of the 11th USENIX workshop on Hot topics in operating systems, Berkeley,
     CA, USA, 2007. USENIX Association.

[19] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. SIGOPS
     Oper. Syst. Rev., 37(5):29–43, 2003.

[20] T. Heath, A. P. Centeno, P. George, L. Ramos, Y. Jaluria, and R. Bianchini.
     Mercury and freon: temperature emulation and management for server systems.
     In ASPLOS, pages 106–116, 2006.

[21] U. Hoelzle and L. A. Barroso. The Datacenter as a Computer: An Introduction
     to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers,
     May 29, 2009.

[22] Intel. Quad-core intel xeon processor 5400 series. 2008.



                                                                                       287

More Related Content

What's hot (18)

A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
SANTOSH WAYAL
 
Integrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoopIntegrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoop
João Gabriel Lima
 
D04501036040
D04501036040D04501036040
D04501036040
ijceronline
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
Shital Kat
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
Cloud batch a batch job queuing system on clouds with hadoop and h-base
Cloud batch  a batch job queuing system on clouds with hadoop and h-baseCloud batch  a batch job queuing system on clouds with hadoop and h-base
Cloud batch a batch job queuing system on clouds with hadoop and h-base
João Gabriel Lima
 
Cidr11 paper32
Cidr11 paper32Cidr11 paper32
Cidr11 paper32
jujukoko
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
João Gabriel Lima
 
Paper id 25201498
Paper id 25201498Paper id 25201498
Paper id 25201498
IJRAT
 
Attaching cloud storage to a campus grid using parrot, chirp, and hadoop
Attaching cloud storage to a campus grid using parrot, chirp, and hadoopAttaching cloud storage to a campus grid using parrot, chirp, and hadoop
Attaching cloud storage to a campus grid using parrot, chirp, and hadoop
João Gabriel Lima
 
Implementation of nosql for robotics
Implementation of nosql for roboticsImplementation of nosql for robotics
Implementation of nosql for robotics
João Gabriel Lima
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET Journal
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
eldariof
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...
ijcses
 
Mansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analyticsMansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analytics
MansiChowkkar
 
EMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data StorageEMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data Storage
EMC
 
clusterstor-hadoop-data-sheet
clusterstor-hadoop-data-sheetclusterstor-hadoop-data-sheet
clusterstor-hadoop-data-sheet
Andrei Khurshudov
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
CitiusTech
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
SANTOSH WAYAL
 
Integrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoopIntegrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoop
João Gabriel Lima
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
Shital Kat
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
Cloud batch a batch job queuing system on clouds with hadoop and h-base
Cloud batch  a batch job queuing system on clouds with hadoop and h-baseCloud batch  a batch job queuing system on clouds with hadoop and h-base
Cloud batch a batch job queuing system on clouds with hadoop and h-base
João Gabriel Lima
 
Cidr11 paper32
Cidr11 paper32Cidr11 paper32
Cidr11 paper32
jujukoko
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
João Gabriel Lima
 
Paper id 25201498
Paper id 25201498Paper id 25201498
Paper id 25201498
IJRAT
 
Attaching cloud storage to a campus grid using parrot, chirp, and hadoop
Attaching cloud storage to a campus grid using parrot, chirp, and hadoopAttaching cloud storage to a campus grid using parrot, chirp, and hadoop
Attaching cloud storage to a campus grid using parrot, chirp, and hadoop
João Gabriel Lima
 
Implementation of nosql for robotics
Implementation of nosql for roboticsImplementation of nosql for robotics
Implementation of nosql for robotics
João Gabriel Lima
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET Journal
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
eldariof
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...
ijcses
 
Mansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analyticsMansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analytics
MansiChowkkar
 
EMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data StorageEMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data Storage
EMC
 
clusterstor-hadoop-data-sheet
clusterstor-hadoop-data-sheetclusterstor-hadoop-data-sheet
clusterstor-hadoop-data-sheet
Andrei Khurshudov
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
CitiusTech
 

Similar to Evaluation and analysis of green hdfs a self-adaptive, energy-conserving variant of the hadoop distributed file system (20)

Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File Systems
Mário Almeida
 
Strata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and FutureStrata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and Future
Cloudera, Inc.
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
DataWorks Summit
 
Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN
Arinto Murdopo
 
Energy efficient task scheduling algorithms for cloud data centers
Energy efficient task scheduling algorithms for cloud data centersEnergy efficient task scheduling algorithms for cloud data centers
Energy efficient task scheduling algorithms for cloud data centers
eSAT Publishing House
 
Energy efficient task scheduling algorithms for cloud data centers
Energy efficient task scheduling algorithms for cloud data centersEnergy efficient task scheduling algorithms for cloud data centers
Energy efficient task scheduling algorithms for cloud data centers
eSAT Journals
 
Energy efficient task scheduling algorithms for cloud data centers
Energy efficient task scheduling algorithms for cloud data centersEnergy efficient task scheduling algorithms for cloud data centers
Energy efficient task scheduling algorithms for cloud data centers
eSAT Journals
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Mark Kerzner
 
Energy aware load balancing and application scaling for the cloud ecosystem
Energy aware load balancing and application scaling for the cloud ecosystemEnergy aware load balancing and application scaling for the cloud ecosystem
Energy aware load balancing and application scaling for the cloud ecosystem
Pvrtechnologies Nellore
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Sumeet Singh
 
635 642
635 642635 642
635 642
Editor IJARCET
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
Ajit Koti
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Joey Jablonski
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
DataWorks Summit
 
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERLOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
ijdpsjournal
 
Hadoop As The Platform For The Smartgrid At TVA
Hadoop As The Platform For The Smartgrid At TVAHadoop As The Platform For The Smartgrid At TVA
Hadoop As The Platform For The Smartgrid At TVA
Cloudera, Inc.
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesWebinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Cloudera, Inc.
 
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERSAN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
ijfcstjournal
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
Chris Nauroth
 
Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File Systems
Mário Almeida
 
Strata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and FutureStrata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and Future
Cloudera, Inc.
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
DataWorks Summit
 
Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN
Arinto Murdopo
 
Energy efficient task scheduling algorithms for cloud data centers
Energy efficient task scheduling algorithms for cloud data centersEnergy efficient task scheduling algorithms for cloud data centers
Energy efficient task scheduling algorithms for cloud data centers
eSAT Publishing House
 
Energy efficient task scheduling algorithms for cloud data centers
Energy efficient task scheduling algorithms for cloud data centersEnergy efficient task scheduling algorithms for cloud data centers
Energy efficient task scheduling algorithms for cloud data centers
eSAT Journals
 
Energy efficient task scheduling algorithms for cloud data centers
Energy efficient task scheduling algorithms for cloud data centersEnergy efficient task scheduling algorithms for cloud data centers
Energy efficient task scheduling algorithms for cloud data centers
eSAT Journals
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Mark Kerzner
 
Energy aware load balancing and application scaling for the cloud ecosystem
Energy aware load balancing and application scaling for the cloud ecosystemEnergy aware load balancing and application scaling for the cloud ecosystem
Energy aware load balancing and application scaling for the cloud ecosystem
Pvrtechnologies Nellore
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Sumeet Singh
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
Ajit Koti
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Joey Jablonski
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
DataWorks Summit
 
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERLOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
ijdpsjournal
 
Hadoop As The Platform For The Smartgrid At TVA
Hadoop As The Platform For The Smartgrid At TVAHadoop As The Platform For The Smartgrid At TVA
Hadoop As The Platform For The Smartgrid At TVA
Cloudera, Inc.
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesWebinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Cloudera, Inc.
 
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERSAN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
ijfcstjournal
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
Chris Nauroth
 

More from João Gabriel Lima (20)

Cooking with data
Cooking with dataCooking with data
Cooking with data
João Gabriel Lima
 
Deep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationDeep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer Segmentation
João Gabriel Lima
 
Aplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackAplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full Stack
João Gabriel Lima
 
Realidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitRealidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKit
João Gabriel Lima
 
JS - IA
JS - IAJS - IA
JS - IA
João Gabriel Lima
 
Big data e Inteligência Artificial
Big data e Inteligência ArtificialBig data e Inteligência Artificial
Big data e Inteligência Artificial
João Gabriel Lima
 
Mineração de Dados no Weka - Regressão Linear
Mineração de Dados no Weka -  Regressão LinearMineração de Dados no Weka -  Regressão Linear
Mineração de Dados no Weka - Regressão Linear
João Gabriel Lima
 
Segurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoSegurança na Internet - Estudos de caso
Segurança na Internet - Estudos de caso
João Gabriel Lima
 
Segurança na Internet - Google Hacking
Segurança na Internet - Google  HackingSegurança na Internet - Google  Hacking
Segurança na Internet - Google Hacking
João Gabriel Lima
 
Segurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisSegurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentais
João Gabriel Lima
 
Web Machine Learning
Web Machine LearningWeb Machine Learning
Web Machine Learning
João Gabriel Lima
 
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
João Gabriel Lima
 
Mineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoMineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - Clusterização
João Gabriel Lima
 
Mineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaMineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e Weka
João Gabriel Lima
 
Visualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideVisualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark side
João Gabriel Lima
 
REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?
João Gabriel Lima
 
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
João Gabriel Lima
 
E-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosE-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãos
João Gabriel Lima
 
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
João Gabriel Lima
 
Hackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptHackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com Javascript
João Gabriel Lima
 
Deep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationDeep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer Segmentation
João Gabriel Lima
 
Aplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackAplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full Stack
João Gabriel Lima
 
Realidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitRealidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKit
João Gabriel Lima
 
Big data e Inteligência Artificial
Big data e Inteligência ArtificialBig data e Inteligência Artificial
Big data e Inteligência Artificial
João Gabriel Lima
 
Mineração de Dados no Weka - Regressão Linear
Mineração de Dados no Weka -  Regressão LinearMineração de Dados no Weka -  Regressão Linear
Mineração de Dados no Weka - Regressão Linear
João Gabriel Lima
 
Segurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoSegurança na Internet - Estudos de caso
Segurança na Internet - Estudos de caso
João Gabriel Lima
 
Segurança na Internet - Google Hacking
Segurança na Internet - Google  HackingSegurança na Internet - Google  Hacking
Segurança na Internet - Google Hacking
João Gabriel Lima
 
Segurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisSegurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentais
João Gabriel Lima
 
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
João Gabriel Lima
 
Mineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoMineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - Clusterização
João Gabriel Lima
 
Mineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaMineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e Weka
João Gabriel Lima
 
Visualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideVisualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark side
João Gabriel Lima
 
REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?
João Gabriel Lima
 
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
João Gabriel Lima
 
E-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosE-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãos
João Gabriel Lima
 
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
João Gabriel Lima
 
Hackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptHackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com Javascript
João Gabriel Lima
 

Recently uploaded (20)

Droidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing HealthcareDroidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing Healthcare
Droidal LLC
 
SAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AI
SAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AISAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AI
SAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AI
Peter Spielvogel
 
Introducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and ARIntroducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and AR
Safe Software
 
What is DePIN? The Hottest Trend in Web3 Right Now!
What is DePIN? The Hottest Trend in Web3 Right Now!What is DePIN? The Hottest Trend in Web3 Right Now!
What is DePIN? The Hottest Trend in Web3 Right Now!
cryptouniversityoffi
 
Security Operations and the Defense Analyst - Splunk Certificate
Security Operations and the Defense Analyst - Splunk CertificateSecurity Operations and the Defense Analyst - Splunk Certificate
Security Operations and the Defense Analyst - Splunk Certificate
VICTOR MAESTRE RAMIREZ
 
Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025
Prasta Maha
 
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Lorenzo Miniero
 
John Carmack’s Notes From His Upper Bound 2025 Talk
John Carmack’s Notes From His Upper Bound 2025 TalkJohn Carmack’s Notes From His Upper Bound 2025 Talk
John Carmack’s Notes From His Upper Bound 2025 Talk
Razin Mustafiz
 
SDG 9000 Series: Unleashing multigigabit everywhere
SDG 9000 Series: Unleashing multigigabit everywhereSDG 9000 Series: Unleashing multigigabit everywhere
SDG 9000 Series: Unleashing multigigabit everywhere
Adtran
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
UiPath Community Zurich: Release Management and Build Pipelines
UiPath Community Zurich: Release Management and Build PipelinesUiPath Community Zurich: Release Management and Build Pipelines
UiPath Community Zurich: Release Management and Build Pipelines
UiPathCommunity
 
TrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy ContractingTrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy Contracting
TrustArc
 
Artificial Intelligence (Kecerdasan Buatan).pdf
Artificial Intelligence (Kecerdasan Buatan).pdfArtificial Intelligence (Kecerdasan Buatan).pdf
Artificial Intelligence (Kecerdasan Buatan).pdf
NufiEriKusumawati
 
AI against disinformation and why it is not enough
AI against disinformation and why it is not enoughAI against disinformation and why it is not enough
AI against disinformation and why it is not enough
Yiannis Kompatsiaris
 
Reducing Bugs With Static Code Analysis php tek 2025
Reducing Bugs With Static Code Analysis php tek 2025Reducing Bugs With Static Code Analysis php tek 2025
Reducing Bugs With Static Code Analysis php tek 2025
Scott Keck-Warren
 
Build your own NES Emulator... with Kotlin
Build your own NES Emulator... with KotlinBuild your own NES Emulator... with Kotlin
Build your own NES Emulator... with Kotlin
Artur Skowroński
 
Cyber security cyber security cyber security cyber security cyber security cy...
Cyber security cyber security cyber security cyber security cyber security cy...Cyber security cyber security cyber security cyber security cyber security cy...
Cyber security cyber security cyber security cyber security cyber security cy...
pranavbodhak
 
"AI in the browser: predicting user actions in real time with TensorflowJS", ...
"AI in the browser: predicting user actions in real time with TensorflowJS", ..."AI in the browser: predicting user actions in real time with TensorflowJS", ...
"AI in the browser: predicting user actions in real time with TensorflowJS", ...
Fwdays
 
AI Emotional Actors: “When Machines Learn to Feel and Perform"
AI Emotional Actors:  “When Machines Learn to Feel and Perform"AI Emotional Actors:  “When Machines Learn to Feel and Perform"
AI Emotional Actors: “When Machines Learn to Feel and Perform"
AkashKumar809858
 
Master tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 Professio
Master tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 ProfessioMaster tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 Professio
Master tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 Professio
Kari Kakkonen
 
Droidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing HealthcareDroidal: AI Agents Revolutionizing Healthcare
Droidal: AI Agents Revolutionizing Healthcare
Droidal LLC
 
SAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AI
SAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AISAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AI
SAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AI
Peter Spielvogel
 
Introducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and ARIntroducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and AR
Safe Software
 
What is DePIN? The Hottest Trend in Web3 Right Now!
What is DePIN? The Hottest Trend in Web3 Right Now!What is DePIN? The Hottest Trend in Web3 Right Now!
What is DePIN? The Hottest Trend in Web3 Right Now!
cryptouniversityoffi
 
Security Operations and the Defense Analyst - Splunk Certificate
Security Operations and the Defense Analyst - Splunk CertificateSecurity Operations and the Defense Analyst - Splunk Certificate
Security Operations and the Defense Analyst - Splunk Certificate
VICTOR MAESTRE RAMIREZ
 
Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025
Prasta Maha
 
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Lorenzo Miniero
 
John Carmack’s Notes From His Upper Bound 2025 Talk
John Carmack’s Notes From His Upper Bound 2025 TalkJohn Carmack’s Notes From His Upper Bound 2025 Talk
John Carmack’s Notes From His Upper Bound 2025 Talk
Razin Mustafiz
 
SDG 9000 Series: Unleashing multigigabit everywhere
SDG 9000 Series: Unleashing multigigabit everywhereSDG 9000 Series: Unleashing multigigabit everywhere
SDG 9000 Series: Unleashing multigigabit everywhere
Adtran
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
UiPath Community Zurich: Release Management and Build Pipelines
UiPath Community Zurich: Release Management and Build PipelinesUiPath Community Zurich: Release Management and Build Pipelines
UiPath Community Zurich: Release Management and Build Pipelines
UiPathCommunity
 
TrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy ContractingTrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy Contracting
TrustArc
 
Artificial Intelligence (Kecerdasan Buatan).pdf
Artificial Intelligence (Kecerdasan Buatan).pdfArtificial Intelligence (Kecerdasan Buatan).pdf
Artificial Intelligence (Kecerdasan Buatan).pdf
NufiEriKusumawati
 
AI against disinformation and why it is not enough
AI against disinformation and why it is not enoughAI against disinformation and why it is not enough
AI against disinformation and why it is not enough
Yiannis Kompatsiaris
 
Reducing Bugs With Static Code Analysis php tek 2025
Reducing Bugs With Static Code Analysis php tek 2025Reducing Bugs With Static Code Analysis php tek 2025
Reducing Bugs With Static Code Analysis php tek 2025
Scott Keck-Warren
 
Build your own NES Emulator... with Kotlin
Build your own NES Emulator... with KotlinBuild your own NES Emulator... with Kotlin
Build your own NES Emulator... with Kotlin
Artur Skowroński
 
Cyber security cyber security cyber security cyber security cyber security cy...
Cyber security cyber security cyber security cyber security cyber security cy...Cyber security cyber security cyber security cyber security cyber security cy...
Cyber security cyber security cyber security cyber security cyber security cy...
pranavbodhak
 
"AI in the browser: predicting user actions in real time with TensorflowJS", ...
"AI in the browser: predicting user actions in real time with TensorflowJS", ..."AI in the browser: predicting user actions in real time with TensorflowJS", ...
"AI in the browser: predicting user actions in real time with TensorflowJS", ...
Fwdays
 
AI Emotional Actors: “When Machines Learn to Feel and Perform"
AI Emotional Actors:  “When Machines Learn to Feel and Perform"AI Emotional Actors:  “When Machines Learn to Feel and Perform"
AI Emotional Actors: “When Machines Learn to Feel and Perform"
AkashKumar809858
 
Master tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 Professio
Master tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 ProfessioMaster tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 Professio
Master tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 Professio
Kari Kakkonen
 

Evaluation and analysis of green hdfs a self-adaptive, energy-conserving variant of the hadoop distributed file system

  • 1. 2nd IEEE International Conference on Cloud Computing Technology and Science Evaluation and Analysis of GreenHDFS: A Self-Adaptive, Energy-Conserving Variant of the Hadoop Distributed File System Rini T. Kaushik Milind Bhandarkar University of Illinois, Urbana-Champaign Yahoo! Inc. [email protected] [email protected] Klara Nahrstedt University of Illinois, Urbana-Champaign [email protected] Abstract needs [13]. Hadoop’s data-intensive computing framework is built on a large-scale, highly resilient object-based clus- We present a detailed evaluation and sensitivity anal- ter storage managed by Hadoop Distributed File System ysis of an energy-conserving, highly scalable variant of (HDFS) [24]. the Hadoop Distributed File System (HDFS) called Green- With the increase in the sheer volume of the data that HDFS. GreenHDFS logically divides the servers in a needs to be processed, storage and server demands of com- Hadoop cluster into Hot and Cold Zones and relies on in- puting workloads are on a rapid increase. Yahoo!’s com- sightful data-classification driven energy-conserving data pute infrastructure already hosts 170 petabytes of data and placement to realize guaranteed, substantially long periods deploys over 38000 servers [15]. Over the lifetime of IT (several days) of idleness in a significant subset of servers equipment, the operating energy cost is comparable to the in the Cold Zone. Detailed lifespan analysis of the files in initial equipment acquisition cost [11] and constitutes a sig- a large-scale production Hadoop cluster at Yahoo! points at nificant part of the total cost of ownership of a datacen- the viability of GreenHDFS. Simulation results with real- ter [6]. Hence, energy-conservation of the extremely large- world Yahoo! HDFS traces show that GreenHDFS can scale, commodity server farms has become a priority. achieve 24% energy cost reduction by doing power man- Scale-down (i.e., transitioning servers to an inactive, low agement in only one top-level tenant directory in the clus- power consuming sleep/standby state) is an attractive tech- ter and meets all the scale-down mandates in spite of the nique to conserve energy as it allows energy proportional- unique scale-down challenges present in a Hadoop cluster. ity with non energy-proportional components such as the If GreenHDFS technique is applied to all the Hadoop clus- disks [17] and significantly reduces power consumption ters at Yahoo! (amounting to 38000 servers), $2.1million (idle power draw of 132.46W vs. sleep power draw of can be saved in energy costs per annum. Sensitivity anal- 13.16W in a typical server as shown in Table 1). However, ysis shows that energy-conservation is minimally sensitive scale-down cannot be done naively as discussed in Section to the thresholds in GreenHDFS. Lifespan analysis points 3.2. out that one-size-fits-all energy-management policies won’t One technique is to scale-down servers by manufactur- suffice in a multi-tenant Hadoop Cluster. ing idleness by migrating workloads and their correspond- ing state to fewer machines during periods of low activ- ity [5, 9, 10, 25, 30, 34, 36]. This can be relatively easy to ac- 1 Introduction complish when servers are state-less (i.e., serving data that resides on a shared NAS or SAN storage system). However, Cloud computing is gaining rapid popularity. Data- servers in a Hadoop cluster are not state-less. intensive computing needs range from advertising optimiza- HDFS distributes data chunks and replicas across servers tions, user-interest predictions, mail anti-spam, and data an- for resiliency, performance, load-balancing and data- alytics to deriving search rankings. An increasing number locality reasons. With data distributed across all nodes, any of companies and academic institutions have started to rely node may be participating in the reading, writing, or com- on Hadoop [1] which is an open-source version of Google’s putation of a data-block at any time. Such data placement Map-reduce framework for their data-intensive computing makes it hard to generate significant periods of idleness in 978-0-7695-4302-4/10 $26.00 © 2010 IEEE 274 DOI 10.1109/CloudCom.2010.109
  • 2. the Hadoop clusters and renders usage of inactive power work and conclude. modes infeasible [26]. Recent research on scale-down in GFS and HDFS man- 2 Key observations aged clusters [3, 27] propose maintaining a primary replica of the data on a small covering subset of nodes that are guar- We did a detailed analysis of the evolution and lifespan anteed to be on. However, these solutions suffer from de- of the files in in a production Yahoo! Hadoop cluster us- graded write-performance as they rely on write-offloading ing one-month long HDFS traces and Namespace metadata technique [31] to avoid server wakeups at the time of writes. checkpoints. We analyzed each top-level directory sepa- Write-performance is an important consideration in Hadoop rately in the production multi-tenant Yahoo! Hadoop clus- and even more so in a production Hadoop cluster as dis- ter as each top-level directory in the namespace exhibited cussed in Section 3.1. different access patterns and lifespan distributions. The key We took a different approach and proposed GreenHDFS, observations from the analysis are: an energy-conserving, self-adaptive, hybrid, logical multi- zoned variant of HDFS in our paper [23]. Instead of an ∙ There is significant heterogeneity in the access pat- energy-efficient placement of computations or using a small terns and the lifespan distributions across the various covering set for primary replicas as done in earlier research, top-level directories in the production Hadoop clus- GreenHDFS focuses on data-classification techniques to ter and one-size-fits-all energy-management policies extract energy savings by doing energy-aware placement of don’t suffice across all directories. data. ∙ Significant amount of data amounting to 60% of used GreenHDFS trades cost, performance and power by sep- capacity is cold (i.e., is lying dormant in the system arating cluster into logical zones of servers. Each cluster without getting accessed) in the production Hadoop zone has a different temperature characteristic where tem- cluster. A majority of this cold data needs to exist for perature is measured by the power consumption and the per- regulatory and historical trend analysis purposes. formance requirements of the zone. GreenHDFS relies on the inherent heterogeneity in the access patterns in the data ∙ We found that the 95-98% files in majority of the top- stored in HDFS to differentiate the data and to come up with level directories had a very short hotness lifespan of an energy-conserving data layout and data placement onto less than 3 days. Only one directory had files with the zones. Since, computations exhibit high data locality in longer hotness lifespan. Even in that directory 80% the Hadoop framework, the computations then flow natu- of files were hot for less than 8 days. rally to the data in the right temperature zones. The contribution of this paper lies in showing that the ∙ We found that 90% of files amounting to 80.1% of the energy-aware data-differentiation based data-placement in total used capacity in the most storage-heavy top-level GreenHDFS is able to meet all the effective scale-down directory were dormant and hence, cold for more than mandates (i.e., generates significant idleness, results in 18 days. Dormancy periods were much shorter in the few power state transitions, and doesn’t degrade write per- rest of the directories and only 20% files were dormant formance) despite the significant challenges posed by a beyond 1 day. Hadoop cluster to scale-down. We do a detailed evaluation ∙ Access pattern to majority of the data in the production and sensitivity analysis of the policy thresholds in use in Hadoop cluster have a news-server-like access pattern GreenHDFS with a trace-driven simulator with real-world whereby most of the computations to the data happens HDFS traces from a production Hadoop cluster at Yahoo!. soon after the data’s creation. While some aspects of GreenHDFS are sensitive to the pol- icy thresholds, we found that energy-conservation is mini- mally sensitive to the policy thresholds in GreenHDFS. 3 Background The remainder of the paper is structured as follows. In Section 2, we list some of the key observations from our Map-reduce is a programming model designed to sim- analysis of the production Hadoop cluster at Yahoo!. In plify data processing [13]. Google, Yahoo!, Facebook, Section 3, we provide background on HDFS, and discuss Twitter etc. use Map-reduce to process massive amount of scale-down mandates. In Section 4, we give an overview of data on large-scale commodity clusters. Hadoop is an open- the energy management policies of GreenHDFS. In Section source cluster-based Map-reduce implementation written in 5, we present an analysis of the Yahoo! cluster. In Section Java [1]. It is logically separated into two subsystems: a 6, we include experimental results demonstrating the effec- highly resilient and scalable Hadoop Distributed File Sys- tiveness and robustness of our design and algorithms in a tem (HDFS), and a Map-reduce task execution framework. simulation environment. In Section 7, we discuss related HDFS runs on clusters of commodity hardware and is an 275
  • 3. object-based distributed file system. The namespace and to the class of data residing in that zone. Differentiating the metadata (modification, access times, permissions, and the zones in terms of power is crucial towards attaining our quotas) are stored on a dedicated server called the NameN- energy-conservation goal. ode and are decoupled from the actual data which is stored Hot zone consists of files that are being accessed cur- on servers called the DataNodes. Each file in HDFS is repli- rently and the newly created files. This zone has strict SLA cated for resiliency and split into blocks of typically 128MB (Service Level Agreements) requirements and hence, per- and individual blocks and replicas are placed on the DataN- formance is of the greatest importance. We trade-off energy odes for fine-grained load-balancing. savings in interest of very high performance in this zone. In this paper, GreenHDFS employs data chunking, placement 3.1 Importance of Write-Performance in and replication policies similar to the policies in baseline Production Hadoop Cluster HDFS or GFS. Cold zone consists of files with low to rare accesses. Reduce phase of a Map-reduce task writes intermediate Files are moved by File Migration policy from the Hot computation results back to the Hadoop cluster and relies on zones to the Cold zone as their temperature decreases be- high write performance for overall performance of a Map- yond a certain threshold. Performance and SLA require- reduce task. Furthermore, we observed that the majority of ments are not as critical for this zone and GreenHDFS em- the data in a production Hadoop cluster has a news-server ploys aggressive energy-management schemes and policies like access pattern. Predominant number of computations in this zone to transition servers to low power inactive state. happen on newly created data; thereby mandating good read Hence, GreenHDFS trades-off performance with high en- and write performance of the newly created data. ergy savings in the Cold zone. For optimal energy savings, it is important to increase 3.2 Scale-down Mandates the idle times of the servers and limit the wakeups of servers that have transitioned to the power saving mode. Keeping Scale-down, in which server components such as CPU, this rationale in mind and recognizing the low performance disks, and DRAM are transitioned to inactive, low power needs and infrequency of data accesses to the Cold zone; consuming mode, is a popular energy-conservation tech- this zone will not chunk the data. This will ensure that upon nique. However, scale-down cannot be applied naively. En- a future access only the server containing the data will be ergy is expended and transition time penalty is incurred woken up. when the components are transitioned back to an active By default, the servers in Cold zone are in a sleeping power mode. For example, transition time of components mode. A server is woken up when either new data needs such as the disks can be as high as 10secs. Hence, an effec- to be placed on it or when data already residing on the tive scale-down technique mandates the following: server is accessed. GreenHDFS tries to avoid powering-on ∙ Sufficient idleness to ensure that energy savings are a server in the Cold zone and maximizes the use of the exist- higher than the energy spent in the transition. ing powered-on servers in its server allocation decisions in ∙ Less number of power state transitions as some com- interest of maximizing the energy savings. One server wo- ponents (e.g., disks) have limited number of start/stop ken up and is filled completely to its capacity before next cycles and too frequent transitions may adversely im- server is chosen to be transitioned to an active power state pact the lifetime of the disks. from an ordered list of servers in the Cold zone. The goal of GreenHDFS is to maximize the allocation ∙ No performance degradation. Steps need to be taken of the servers to the Hot zone to minimize the performance to amortize performance penalty of power state transi- impact of zoning and minimize the number of servers allo- tions and to ensure that load concentration on the re- cated to the Cold zone. We introduced a hybrid, storage- maining active state servers doesn’t adversely impact heavy cluster model in [23] paper whereby servers in the overall performance of the system. Cold zone are storage-heavy and have 12, 1TB disks/server. We argue that zoning in GreenHDFS will not affect the 4 GreenHDFS Design Hot zone’s performance adversely and the computational workload can be consolidated on the servers in the Hot zone GreenHDFS is a variant of the Hadoop Distributed File without exceeding the CPU utilization above the provision- System (HDFS) and GreenHDFS logically organizes the ing guidelines. A study of 5000 Google compute servers, servers in the datacenter in multiple dynamically provi- showed that most of the time is spent within the 10% - 50% sioned Hot and Cold zones. Each zone has a distinct perfor- CPU utilization range [4]. Hence, significant opportunities mance, cost, and power characteristic. Each zone is man- exist in workload consolidation. And, the compute capacity aged by power and data placement policies most conducive of the Cold zone can always be harnessed under peak load 276
  • 4. scenarios. 4.1.2 Server Power Conserver Policy 4.1 Energy-management Policies The Server Power Conserver Policy runs in the Cold zone and determines the servers which can be transitioned into Files are moved from the Hot Zones to the Cold Zone as a power saving standby/sleep mode in the Cold Zone as their temperature changes over time as shown in Figure 1. shown in Algorithm 2. The current trend in the internet- In this paper, we use dormancy of a file, as defined by the scale data warehouses and Hadoop clusters is to use com- elapsed time since the last access to the file, as the measure modity servers with 4-6 directly attached disks instead of of temperature of the file. Higher the dormancy lower is the using expensive RAID controllers. In such systems, disks temperature of the file and hence, higher is the coldness of actually just constitute 10% of the entire power usage as il- the files. On the other hand, lower the dormancy, higher is lustrated in a study performed at Google [21] and CPU and the heat of the files. GreenHDFS uses existing mechanism DRAM constitute of 63% of the total power usage. Hence, in baseline HDFS to record and update the last access time power management of any one component is not sufficient. of the files upon every file read. We leverage energy cost savings at the entire server granu- larity (CPU, Disks, and DRAM) in the Cold zone. The GreenHDFS uses hardware techniques similar to 4.1.1 File Migration Policy [28] to transition the processors, disks and the DRAM into The File Migration Policy runs in the Hot zone, monitors a low power state. GreenHDFS uses the disk Sleep mode 1 , the dormancy of the files as shown in Algorithm 1 and CPU’s ACPI S3 Sleep state as it consumes minimal power moves dormant, i.e., cold files to the Cold Zone. The advan- and requires only 30us to transition from sleep back to ac- tages of this policy are two-fold: 1) leads to higher space- tive execution, and DRAM’s self-refresh operating mode in efficiency as space is freed up on the hot Zone for files which transitions into and out of self refresh can be com- which have higher SLA requirements by moving rarely ac- pleted in less than a microsecond in the Cold zone. cessed files out of the servers in these zones, and 2) allows The servers are transitioned back to an active power significant energy-conservation. Data-locality is an impor- mode in three conditions: 1) data residing on the server is tant consideration in the Map-reduce framework and com- accessed, 2) additional data needs to be placed on the server, putations are co-located with data. Thus, computations nat- or 3) block scanner needs to run on the server to ensure urally happen on the data residing in the Hot zone. This the integrity of the data residing in the Cold zone servers. results in significant idleness in all the components of the GreenHDFS relies on Wake-on-LAN in the NICs to send a servers in the Cold zone (i.e., CPU, DRAM and Disks), al- magic packet to transition a server back to an active power lowing effective scale-down of these servers. state. Wake-up Events: File Access Bit Rot Integrity Checker File Placement Coldness > ThresholdFMP File Deletion Hot Cold Active Inactive Zone Zone Server Power Conserver Policy: Hotness > ThresholdFRP Coldness > Threshold PCS Figure 1. State Diagram of a File’s Zone Alloca- Figure 2. Triggering events leading to Power State tion based on Migration Policies Transitions in the Cold Zone Algorithm 1 Description of the File Migration Policy which Algorithm 2 Server Power Conserver Policy Classifies and Migrates cold data to the Cold Zone from the {For every Server i in Cold Zone} Hot Zones for 𝑖 = 1 to n do {For every file i in Hot Zone} coldness 𝑖 ⇐ max0≤𝑗≤𝑚 last access time 𝑗 for 𝑖 = 1 to n do if coldness 𝑖 ≥ 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝑃 𝐶 then dormancy 𝑖 ⇐ current time − last access time 𝑖 S 𝑖 ⇐ INACTIVE STATE if dormancy 𝑖 ≥ 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 then end if {Cold Zone} ⇐ {Cold Zone} ∪ {f 𝑖 } end for {Hot Zone} ⇐ {Hot Zone} / {f 𝑖 }//filesystem metadata structures are changed to Cold Zone end if end for 1 In the Sleep mode the drive buffer is disabled, the heads are parked and the spindle is at rest. 277
  • 5. 4.1.3 File Reversal Policy after they have been dormant in the system for a longer pe- riod of time. This would be an overkill for files with very The File Reversal Policy runs in the Cold zone and en- short 𝐿𝑖𝑓 𝑒𝑠𝑝𝑎𝑛 𝐶 𝐿𝑅 (hotness lifespan) as such files will sures that the QoS, bandwidth and response time of files unnecessarily lie dormant in the system, occupying precious that becomes popular again after a period of dormancy is Hot zone capacity for a longer period of time. not impacted. If the number of accesses to a file that is re- 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 : A high 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 increases the siding in the Cold zone becomes higher than the threshold number of the days the servers in the Cold Zone remain 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 , the file is moved back to the Hot zone as in active power state and hence, lowers the energy savings. shown in 3. The file is chunked and placed unto the servers On the other hand, it results in a reduction in the power state in the Hot zone in congruence with the policies in the Hot transitions which results in improved performance of the ac- zone. cesses to the Cold Zone. Thus, a trade-off needs to be made Algorithm 3 Description of the File Reversal Policy Which between energy-conservation and data access performance Monitors temperature of the cold files in the Cold Zones and in the selection of the value for 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 . Moves Files Back to Hot Zones if their temperature changes 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 : A relatively high value of {For every file i in Cold Zone} 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 ensures that files are accurately clas- for 𝑖 = 1 to n do if num accesses 𝑖 ≥ 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 then sified as hot-again files before they are moved back to the {Hot Zone} ⇐ {Hot Zone} ∪ {f 𝑖 } Hot zone from the Cold zone. This reduces data oscillations {Cold Zone} ⇐ {Cold Zone} / {f 𝑖 }//filesystem metadata are changed to Hot Zone in the system and reduces unnecessary file reversals. end if end for 5 Analysis of a production Hadoop cluster at Yahoo! 4.1.4 Policy Thresholds Discussion We analyzed one-month of HDFS logs 2 and namespace A good data migration scheme should result in maximal checkpoints in a multi-tenant cluster at Yahoo!. The clus- energy savings, minimal data oscillations between Green- ter had 2600 servers, hosted 34 million files in the names- HDFS zones and minimal performance degradation. Min- pace and the data set size was 6 Petabytes. There were imization of the accesses to the Cold zone files results in 425 million entries in the HDFS logs and each names- maximal energy savings and minimal performance impact. pace checkpoint contained 30-40 million files. The clus- For this, policy thresholds should be chosen in a way that ter namespace was divided into six main top-level directo- minimizes the number of accesses to the files residing in the ries, whereby each directory addresses different workloads Cold zone while maximizing the movement of the dormant and access patterns. We only considered 4 main directories data to the Cold zone. Results from our detailed sensitivity and refer to them as: d, p, u, and m in our analysis instead analysis of the thresholds used in GreenHDFS are covered of referring them by their real names. The total number in Section 6.3.5. of unique files that was seen in the HDFS logs in the one- 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 : Low (i.e., aggressive) value of month duration were 70 million (d-1.8million, p-30million, 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 results in an ultra-greedy selection of u-23million, and m-2million). files as potential candidates for migration to the Cold The logs and the metadata checkpoints were huge in size zone. While there are several advantages of an aggressive and we used a large-scale research Hadoop cluster at Yahoo! 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 such as higher space-savings in the Cold extensively for our analysis. We wrote the analysis scripts zone, there are disadvantages as well. If files have inter- in Pig. We considered several cases in our analysis as shown mittent periods of dormancy, the files may incorrectly get below: labeled as cold and get moved to the Cold zone. There is ∙ Files created before the analysis period and which high probability that such files will get accessed in the near were not read or deleted subsequently at all. We clas- future. Such accesses may suffer performance degradation sify these files as long-living cold files. as the accesses may get subject to power transition penalty and may trigger data oscillations because of file reversals ∙ Files created before the analysis period and which back to the Hot zone. were read during the analysis period. A higher value of 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 results in a higher 2 The inode data and the list of blocks belonging to each file comprise accuracy in determining the really cold files. Hence, the the metadata of the name system called the image. The persistent record of number of reversals, server wakeups and associated perfor- the image is called a checkpoint. HDFS has the ability to log all file system access requests, which is required for auditing purposes in enterprises. The mance degradation decreases as the threshold is increased. audit logging is implemented using log4j and once enabled, logs every On the other hand, higher value of 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 signi- HDFS event in the NameNode’s log [37]. We used the above-mentioned fies that files will be chosen as candidates for migration only checkpoint and HDFS logs for our analysis. 278
  • 6. ∙ Files created before the analysis period and which ∙ FileLifetime. This metric helps in determining the life- were both read and deleted during the analysis period. time of the file between its creation and its deletion. ∙ Files created during the analysis period and which were not read during the analysis period or deleted. 5.1.1 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐹 𝑅 The 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐹 𝑅 distribution throws light on the ∙ Files created during the analysis period and which clustering of the file reads with the file creation. As shown were not read during the analysis period, but were in Figure 3, 99% of the files have a 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐹 𝑅 of deleted. less than 2 days. ∙ Files created during the analysis period and which were read and deleted during the analysis period. 5.1.2 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 To accurately account for the file lifespan and lifetime, Figure 4 shows the distribution of 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 in we handled the following cases: (a) Filename reuse. We the cluster. In directory d, 80% of files are hot for less than appended a timestamp to each file create to accurately track 8 days and 90% of the files amounting to 94.62% storage, the audit log entries following the file create entry in the au- are hot for less than 24 days. The 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 of dit log, (b) File renames. We used an unique id per file to ac- 95% of the files amounting to 96.51% storage in the direc- curately track its lifetime across create, rename and delete, tory p is less than 3 days and the 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 of the (c) Renames and deletes at higher level in the path hierarchy 100% of files in directory m and 98% of files in directory had to be translated to leaf-level renames and deletes for our a is as small as 2 days. In directory u, 98% of files have analysis, (d) HDFS logs do not have file size information 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 of less than 1 day. Thus, majority of and hence, did a join of the dataset found in the HDFS logs the files in the cluster have a short hotness lifespan. and namespace checkpoint to get the file size information. 5.1.3 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 5.1 File Lifespan Analysis of the Yahoo! Hadoop Cluster 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 indicates the time for which a file stays in a dormant state in the system. The longer the dormancy A file goes to several stages in its lifetime: 1) file cre- period, higher is the coldness of the file and hence, higher ation, 2) hot period during which the file is frequently ac- the suitability of the file for migration to the cold zone. Fig- cessed, 3) dormant period during which file is not accessed, ure 5 shows the distribution of 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 in the and 4) deletion. We introduced and considered various lifes- cluster. In directory d, 90% of files are dormant beyond pan metrics in our analysis to characterize a file’s evolution. 1 day and 80% of files, amounting to 80.1% of storage A study of the various lifespan distributions helps in decid- exist in dormant state past 20 days. In directory p, only ing the energy-management policy thresholds that need to 25% files are dormant beyond 1 day and only 20% of the be in place in GreenHDFS. files remain dormant in the system beyond 10 days. In di- rectory m, only 0.02% files are dormant for more than 1 ∙ 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐹 𝑅 metric is defined as the File lifes- day and in directory u, 20% of files are dormant beyond pan between the file creation and first read access. This 10 days. The 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 needs to be considered metric is used to find the clustering of the read accesses to find true migration suitability of a file. For example, around the file creation. given the extremely short dormancy period of the files in the directory m, there is no point in exercising the File Mi- ∙ 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 metric is defined as the File lifes- gration Policy on directory m. For directories p, and u, pan between creation and last read access. This metric 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 less than 5 days will result in unneces- is used to determine the hotness profile of the files. sary movement of files to the Cold zone as these files are ∙ 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 metric is defined as the File lifes- due for deletion in any case. On the other hand, given the pan between last read access and file deletion. This short 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 in these directories, high value of metric helps in determine the coldness profile of the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 won’t do justice to space-efficiency in the files as this is the period for which files are dormant in Cold zone as discussed in Section 4.1.4. the system. 5.1.4 File Lifetime Analysis ∙ 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐹 𝐿𝑅 metric is defined as the File lifes- pan between first read access and last read access. This Knowledge of the FileLifetime further assists in the metric helps in determining another dimension of the migration file candidate selection and needs to be ac- hotness profile of the files. counted for in addition to the 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 and 279
  • 7. d p m u 102% d p m u 120% % of Tota Used Capacity % of To File Count 100% 100% 98% 80% 96% 60% otal 94% al 40% 92% 20% 90% 0% 1 3 5 7 9 11 13 15 17 19 21 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 FileLifeSpanCFR (Days) FileLifeSpanCFR (Days) Figure 3. 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐹 𝑅 distribution. 99% of files in directory d and 98% of files in directory p were accessed for the first time less than 2 days of creation. d p m u d p m u 105% 120% % of Tota Used Capacity % of Total File Count 100% 100% 95% 90% 80% 85% 60% 80% al 75% 40% T 70% 20% 65% 60% 0% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 FileLifeSpanCLR (Days) FileLifeSpanCLR (Days) Figure 4. 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 Distribution in the four main top-level directories in the Yahoo! production cluster. 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 characterizes the lifespan for which files are hot. In directory d, 80% of files were hot for less than 8 days and 90% of the files amounting to 94.62% storage, are hot for less than 24 days. The hotness lifespan of 95% of the files amounting to 96.51% storage in the directory p is less than 3 days and the hotness lifespan of the 100% of files in directory m and in directory u, 98% of files are hot for less than 1 day. d p m u d p m u 120% 120% % if Tota Used Capacity % of Total File Count 100% 100% 80% 80% 60% 60% al 40% 40% T 20% 20% 0% 0% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 FileLifeSpanLRD (Days) FileLifeSpanLRD (Days) Figure 5. 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 distribution of the top-level directories in the Yahoo! production cluster. 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 characterizes the coldness in the cluster and is indicative of the time a file stays in a dormant state in the system. 80% of files, amounting to 80.1% of storage in the directory d have a dormancy period of higher than 20 days. 20% of files, amounting to 28.6% storage in directory p are dormant beyond 10 days. 0.02% of files in directory m are dormant beyond 1 day. d p m u d p m u 120% 120% % of Tota Used Capacity % of Total File Count 100% 100% 80% 80% 60% 60% al 40% 40% T 20% 20% 0% 7 0% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 0 2 4 6 8 1012141618202224262830 FileLifetime (Days) FileLifetime(Days) Figure 6. FileLifetime distribution. 67% of the files in the directory p are deleted within one day of their creation. Only 23% files live beyond 20 days. On the other hand, in directory d 80% of the files have a FileLifetime of more than 30 days. 280
  • 8. % of Total File Count % of Total Used Storage 40.00% 35.00% 30.00% 25.00% 20.00% 15.00% 15 00% 10.00% 5.00% 0.00% d p u Figure 7. File size and file count percentage of long-living cold files. The cold files are defined as the files that were created prior to the start of the observation period of one-month and were not accessed during the period of observation at all. In case of directory d directory, 13% of the total file count in the cluster which amounts to 33% of total used capacity is cold. In case of directory p, 37% of the total file count in the cluster which amounts to 16% of total used capacity is cold. Overall, 63.16% of total file count and 56.23% of total used capacity is cold in the system d p u d p u 7 100% % of Total File Count File Count (Millions) 6 80% 5 60% 4 3 40% 2 C 20% 1 0% 0 10 20 40 60 80 100 120 140 10 20 40 60 80 100 120 140 Dormancy > than (Days) Dormancy > than (Days) d p u d p u 3500 90% % of Total Used Storage 3000 80% Used Storage Capaicty (TB) 70% 2500 60% 2000 50% Capacity 1500 40% 30% 1000 20% 500 10% 0% 0 10 20 40 60 80 100 120 140 10 20 40 60 80 100 120 140 Dormancy > than (Days) Dormancy > than (Days) Figure 8. Dormant period analysis of the file count distribution and histogram in one namespace checkpoint. Dormancy of the file is defined as the elapsed time between the last access time recorded in the checkpoint and the day of observation. 34% of the files in the directory p and 58% of the files in the directory d were not accessed in the last 40 days. 281
  • 9. 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 metrices covered earlier. As shown in ∙ What is the sensitivity of the various policy thresholds Figure 6, directory p only has 23% files that live beyond 20 used in GreenHDFS on the energy savings results? days. On the other hand, 80% of files in directory d live ∙ How many power state transitions does a server go for more than 30 days and 80% of the files have a hot lifes- through in average in the Cold Zone? pan of less than 8 days. Thus, directory d is a very good candidate for invoking the File Migration Policy. ∙ Finally, what is the number of accesses that happen to the files in the Cold Zones, the days servers are pow- 5.2 Coldness Characterization of the Files ered on and the number of migrations and reversals ob- served in the system? In this section, we show the file count and the storage capacity used by the long-living cold files. The long-living ∙ How many migrations happen daily? cold files are defined as the files that were created prior to ∙ How may power state transitions are occurred during the start of the observation period and were not accessed the simulation-run? during the one-month period of observation at all. As shown in Figure 13, 63.16% of files amounting to 56.23% of the The following evaluation sections answer these questions, total used capacity are cold in the system. Such long-living beginning with a description of our methodology, and the cold files present significant opportunity to conserve energy trace workloads we use as inputs to the experiments. in GreenHDFS. 5.3 Dormancy Characterization of the 6.1 Evaluation methodology Files We evaluated GreenHDFS using a trace-driven simula- The HDFS trace analysis gives information only about tor. The simulator was driven by real-world HDFS traces the files that were accessed in the one-month duration. To generated by a production Hadoop cluster at Yahoo!. The get a better picture, we analyzed the namespace checkpoints cluster had 2600 servers, hosted 34 million files in the for historical data on the file temperatures and periods of namespace and the data set size was 6 Petabytes. dormancy. The namespace checkpoints contain the last ac- We focused our analysis on the directory d as this di- cess time information of the files and used this information rectory constituted of 60% of the used storage capacity in to calculate the dormancy of the files. The Dormancy met- the cluster (4PB out of the 6PB total used capacity). Just ric defines the elapsed time between the last noted access focusing our analysis on the directory d cut down on our time of the file and the day of observation. Figure 8 contains simulation time significantly and reduced our analysis time the frequency histograms and distributions of the dormancy. 4 . We used 60% of the total cluster nodes in our analysis to 34% of files amounting to 37% of storage in the directory p make the results realistic for just directory d analysis. The present in the namespace checkpoint were not accessed in total number of unique files that were seen in the HDFS the last 40 days. 58% of files amounting to 53% of storage traces for the directory d in the one-month duration were in the directory d were not accessed in the last 40 days. The 0.9 million. In our experiments, we compare GreenHDFS extent of dormancy exhibited in the system again shows the to the baseline case (HDFS without energy management). viability of the GreenHDFS solution.3 The baseline results give us the upper bound for energy con- sumption and the lower bound for average response time. 6 Evaluation Simulation Platform: We used a trace-driven simula- tor for GreenHDFS to perform our experiments. We used In this section, we first present our experimental platform models for the power levels, power state transitions times and methodology, followed by a description of the work- and access times of the disk, processor and the DRAM in loads used and then we give our experimental results. Our the simulator. The GreenHDFS simulator was implemented goal is to answer seven high-level sets of questions: in Java and MySQL distribution 5.1.41 and executed using Java 2 SDK, version 1.6.0-17. 5 Table 1 lists the various ∙ What much energy is GreenHDFS able to conserve power, latency, transition times etc. used in the Simulator. compared to a baseline HDFS with no energy manage- The simulator was run on 10 nodes in a development cluster ment? at Yahoo!. ∙ What is the penalty of the energy management on av- 4 An important consideration given the massive scale of the traces erage response time? 5 Both,performance and energy statistics were calculated based on the 3 The number of files present in the namespace checkpoints were less information extracted from the datasheet of Seagate Barracuda ES.2 which than half the number of the files seen in the one-month trace. is a 1TB SATA hard drive, a Quad core Intel Xeon X5400 processor 282
  • 10. 6.3.2 Storage-Efficiency Table 1. Power and power-on penalties used in Simu- lator In this section, we show the increased storage efficiency of the Hot Zones compared to baseline. Figure 10 shows that Component Active Idle Sleep Power- in the baseline case, the average capacity utilization of the Power Power Power up (W) (W) (W) time 1560 servers is higher than that of GreenHDFS which just CPU (Quad core, Intel Xeon 80-150 12.0- 3.4 30 us has 1170 servers out of the 1560 servers provisioned to the X5400 [22]) 20.0 DRAM DIMM [29] 3.5-5 1.8- 0.2 1 us Hot second Zone. GreenHDFS has much higher amount of 2.5 free space available in the Hot zone which tremendously in- NIC [35] 0.7 0.3 0.3 NA SATA HDD (Seagate Bar- 11.16 9.29 0.99 10 sec creases the potential for better data placement techniques on racuda ES.2 1TB [16] the Hot zone. More aggressive the policy threshold, more PSU [2] 50-60 25-35 0.5 300 us Hot server (2 CPU, 8 DRAM 445.34 132.46 13.16 space is available in the Hot zone for truly hot data as more DIMM, 4 1TB HDD) data is migrated out to the Cold zone. Cold server (2 CPU, 8 DRAM 534.62 206.78 21.08 DIMM, 12 1TB HDD) 6.3.3 File Migrations and Reversals 6.2 Simulator Parameters The Figure 10 (right-most) shows the number and total size of the files which were migrated to the Cold zone daily with The default simulation parameters used by in this paper a 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 value of 10 Days. Every day, on average are shown in Table 2. 6.38TB worth of data and 28.9 thousand files are migrated to the Cold zone. Since, we have assumed storage-heavy servers in the Cold zone where each server has 12, 1TB Table 2. Simulator Parameters disks, assuming 80MB/sec of disk bandwidth, 6.38TB data Parameter Value NumServer 1560 can be absorbed in less than 2hrs by one server. The mi- NumZones 2 gration policy can be run during off-peak hours to minimize 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝐹 𝑀 𝑃 1 Day 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 5, 10, 15, 20 Days any performance impact. 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑆𝑃 𝐶 1 Day 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝑃 𝐶 2, 4, 6, 8 Days 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝐹 𝑅𝑃 1 Day 6.3.4 Impact of Power Management on Response Time 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 1, 5, 10 Accesses NumServersPerZone Hot 1170 Cold 390 We examined the impact of server power management on the response time of a file which was moved to the Cold Zone following a period of dormancy and was accessed 6.3 Simulation results again for some reason. The files residing on the Cold Zone may suffer performance degradation in two ways: 1) if the file resides on a server that is not powered ON currently– 6.3.1 Energy-Conservation this will incur a server wakeup time penalty, 2) transfer time In this section, we show the energy savings made possible degradation courtesy of no striping on the lower Zones. The by GreenHDFS, compared to baseline, in one month sim- file is moved back to Hot zone and chunked again by the file ply by doing power management in one of the main tenant reversal policy. Figure 11 shows the impact on the average directory of the Hadoop Cluster. The cost of electricity was response time. 97.8% of the total read requests are not im- assumed to be $0.063/KWh. Figure 9(Left) shows a 24% pacted by the power management. Impact is seen only by reduction in energy consumption of a 1560 server datacen- 2.1% of the reads. With a less aggressive 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 ter with 80% capacity utilization. Extrapolating, $2.1mil- (15, 20 days), impact on the Response time will reduce lion can be saved in the energy costs if GreenHDFS tech- much further. nique is applied to all the Hadoop clusters at Yahoo (up- wards of 38000 servers). Energy saving from off-power 6.3.5 Sensitivity Analysis servers will be further compounded in the cooling system of a real datacenter. For every Watt of power consumed by the We tried different values of the thresholds for the File Mi- compute infrastructure, a modern data center expends an- gration policy and the Server Power Conserver policy to other one-half to one Watt to power the cooling infrastruc- understand the sensitivity of these thresholds on storage- ture [32]. Energy-saving results underscore the importance efficiency, energy-conservation and number of power state of supporting access time recording in the Hadoop compute transitions. A discussion on the impact of the various clusters. thresholds is done in Section 4.1.4. 283
  • 11. $35,000 Cold Zone Hot Zone # Migrations # Reversals 35 8 $30,000 30 7 Energy Costs $25,000 Cou (x100000) Day Server ON 25 6 $20,000 5 $15,000 20 4 $10,000 15 unt 3 ys $5,000 10 2 $0 5 1 0 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 5 10 15 20 File Migration Policy (Days) Cold Zone Servers File Migration Policy Interval (Days) Figure 9. (Left) Energy Savings with GreenHDFS and (Middle) Days Servers in Cold Zone were ON compared to the Baseline. Energy Cost Savings are Minimally Sensitive to the Policy Threshold Values. GreenHDFS achieves 24% savings in the energy costs in one month simply by doing power management in one of the main tenant directory of the Hadoop Cluster. (Right) Number of migrations and reversals in GreenHDFS with different values of the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 threshold. 500 600 FileSize FileCount orage Capacity (GB) Cold Zo Used Capacity 450 Baseline 12 45 500 400 40 10 ount (x 1000) 350 Policy15 400 35 300 8 30 (TB) File Size (TB) 250 Policy10 300 25 6 200 20 one 150 200 4 15 File Co Policy5 10 Used Sto 100 2 50 100 5 - - 0 - 6/12 6/14 6/16 6/18 6/20 6/22 6/24 6/26 6/28 6/30 1105 1197 1 93 185 277 369 461 553 645 737 829 921 1013 1289 1381 1473 5 10 15 20 File Migration Policy Interval (Days) Days Server Number Figure 10. Capacity Growth and Utilization in the Hot and Cold Zone compared to the Baseline and Daily Migrations. GreenHDFS substantially increases the free space in the Hot Zones by migrating cold data to the Cold Zones. In the left and middle chart, we only consider the new data that was introduced in the data directory and old data which was accessed during the 1 month period. Right chart shows the number and total size of the files migrated daily to the Cold zone with 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 value of 10 Days. 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 : We found that the energy costs are experiments were done with a 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 value of 1. minimally sensitive to the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 threshold value. The number of file reversals are substantially reduced by in- As shown in Figure 9[Left], the energy cost savings varied creasing the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 value. With a 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 minimally when the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 was changed to 5, 10, value of 10, zero reversals happen in the system. 15 and 20 days. The storage-efficiency is sensitive to the value of the The performance impact and number of file reversals is 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 threshold as shown in Figure 10[Left]. An minimally sensitive to the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 value as well. increase in the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 value results in less effi- This behavior can be explained by the observation that ma- cient capacity utilization of the Hot Zones. Higher value of jority of the data in the production Hadoop cluster at Yahoo! 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 threshold signifies that files will be chosen has a news-server-like access pattern. This implies that once as candidates for migration only after they have been dor- data is deemed cold, there is low probability of data getting mant in the system for a longer period of time. This would accessed again. be an overkill for files with very short 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑠𝑝𝑎𝑛 𝐶𝐿𝑅 The Figure 9 (right-most) shows the total number of mi- as they will unnecessarily lie dormant in the system, oc- grations of the files which were deemed cold by the file mi- cupying precious Hot zone capacity for a longer period of gration policy and the reversals of the moved files in case time. they were later accessed by a client in the one-month sim- 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 : As Figure 12(Right) illustrates, in- ulation run. There were more instances (40,170, i.e., 4% creasing the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 value, minimally increases the of overall file count) of file reversals with the most ag- number of the days the servers in the Cold Zone remain ON gressive 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 of 5 days. With less aggressive and hence, minimally lowers the energy savings. On the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 of 15 days, the number of reversals in the other hand, increasing the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 value results in system went down to 6,548 (i.e., 0.7% of file count). The a reduction in the power state transitions which improves 284
  • 12. 120% 1000000 File Count in Log Scale % of Total File Reads 100% 100000 80% 10000 60% 1000 40% 100 20% 10 0% 1 012 012 012 012 012 012 012 012 012 012 012 012 012 012 012 012 012 012 012 012 012 012 012 012 012 012 012 012 012 012 105013-1060 127013-1280 137013-1380 122013-1230 132013-1330 13-10 8013-90 16013-170 24013-250 32013-330 40013-410 48013-490 56013-570 67013-680 78013-790 87013-880 95013-960 117013-1180 13-10 9013-100 18013-190 27013-280 36013-370 45013-460 54013-550 66013-670 78013-790 88013-890 98013-990 110013-1110 Read Response Time (msecs) Read Response Time (msecs) Figure 11. Performance Analysis: Impact on Response Time because of power management with a 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 of 10 days. 97.8% of the total read requests are not impacted by the power management. Impact is seen only by 2.1% of the reads. With a less aggressive 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 (15, 20), impact on the Response time will reduce much more. 195 Used Capacity Hot (TB) Used Capacity Cold (TB) timesOn daysOn Policy5 190 600 200 Used Storag Capacity (TB) Used Cold Zone Servers 185 500 150 180 Policy10 400 Count 175 Policy15 100 300 170 d ge Policy20 200 50 165 160 100 0 155 0 4 6 8 150 5 10 15 20 Server Power Conserver Policy Interval File Migration Policy Interval (Days) (Days) Figure 12. Sensitivity Analysis: Sensitivity of Number of Servers Used in Cold Zone, Number of Power State Transi- tions and Capacity per Zone to the Migration File Policy’s Age Threshold and the Server Power Conserver Policy’s Access Threshold. the performance of the accesses to the Cold Zone. Thus, in the Cold Zone exhibited this behavior. Most of the disks a trade-off needs to be made between energy-conservation are designed for a maximum service life time of 5 years and and data access performance. can tolerate 500,000 start/stop cycles. Given the very small Summary on Sensitivity Analysis: From the above number of transitions incurred by a server in the Cold Zone evaluation, it is clear that a trade-off needs to be made in in a year, GreenHDFS has no risk of exceeding the start/stop choosing the right thresholds in GreenHDFS based on an cycles during the service life time of the disks. enterprise’s needs. If Hot zone space is at a premium, more aggressive 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 needs to be used. This can be done without impacting the energy-conservation that can be 7 Related Work derived in GreenHDFS. Management of energy, peak power, and temperature of 6.3.6 Number of Server Power Transitions data centers and warehouses are becoming the targets of an increasing number of research studies. However, to the The Figure 13 (Left) shows the number of power transitions best of our knowledge, none of the existing systems exploit incurred by the servers in the Cold Zones. Frequently start- data classification-driven data placement to derive energy- ing and stopping disks is suspected to affect disk longevity. efficiency nor have a file system managed multi-zoned, hy- The number of start/stop cycles a disk can tolerate during brid data center layout. Most of the prior work focuses its service life time is still limited. Making the power tran- on workload placement to manage the thermal distribution sitions infrequently reduces the risk of running into this within a data center. [30, 34] considered the placement of limit.The maximum number of power state transitions in- computational workload for energy-efficiency. Chase et al. curred by a server in a one-month simulation run is just 11 [8] do an energy-conscious provisioning which configures times and only 1 server out of the 390 servers provisioned switches to concentrate request load on a minimal active set 285
  • 13. 12 Numb of Power State 10 Transitions 8 6 ber 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 Servers in Cold Zone Figure 13. Cold Zone Behavior: Number of Times Servers Transitioned Power State with 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 of 10 Days. We only show those servers in the Cold zone that either received newly cold data or had data accesses targeted to them in the one-month simulation run. of servers for the current aggregate load level. was proposed by Hakim et. al. [18]. However, that aims Le et. al. [25] focus on a multi-datacenter internet ser- to concentrate load on one disk at a time and hence, this vice. They exploit the inherent heterogeneity in the data- design will impact availability and performance. centers in electricity pricing, time-zone differences and col- location to renewable energy source, to reduce energy con- 8 Conclusion and Future Work sumption without impacting SLA requirements of the appli- cations. Bash et al. [5] allocate heavy computational, long We presented the detailed evaluation and sensitivity anal- running workloads onto servers that are in more thermally- ysis of GreenHDFS, a policy-driven, self-adaptive, variant efficient places. Chun et. al. [12] propose a hybrid data- of Hadoop Distributed File System. GreenHDFS relies on center comprising of low power Atom processors and high data classification driven data placement to realize guar- power, high performance Xeon processors. However, they anteed, substantially long periods of idleness in a signifi- do not specify any zoning in the system and focus more on cant subset of servers in the datacenter. Detailed experi- task migration rather than data migration. Narayanan et. mental results with real-world traces from a production Ya- al. [31] use a technique to offload write workload to one hoo! Hadoop cluster show that GreenHDFS is capable of volume to other storage elsewhere in the data center. Meis- achieving 24% savings in the energy costs of a Hadoop clus- ner et al. [28] reduce the power costs by transitioning the ter by doing power management in only one of the main servers to a ”powernap” state whenever there is a period of tenant top-level directory in the cluster. These savings will low utilization. be further compounded in the savings in the cooling costs. In addition, there is research on hardware-level tech- Detailed lifespan analysis of the files in a large-scale pro- niques such as dynamic-voltage scaling as a mechanism duction Hadoop cluster at Yahoo! points at the viability to reduce peak power consumption in the datacenters [7, of GreenHDFS. Evaluation results show that GreenHDFS 14] and Raghavendra et al. [33] coordinate hardware-level is able to meet all the scale-down mandates (i.e., generates power capping with virtual machine dispatching mecha- significant idleness in the cluster, results in very few power nisms. Managing temperature is the subject of the systems state transitions, and doesn’t degrade write performance) proposed in [20]. in spite of the unique scale-down challenges present in a Recent research on increasing energy-efficiency in GFS Hadoop cluster. and HDFS managed clusters [3, 27] propose maintaining a primary replica of the data on a small covering subset of nodes that are guaranteed to be on and which represent low- 9 Acknowledgement est power setting. Remaining replicas are stored in larger set of secondary nodes. Performance is scaled up by increas- This work was supported by NSF grant CNS 05-51665 and an internship at Yahoo!. The views and conclusions ing number of secondary nodes. However, these solutions contained in this paper are those of the authors and should suffer from degraded write-performance and increased DFS not be interpreted as representing the official policies, either code complexity. These solutions also do not do any data expressed or implied, of NSF or the U.S. government. differentiation and treat all the data in the system alike. Existing highly scalable file systems such as Google file References system [19] and HDFS [37] do not do energy management. Recently, an energy-efficient Log Structured File System [1] http://hadoop.apache.org/. 286
  • 14. [2] Introduction to power supplies. National Semiconductor, 2002. [23] R. T. kaushik and M. Bhandarkar. Greenhdfs: Towards an energy-conserving, storage-efficient, hybrid hadoop compute cluster. HotPower, 2010. [3] H. Amur, J. Cipar, V. Gupta, G. R. Ganger, M. A. Kozuch, and K. Schwan. Robust and flexible power-proportional storage. In SoCC ’10: Proceedings of [24] S. Konstantin, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed the 1st ACM symposium on Cloud computing, pages 217–228, New York, NY, file system. Symposium on Massive Storage Systems and Technologies, 2010. USA, 2010. ACM. [25] K. Le, R. Bianchini, M. Martonosi, and T. Nguyen. Cost- and energy-aware [4] L. A. Barroso and U. H¨ lzle. The case for energy-proportional computing. o load distribution across data centers. In HotPower, 2009. Computer, 40(12), 2007. [26] J. Leverich and C. Kozyrakis. On the energy (in)efficiency of hadoop clusters. [5] C. Bash and G. Forman. Cool job allocation: measuring the power savings HotPower, 2009. of placing jobs at cooling-efficient locations in the data center. In ATC’07: 2007 USENIX Annual Technical Conference on Proceedings of the USENIX [27] J. Leverich and C. Kozyrakis. On the energy (in)efficiency of hadoop clusters. Annual Technical Conference, pages 1–6, Berkeley, CA, USA, 2007. USENIX SIGOPS Oper. Syst. Rev., 44(1):61–65, 2010. Association. [28] D. Meisner, B. T. Gold, and T. F. Wenisch. Powernap: eliminating server idle [6] C. Belady. In the data center, power and cooling costs more than the it equip- power. In ASPLOS ’09: Proceeding of the 14th international conference on Ar- ment it supports. Electronics Cooling, February, 2010. chitectural support for programming languages and operating systems, pages 205–216, New York, NY, USA, 2009. ACM. [7] D. Brooks and M. Martonosi. Dynamic thermal management for high- performance microprocessors. In HPCA, pages 171–, 2001. [29] Micron. Ddr2 sdram sodimm. 2004. [8] J. S. Chase and R. P. Doyle. Balance of power: Energy management for server [30] J. Moore, J. Chase, P. Ranganathan, and R. Sharma. Making scheduling ”cool”: clusters. In In Proceedings of the 8th Workshop on Hot Topics in Operating temperature-aware workload placement in data centers. In ATEC ’05: Proceed- Systems HotOS, 2001. ings of the annual conference on USENIX Annual Technical Conference, pages 5–5, Berkeley, CA, USA, 2005. USENIX Association. [9] G. Chen, W. He, J. Liu, S. Nath, L. Rigas, L. Xiao, and F. Zhao. Energy-aware server provisioning and load dispatching for connection-intensive internet ser- [31] D. Narayanan, A. Donnelly, and A. Rowstron. Write off-loading: Practical vices. In NSDI’08: Proceedings of the 5th USENIX Symposium on Networked power management for enterprise storage. Trans. Storage, 4(3):1–23, 2008. Systems Design and Implementation, Berkeley, CA, USA, 2008. USENIX As- sociation. [32] C. Patel, E. Bash, R. Sharma, and M. Beitelmal. Smart cooling of data centers. In In Proceedings of the Pacific RIM/ASME International Electronics Packag- [10] Y. Chen, A. Das, W. Qin, A. Sivasubramaniam, Q. Wang, and N. Gautam. ing Technical Conference and Exhibition (IPACK03), 2003. Managing server energy and operational costs in hosting centers. SIGMETRICS Perform. Eval. Rev., 33(1), 2005. [33] R. Raghavendra, P. Ranganathan, V. Talwar, Z. Wang, and X. Zhu. No ”power” struggles: coordinated multi-level power management for the data center. In [11] Y. Chen, A. Ganapathi, A. Fox, R. H. Katz, and D. A. Patterson. Statistical ASPLOS XIII, pages 48–59, New York, NY, USA, 2008. ACM. workloads for energy efficient mapreduce. Technical report, UC, Berkeley, 2010. [34] R. K. Sharma, C. E. Bash, C. D. Patel, R. J. Friedrich, and J. S. Chase. Bal- ance of power: Dynamic thermal management for internet data centers. IEEE [12] B.-G. Chun, G. Iannaccone, G. Iannaccone, R. Katz, G. Lee, and L. Niccolini. Internet Computing, 9:42–49, 2005. An energy case for hybrid datacenters. In HotPower, 2009. [35] SMSC. Lan9420/lan9420i single-chip ethernet controller with hp auto-mdix [13] J. Dean, S. Ghemawat, and G. Inc. Mapreduce: simplified data processing on support and pci interface. 2008. large clusters. In In OSDI04: Proceedings of the 6th conference on Sympo- sium on Opearting Systems Design and Implementation. USENIX Association, [36] N. Tolia, Z. Wang, M. Marwah, C. Bash, P. Ranganathan, and X. Zhu. Deliver- 2004. ing energy proportionality with non energy-proportional systems - optimizing the ensemble. In HotPower, 2008. [14] M. E. Femal and V. W. Freeh. Boosting data center performance through non- uniform power allocation. In ICAC ’05: Proceedings of the Second Inter- [37] T. White. Hadoop: The Definitive Guide. O’Reilly Media, May, 2009. national Conference on Automatic Computing, Washington, DC, USA, 2005. IEEE Computer Society. [15] Y. I. Eric Baldeschwieler. http://developer.yahoo.com/events/hadoopsummit2010. [16] S. ES.2. http://www.seagate.com/staticfiles/support/disc/manuals/nl35 series & bc es series/barracuda es.2 series/100468393e.pdf. 2008. [17] X. Fan, W.-D. Weber, and L. A. Barroso. Power provisioning for a warehouse- sized computer. In ISCA ’07: Proceedings of the 34th annual international sym- posium on Computer architecture, pages 13–23, New York, NY, USA, 2007. ACM. [18] L. Ganesh, H. Weatherspoon, M. Balakrishnan, and K. Birman. Optimizing power consumption in large scale storage systems. In HOTOS’07: Proceedings of the 11th USENIX workshop on Hot topics in operating systems, Berkeley, CA, USA, 2007. USENIX Association. [19] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. SIGOPS Oper. Syst. Rev., 37(5):29–43, 2003. [20] T. Heath, A. P. Centeno, P. George, L. Ramos, Y. Jaluria, and R. Bianchini. Mercury and freon: temperature emulation and management for server systems. In ASPLOS, pages 106–116, 2006. [21] U. Hoelzle and L. A. Barroso. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers, May 29, 2009. [22] Intel. Quad-core intel xeon processor 5400 series. 2008. 287