This document summarizes a research paper that proposes GreenHDFS, an energy-efficient variant of HDFS that uses data classification to place data in hot and cold zones for power management. The authors analyzed file access patterns and lifespans in a large Yahoo! HDFS cluster and found that: 1) Patterns and lifespans varied significantly across directories; 2) 60% of data was cold/unused but needed for regulatory/historical purposes; and 3) 95-98% of files were hot for less than 3 days, though one directory had longer lifespans. GreenHDFS aims to generate long idle periods to power down servers while maintaining performance.
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
Abstract: Hadoop is a open source software framework for storage and processing large scale of datasets on clusters of commodity hardware. Hadoop provides a reliable shared storage and analysis system, here storage provided by HDFS and analysis provided by MapReduce. MapReduce frameworks are foraying into the domain of high performance of computing with stringent non-functional requirements namely execution times and throughputs. MapReduce provides simple programming interfaces with two functions: map and reduce. The functions can be automatically executed in parallel on a cluster without requiring any intervention from the programmer. Moreover, MapReduce offers other benefits, including load balancing, high scalability, and fault tolerance. The challenge is that when we consider the data is dynamically and continuously produced, from different geographical locations. For dynamically generated data, an efficient algorithm is desired, for timely guiding the transfer of data into the cloud over time for geo-dispersed data sets, there is need to select the best data center to aggregate all data onto given that a MapReduce like framework is most efficient when data to be processed are all in one place, and not across data centers due to the enormous overhead of inter-data center data moving in the stage of shuffle and reduce. Recently, many researchers tend to implement and deploy data-intensive and/or computation-intensive algorithms on MapReduce parallel computing framework for high processing efficiency.
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedDataWorks Summit
Scientific data services are a critical aspect of the NASA Center for Climate Simulation’s mission (NCCS). Hadoop, via MapReduce, provides an approach to high-performance analytics that is proving to be useful to data intensive problems in climate research. It offers an analysis paradigm that uses clusters of computers and combines distributed storage of large data sets with parallel computation. The NCCS is particularly interested in the potential of Hadoop to speed up basic operations common to a wide range of analyses. In order to evaluate this potential, we prototyped a series of canonical MapReduce operations over a test suite of observational and climate simulation datasets. The initial focus was on averaging operations over arbitrary spatial and temporal extents within Modern Era Retrospective- Analysis for Research and Applications (MERRA) data. After preliminary results suggested that this approach improves efficiencies within data intensive analytic workflows, we invested in building a cyberinfrastructure resource for developing a new generation of climate data analysis capabilities using Hadoop. This resource is focused on reducing the time spent in the preparation of reanalysis data used in data-model intercomparison, a long sought goal of the climate community. This paper summarizes the related use cases and lessons learned.
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET Journal
This document proposes a novel approach to improve the efficiency of processing small files in the Hadoop Distributed File System (HDFS) using Apache Spark. It discusses how HDFS is optimized for large files but suffers from low efficiency when handling many small files. The proposed approach uses Spark to judge file sizes, merge small files to improve block utilization, and process the files in-memory for faster performance compared to the traditional MapReduce approach. Evaluation results show the Spark-based system reduces NameNode memory usage and improves processing speeds by up to 100 times compared to conventional Hadoop processing.
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...DataWorks Summit
Monsanto uses geospatial data and analytics to improve sustainable agriculture. They process vast amounts of spatial data on Hadoop to generate prescription maps that optimize seeding rates. Their previous SQL-based system could only handle a small fraction of the data and took over 30 days to process. Monsanto's new Hadoop/HBase architecture loads the entire US dataset in 18 hours, representing significant cost savings over the SQL approach. This foundational system provides agronomic insights to farmers and supports Monsanto's vision of doubling yields by 2030 through information-driven farming.
This document describes a proposed architecture for improving data retrieval performance in a Hadoop Distributed File System (HDFS) deployed in a cloud environment. The key aspects are:
1) A web server would replace the map phase of MapReduce to provide faster searching of data. The web server uses multi-level indexing for real-time processing on HDFS.
2) An Apache load balancer distributes requests across backend application servers to improve throughput and scalability.
3) The NameNode is divided into master and slave servers, with the master containing the multi-level index and slaves storing data and lower-level indexes. This allows distributed data retrieval.
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...EMC
The document describes a partnership between Cisco and Greenplum to deliver optimized high-performance Hadoop reference configurations. Key elements include:
- Greenplum MR provides a high-performance distribution of Hadoop with features like direct data access, high availability, and advanced management.
- Cisco UCS is the exclusive hardware platform and provides a flexible, scalable computing platform optimized for Hadoop workloads.
- The Cisco Greenplum MR Reference Configuration combines these software and hardware components into an integrated solution for running Hadoop and big data analytics workloads.
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
IRJET- Performing Load Balancing between Namenodes in HDFSIRJET Journal
This document proposes a new architecture for HDFS to address the single point of failure issue of the NameNode. The current HDFS architecture uses a single NameNode that manages file system metadata. If it fails, the entire system fails. The proposed architecture uses multiple interconnected NameNodes that maintain mirrors of each other's metadata using the Chord system. This allows load balancing between NameNodes and prevents failure if one NameNode goes down, as other NameNodes can handle the load and client requests/responses. The goal is to improve scalability, availability and reduce downtime of the NameNode in HDFS through this new distributed architecture.
This document discusses a proposed data-aware caching framework called Dache that could be used with big data applications built on MapReduce. Dache aims to cache intermediate data generated during MapReduce jobs to avoid duplicate computations. When tasks run, they would first check the cache for existing results before running the actual computations. The goal is to improve efficiency by reducing redundant work. The document outlines the objectives and scope of extending MapReduce with Dache, provides background on MapReduce and Hadoop, and concludes that initial experiments show Dache can eliminate duplicate tasks in incremental jobs.
Integrating dbm ss as a read only execution layer into hadoopJoão Gabriel Lima
The document discusses integrating database management systems (DBMSs) as a read-only execution layer into Hadoop. It proposes a new system architecture that incorporates modified DBMS engines augmented with a customized storage engine capable of directly accessing data from the Hadoop Distributed File System (HDFS) and using global indexes. This allows DBMS to efficiently provide read-only operators while not being responsible for data management. The system is designed to address limitations of HadoopDB by improving performance, fault tolerance, and data loading speed.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Big data processing using - Hadoop TechnologyShital Kat
This document summarizes a report on Hadoop technology as a solution to big data processing. It discusses the big data problem, including defining big data, its characteristics and challenges. It then introduces Hadoop as a solution, describing its components HDFS for storage and MapReduce for parallel processing. Examples of common friend lists and word counting are provided. Finally, it briefly mentions some Hadoop projects and companies that use Hadoop.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
Cloud batch a batch job queuing system on clouds with hadoop and h-baseJoão Gabriel Lima
CloudBATCH is a batch job queuing system that allows Hadoop to function like a traditional batch job queuing system with enhanced functionality. It uses HBase tables to store resource management information like user credentials and job/queue details. Job brokers submit "wrappers" (MapReduce programs) as agents to execute jobs in Hadoop. Users can submit and check the status of jobs through a client. CloudBATCH aims to enable Hadoop to manage hybrid computing needs on a cluster by providing traditional management features like user access control, accounting, and customizable job scheduling.
The document describes Megastore, a storage system developed by Google to meet the requirements of interactive online services. Megastore blends the scalability of NoSQL databases with the features of relational databases. It uses partitioning and synchronous replication across datacenters using Paxos to provide strong consistency and high availability. Megastore has been widely deployed at Google to handle billions of transactions daily storing nearly a petabyte of data across global datacenters.
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
This document summarizes research characterizing Hadoop jobs using unsupervised learning techniques. The researchers clustered over 11,000 Hadoop jobs from Yahoo production clusters into 8 groups based on job metrics. The centroids of each cluster represent characteristic jobs and show differences in map/reduce tasks and data processed. Identifying common job profiles can help benchmark and optimize Hadoop performance.
This document summarizes a research paper that analyzes and evaluates the performance of processing large data sets using Hadoop. It discusses how Hadoop Distributed File System (HDFS) and MapReduce provide parallel and distributed processing of large structured and unstructured data at scale. The paper also presents the results of experiments conducted on Hadoop to classify and cluster large data sets using machine learning algorithms. The experiments showed that Hadoop can process large data sets more efficiently and reliably compared to processing on a single computer.
Attaching cloud storage to a campus grid using parrot, chirp, and hadoopJoão Gabriel Lima
This document discusses attaching cloud storage to a campus grid using Parrot, Chirp, and Hadoop. The authors present a solution that bridges the Chirp distributed filesystem to Hadoop to provide simple access to large datasets on Hadoop for jobs running on the campus grid. Chirp layers additional grid computing features on top of Hadoop like simple deployment without special privileges, easy access via Parrot, and strong flexible security access control lists. The authors evaluate the performance of connecting Parrot directly to Hadoop for better scalability versus connecting Parrot to Chirp and then to Hadoop for greater stability.
This paper discusses implementing NoSQL databases for robotics applications. NoSQL databases are well-suited for robotics because they can store massive amounts of data, retrieve information quickly, and easily scale. The paper proposes using a NoSQL graph database to store robot instructions and relate them according to tasks. MapReduce processing is also suggested to break large robot data problems into parallel pieces. Implementing a NoSQL system would allow building more intelligent humanoid robots that can process billions of objects and learn quickly from massive sensory inputs.
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
This document presents two variations of a job-driven scheduling scheme called JOSS for efficiently executing MapReduce jobs on remote outsourced data across multiple data centers. The goal of JOSS is to improve data locality for map and reduce tasks, avoid job starvation, and improve job performance. Extensive experiments show that the two JOSS variations, called JOSS-T and JOSS-J, outperform other scheduling algorithms in terms of data locality and network overhead without significant overhead. JOSS-T performs best for workloads of small jobs, while JOSS-J provides the shortest workload time for jobs of varying sizes distributed across data centers.
A sql implementation on the map reduce frameworkeldariof
Tenzing is a SQL query engine built on top of MapReduce for analyzing large datasets at Google. It provides low latency queries over heterogeneous data sources at petabyte scale. Tenzing supports a full SQL implementation with extensions and optimizations to achieve high performance comparable to commercial parallel databases by leveraging MapReduce and traditional database techniques. It is used by over 1,000 Google employees to query over 1.5 petabytes of data and serve 10,000 queries daily.
A comparative survey based on processing network traffic data using hadoop pi...ijcses
Big data analysis has now become an integral part of many computational and statistical departments.
Analysis of peta-byte scale of data is having an enhanced importance in the present day scenario. Big data
manipulation is now considered as a key area of research in the field of data analytics and novel
techniques are being evolved day by day. Thousands of transaction requests are being processed in every
minute by different websites related to e-commerce, shopping carts and online banking. Here comes the
need of network traffic and weblog analysis for which Hadoop comes as a suggested solution. It can
efficiently process the Netflow data collected from routers, switches or even from website access logs at
fixed intervals.
Global warming big data is analyzed and processed using Pig, Hive, Mapreduce, HDFC, HBase technologies. Results are visualized using Tableau and stored in HBase in Programming in Data Analytics Project.
EMC Isilon Best Practices for Hadoop Data StorageEMC
This document provides best practices for setting up and managing HDFS on an EMC Isilon cluster to optimize storage for Hadoop analytics. Key points include:
- An Isilon cluster implements the HDFS protocol and presents every node as both a namenode and datanode for redundancy and load balancing.
- Virtual racks can mimic data locality to optimize performance.
- Enterprise features like SmartPools, deduplication, and InsightIQ help manage and monitor large Hadoop data sets on the Isilon platform.
The document describes the Seagate Hadoop Workflow Accelerator, which enables organizations to optimize Hadoop workflows and centralize data storage. It accelerates Hadoop applications by leveraging ClusterStor's high-performance Lustre parallel file system and bypassing the HDFS software layer. This provides improved Hadoop performance, flexibility to scale compute and storage independently, and reduced total cost of ownership.
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...CitiusTech
This document discusses de-duplicating data in a healthcare data lake using big data processing frameworks. It describes keeping duplicate records and querying the latest one, or rewriting records to create a golden copy. The preferred approach uses Spark to partition data, identify new/updated records, de-duplicate by selecting the latest from incremental and refined data, and overwrite only affected partitions. This creates a non-ambiguous, de-duplicated dataset for analysis in a scalable and cost-effective manner.
The document discusses GreenHDFS, a self-adaptive variant of HDFS that aims to reduce energy consumption. It does this through techniques like data classification to place "hot" and "cold" data in different zones, power management policies to transition servers to low-power states, and machine learning to predict file access patterns and inform placement. An evaluation of GreenHDFS found it reduced energy consumption by 24% and saved $2.1 million annually in a 38,000 server cluster.
Strata + Hadoop World 2012: HDFS: Now and FutureCloudera, Inc.
Hadoop 1.0 is a significant milestone in being the most stable and robust Hadoop release tested in production against a variety of applications. It offers improved performance, support for HBase, disk-fail-in-place, Webhdfs, etc over previous releases. The next major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, further performance improvements, etc. We describe how to take advantages of the new features and their benefits. We also discuss some of the misconceptions and myths about HDFS.
Architecting a Scalable Hadoop Platform: Top 10 considerations for successDataWorks Summit
This document discusses 10 considerations for architecting a scalable Hadoop platform:
1. Choosing between on-premise or public cloud deployment.
2. Evaluating total cost of ownership which includes hardware, software, support and other recurring costs.
3. Configuring hardware including servers, storage, networking and heterogeneous resources.
4. Ensuring a high performance network backbone that avoids bottlenecks.
5. Maintaining a software stack that focuses on use cases over specific technologies.
Next Generation Hadoop: High Availability for YARN Arinto Murdopo
The document proposes a new architecture for YARN to solve its availability limitation of single-point-of-failure in the resource manager. The key aspects of the proposed architecture are:
1. It utilizes a stateless failure model where all necessary states and information used by the resource manager are stored in a persistent storage.
2. MySQL Cluster (NDB) is proposed as the storage technology due to its high availability, linear scalability, and high throughput of up to 1.8 million writes per second.
3. A proof-of-concept implementation was done using NDB to store application states and their corresponding attempts. Evaluations showed the architecture is able to increase YARN's availability and NDB
This document discusses a proposed data-aware caching framework called Dache that could be used with big data applications built on MapReduce. Dache aims to cache intermediate data generated during MapReduce jobs to avoid duplicate computations. When tasks run, they would first check the cache for existing results before running the actual computations. The goal is to improve efficiency by reducing redundant work. The document outlines the objectives and scope of extending MapReduce with Dache, provides background on MapReduce and Hadoop, and concludes that initial experiments show Dache can eliminate duplicate tasks in incremental jobs.
Integrating dbm ss as a read only execution layer into hadoopJoão Gabriel Lima
The document discusses integrating database management systems (DBMSs) as a read-only execution layer into Hadoop. It proposes a new system architecture that incorporates modified DBMS engines augmented with a customized storage engine capable of directly accessing data from the Hadoop Distributed File System (HDFS) and using global indexes. This allows DBMS to efficiently provide read-only operators while not being responsible for data management. The system is designed to address limitations of HadoopDB by improving performance, fault tolerance, and data loading speed.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Big data processing using - Hadoop TechnologyShital Kat
This document summarizes a report on Hadoop technology as a solution to big data processing. It discusses the big data problem, including defining big data, its characteristics and challenges. It then introduces Hadoop as a solution, describing its components HDFS for storage and MapReduce for parallel processing. Examples of common friend lists and word counting are provided. Finally, it briefly mentions some Hadoop projects and companies that use Hadoop.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
Cloud batch a batch job queuing system on clouds with hadoop and h-baseJoão Gabriel Lima
CloudBATCH is a batch job queuing system that allows Hadoop to function like a traditional batch job queuing system with enhanced functionality. It uses HBase tables to store resource management information like user credentials and job/queue details. Job brokers submit "wrappers" (MapReduce programs) as agents to execute jobs in Hadoop. Users can submit and check the status of jobs through a client. CloudBATCH aims to enable Hadoop to manage hybrid computing needs on a cluster by providing traditional management features like user access control, accounting, and customizable job scheduling.
The document describes Megastore, a storage system developed by Google to meet the requirements of interactive online services. Megastore blends the scalability of NoSQL databases with the features of relational databases. It uses partitioning and synchronous replication across datacenters using Paxos to provide strong consistency and high availability. Megastore has been widely deployed at Google to handle billions of transactions daily storing nearly a petabyte of data across global datacenters.
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
This document summarizes research characterizing Hadoop jobs using unsupervised learning techniques. The researchers clustered over 11,000 Hadoop jobs from Yahoo production clusters into 8 groups based on job metrics. The centroids of each cluster represent characteristic jobs and show differences in map/reduce tasks and data processed. Identifying common job profiles can help benchmark and optimize Hadoop performance.
This document summarizes a research paper that analyzes and evaluates the performance of processing large data sets using Hadoop. It discusses how Hadoop Distributed File System (HDFS) and MapReduce provide parallel and distributed processing of large structured and unstructured data at scale. The paper also presents the results of experiments conducted on Hadoop to classify and cluster large data sets using machine learning algorithms. The experiments showed that Hadoop can process large data sets more efficiently and reliably compared to processing on a single computer.
Attaching cloud storage to a campus grid using parrot, chirp, and hadoopJoão Gabriel Lima
This document discusses attaching cloud storage to a campus grid using Parrot, Chirp, and Hadoop. The authors present a solution that bridges the Chirp distributed filesystem to Hadoop to provide simple access to large datasets on Hadoop for jobs running on the campus grid. Chirp layers additional grid computing features on top of Hadoop like simple deployment without special privileges, easy access via Parrot, and strong flexible security access control lists. The authors evaluate the performance of connecting Parrot directly to Hadoop for better scalability versus connecting Parrot to Chirp and then to Hadoop for greater stability.
This paper discusses implementing NoSQL databases for robotics applications. NoSQL databases are well-suited for robotics because they can store massive amounts of data, retrieve information quickly, and easily scale. The paper proposes using a NoSQL graph database to store robot instructions and relate them according to tasks. MapReduce processing is also suggested to break large robot data problems into parallel pieces. Implementing a NoSQL system would allow building more intelligent humanoid robots that can process billions of objects and learn quickly from massive sensory inputs.
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
This document presents two variations of a job-driven scheduling scheme called JOSS for efficiently executing MapReduce jobs on remote outsourced data across multiple data centers. The goal of JOSS is to improve data locality for map and reduce tasks, avoid job starvation, and improve job performance. Extensive experiments show that the two JOSS variations, called JOSS-T and JOSS-J, outperform other scheduling algorithms in terms of data locality and network overhead without significant overhead. JOSS-T performs best for workloads of small jobs, while JOSS-J provides the shortest workload time for jobs of varying sizes distributed across data centers.
A sql implementation on the map reduce frameworkeldariof
Tenzing is a SQL query engine built on top of MapReduce for analyzing large datasets at Google. It provides low latency queries over heterogeneous data sources at petabyte scale. Tenzing supports a full SQL implementation with extensions and optimizations to achieve high performance comparable to commercial parallel databases by leveraging MapReduce and traditional database techniques. It is used by over 1,000 Google employees to query over 1.5 petabytes of data and serve 10,000 queries daily.
A comparative survey based on processing network traffic data using hadoop pi...ijcses
Big data analysis has now become an integral part of many computational and statistical departments.
Analysis of peta-byte scale of data is having an enhanced importance in the present day scenario. Big data
manipulation is now considered as a key area of research in the field of data analytics and novel
techniques are being evolved day by day. Thousands of transaction requests are being processed in every
minute by different websites related to e-commerce, shopping carts and online banking. Here comes the
need of network traffic and weblog analysis for which Hadoop comes as a suggested solution. It can
efficiently process the Netflow data collected from routers, switches or even from website access logs at
fixed intervals.
Global warming big data is analyzed and processed using Pig, Hive, Mapreduce, HDFC, HBase technologies. Results are visualized using Tableau and stored in HBase in Programming in Data Analytics Project.
EMC Isilon Best Practices for Hadoop Data StorageEMC
This document provides best practices for setting up and managing HDFS on an EMC Isilon cluster to optimize storage for Hadoop analytics. Key points include:
- An Isilon cluster implements the HDFS protocol and presents every node as both a namenode and datanode for redundancy and load balancing.
- Virtual racks can mimic data locality to optimize performance.
- Enterprise features like SmartPools, deduplication, and InsightIQ help manage and monitor large Hadoop data sets on the Isilon platform.
The document describes the Seagate Hadoop Workflow Accelerator, which enables organizations to optimize Hadoop workflows and centralize data storage. It accelerates Hadoop applications by leveraging ClusterStor's high-performance Lustre parallel file system and bypassing the HDFS software layer. This provides improved Hadoop performance, flexibility to scale compute and storage independently, and reduced total cost of ownership.
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...CitiusTech
This document discusses de-duplicating data in a healthcare data lake using big data processing frameworks. It describes keeping duplicate records and querying the latest one, or rewriting records to create a golden copy. The preferred approach uses Spark to partition data, identify new/updated records, de-duplicate by selecting the latest from incremental and refined data, and overwrite only affected partitions. This creates a non-ambiguous, de-duplicated dataset for analysis in a scalable and cost-effective manner.
The document discusses GreenHDFS, a self-adaptive variant of HDFS that aims to reduce energy consumption. It does this through techniques like data classification to place "hot" and "cold" data in different zones, power management policies to transition servers to low-power states, and machine learning to predict file access patterns and inform placement. An evaluation of GreenHDFS found it reduced energy consumption by 24% and saved $2.1 million annually in a 38,000 server cluster.
Strata + Hadoop World 2012: HDFS: Now and FutureCloudera, Inc.
Hadoop 1.0 is a significant milestone in being the most stable and robust Hadoop release tested in production against a variety of applications. It offers improved performance, support for HBase, disk-fail-in-place, Webhdfs, etc over previous releases. The next major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, further performance improvements, etc. We describe how to take advantages of the new features and their benefits. We also discuss some of the misconceptions and myths about HDFS.
Architecting a Scalable Hadoop Platform: Top 10 considerations for successDataWorks Summit
This document discusses 10 considerations for architecting a scalable Hadoop platform:
1. Choosing between on-premise or public cloud deployment.
2. Evaluating total cost of ownership which includes hardware, software, support and other recurring costs.
3. Configuring hardware including servers, storage, networking and heterogeneous resources.
4. Ensuring a high performance network backbone that avoids bottlenecks.
5. Maintaining a software stack that focuses on use cases over specific technologies.
Next Generation Hadoop: High Availability for YARN Arinto Murdopo
The document proposes a new architecture for YARN to solve its availability limitation of single-point-of-failure in the resource manager. The key aspects of the proposed architecture are:
1. It utilizes a stateless failure model where all necessary states and information used by the resource manager are stored in a persistent storage.
2. MySQL Cluster (NDB) is proposed as the storage technology due to its high availability, linear scalability, and high throughput of up to 1.8 million writes per second.
3. A proof-of-concept implementation was done using NDB to store application states and their corresponding attempts. Evaluations showed the architecture is able to increase YARN's availability and NDB
Energy efficient task scheduling algorithms for cloud data centerseSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Energy efficient task scheduling algorithms for cloud data centerseSAT Journals
Abstract
Life time of Wireless device Networks (WSNs) has perpetually been a important issue and has received enlarged attention within the
recent years. Typically wireless device nodes area unit equipped with low power batteries that area unit impossible to recharge.
Wireless device networks ought to have enough energy to satisfy the specified necessities of applications. during this paper, we have a
tendency to propose Energy economical Routing and Fault node Replacement (EERFNR) formula to extend the lifespan of wireless
device network, cut back information loss and conjointly cut back device node replacement value. Transmission drawback and device
node loading drawback is solved by adding many relay nodes and composition device node’s routing mistreatment stratified Gradient
Diffusion. The device node will save backup nodes to cut back the energy for re-looking the route once the device node routing is
broken. Genetic formula can calculate the device nodes to exchange, apply the foremost on the market routing methods to replace the
fewest device nodes.
Keywords: Genetic algorithmic rule, stratified gradient diffusion, grade diffusion, wireless device networks
Energy efficient task scheduling algorithms for cloud data centerseSAT Journals
Abstract Cloud computing is a modern technology which contains a network of systems that form a cloud. Energy conservation is one of the major concern in cloud computing. Large amount of energy is wasted by the computers and other devices and the carbon dioxide gas is released into the atmosphere polluting the environment. Green computing is an emerging technology which focuses on preserving the environment by reducing various kinds of pollutions. Pollutions include excessive emission of greenhouse gas, disposal of e-waste and so on leading to greenhouse effect. So pollution needs to be reduced by lowering the energy usage. By doing this, utilization of resources should not be reduced. With less usage of energy, maximum resource utilization should be possible. For this purpose, many green task scheduling algorithms are used so that the energy consumption can be minimized in servers of cloud data centers. In this paper, ESF-ES algorithm is developed which focuses on minimizing energy consumption by minimizing the number of servers used. The comparison is made with hybrid algorithms and most-efficient-server first scheme. Keywords: Cloud computing, Green computing, Energy-efficiency, Green data centers and Task scheduling.
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
The document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It describes Hadoop's core components - the Hadoop Distributed File System (HDFS) for scalable data storage, and MapReduce for distributed processing of large datasets in parallel. Typical problems suited for Hadoop involve complex data from multiple sources that need to be consolidated, stored inexpensively at scale, and processed in parallel across the cluster.
Energy aware load balancing and application scaling for the cloud ecosystemPvrtechnologies Nellore
This document summarizes an article that introduces an energy-aware operation model for load balancing and application scaling in cloud computing environments. It aims to define an energy-optimal operation regime for servers and maximize the number operating in this regime. Idle and lightly loaded servers would be switched to low-power sleep states to save energy. The document provides background on energy efficiency in data centers and systems, discusses related work, and outlines the contributions and evaluation approach of the article.
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
Since 2006, Hadoop and its ecosystem components have evolved into a platform that Yahoo has begun to trust for running its businesses globally. Hadoop’s scalability, efficiency, built-in reliability, and cost effectiveness have made it an enterprise-wide platform that web-scale cloud operations run on. In this talk, we will take a broad look at some of the top software, hardware, and services considerations that have gone in to make the platform indispensable for nearly 1,000 active developers on a daily basis, including the challenges that come from scale, security and multi-tenancy we have dealt with in the last several years of operating one the largest Hadoop footprints in the world. We will cover the current technology stack Yahoo that has built or assembled, infrastructure elements such as configurations, deployment models, and network, and what it takes to offer hosted Hadoop services to a large customer base at Yahoo. Throughout the talk, we will highlight relevant use cases from Yahoo’s Mobile, Search, Advertising, Personalization, Media, and Communications businesses that may make these considerations more pertinent to your situation.
This document discusses the problem of storing and processing many small files in HDFS and Hadoop. It introduces the concept of "harballing" where Hadoop uses an archiving technique called Hadoop Archive (HAR) to collect many small files into a single large file to reduce overhead on the namenode and improve performance. HAR packs small files into an archive file with a .har extension so the original files can still be accessed efficiently and in parallel. This reduction of small files through harballing increases scalability by reducing namespace usage and load on the namenode.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
This document provides an introduction to Hadoop, including:
- Hadoop challenges such as deployment, change management, and complexity in tuning its many parameters
- The main node types in Hadoop including NameNode, DataNode, and EdgeNode
- Common uses of Hadoop including distributed computing, storage, and presenting data in a SQL-like format for analysis
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
This document discusses considerations for scaling Hadoop platforms at Yahoo. It covers topics such as deployment models (on-premise vs. public cloud), total cost of ownership, hardware configuration, networking, software stack, security, data lifecycle management, metering and governance, and debunking myths. The key takeaways are that utilization matters for cost analysis, hardware becomes increasingly heterogeneous over time, advanced networking designs are needed to avoid bottlenecks, security and access management must be flexible, and data lifecycles require policy-based management.
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERijdpsjournal
With the interconnection of one to many computers online, the data shared by users is multiplying daily. As
a result, the amount of data to be processed by dedicated servers rises very quickly. However, the
instantaneous increase in the volume of data to be processed by the server comes up against latency during
processing. This requires a model to manage the distribution of tasks across several machines. This article
presents a study of load balancing for large data sets on a cluster of Hadoop nodes. In this paper, we use
Mapreduce to implement parallel programming and Yarn to monitor task execution and submission in a
node cluster.
Hadoop As The Platform For The Smartgrid At TVACloudera, Inc.
Cloudera's Josh Paterson presented how Hadoop is used as the platform for smartgrid technologies at the Tennessee Valley Authority. This presentation encompasses a retrospective on the openPDC project, what Hadoop is, current smartgrid obsticles, and Cloudera Enterprise as The New Smartgrid Platform.
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesCloudera, Inc.
Slides describing Cloudera and Karmasphere, and how combined their products can install a Hadoop cluster, import data, run queries and generate results.
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERSijfcstjournal
This document summarizes an integer-linear programming model for optimizing energy efficiency in data centers. The model seeks to minimize energy consumption and maximize user satisfaction by determining the optimal migration of requests between resources/servers. Key aspects of the model include: (1) defining variables to represent server capacity usage and migrations between servers, (2) formulating an objective function that minimizes setup costs, energy usage, and migration costs, and (3) adding constraints to ensure capacity limits are met and minimum servers remain active. Relaxing integer variables to continuous values provides faster approximating solutions. Near-feasible solutions with remissible infeasibility are also considered.
The document discusses backup and disaster recovery strategies for Hadoop. It focuses on protecting data sets stored in HDFS. HDFS uses data replication and checksums to protect against disk and node failures. Snapshots can protect against data corruption and accidental deletes. The document recommends copying data from the primary to secondary site for disaster recovery rather than teeing, and discusses considerations for large data movement like bandwidth needs and security. It also notes the importance of backing up metadata like Hive configurations along with core data.
The document discusses the steps of a data analysis project and provides a case study example. The key steps are:
1) Defining the business problem and purpose of the analysis.
2) Choosing and preparing appropriate data sources.
3) Applying relevant techniques such as data preparation, pattern recognition, and data analysis.
4) Evaluating the results and delivering value or insights to address the original business problem.
The case study examines building a neuromarketing tool using AI to predict areas of visual attention in images and memory retention. Pattern recognition techniques are trained on labeled datasets to help identify these patterns autonomously.
We are a company that delivers value to our customers by lowering costs with digital marketing and increasing the efficiency of campaigns and their conversions. Using the most advanced artificial intelligence models in the neuro-marketing perspective, we have been able to predict the effectiveness of a marketing campaign before it is published. After its publication, we evaluated the campaign, segmenting the public according to the standard extracted from each market segment, delivering information for strategic and efficient management.
Aplicações de Alto Desempenho com JHipster Full StackJoão Gabriel Lima
Uma apresentação sobre o framework JHipster para construção de aplicações full stack Java. O framework fornece generators, scaffolding e estruturas para criar aplicações web e APIs escaláveis com Spring Boot e Angular/React. O documento discute a arquitetura, geração de projetos, estrutura de pastas, depuração, produção e dicas de uso do JHipster.
O documento discute a realidade aumentada com React Native e ARKit. Apresenta exemplos de aplicações incríveis de realidade aumentada e lista os requisitos para usar o ARKit no iOS. Explica como começar um projeto de realidade aumentada com React Native e ARKit, incluindo como criar a aplicação e linkar as dependências.
O documento discute os conceitos de Big Data, Inteligência Artificial e Aprendizado de Máquina. Apresenta as principais ferramentas e técnicas dessas áreas, incluindo redes neurais profundas, clustering, regressão linear e florestas aleatórias. Também aborda a importância dessas tecnologias para a tomada de decisão estratégica e geração de conhecimento a partir de dados.
O modelo de regressão é então usado para prever o resultado de uma variável dependente desconhecida, dados os valores das variáveis independentes.
Nesta aula, mostro um passo a passo com a bordage teórica e prática de como fazer regressão linear utilizando o WEKA
O documento apresenta estudos de caso sobre segurança na internet conduzidos pelo professor João Gabriel Lima, incluindo o ataque ao site Ashley Madison, ataques de malvertising escondendo malware em pixels de banners publicitários e o grande ataque DDoS de 2016 contra servidores da Dyn que causou instabilidade em diversos sites e serviços.
O documento descreve diversos tipos de ameaças à segurança na internet, incluindo vírus, worms, adwares, phishing e ataques de negação de serviço. O autor, Prof. João Gabriel Lima, fornece definições concisas de cada ameaça, desde conceitos básicos até técnicas avançadas usadas por crackers.
O documento apresenta uma introdução à aprendizagem de máquina com Javascript, discutindo conceitos de inteligência artificial e machine learning, exemplos de aplicações, ferramentas e desafios para implementar machine learning na web com Javascript.
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...João Gabriel Lima
Nesta palestra, vamos trabalhar uma abordagem passo a passo de como construir um modelo de classificação, para identificar os padrões de clientes de uma empresa de telefonia que cancelaram o serviço, de modo que a operadora possa prever o risco de cancelamento e iniciar um trabalho para evitar que isso aconteça.
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoJoão Gabriel Lima
Nesta apresentação, apresento um passo a passo prático de como clusterizar e mais importante que isso, como interpretar os resultados aplicando isso para auxiliar a tomada de decisão.
No final temos um exercício de fixação muito interessante que nos dá a oportunidade de aplicar os conhecimentos adquiridos.
[email protected]
Mineração de dados na prática com RapidMiner e WekaJoão Gabriel Lima
O documento apresenta uma introdução à regressão linear usando o software WEKA para mineração de dados. Explica o que é mineração de dados e regressão, como carregar e formatar dados no WEKA, criar um modelo de regressão linear para prever preços de casas com base em variáveis como tamanho e quartos, e interpretar os resultados do modelo.
Nessa apresentação apresento ambas arquiteturas e mostro que ao invés de escolher entre uma e outra, podemos tirar o que há de melhor em cada e utilizá-las de forma limpa, simples e objetiva.
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...João Gabriel Lima
Nesta apresentação mostro um estudo realizado pela universidade de Munique que visa prever a probabilidade de um personagem morrer na próxima temporada de acordo com 24 características pré-selecionadas
O documento discute o aplicativo IPVA Cidadão Mobile, que permite aos cidadãos pagarem o IPVA (Imposto sobre a Propriedade de Veículos Automotores) em seus dispositivos móveis. O aplicativo está disponível para os estados do Ceará e Mato Grosso e permite realizar o pagamento de forma rápida e prática.
[Estácio - IESAM] Automatizando Tarefas com Gulp.jsJoão Gabriel Lima
O documento descreve como automatizar tarefas com a ferramenta Gulp.js. Ele explica que Gulp ajuda a automatizar tarefas repetitivas como concatenar arquivos, minificar e rodar testes. Também fornece exemplos de como usar Gulp para rodar testes JavaScript, minificar HTML, CSS e JavaScript e otimizar imagens. Recomenda pré-processadores e plugins úteis e encoraja a explorar mais funcionalidades de Gulp.
O documento discute como JavaScript pode ser usado para conectar dispositivos da Internet das Coisas, como eletrodomésticos e roupas inteligentes. Apresenta como JavaScript é uma linguagem amplamente usada na Internet que possui muitas bibliotecas e frameworks úteis para IoT. Também lista alguns projetos de IoT feitos com JavaScript e áreas em que pode ser aplicado, como cidades inteligentes e agricultura.
Droidal: AI Agents Revolutionizing HealthcareDroidal LLC
Droidal’s AI Agents are transforming healthcare by bringing intelligence, speed, and efficiency to key areas such as Revenue Cycle Management (RCM), clinical operations, and patient engagement. Built specifically for the needs of U.S. hospitals and clinics, Droidal's solutions are designed to improve outcomes and reduce administrative burden.
Through simple visuals and clear examples, the presentation explains how AI Agents can support medical coding, streamline claims processing, manage denials, ensure compliance, and enhance communication between providers and patients. By integrating seamlessly with existing systems, these agents act as digital coworkers that deliver faster reimbursements, reduce errors, and enable teams to focus more on patient care.
Droidal's AI technology is more than just automation — it's a shift toward intelligent healthcare operations that are scalable, secure, and cost-effective. The presentation also offers insights into future developments in AI-driven healthcare, including how continuous learning and agent autonomy will redefine daily workflows.
Whether you're a healthcare administrator, a tech leader, or a provider looking for smarter solutions, this presentation offers a compelling overview of how Droidal’s AI Agents can help your organization achieve operational excellence and better patient outcomes.
A free demo trial is available for those interested in experiencing Droidal’s AI Agents firsthand. Our team will walk you through a live demo tailored to your specific workflows, helping you understand the immediate value and long-term impact of adopting AI in your healthcare environment.
To request a free trial or learn more:
https://droidal.com/
SAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AIPeter Spielvogel
Explore how AI in SAP Fiori apps enhances productivity and collaboration. Learn best practices for SAPUI5, Fiori elements, and tools to build enterprise-grade apps efficiently. Discover practical tips to deploy apps quickly, leveraging AI, and bring your questions for a deep dive into innovative solutions.
Introducing FME Realize: A New Era of Spatial Computing and ARSafe Software
A new era for the FME Platform has arrived – and it’s taking data into the real world.
Meet FME Realize: marking a new chapter in how organizations connect digital information with the physical environment around them. With the addition of FME Realize, FME has evolved into an All-data, Any-AI Spatial Computing Platform.
FME Realize brings spatial computing, augmented reality (AR), and the full power of FME to mobile teams: making it easy to visualize, interact with, and update data right in the field. From infrastructure management to asset inspections, you can put any data into real-world context, instantly.
Join us to discover how spatial computing, powered by FME, enables digital twins, AI-driven insights, and real-time field interactions: all through an intuitive no-code experience.
In this one-hour webinar, you’ll:
-Explore what FME Realize includes and how it fits into the FME Platform
-Learn how to deliver real-time AR experiences, fast
-See how FME enables live, contextual interactions with enterprise data across systems
-See demos, including ones you can try yourself
-Get tutorials and downloadable resources to help you start right away
Whether you’re exploring spatial computing for the first time or looking to scale AR across your organization, this session will give you the tools and insights to get started with confidence.
DePIN = Real-World Infra + Blockchain
DePIN stands for Decentralized Physical Infrastructure Networks.
It connects physical devices to Web3 using token incentives.
How Does It Work?
Individuals contribute to infrastructure like:
Wireless networks (e.g., Helium)
Storage (e.g., Filecoin)
Sensors, compute, and energy
They earn tokens for their participation.
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025Lorenzo Miniero
Slides for my "Multistream support in the Janus SIP and NoSIP plugins" presentation at the OpenSIPS Summit 2025 event.
They describe my efforts refactoring the Janus SIP and NoSIP plugins to allow for the gatewaying of an arbitrary number of audio/video streams per call (thus breaking the current 1-audio/1-video limitation), plus some additional considerations on what this could mean when dealing with application protocols negotiated via SIP as well.
Adtran’s SDG 9000 Series brings high-performance, cloud-managed Wi-Fi 7 to homes, businesses and public spaces. Built on a unified SmartOS platform, the portfolio includes outdoor access points, ceiling-mount APs and a 10G PoE router. Intellifi and Mosaic One simplify deployment, deliver AI-driven insights and unlock powerful new revenue streams for service providers.
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
UiPath Community Zurich: Release Management and Build PipelinesUiPathCommunity
Ensuring robust, reliable, and repeatable delivery processes is more critical than ever - it's a success factor for your automations and for automation programmes as a whole. In this session, we’ll dive into modern best practices for release management and explore how tools like the UiPathCLI can streamline your CI/CD pipelines. Whether you’re just starting with automation or scaling enterprise-grade deployments, our event promises to deliver helpful insights to you. This topic is relevant for both on-premise and cloud users - as well as for automation developers and software testers alike.
📕 Agenda:
- Best Practices for Release Management
- What it is and why it matters
- UiPath Build Pipelines Deep Dive
- Exploring CI/CD workflows, the UiPathCLI and showcasing scenarios for both on-premise and cloud
- Discussion, Q&A
👨🏫 Speakers
Roman Tobler, CEO@ Routinuum
Johans Brink, CTO@ MvR Digital Workforce
We look forward to bringing best practices and showcasing build pipelines to you - and to having interesting discussions on this important topic!
If you have any questions or inputs prior to the event, don't hesitate to reach out to us.
This event streamed live on May 27, 16:00 pm CET.
Check out all our upcoming UiPath Community sessions at:
👉 https://community.uipath.com/events/
Join UiPath Community Zurich chapter:
👉 https://community.uipath.com/zurich/
As data privacy regulations become more pervasive across the globe and organizations increasingly handle and transfer (including across borders) meaningful volumes of personal and confidential information, the need for robust contracts to be in place is more important than ever.
This webinar will provide a deep dive into privacy contracting, covering essential terms and concepts, negotiation strategies, and key practices for managing data privacy risks.
Whether you're in legal, privacy, security, compliance, GRC, procurement, or otherwise, this session will include actionable insights and practical strategies to help you enhance your agreements, reduce risk, and enable your business to move fast while protecting itself.
This webinar will review key aspects and considerations in privacy contracting, including:
- Data processing addenda, cross-border transfer terms including EU Model Clauses/Standard Contractual Clauses, etc.
- Certain legally-required provisions (as well as how to ensure compliance with those provisions)
- Negotiation tactics and common issues
- Recent lessons from recent regulatory actions and disputes
AI stands for Artificial Intelligence.
It refers to the ability of a computer system or machine to perform tasks that usually require human intelligence, such as:
thinking,
learning from experience,
solving problems, and
making decisions.
In recent years, the proliferation of generative AI technology has revolutionized the landscape of media content creation, enabling even the average user to fabricate convincing videos, images, text, and audio. However, this advancement has also exacerbated the issue of online disinformation, which is spiraling out of control due to the vast reach of social media platforms, sophisticated campaigns, and the proliferation of deepfakes. After an introduction including the significant impact on key societal values such as Democracy, Public Health and Peace, the talk focuses on techniques to detect visual disinformation, manipulated photos/video, deepfakes and visuals out of context. While AI technologies offer promising avenues for addressing disinformation, it is clear that they alone are not sufficient to address this complex and multifaceted problem. Limitations of current AI approaches will be discussed, along with broader human behaviour, societal and financial challenges that must be addressed to effectively combat online disinformation. A holistic approach that encompasses technological, regulatory, and educational interventions, developing critical thought will be finally presented.
Reducing Bugs With Static Code Analysis php tek 2025Scott Keck-Warren
Have you ever deployed code only to have it causes errors and unexpected results? By using static code analysis we can reduce, if not completely remove this risk. In this session, we'll discuss the basics of static code analysis, some free and inexpensive tools we can use, and how we can run the tools successfully.
For those who have ever wanted to recreate classic games, this presentation covers my five-year journey to build a NES emulator in Kotlin. Starting from scratch in 2020 (you can probably guess why), I’ll share the challenges posed by the architecture of old hardware, performance optimization (surprise, surprise), and the difficulties of emulating sound. I’ll also highlight which Kotlin features shine (and why concurrency isn’t one of them). This high-level overview will walk through each step of the process—from reading ROM formats to where GPT can help, though it won’t write the code for us just yet. We’ll wrap up by launching Mario on the emulator (hopefully without a call from Nintendo).
"AI in the browser: predicting user actions in real time with TensorflowJS", ...Fwdays
With AI becoming increasingly present in our everyday lives, the latest advancements in the field now make it easier than ever to integrate it into our software projects. In this session, we’ll explore how machine learning models can be embedded directly into front-end applications. We'll walk through practical examples, including running basic models such as linear regression and random forest classifiers, all within the browser environment.
Once we grasp the fundamentals of running ML models on the client side, we’ll dive into real-world use cases for web applications—ranging from real-time data classification and interpolation to object tracking in the browser. We'll also introduce a novel approach: dynamically optimizing web applications by predicting user behavior in real time using a machine learning model. This opens the door to smarter, more adaptive user experiences and can significantly improve both performance and engagement.
In addition to the technical insights, we’ll also touch on best practices, potential challenges, and the tools that make browser-based machine learning development more accessible. Whether you're a developer looking to experiment with ML or someone aiming to bring more intelligence into your web apps, this session will offer practical takeaways and inspiration for your next project.
AI Emotional Actors: “When Machines Learn to Feel and Perform"AkashKumar809858
Welcome to the era of AI Emotional Actors.
The entertainment landscape is undergoing a seismic transformation. What started as motion capture and CGI enhancements has evolved into a full-blown revolution: synthetic beings not only perform but express, emote, and adapt in real time.
For reading further follow this link -
https://akash97.gumroad.com/l/meioex
Master tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 ProfessioKari Kakkonen
My slides at Professio Testaus ja AI 2025 seminar in Espoo, Finland.
Deck in English, even though I talked in Finnish this time, in addition to chairing the event.
I discuss the different motivations for testing to use AI tools to help in testing, and give several examples in each categories, some open source, some commercial.
2. the Hadoop clusters and renders usage of inactive power work and conclude.
modes infeasible [26].
Recent research on scale-down in GFS and HDFS man- 2 Key observations
aged clusters [3, 27] propose maintaining a primary replica
of the data on a small covering subset of nodes that are guar- We did a detailed analysis of the evolution and lifespan
anteed to be on. However, these solutions suffer from de- of the files in in a production Yahoo! Hadoop cluster us-
graded write-performance as they rely on write-offloading ing one-month long HDFS traces and Namespace metadata
technique [31] to avoid server wakeups at the time of writes. checkpoints. We analyzed each top-level directory sepa-
Write-performance is an important consideration in Hadoop rately in the production multi-tenant Yahoo! Hadoop clus-
and even more so in a production Hadoop cluster as dis- ter as each top-level directory in the namespace exhibited
cussed in Section 3.1. different access patterns and lifespan distributions. The key
We took a different approach and proposed GreenHDFS, observations from the analysis are:
an energy-conserving, self-adaptive, hybrid, logical multi-
zoned variant of HDFS in our paper [23]. Instead of an ∙ There is significant heterogeneity in the access pat-
energy-efficient placement of computations or using a small terns and the lifespan distributions across the various
covering set for primary replicas as done in earlier research, top-level directories in the production Hadoop clus-
GreenHDFS focuses on data-classification techniques to ter and one-size-fits-all energy-management policies
extract energy savings by doing energy-aware placement of don’t suffice across all directories.
data.
∙ Significant amount of data amounting to 60% of used
GreenHDFS trades cost, performance and power by sep-
capacity is cold (i.e., is lying dormant in the system
arating cluster into logical zones of servers. Each cluster
without getting accessed) in the production Hadoop
zone has a different temperature characteristic where tem-
cluster. A majority of this cold data needs to exist for
perature is measured by the power consumption and the per-
regulatory and historical trend analysis purposes.
formance requirements of the zone. GreenHDFS relies on
the inherent heterogeneity in the access patterns in the data ∙ We found that the 95-98% files in majority of the top-
stored in HDFS to differentiate the data and to come up with level directories had a very short hotness lifespan of
an energy-conserving data layout and data placement onto less than 3 days. Only one directory had files with
the zones. Since, computations exhibit high data locality in longer hotness lifespan. Even in that directory 80%
the Hadoop framework, the computations then flow natu- of files were hot for less than 8 days.
rally to the data in the right temperature zones.
The contribution of this paper lies in showing that the ∙ We found that 90% of files amounting to 80.1% of the
energy-aware data-differentiation based data-placement in total used capacity in the most storage-heavy top-level
GreenHDFS is able to meet all the effective scale-down directory were dormant and hence, cold for more than
mandates (i.e., generates significant idleness, results in 18 days. Dormancy periods were much shorter in the
few power state transitions, and doesn’t degrade write per- rest of the directories and only 20% files were dormant
formance) despite the significant challenges posed by a beyond 1 day.
Hadoop cluster to scale-down. We do a detailed evaluation
∙ Access pattern to majority of the data in the production
and sensitivity analysis of the policy thresholds in use in
Hadoop cluster have a news-server-like access pattern
GreenHDFS with a trace-driven simulator with real-world
whereby most of the computations to the data happens
HDFS traces from a production Hadoop cluster at Yahoo!.
soon after the data’s creation.
While some aspects of GreenHDFS are sensitive to the pol-
icy thresholds, we found that energy-conservation is mini-
mally sensitive to the policy thresholds in GreenHDFS. 3 Background
The remainder of the paper is structured as follows. In
Section 2, we list some of the key observations from our Map-reduce is a programming model designed to sim-
analysis of the production Hadoop cluster at Yahoo!. In plify data processing [13]. Google, Yahoo!, Facebook,
Section 3, we provide background on HDFS, and discuss Twitter etc. use Map-reduce to process massive amount of
scale-down mandates. In Section 4, we give an overview of data on large-scale commodity clusters. Hadoop is an open-
the energy management policies of GreenHDFS. In Section source cluster-based Map-reduce implementation written in
5, we present an analysis of the Yahoo! cluster. In Section Java [1]. It is logically separated into two subsystems: a
6, we include experimental results demonstrating the effec- highly resilient and scalable Hadoop Distributed File Sys-
tiveness and robustness of our design and algorithms in a tem (HDFS), and a Map-reduce task execution framework.
simulation environment. In Section 7, we discuss related HDFS runs on clusters of commodity hardware and is an
275
3. object-based distributed file system. The namespace and to the class of data residing in that zone. Differentiating
the metadata (modification, access times, permissions, and the zones in terms of power is crucial towards attaining our
quotas) are stored on a dedicated server called the NameN- energy-conservation goal.
ode and are decoupled from the actual data which is stored Hot zone consists of files that are being accessed cur-
on servers called the DataNodes. Each file in HDFS is repli- rently and the newly created files. This zone has strict SLA
cated for resiliency and split into blocks of typically 128MB (Service Level Agreements) requirements and hence, per-
and individual blocks and replicas are placed on the DataN- formance is of the greatest importance. We trade-off energy
odes for fine-grained load-balancing. savings in interest of very high performance in this zone. In
this paper, GreenHDFS employs data chunking, placement
3.1 Importance of Write-Performance in and replication policies similar to the policies in baseline
Production Hadoop Cluster HDFS or GFS.
Cold zone consists of files with low to rare accesses.
Reduce phase of a Map-reduce task writes intermediate Files are moved by File Migration policy from the Hot
computation results back to the Hadoop cluster and relies on zones to the Cold zone as their temperature decreases be-
high write performance for overall performance of a Map- yond a certain threshold. Performance and SLA require-
reduce task. Furthermore, we observed that the majority of ments are not as critical for this zone and GreenHDFS em-
the data in a production Hadoop cluster has a news-server ploys aggressive energy-management schemes and policies
like access pattern. Predominant number of computations in this zone to transition servers to low power inactive state.
happen on newly created data; thereby mandating good read Hence, GreenHDFS trades-off performance with high en-
and write performance of the newly created data. ergy savings in the Cold zone.
For optimal energy savings, it is important to increase
3.2 Scale-down Mandates the idle times of the servers and limit the wakeups of servers
that have transitioned to the power saving mode. Keeping
Scale-down, in which server components such as CPU,
this rationale in mind and recognizing the low performance
disks, and DRAM are transitioned to inactive, low power
needs and infrequency of data accesses to the Cold zone;
consuming mode, is a popular energy-conservation tech-
this zone will not chunk the data. This will ensure that upon
nique. However, scale-down cannot be applied naively. En-
a future access only the server containing the data will be
ergy is expended and transition time penalty is incurred
woken up.
when the components are transitioned back to an active
By default, the servers in Cold zone are in a sleeping
power mode. For example, transition time of components
mode. A server is woken up when either new data needs
such as the disks can be as high as 10secs. Hence, an effec-
to be placed on it or when data already residing on the
tive scale-down technique mandates the following:
server is accessed. GreenHDFS tries to avoid powering-on
∙ Sufficient idleness to ensure that energy savings are a server in the Cold zone and maximizes the use of the exist-
higher than the energy spent in the transition. ing powered-on servers in its server allocation decisions in
∙ Less number of power state transitions as some com- interest of maximizing the energy savings. One server wo-
ponents (e.g., disks) have limited number of start/stop ken up and is filled completely to its capacity before next
cycles and too frequent transitions may adversely im- server is chosen to be transitioned to an active power state
pact the lifetime of the disks. from an ordered list of servers in the Cold zone.
The goal of GreenHDFS is to maximize the allocation
∙ No performance degradation. Steps need to be taken of the servers to the Hot zone to minimize the performance
to amortize performance penalty of power state transi- impact of zoning and minimize the number of servers allo-
tions and to ensure that load concentration on the re- cated to the Cold zone. We introduced a hybrid, storage-
maining active state servers doesn’t adversely impact heavy cluster model in [23] paper whereby servers in the
overall performance of the system. Cold zone are storage-heavy and have 12, 1TB disks/server.
We argue that zoning in GreenHDFS will not affect the
4 GreenHDFS Design Hot zone’s performance adversely and the computational
workload can be consolidated on the servers in the Hot zone
GreenHDFS is a variant of the Hadoop Distributed File without exceeding the CPU utilization above the provision-
System (HDFS) and GreenHDFS logically organizes the ing guidelines. A study of 5000 Google compute servers,
servers in the datacenter in multiple dynamically provi- showed that most of the time is spent within the 10% - 50%
sioned Hot and Cold zones. Each zone has a distinct perfor- CPU utilization range [4]. Hence, significant opportunities
mance, cost, and power characteristic. Each zone is man- exist in workload consolidation. And, the compute capacity
aged by power and data placement policies most conducive of the Cold zone can always be harnessed under peak load
276
4. scenarios. 4.1.2 Server Power Conserver Policy
4.1 Energy-management Policies The Server Power Conserver Policy runs in the Cold zone
and determines the servers which can be transitioned into
Files are moved from the Hot Zones to the Cold Zone as a power saving standby/sleep mode in the Cold Zone as
their temperature changes over time as shown in Figure 1. shown in Algorithm 2. The current trend in the internet-
In this paper, we use dormancy of a file, as defined by the scale data warehouses and Hadoop clusters is to use com-
elapsed time since the last access to the file, as the measure modity servers with 4-6 directly attached disks instead of
of temperature of the file. Higher the dormancy lower is the using expensive RAID controllers. In such systems, disks
temperature of the file and hence, higher is the coldness of actually just constitute 10% of the entire power usage as il-
the files. On the other hand, lower the dormancy, higher is lustrated in a study performed at Google [21] and CPU and
the heat of the files. GreenHDFS uses existing mechanism DRAM constitute of 63% of the total power usage. Hence,
in baseline HDFS to record and update the last access time power management of any one component is not sufficient.
of the files upon every file read. We leverage energy cost savings at the entire server granu-
larity (CPU, Disks, and DRAM) in the Cold zone.
The GreenHDFS uses hardware techniques similar to
4.1.1 File Migration Policy
[28] to transition the processors, disks and the DRAM into
The File Migration Policy runs in the Hot zone, monitors a low power state. GreenHDFS uses the disk Sleep mode 1 ,
the dormancy of the files as shown in Algorithm 1 and CPU’s ACPI S3 Sleep state as it consumes minimal power
moves dormant, i.e., cold files to the Cold Zone. The advan- and requires only 30us to transition from sleep back to ac-
tages of this policy are two-fold: 1) leads to higher space- tive execution, and DRAM’s self-refresh operating mode in
efficiency as space is freed up on the hot Zone for files which transitions into and out of self refresh can be com-
which have higher SLA requirements by moving rarely ac- pleted in less than a microsecond in the Cold zone.
cessed files out of the servers in these zones, and 2) allows The servers are transitioned back to an active power
significant energy-conservation. Data-locality is an impor- mode in three conditions: 1) data residing on the server is
tant consideration in the Map-reduce framework and com- accessed, 2) additional data needs to be placed on the server,
putations are co-located with data. Thus, computations nat- or 3) block scanner needs to run on the server to ensure
urally happen on the data residing in the Hot zone. This the integrity of the data residing in the Cold zone servers.
results in significant idleness in all the components of the GreenHDFS relies on Wake-on-LAN in the NICs to send a
servers in the Cold zone (i.e., CPU, DRAM and Disks), al- magic packet to transition a server back to an active power
lowing effective scale-down of these servers. state.
Wake-up Events:
File Access
Bit Rot Integrity Checker
File Placement
Coldness > ThresholdFMP File Deletion
Hot Cold Active Inactive
Zone Zone
Server Power Conserver Policy:
Hotness > ThresholdFRP Coldness > Threshold PCS
Figure 1. State Diagram of a File’s Zone Alloca- Figure 2. Triggering events leading to Power State
tion based on Migration Policies Transitions in the Cold Zone
Algorithm 1 Description of the File Migration Policy which Algorithm 2 Server Power Conserver Policy
Classifies and Migrates cold data to the Cold Zone from the
{For every Server i in Cold Zone}
Hot Zones for 𝑖 = 1 to n do
{For every file i in Hot Zone} coldness 𝑖 ⇐ max0≤𝑗≤𝑚 last access time 𝑗
for 𝑖 = 1 to n do if coldness 𝑖 ≥ 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝑃 𝐶 then
dormancy 𝑖 ⇐ current time − last access time 𝑖 S 𝑖 ⇐ INACTIVE STATE
if dormancy 𝑖 ≥ 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 then end if
{Cold Zone} ⇐ {Cold Zone} ∪ {f 𝑖 } end for
{Hot Zone} ⇐ {Hot Zone} / {f 𝑖 }//filesystem metadata structures are
changed to Cold Zone
end if
end for 1 In the Sleep mode the drive buffer is disabled, the heads are parked
and the spindle is at rest.
277
5. 4.1.3 File Reversal Policy after they have been dormant in the system for a longer pe-
riod of time. This would be an overkill for files with very
The File Reversal Policy runs in the Cold zone and en-
short 𝐿𝑖𝑓 𝑒𝑠𝑝𝑎𝑛 𝐶 𝐿𝑅 (hotness lifespan) as such files will
sures that the QoS, bandwidth and response time of files
unnecessarily lie dormant in the system, occupying precious
that becomes popular again after a period of dormancy is
Hot zone capacity for a longer period of time.
not impacted. If the number of accesses to a file that is re-
𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 : A high 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 increases the
siding in the Cold zone becomes higher than the threshold
number of the days the servers in the Cold Zone remain
𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 , the file is moved back to the Hot zone as
in active power state and hence, lowers the energy savings.
shown in 3. The file is chunked and placed unto the servers
On the other hand, it results in a reduction in the power state
in the Hot zone in congruence with the policies in the Hot
transitions which results in improved performance of the ac-
zone.
cesses to the Cold Zone. Thus, a trade-off needs to be made
Algorithm 3 Description of the File Reversal Policy Which between energy-conservation and data access performance
Monitors temperature of the cold files in the Cold Zones and in the selection of the value for 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 .
Moves Files Back to Hot Zones if their temperature changes 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 : A relatively high value of
{For every file i in Cold Zone} 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 ensures that files are accurately clas-
for 𝑖 = 1 to n do
if num accesses 𝑖 ≥ 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 then
sified as hot-again files before they are moved back to the
{Hot Zone} ⇐ {Hot Zone} ∪ {f 𝑖 } Hot zone from the Cold zone. This reduces data oscillations
{Cold Zone} ⇐ {Cold Zone} / {f 𝑖 }//filesystem metadata are changed to
Hot Zone
in the system and reduces unnecessary file reversals.
end if
end for
5 Analysis of a production Hadoop cluster at
Yahoo!
4.1.4 Policy Thresholds Discussion We analyzed one-month of HDFS logs 2 and namespace
A good data migration scheme should result in maximal checkpoints in a multi-tenant cluster at Yahoo!. The clus-
energy savings, minimal data oscillations between Green- ter had 2600 servers, hosted 34 million files in the names-
HDFS zones and minimal performance degradation. Min- pace and the data set size was 6 Petabytes. There were
imization of the accesses to the Cold zone files results in 425 million entries in the HDFS logs and each names-
maximal energy savings and minimal performance impact. pace checkpoint contained 30-40 million files. The clus-
For this, policy thresholds should be chosen in a way that ter namespace was divided into six main top-level directo-
minimizes the number of accesses to the files residing in the ries, whereby each directory addresses different workloads
Cold zone while maximizing the movement of the dormant and access patterns. We only considered 4 main directories
data to the Cold zone. Results from our detailed sensitivity and refer to them as: d, p, u, and m in our analysis instead
analysis of the thresholds used in GreenHDFS are covered of referring them by their real names. The total number
in Section 6.3.5. of unique files that was seen in the HDFS logs in the one-
𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 : Low (i.e., aggressive) value of month duration were 70 million (d-1.8million, p-30million,
𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 results in an ultra-greedy selection of u-23million, and m-2million).
files as potential candidates for migration to the Cold The logs and the metadata checkpoints were huge in size
zone. While there are several advantages of an aggressive and we used a large-scale research Hadoop cluster at Yahoo!
𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 such as higher space-savings in the Cold extensively for our analysis. We wrote the analysis scripts
zone, there are disadvantages as well. If files have inter- in Pig. We considered several cases in our analysis as shown
mittent periods of dormancy, the files may incorrectly get below:
labeled as cold and get moved to the Cold zone. There is ∙ Files created before the analysis period and which
high probability that such files will get accessed in the near were not read or deleted subsequently at all. We clas-
future. Such accesses may suffer performance degradation sify these files as long-living cold files.
as the accesses may get subject to power transition penalty
and may trigger data oscillations because of file reversals ∙ Files created before the analysis period and which
back to the Hot zone. were read during the analysis period.
A higher value of 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 results in a higher 2 The inode data and the list of blocks belonging to each file comprise
accuracy in determining the really cold files. Hence, the the metadata of the name system called the image. The persistent record of
number of reversals, server wakeups and associated perfor- the image is called a checkpoint. HDFS has the ability to log all file system
access requests, which is required for auditing purposes in enterprises. The
mance degradation decreases as the threshold is increased.
audit logging is implemented using log4j and once enabled, logs every
On the other hand, higher value of 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 signi- HDFS event in the NameNode’s log [37]. We used the above-mentioned
fies that files will be chosen as candidates for migration only checkpoint and HDFS logs for our analysis.
278
6. ∙ Files created before the analysis period and which ∙ FileLifetime. This metric helps in determining the life-
were both read and deleted during the analysis period. time of the file between its creation and its deletion.
∙ Files created during the analysis period and which
were not read during the analysis period or deleted. 5.1.1 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐹 𝑅
The 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐹 𝑅 distribution throws light on the
∙ Files created during the analysis period and which
clustering of the file reads with the file creation. As shown
were not read during the analysis period, but were
in Figure 3, 99% of the files have a 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐹 𝑅 of
deleted.
less than 2 days.
∙ Files created during the analysis period and which
were read and deleted during the analysis period. 5.1.2 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅
To accurately account for the file lifespan and lifetime, Figure 4 shows the distribution of 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 in
we handled the following cases: (a) Filename reuse. We the cluster. In directory d, 80% of files are hot for less than
appended a timestamp to each file create to accurately track 8 days and 90% of the files amounting to 94.62% storage,
the audit log entries following the file create entry in the au- are hot for less than 24 days. The 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 of
dit log, (b) File renames. We used an unique id per file to ac- 95% of the files amounting to 96.51% storage in the direc-
curately track its lifetime across create, rename and delete, tory p is less than 3 days and the 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 of the
(c) Renames and deletes at higher level in the path hierarchy 100% of files in directory m and 98% of files in directory
had to be translated to leaf-level renames and deletes for our a is as small as 2 days. In directory u, 98% of files have
analysis, (d) HDFS logs do not have file size information 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 of less than 1 day. Thus, majority of
and hence, did a join of the dataset found in the HDFS logs the files in the cluster have a short hotness lifespan.
and namespace checkpoint to get the file size information.
5.1.3 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷
5.1 File Lifespan Analysis of the Yahoo!
Hadoop Cluster 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 indicates the time for which a file stays
in a dormant state in the system. The longer the dormancy
A file goes to several stages in its lifetime: 1) file cre- period, higher is the coldness of the file and hence, higher
ation, 2) hot period during which the file is frequently ac- the suitability of the file for migration to the cold zone. Fig-
cessed, 3) dormant period during which file is not accessed, ure 5 shows the distribution of 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 in the
and 4) deletion. We introduced and considered various lifes- cluster. In directory d, 90% of files are dormant beyond
pan metrics in our analysis to characterize a file’s evolution. 1 day and 80% of files, amounting to 80.1% of storage
A study of the various lifespan distributions helps in decid- exist in dormant state past 20 days. In directory p, only
ing the energy-management policy thresholds that need to 25% files are dormant beyond 1 day and only 20% of the
be in place in GreenHDFS. files remain dormant in the system beyond 10 days. In di-
rectory m, only 0.02% files are dormant for more than 1
∙ 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐹 𝑅 metric is defined as the File lifes- day and in directory u, 20% of files are dormant beyond
pan between the file creation and first read access. This 10 days. The 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 needs to be considered
metric is used to find the clustering of the read accesses to find true migration suitability of a file. For example,
around the file creation. given the extremely short dormancy period of the files in
the directory m, there is no point in exercising the File Mi-
∙ 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 metric is defined as the File lifes-
gration Policy on directory m. For directories p, and u,
pan between creation and last read access. This metric
𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 less than 5 days will result in unneces-
is used to determine the hotness profile of the files.
sary movement of files to the Cold zone as these files are
∙ 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 metric is defined as the File lifes- due for deletion in any case. On the other hand, given the
pan between last read access and file deletion. This short 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 in these directories, high value of
metric helps in determine the coldness profile of the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 won’t do justice to space-efficiency in the
files as this is the period for which files are dormant in Cold zone as discussed in Section 4.1.4.
the system.
5.1.4 File Lifetime Analysis
∙ 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐹 𝐿𝑅 metric is defined as the File lifes-
pan between first read access and last read access. This Knowledge of the FileLifetime further assists in the
metric helps in determining another dimension of the migration file candidate selection and needs to be ac-
hotness profile of the files. counted for in addition to the 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 and
279
7. d p m u
102% d p m u
120%
% of Tota Used Capacity
% of To File Count
100%
100%
98%
80%
96%
60%
otal
94%
al
40%
92% 20%
90% 0%
1 3 5 7 9 11 13 15 17 19 21 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
FileLifeSpanCFR (Days)
FileLifeSpanCFR (Days)
Figure 3. 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐹 𝑅 distribution. 99% of files in directory d and 98% of files in directory p were
accessed for the first time less than 2 days of creation.
d p m u d p m u
105% 120%
% of Tota Used Capacity
% of Total File Count
100%
100%
95%
90% 80%
85%
60%
80%
al
75% 40%
T
70%
20%
65%
60% 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
FileLifeSpanCLR (Days) FileLifeSpanCLR (Days)
Figure 4. 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 Distribution in the four main top-level directories in the Yahoo! production cluster.
𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 characterizes the lifespan for which files are hot. In directory d, 80% of files were hot for less than
8 days and 90% of the files amounting to 94.62% storage, are hot for less than 24 days. The hotness lifespan of 95% of
the files amounting to 96.51% storage in the directory p is less than 3 days and the hotness lifespan of the 100% of files in
directory m and in directory u, 98% of files are hot for less than 1 day.
d p m u d p m u
120% 120%
% if Tota Used Capacity
% of Total File Count
100% 100%
80% 80%
60% 60%
al
40% 40%
T
20% 20%
0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
FileLifeSpanLRD (Days) FileLifeSpanLRD (Days)
Figure 5. 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷 distribution of the top-level directories in the Yahoo! production cluster. 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐿𝑅𝐷
characterizes the coldness in the cluster and is indicative of the time a file stays in a dormant state in the system. 80% of files,
amounting to 80.1% of storage in the directory d have a dormancy period of higher than 20 days. 20% of files, amounting to
28.6% storage in directory p are dormant beyond 10 days. 0.02% of files in directory m are dormant beyond 1 day.
d p m u d p m u
120% 120%
% of Tota Used Capacity
% of Total File Count
100% 100%
80% 80%
60% 60%
al
40% 40%
T
20% 20%
0% 7 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 0 2 4 6 8 1012141618202224262830
FileLifetime (Days) FileLifetime(Days)
Figure 6. FileLifetime distribution. 67% of the files in the directory p are deleted within one day of their creation. Only
23% files live beyond 20 days. On the other hand, in directory d 80% of the files have a FileLifetime of more than 30 days.
280
8. % of Total File Count % of Total Used Storage
40.00%
35.00%
30.00%
25.00%
20.00%
15.00%
15 00%
10.00%
5.00%
0.00%
d p u
Figure 7. File size and file count percentage of long-living cold files. The cold files are defined as the files that were created
prior to the start of the observation period of one-month and were not accessed during the period of observation at all. In
case of directory d directory, 13% of the total file count in the cluster which amounts to 33% of total used capacity is cold.
In case of directory p, 37% of the total file count in the cluster which amounts to 16% of total used capacity is cold. Overall,
63.16% of total file count and 56.23% of total used capacity is cold in the system
d p u d p u
7
100%
% of Total File Count
File Count (Millions)
6
80% 5
60% 4
3
40%
2
C
20% 1
0% 0
10 20 40 60 80 100 120 140 10 20 40 60 80 100 120 140
Dormancy > than (Days) Dormancy > than (Days)
d p u
d p u 3500
90%
% of Total Used Storage
3000
80%
Used Storage Capaicty (TB)
70% 2500
60% 2000
50%
Capacity
1500
40%
30% 1000
20% 500
10%
0% 0
10 20 40 60 80 100 120 140
10 20 40 60 80 100 120 140
Dormancy > than (Days)
Dormancy > than (Days)
Figure 8. Dormant period analysis of the file count distribution and histogram in one namespace checkpoint. Dormancy
of the file is defined as the elapsed time between the last access time recorded in the checkpoint and the day of observation.
34% of the files in the directory p and 58% of the files in the directory d were not accessed in the last 40 days.
281
9. 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑆𝑝𝑎𝑛 𝐶𝐿𝑅 metrices covered earlier. As shown in ∙ What is the sensitivity of the various policy thresholds
Figure 6, directory p only has 23% files that live beyond 20 used in GreenHDFS on the energy savings results?
days. On the other hand, 80% of files in directory d live
∙ How many power state transitions does a server go
for more than 30 days and 80% of the files have a hot lifes-
through in average in the Cold Zone?
pan of less than 8 days. Thus, directory d is a very good
candidate for invoking the File Migration Policy. ∙ Finally, what is the number of accesses that happen to
the files in the Cold Zones, the days servers are pow-
5.2 Coldness Characterization of the Files ered on and the number of migrations and reversals ob-
served in the system?
In this section, we show the file count and the storage
capacity used by the long-living cold files. The long-living ∙ How many migrations happen daily?
cold files are defined as the files that were created prior to ∙ How may power state transitions are occurred during
the start of the observation period and were not accessed the simulation-run?
during the one-month period of observation at all. As shown
in Figure 13, 63.16% of files amounting to 56.23% of the The following evaluation sections answer these questions,
total used capacity are cold in the system. Such long-living beginning with a description of our methodology, and the
cold files present significant opportunity to conserve energy trace workloads we use as inputs to the experiments.
in GreenHDFS.
5.3 Dormancy Characterization of the 6.1 Evaluation methodology
Files
We evaluated GreenHDFS using a trace-driven simula-
The HDFS trace analysis gives information only about tor. The simulator was driven by real-world HDFS traces
the files that were accessed in the one-month duration. To generated by a production Hadoop cluster at Yahoo!. The
get a better picture, we analyzed the namespace checkpoints cluster had 2600 servers, hosted 34 million files in the
for historical data on the file temperatures and periods of namespace and the data set size was 6 Petabytes.
dormancy. The namespace checkpoints contain the last ac- We focused our analysis on the directory d as this di-
cess time information of the files and used this information rectory constituted of 60% of the used storage capacity in
to calculate the dormancy of the files. The Dormancy met- the cluster (4PB out of the 6PB total used capacity). Just
ric defines the elapsed time between the last noted access focusing our analysis on the directory d cut down on our
time of the file and the day of observation. Figure 8 contains simulation time significantly and reduced our analysis time
the frequency histograms and distributions of the dormancy. 4
. We used 60% of the total cluster nodes in our analysis to
34% of files amounting to 37% of storage in the directory p make the results realistic for just directory d analysis. The
present in the namespace checkpoint were not accessed in total number of unique files that were seen in the HDFS
the last 40 days. 58% of files amounting to 53% of storage traces for the directory d in the one-month duration were
in the directory d were not accessed in the last 40 days. The 0.9 million. In our experiments, we compare GreenHDFS
extent of dormancy exhibited in the system again shows the to the baseline case (HDFS without energy management).
viability of the GreenHDFS solution.3 The baseline results give us the upper bound for energy con-
sumption and the lower bound for average response time.
6 Evaluation Simulation Platform: We used a trace-driven simula-
tor for GreenHDFS to perform our experiments. We used
In this section, we first present our experimental platform models for the power levels, power state transitions times
and methodology, followed by a description of the work- and access times of the disk, processor and the DRAM in
loads used and then we give our experimental results. Our the simulator. The GreenHDFS simulator was implemented
goal is to answer seven high-level sets of questions: in Java and MySQL distribution 5.1.41 and executed using
Java 2 SDK, version 1.6.0-17. 5 Table 1 lists the various
∙ What much energy is GreenHDFS able to conserve power, latency, transition times etc. used in the Simulator.
compared to a baseline HDFS with no energy manage- The simulator was run on 10 nodes in a development cluster
ment? at Yahoo!.
∙ What is the penalty of the energy management on av-
4 An important consideration given the massive scale of the traces
erage response time?
5 Both,performance and energy statistics were calculated based on the
3 The number of files present in the namespace checkpoints were less information extracted from the datasheet of Seagate Barracuda ES.2 which
than half the number of the files seen in the one-month trace. is a 1TB SATA hard drive, a Quad core Intel Xeon X5400 processor
282
10. 6.3.2 Storage-Efficiency
Table 1. Power and power-on penalties used in Simu-
lator In this section, we show the increased storage efficiency of
the Hot Zones compared to baseline. Figure 10 shows that
Component Active Idle Sleep Power- in the baseline case, the average capacity utilization of the
Power Power Power up
(W) (W) (W) time 1560 servers is higher than that of GreenHDFS which just
CPU (Quad core, Intel Xeon 80-150 12.0- 3.4 30 us has 1170 servers out of the 1560 servers provisioned to the
X5400 [22]) 20.0
DRAM DIMM [29] 3.5-5 1.8- 0.2 1 us Hot second Zone. GreenHDFS has much higher amount of
2.5 free space available in the Hot zone which tremendously in-
NIC [35] 0.7 0.3 0.3 NA
SATA HDD (Seagate Bar- 11.16 9.29 0.99 10 sec creases the potential for better data placement techniques on
racuda ES.2 1TB [16] the Hot zone. More aggressive the policy threshold, more
PSU [2] 50-60 25-35 0.5 300 us
Hot server (2 CPU, 8 DRAM 445.34 132.46 13.16 space is available in the Hot zone for truly hot data as more
DIMM, 4 1TB HDD) data is migrated out to the Cold zone.
Cold server (2 CPU, 8 DRAM 534.62 206.78 21.08
DIMM, 12 1TB HDD)
6.3.3 File Migrations and Reversals
6.2 Simulator Parameters The Figure 10 (right-most) shows the number and total size
of the files which were migrated to the Cold zone daily with
The default simulation parameters used by in this paper a 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 value of 10 Days. Every day, on average
are shown in Table 2. 6.38TB worth of data and 28.9 thousand files are migrated
to the Cold zone. Since, we have assumed storage-heavy
servers in the Cold zone where each server has 12, 1TB
Table 2. Simulator Parameters disks, assuming 80MB/sec of disk bandwidth, 6.38TB data
Parameter Value
NumServer 1560
can be absorbed in less than 2hrs by one server. The mi-
NumZones 2 gration policy can be run during off-peak hours to minimize
𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝐹 𝑀 𝑃 1 Day
𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 5, 10, 15, 20 Days
any performance impact.
𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑆𝑃 𝐶 1 Day
𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝑃 𝐶 2, 4, 6, 8 Days
𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝐹 𝑅𝑃 1 Day 6.3.4 Impact of Power Management on Response Time
𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 1, 5, 10 Accesses
NumServersPerZone Hot 1170 Cold 390
We examined the impact of server power management on
the response time of a file which was moved to the Cold
Zone following a period of dormancy and was accessed
6.3 Simulation results again for some reason. The files residing on the Cold Zone
may suffer performance degradation in two ways: 1) if the
file resides on a server that is not powered ON currently–
6.3.1 Energy-Conservation
this will incur a server wakeup time penalty, 2) transfer time
In this section, we show the energy savings made possible degradation courtesy of no striping on the lower Zones. The
by GreenHDFS, compared to baseline, in one month sim- file is moved back to Hot zone and chunked again by the file
ply by doing power management in one of the main tenant reversal policy. Figure 11 shows the impact on the average
directory of the Hadoop Cluster. The cost of electricity was response time. 97.8% of the total read requests are not im-
assumed to be $0.063/KWh. Figure 9(Left) shows a 24% pacted by the power management. Impact is seen only by
reduction in energy consumption of a 1560 server datacen- 2.1% of the reads. With a less aggressive 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃
ter with 80% capacity utilization. Extrapolating, $2.1mil- (15, 20 days), impact on the Response time will reduce
lion can be saved in the energy costs if GreenHDFS tech- much further.
nique is applied to all the Hadoop clusters at Yahoo (up-
wards of 38000 servers). Energy saving from off-power 6.3.5 Sensitivity Analysis
servers will be further compounded in the cooling system of
a real datacenter. For every Watt of power consumed by the We tried different values of the thresholds for the File Mi-
compute infrastructure, a modern data center expends an- gration policy and the Server Power Conserver policy to
other one-half to one Watt to power the cooling infrastruc- understand the sensitivity of these thresholds on storage-
ture [32]. Energy-saving results underscore the importance efficiency, energy-conservation and number of power state
of supporting access time recording in the Hadoop compute transitions. A discussion on the impact of the various
clusters. thresholds is done in Section 4.1.4.
283
11. $35,000 Cold Zone Hot Zone # Migrations # Reversals
35 8
$30,000
30 7
Energy Costs
$25,000
Cou (x100000)
Day Server ON
25 6
$20,000
5
$15,000 20
4
$10,000 15
unt
3
ys
$5,000 10 2
$0 5 1
0 0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 5 10 15 20
File Migration Policy (Days) Cold Zone Servers File Migration Policy Interval (Days)
Figure 9. (Left) Energy Savings with GreenHDFS and (Middle) Days Servers in Cold Zone were ON compared to the
Baseline. Energy Cost Savings are Minimally Sensitive to the Policy Threshold Values. GreenHDFS achieves 24% savings in
the energy costs in one month simply by doing power management in one of the main tenant directory of the Hadoop Cluster.
(Right) Number of migrations and reversals in GreenHDFS with different values of the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 threshold.
500 600 FileSize FileCount
orage Capacity (GB)
Cold Zo Used Capacity
450 Baseline 12 45
500
400 40
10
ount (x 1000)
350 Policy15 400 35
300 8 30
(TB)
File Size (TB)
250 Policy10 300 25
6
200 20
one
150 200 4 15
File Co
Policy5
10
Used Sto
100 2
50 100 5
- - 0
-
6/12
6/14
6/16
6/18
6/20
6/22
6/24
6/26
6/28
6/30
1105
1197
1
93
185
277
369
461
553
645
737
829
921
1013
1289
1381
1473
5 10 15 20
File Migration Policy Interval (Days) Days
Server Number
Figure 10. Capacity Growth and Utilization in the Hot and Cold Zone compared to the Baseline and Daily Migrations.
GreenHDFS substantially increases the free space in the Hot Zones by migrating cold data to the Cold Zones. In the left
and middle chart, we only consider the new data that was introduced in the data directory and old data which was accessed
during the 1 month period. Right chart shows the number and total size of the files migrated daily to the Cold zone with
𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 value of 10 Days.
𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 : We found that the energy costs are experiments were done with a 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 value of 1.
minimally sensitive to the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 threshold value. The number of file reversals are substantially reduced by in-
As shown in Figure 9[Left], the energy cost savings varied creasing the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃 value. With a 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑅𝑃
minimally when the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 was changed to 5, 10, value of 10, zero reversals happen in the system.
15 and 20 days. The storage-efficiency is sensitive to the value of the
The performance impact and number of file reversals is 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 threshold as shown in Figure 10[Left]. An
minimally sensitive to the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 value as well. increase in the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 value results in less effi-
This behavior can be explained by the observation that ma- cient capacity utilization of the Hot Zones. Higher value of
jority of the data in the production Hadoop cluster at Yahoo! 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 threshold signifies that files will be chosen
has a news-server-like access pattern. This implies that once as candidates for migration only after they have been dor-
data is deemed cold, there is low probability of data getting mant in the system for a longer period of time. This would
accessed again. be an overkill for files with very short 𝐹 𝑖𝑙𝑒𝐿𝑖𝑓 𝑒𝑠𝑝𝑎𝑛 𝐶𝐿𝑅
The Figure 9 (right-most) shows the total number of mi- as they will unnecessarily lie dormant in the system, oc-
grations of the files which were deemed cold by the file mi- cupying precious Hot zone capacity for a longer period of
gration policy and the reversals of the moved files in case time.
they were later accessed by a client in the one-month sim- 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 : As Figure 12(Right) illustrates, in-
ulation run. There were more instances (40,170, i.e., 4% creasing the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 value, minimally increases the
of overall file count) of file reversals with the most ag- number of the days the servers in the Cold Zone remain ON
gressive 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 of 5 days. With less aggressive and hence, minimally lowers the energy savings. On the
𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 of 15 days, the number of reversals in the other hand, increasing the 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑆𝐶𝑃 value results in
system went down to 6,548 (i.e., 0.7% of file count). The a reduction in the power state transitions which improves
284
12. 120% 1000000
File Count in Log Scale
% of Total File Reads
100% 100000
80% 10000
60% 1000
40% 100
20% 10
0% 1
012
012
012
012
012
012
012
012
012
012
012
012
012
012
012
012
012
012
012
012
012
012
012
012
012
012
012
012
012
012
105013-1060
127013-1280
137013-1380
122013-1230
132013-1330
13-10
8013-90
16013-170
24013-250
32013-330
40013-410
48013-490
56013-570
67013-680
78013-790
87013-880
95013-960
117013-1180
13-10
9013-100
18013-190
27013-280
36013-370
45013-460
54013-550
66013-670
78013-790
88013-890
98013-990
110013-1110
Read Response Time (msecs) Read Response Time (msecs)
Figure 11. Performance Analysis: Impact on Response Time because of power management with a 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 of 10
days. 97.8% of the total read requests are not impacted by the power management. Impact is seen only by 2.1% of the reads.
With a less aggressive 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 (15, 20), impact on the Response time will reduce much more.
195 Used Capacity Hot (TB) Used Capacity Cold (TB) timesOn daysOn
Policy5
190 600 200
Used Storag Capacity (TB)
Used Cold Zone Servers
185 500
150
180 Policy10
400
Count
175 Policy15 100
300
170
d
ge
Policy20 200 50
165
160
100
0
155 0 4 6 8
150 5 10 15 20 Server Power Conserver Policy Interval
File Migration Policy Interval (Days) (Days)
Figure 12. Sensitivity Analysis: Sensitivity of Number of Servers Used in Cold Zone, Number of Power State Transi-
tions and Capacity per Zone to the Migration File Policy’s Age Threshold and the Server Power Conserver Policy’s Access
Threshold.
the performance of the accesses to the Cold Zone. Thus, in the Cold Zone exhibited this behavior. Most of the disks
a trade-off needs to be made between energy-conservation are designed for a maximum service life time of 5 years and
and data access performance. can tolerate 500,000 start/stop cycles. Given the very small
Summary on Sensitivity Analysis: From the above number of transitions incurred by a server in the Cold Zone
evaluation, it is clear that a trade-off needs to be made in in a year, GreenHDFS has no risk of exceeding the start/stop
choosing the right thresholds in GreenHDFS based on an cycles during the service life time of the disks.
enterprise’s needs. If Hot zone space is at a premium, more
aggressive 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 needs to be used. This can be
done without impacting the energy-conservation that can be
7 Related Work
derived in GreenHDFS.
Management of energy, peak power, and temperature of
6.3.6 Number of Server Power Transitions data centers and warehouses are becoming the targets of
an increasing number of research studies. However, to the
The Figure 13 (Left) shows the number of power transitions best of our knowledge, none of the existing systems exploit
incurred by the servers in the Cold Zones. Frequently start- data classification-driven data placement to derive energy-
ing and stopping disks is suspected to affect disk longevity. efficiency nor have a file system managed multi-zoned, hy-
The number of start/stop cycles a disk can tolerate during brid data center layout. Most of the prior work focuses
its service life time is still limited. Making the power tran- on workload placement to manage the thermal distribution
sitions infrequently reduces the risk of running into this within a data center. [30, 34] considered the placement of
limit.The maximum number of power state transitions in- computational workload for energy-efficiency. Chase et al.
curred by a server in a one-month simulation run is just 11 [8] do an energy-conscious provisioning which configures
times and only 1 server out of the 390 servers provisioned switches to concentrate request load on a minimal active set
285
13. 12
Numb of Power State
10
Transitions
8
6
ber
4
2
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27
Servers in Cold Zone
Figure 13. Cold Zone Behavior: Number of Times Servers Transitioned Power State with 𝑇 ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝐹 𝑀 𝑃 of 10 Days. We
only show those servers in the Cold zone that either received newly cold data or had data accesses targeted to them in the
one-month simulation run.
of servers for the current aggregate load level. was proposed by Hakim et. al. [18]. However, that aims
Le et. al. [25] focus on a multi-datacenter internet ser- to concentrate load on one disk at a time and hence, this
vice. They exploit the inherent heterogeneity in the data- design will impact availability and performance.
centers in electricity pricing, time-zone differences and col-
location to renewable energy source, to reduce energy con- 8 Conclusion and Future Work
sumption without impacting SLA requirements of the appli-
cations. Bash et al. [5] allocate heavy computational, long
We presented the detailed evaluation and sensitivity anal-
running workloads onto servers that are in more thermally-
ysis of GreenHDFS, a policy-driven, self-adaptive, variant
efficient places. Chun et. al. [12] propose a hybrid data-
of Hadoop Distributed File System. GreenHDFS relies on
center comprising of low power Atom processors and high
data classification driven data placement to realize guar-
power, high performance Xeon processors. However, they
anteed, substantially long periods of idleness in a signifi-
do not specify any zoning in the system and focus more on
cant subset of servers in the datacenter. Detailed experi-
task migration rather than data migration. Narayanan et.
mental results with real-world traces from a production Ya-
al. [31] use a technique to offload write workload to one
hoo! Hadoop cluster show that GreenHDFS is capable of
volume to other storage elsewhere in the data center. Meis-
achieving 24% savings in the energy costs of a Hadoop clus-
ner et al. [28] reduce the power costs by transitioning the
ter by doing power management in only one of the main
servers to a ”powernap” state whenever there is a period of
tenant top-level directory in the cluster. These savings will
low utilization.
be further compounded in the savings in the cooling costs.
In addition, there is research on hardware-level tech- Detailed lifespan analysis of the files in a large-scale pro-
niques such as dynamic-voltage scaling as a mechanism duction Hadoop cluster at Yahoo! points at the viability
to reduce peak power consumption in the datacenters [7, of GreenHDFS. Evaluation results show that GreenHDFS
14] and Raghavendra et al. [33] coordinate hardware-level is able to meet all the scale-down mandates (i.e., generates
power capping with virtual machine dispatching mecha- significant idleness in the cluster, results in very few power
nisms. Managing temperature is the subject of the systems state transitions, and doesn’t degrade write performance)
proposed in [20]. in spite of the unique scale-down challenges present in a
Recent research on increasing energy-efficiency in GFS Hadoop cluster.
and HDFS managed clusters [3, 27] propose maintaining a
primary replica of the data on a small covering subset of
nodes that are guaranteed to be on and which represent low-
9 Acknowledgement
est power setting. Remaining replicas are stored in larger set
of secondary nodes. Performance is scaled up by increas- This work was supported by NSF grant CNS 05-51665
and an internship at Yahoo!. The views and conclusions
ing number of secondary nodes. However, these solutions contained in this paper are those of the authors and should
suffer from degraded write-performance and increased DFS not be interpreted as representing the official policies, either
code complexity. These solutions also do not do any data expressed or implied, of NSF or the U.S. government.
differentiation and treat all the data in the system alike.
Existing highly scalable file systems such as Google file References
system [19] and HDFS [37] do not do energy management.
Recently, an energy-efficient Log Structured File System [1] http://hadoop.apache.org/.
286
14. [2] Introduction to power supplies. National Semiconductor, 2002. [23] R. T. kaushik and M. Bhandarkar. Greenhdfs: Towards an energy-conserving,
storage-efficient, hybrid hadoop compute cluster. HotPower, 2010.
[3] H. Amur, J. Cipar, V. Gupta, G. R. Ganger, M. A. Kozuch, and K. Schwan.
Robust and flexible power-proportional storage. In SoCC ’10: Proceedings of [24] S. Konstantin, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed
the 1st ACM symposium on Cloud computing, pages 217–228, New York, NY, file system. Symposium on Massive Storage Systems and Technologies, 2010.
USA, 2010. ACM.
[25] K. Le, R. Bianchini, M. Martonosi, and T. Nguyen. Cost- and energy-aware
[4] L. A. Barroso and U. H¨ lzle. The case for energy-proportional computing.
o load distribution across data centers. In HotPower, 2009.
Computer, 40(12), 2007.
[26] J. Leverich and C. Kozyrakis. On the energy (in)efficiency of hadoop clusters.
[5] C. Bash and G. Forman. Cool job allocation: measuring the power savings HotPower, 2009.
of placing jobs at cooling-efficient locations in the data center. In ATC’07:
2007 USENIX Annual Technical Conference on Proceedings of the USENIX [27] J. Leverich and C. Kozyrakis. On the energy (in)efficiency of hadoop clusters.
Annual Technical Conference, pages 1–6, Berkeley, CA, USA, 2007. USENIX SIGOPS Oper. Syst. Rev., 44(1):61–65, 2010.
Association.
[28] D. Meisner, B. T. Gold, and T. F. Wenisch. Powernap: eliminating server idle
[6] C. Belady. In the data center, power and cooling costs more than the it equip- power. In ASPLOS ’09: Proceeding of the 14th international conference on Ar-
ment it supports. Electronics Cooling, February, 2010. chitectural support for programming languages and operating systems, pages
205–216, New York, NY, USA, 2009. ACM.
[7] D. Brooks and M. Martonosi. Dynamic thermal management for high-
performance microprocessors. In HPCA, pages 171–, 2001. [29] Micron. Ddr2 sdram sodimm. 2004.
[8] J. S. Chase and R. P. Doyle. Balance of power: Energy management for server [30] J. Moore, J. Chase, P. Ranganathan, and R. Sharma. Making scheduling ”cool”:
clusters. In In Proceedings of the 8th Workshop on Hot Topics in Operating temperature-aware workload placement in data centers. In ATEC ’05: Proceed-
Systems HotOS, 2001. ings of the annual conference on USENIX Annual Technical Conference, pages
5–5, Berkeley, CA, USA, 2005. USENIX Association.
[9] G. Chen, W. He, J. Liu, S. Nath, L. Rigas, L. Xiao, and F. Zhao. Energy-aware
server provisioning and load dispatching for connection-intensive internet ser- [31] D. Narayanan, A. Donnelly, and A. Rowstron. Write off-loading: Practical
vices. In NSDI’08: Proceedings of the 5th USENIX Symposium on Networked power management for enterprise storage. Trans. Storage, 4(3):1–23, 2008.
Systems Design and Implementation, Berkeley, CA, USA, 2008. USENIX As-
sociation. [32] C. Patel, E. Bash, R. Sharma, and M. Beitelmal. Smart cooling of data centers.
In In Proceedings of the Pacific RIM/ASME International Electronics Packag-
[10] Y. Chen, A. Das, W. Qin, A. Sivasubramaniam, Q. Wang, and N. Gautam. ing Technical Conference and Exhibition (IPACK03), 2003.
Managing server energy and operational costs in hosting centers. SIGMETRICS
Perform. Eval. Rev., 33(1), 2005. [33] R. Raghavendra, P. Ranganathan, V. Talwar, Z. Wang, and X. Zhu. No ”power”
struggles: coordinated multi-level power management for the data center. In
[11] Y. Chen, A. Ganapathi, A. Fox, R. H. Katz, and D. A. Patterson. Statistical ASPLOS XIII, pages 48–59, New York, NY, USA, 2008. ACM.
workloads for energy efficient mapreduce. Technical report, UC, Berkeley,
2010. [34] R. K. Sharma, C. E. Bash, C. D. Patel, R. J. Friedrich, and J. S. Chase. Bal-
ance of power: Dynamic thermal management for internet data centers. IEEE
[12] B.-G. Chun, G. Iannaccone, G. Iannaccone, R. Katz, G. Lee, and L. Niccolini. Internet Computing, 9:42–49, 2005.
An energy case for hybrid datacenters. In HotPower, 2009.
[35] SMSC. Lan9420/lan9420i single-chip ethernet controller with hp auto-mdix
[13] J. Dean, S. Ghemawat, and G. Inc. Mapreduce: simplified data processing on support and pci interface. 2008.
large clusters. In In OSDI04: Proceedings of the 6th conference on Sympo-
sium on Opearting Systems Design and Implementation. USENIX Association, [36] N. Tolia, Z. Wang, M. Marwah, C. Bash, P. Ranganathan, and X. Zhu. Deliver-
2004. ing energy proportionality with non energy-proportional systems - optimizing
the ensemble. In HotPower, 2008.
[14] M. E. Femal and V. W. Freeh. Boosting data center performance through non-
uniform power allocation. In ICAC ’05: Proceedings of the Second Inter- [37] T. White. Hadoop: The Definitive Guide. O’Reilly Media, May, 2009.
national Conference on Automatic Computing, Washington, DC, USA, 2005.
IEEE Computer Society.
[15] Y. I. Eric Baldeschwieler. http://developer.yahoo.com/events/hadoopsummit2010.
[16] S. ES.2. http://www.seagate.com/staticfiles/support/disc/manuals/nl35 series &
bc es series/barracuda es.2 series/100468393e.pdf. 2008.
[17] X. Fan, W.-D. Weber, and L. A. Barroso. Power provisioning for a warehouse-
sized computer. In ISCA ’07: Proceedings of the 34th annual international sym-
posium on Computer architecture, pages 13–23, New York, NY, USA, 2007.
ACM.
[18] L. Ganesh, H. Weatherspoon, M. Balakrishnan, and K. Birman. Optimizing
power consumption in large scale storage systems. In HOTOS’07: Proceedings
of the 11th USENIX workshop on Hot topics in operating systems, Berkeley,
CA, USA, 2007. USENIX Association.
[19] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. SIGOPS
Oper. Syst. Rev., 37(5):29–43, 2003.
[20] T. Heath, A. P. Centeno, P. George, L. Ramos, Y. Jaluria, and R. Bianchini.
Mercury and freon: temperature emulation and management for server systems.
In ASPLOS, pages 106–116, 2006.
[21] U. Hoelzle and L. A. Barroso. The Datacenter as a Computer: An Introduction
to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers,
May 29, 2009.
[22] Intel. Quad-core intel xeon processor 5400 series. 2008.
287