Intro to hadoop ecosystem

Feb 29, 20160 likes151 views

Grzegorz Kolpuc

I have performed this presentation to start up Thomson Reuters Internal meetup for knowledge and experience sharing.

What is cool?
big data
distributed systems
libs (algorithms, collections, network, multithreading, serialization, ...)
patterns, methodologies, best practices
trends

technical presentations
hackathons
workshops
conferences/local events
What we want to do?
trainings

Upcoming presentations...
Distributed caching with HazelCast
Storm - real time stream processing
TDD - myth or good practice.
Handling failures in distributed systems
Serialization for everybody
Test your code. Always.
SQL Server Reporting Services - make your users happy and your
life easier

Upcoming presentations...
Reading (un)real-time feeds in Event Platform
Distributed computing and clustering done right
ActiveMQ usage in a SEM's Live Transcript process.
33 things we did wrong. EP lesson learned.
Who do it better? GitFlow implemented in EP and SEM.
Why Kafka is a standard?

NoSQL (often interpreted as Not only SQL[1][2]) database provides a
mechanism for storage and retrieval of data that is modeled in means other
than the tabular relations used in relational databases

Google released the
Google File System paper
in October 2003

Google released the
MapReduce paper
in December 2004

In 2006, Cutting went to work with Yahoo, which was
equally impressed by the Google File System and
MapReduce papers and wanted to build open source
technologies based on them

The transformation into Hadoop being “behind every click”
(or every batch process, technically) at Yahoo was pretty
much complete by 2008

By the time Yahoo spun out Hortonworks into a separate,
Hadoop-focused software company in 2011, Yahoo’s
Hadoop infrastructure consisted of 42,000 nodes and
hundreds of petabytes of storage

Other YARN applications
Storm
Spark
Tez
Samza
Impala

Hive is a data warehousing infrastructure based on
Hadoop. Hadoop provides massive scale out and fault
tolerance capabilities for data storage and processing

Example
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
STORED AS SEQUENCEFILE;

Example
SELECT pv.*, u.gender, u.age, f.friends
FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN
friend_list f ON (u.id = f.uid)
WHERE pv.date = '2008-03-03';

Example
SELECT pv_users.gender, count(DISTINCT pv_users.userid),
count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

Pig is a high level scripting language that is used with
Apache Hadoop. Pig excels at describing data analysis
problems as data flows. Pig is complete in that you can do
all the required data manipulations in Apache Hadoop with
Pig

Example
players = load 'baseball' as (name:chararray, team:chararray,
position:bag{t:(p:chararray)}, bat:map[]);
noempty = foreach players generate name,
((position is null or IsEmpty(position)) ? {('unknown')} :
position)as position;
pos = foreach noempty generate name, flatten(position) as position;
bypos = group pos by position;

Other frameworks...
Apache Spark
Impala
Apache Tez
Apache Flink
Storm, Samza, Spark S, Flink S (real-time analytics)

When Would I Use Apache HBase?
Use Apache HBase™ when you need random, realtime read/write access to your
Big Data. This project's goal is the hosting of very large tables -- billions of rows X
millions of columns -- atop clusters of commodity hardware

This document summarizes an agenda for the SARA Hadoop Hackathon on December 7, 2010. It provides background on Hadoop and how it relates to earlier technologies like Nutch and MapReduce. It then outlines the agenda for the day which includes introductions, presentations on MapReduce at University of Twente and a kickoff for the hackathon project building period. An optional tour of the SARA facilities is also included. The day will conclude with presentations of hackathon results.

Hadoop introductionRabindra Nath Nandi

Hadoop is an open-source software platform for distributed storage and processing of large datasets across clusters of computers. It was designed to scale up from single servers to thousands of machines, with very high fault tolerance. The document outlines the history of Hadoop, why it was created, its core components HDFS for storage and MapReduce for processing, and provides an example word count problem. It also includes information on installing Hadoop and additional resources.

Toulouse Data Science meetup - Apache zeppelinGérard Dupont

Apache Zeppelin is a web-based notebook for interactive data analytics. It allows for interactive coding and visualization with out-of-the-box support for Spark integration. Some key features include interactive notebooks, built-in visualization options, and extensibility through additional interpreters and custom visualization. While it is easy to configure and use, installation from source is required for customization and it currently lacks multi-user support.

First NL-HUG: Large-scale data processing at SARA with Apache HadoopEvert Lammerts

Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityNodejsFoundation

Today, more data is accumulated than ever before. It has been estimated that over 80% of data collected by businesses is unstructured, mostly in the form of free text. The statistical community has developed many tools for analysing textual data, both in the areas of exploratory data analysis (e.g. clustering methods) and predictive analytics. In this talk, Philipp Burckhardt will discuss tools and libraries that you can use today to perform text mining with Node.js. Creative strategies to overcome the limitations of the V8 engine in the areas of high-performance and memory-intensive computing will be discussed. You will be introduced to how you can use Node.js streams to analyse text in real-time, how to leverage native add-ons for performance-intensive code and how to build command-line interfaces to process text directly from the terminal.

Tds — big science dec 2021Gérard Dupont

BigScience is a one-year research workshop involving over 800 researchers from 60 countries to build and study very large multilingual language models and datasets. It was granted 5 million GPU hours on the Jean Zay supercomputer in France. The workshop aims to advance AI/NLP research by creating shared models and data as well as tools for researchers. Several working groups are studying issues like bias, scaling, and engineering challenges of training such large models. The first model, T0, showed strong zero-shot performance. Upcoming work includes further model training and papers.

Introduction to Spark: Or how I learned to love 'big data' after all.Peadar Coyle

Big dataSampath Bhargav Pinnam

“BIG DATA” is data that is big in volume velocity and Variety “TODAY’S BIG MAY BE TOMMOROW’S NORMAL” Varieties deals with a wide range of data types Structured data - RDMS Semi – structured data – HTML,XML Unstructured data – audios, videos, emails, photos, pdf, social media hadoop It was created by DOUG CUTTING and MICHEAL CAFARELLA in 2005 2003 – NUTCH open source search engine( lucene ,sphinx ,etc…) (google published some papers mentioning about DFS and MAP REDUCE) After yahoo took this initiative step Then the creation of hadoop took place Hadoop 0.1.0 was relesed april 2006 As of now hadoop 2.8 is available

Big Data & HadoopThanakrit Lersmethasakul

This document provides an overview of big data and Hadoop. It introduces big data concepts and architectures, describes the Hadoop ecosystem including its core components of HDFS and MapReduce. It also provides an example of how MapReduce works for a word count problem, splitting the documents, mapping to count word frequencies, and reducing to sum the counts. The document aims to give the reader an understanding of big data and how Hadoop is used for distributed storage and processing of large datasets.

How to deal with nested lists in R? Sotrender

This document discusses how to deal with nested lists in R using the purrr, furrr, and future packages. It summarizes working with nested list data from a JSON response, including adding custom IDs to nested data frames and parallelizing the process using future_map to speed it up. Anonymous functions are also discussed as they are used with the apply functions in purrr, and examples are provided of their syntax.

CityLABS Workshop: Working with large tablesEnrico Daga

This document discusses working with large tables and big data processing. It introduces distributed computing as an approach to process large datasets by distributing data across multiple nodes and parallelizing operations. The document then outlines using Apache Hadoop and the MK Data Hub cluster to distribute data storage and processing. It demonstrates how to use tools like Hue, Hive, and Pig to analyze tabular data in a distributed manner at scale. Finally, hands-on examples are provided for computing TF-IDF statistics on the large Gutenberg text corpus.

Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardBrian O'Neill

Analysis of historical movie data by BHADRABhadra Gowdra

Recommendation system provides the facility to understand a person's taste and find new, desirable content for them automatically based on the pattern between their likes and rating of different items. In this paper, we have proposed a recommendation system for the large amount of data available on the web in the form of ratings, reviews, opinions, complaints, remarks, feedback, and comments about any item (product, event, individual and services) using Hadoop Framework.

An introduction to Hadoop for large scale data analysisAbhijit Sharma

This document provides an overview of Hadoop and how it can be used for large scale data analysis. Some key points discussed include: - Hadoop uses MapReduce, an programming model for processing large datasets in parallel across clusters of computers using a simple programming model. - It also uses HDFS for reliable storage of very large files across clusters of commodity servers. - Examples of how Hadoop can be used include distributed logging, search, analytics, and data mining of large datasets.

Data engineering and analytics using pythonPurna Chander

This document provides an overview of data engineering and analytics using Python. It discusses Jupyter notebooks and commonly used Python modules for data science like Pandas, NumPy, SciPy, Matplotlib and Seaborn. It describes Anaconda distribution and the key features of Pandas including data loading, structures like DataFrames and Series, and core operations like filtering, mapping, joining, sorting, cleaning and grouping. It also demonstrates data visualization using Seaborn and a machine learning example of linear regression.

Big Data - Part IVThanuja Seneviratne

Apache Hive provides a SQL-like interface to query and manipulate data stored in Hadoop, while Apache Pig provides a scripting language to define data flows and transformations. Hive is better suited for business intelligence analysis and ad-hoc queries through its familiar SQL interface, while Pig is more appropriate for data pipelines, iterative data processing, and research through its scripting capabilities. Both can perform similar functions but may differ in performance depending on the use case.

A Map of the PyData StackPeadar Coyle

Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.

This document summarizes machine learning scalability from single machine to distributed systems. It discusses how true scalability is about how long it takes to reach a target accuracy level using any available hardware resources. It introduces GraphLab Create and SFrame/SGraph for scalable machine learning and graph processing. Key points include distributed optimization techniques, graph partitioning strategies, and benchmarks showing GraphLab Create can solve problems faster than other systems by using fewer machines.

Map ReduceMichel Bruley

Beyond Kaggle: Solving Data Science Challenges at ScaleTuri, Inc.

This document summarizes a presentation on entity resolution and data deduplication using Dato toolkits. It discusses key concepts like entity resolution, challenges in entity resolution like missing data and data integration from multiple sources, and provides an example dataset of matching Amazon and Google products. It also outlines the preprocessing steps, describes using a nearest neighbors algorithm to find duplicate records, and shares some resources on entity resolution.

Tech Talk - Underutilized Resources in Distributed SystemRishabh Dugar

This document discusses using underutilized distributed computing resources and the Chord protocol. It first introduces the problem of processing growing data and costs of hardware. It then defines distributed systems and describes MapReduce for parallel processing. The document outlines the Chord protocol for distributed lookup, including the finger table, successor list, consistent hashing, and Chord ring. It notes that Chord lookup scales as O(log n). Finally, it mentions comparing Chord to Hadoop with and without node churn.

Open Source Databases And GisKudos S.A.S

Open source databases like MySQL, PostgreSQL, and Berkeley DB are flexible alternatives to proprietary databases. PostGIS extends PostgreSQL with spatial database capabilities for storing and querying geographic data and objects. It implements OpenGIS standards and provides functions for spatial indexing, analysis, and data access. PostGIS allows integrating geographic data into web and desktop applications and is used successfully in various real-world GIS projects and systems.

Big Data - Part IIIThanuja Seneviratne

The document discusses the core components of Hadoop, including storage, transformation, and analysis using components like HDFS, MapReduce, Tez and Spark. It describes Generation 1 core components as HDFS for storage and MapReduce for processing. HDFS uses a master-slave architecture with the NameNode tracking metadata and DataNodes storing replicated blocks. MapReduce uses mappers to create key-value pairs, a shuffle to group related pairs, and reducers to aggregate pairs for output. Sample MapReduce jobs for word counting and tracking smart phones are provided.

Hadoop Ecosystem Architecture Overview Senthil Kumar

DBPedia-past-present-futureData Science Society

Google's DremelMaria Stylianou

The document discusses Dremel, an interactive query system for analyzing large-scale datasets. Dremel uses a columnar data storage format and a multi-level query execution tree to enable fast querying. It evaluates Dremel's performance on interactive queries, showing it can count terms in a field within seconds using 3000 workers, while MapReduce takes hours. Dremel also scales linearly and handles stragglers well. Today, similar systems like Google BigQuery and Apache Drill use Dremel-like techniques for interactive analysis of web-scale data.

Hadoop @ Sara & BiG GridEvert Lammerts

This document discusses large-scale data processing using Apache Hadoop at SARA and BiG Grid. It provides an introduction to Hadoop and MapReduce, noting that data is easier to collect, store, and analyze in large quantities. Examples are given of projects using Hadoop at SARA, including analyzing Wikipedia data and structural health monitoring. The talk outlines the Hadoop ecosystem and timeline of its adoption at SARA. It discusses how scientists are using Hadoop for tasks like information retrieval, machine learning, and bioinformatics.

Big data vahidamiri-tabriz-13960226-datastack.irdatastack

Hive @ Hadoop day seattle_2010nzhang

Hive is used at Facebook for data warehousing and analytics tasks on a large Hadoop cluster. It allows SQL-like queries on structured data stored in HDFS files. Key features include schema definitions, data summarization and filtering, extensibility through custom scripts and functions. Hive provides scalability for Facebook's rapidly growing data needs through its ability to distribute queries across thousands of nodes.

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.

Michael Sun presented on CBS Interactive's use of Hadoop for web analytics processing. Some key points: - CBS Interactive processes over 1 billion web logs daily from hundreds of websites on a Hadoop cluster with over 1PB of storage. - They developed an ETL framework called Lumberjack in Python for extracting, transforming, and loading data from web logs into Hadoop and databases. - Lumberjack uses streaming, filters, and schemas to parse, clean, lookup dimensions, and sessionize web logs before loading into a data warehouse for reporting and analytics. - Migrating to Hadoop provided significant benefits including reduced processing time, fault tolerance, scalability, and cost effectiveness compared to their

More Related Content

What's hot (20)

Big Data & HadoopThanakrit Lersmethasakul

How to deal with nested lists in R? Sotrender

CityLABS Workshop: Working with large tablesEnrico Daga

Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardBrian O'Neill

Analysis of historical movie data by BHADRABhadra Gowdra

An introduction to Hadoop for large scale data analysisAbhijit Sharma

Data engineering and analytics using pythonPurna Chander

Big Data - Part IVThanuja Seneviratne

A Map of the PyData StackPeadar Coyle

Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.

Map ReduceMichel Bruley

Beyond Kaggle: Solving Data Science Challenges at ScaleTuri, Inc.

Tech Talk - Underutilized Resources in Distributed SystemRishabh Dugar

Open Source Databases And GisKudos S.A.S

Big Data - Part IIIThanuja Seneviratne

Hadoop Ecosystem Architecture Overview Senthil Kumar

DBPedia-past-present-futureData Science Society

Google's DremelMaria Stylianou

Hadoop @ Sara & BiG GridEvert Lammerts

Big data vahidamiri-tabriz-13960226-datastack.irdatastack

Big Data & HadoopThanakrit Lersmethasakul

How to deal with nested lists in R? Sotrender

CityLABS Workshop: Working with large tablesEnrico Daga

Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardBrian O'Neill

Analysis of historical movie data by BHADRABhadra Gowdra

An introduction to Hadoop for large scale data analysisAbhijit Sharma

Data engineering and analytics using pythonPurna Chander

Big Data - Part IVThanuja Seneviratne

A Map of the PyData StackPeadar Coyle

Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.

Map ReduceMichel Bruley

Beyond Kaggle: Solving Data Science Challenges at ScaleTuri, Inc.

Tech Talk - Underutilized Resources in Distributed SystemRishabh Dugar

Open Source Databases And GisKudos S.A.S

Big Data - Part IIIThanuja Seneviratne

Hadoop Ecosystem Architecture Overview Senthil Kumar

DBPedia-past-present-futureData Science Society

Google's DremelMaria Stylianou

Hadoop @ Sara & BiG GridEvert Lammerts

Big data vahidamiri-tabriz-13960226-datastack.irdatastack

Similar to Intro to hadoop ecosystem (20)

Hive @ Hadoop day seattle_2010nzhang

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.

Hadoop ensma poitiersRim Moussa

This document provides an overview of the Apache Hadoop ecosystem. It discusses key components like HDFS, MapReduce, YARN, Pig Latin, and performance tuning for MapReduce jobs. HDFS is introduced as the distributed file system that provides high throughput and scalability. MapReduce is described as the framework for distributed processing of large datasets across clusters. YARN is presented as an improvement over the static resource allocation in Hadoop 0.1.x. Pig Latin is demonstrated as a high-level language for expressing data analysis jobs. The document concludes by discussing extensions beyond MapReduce, like iterative processing and indexing approaches.

Experience SQL Server 2017: The Modern Data PlatformBob Ward

Hadoop: An Industry PerspectiveCloudera, Inc.

This document provides an overview of Hadoop and how it can be used for data consolidation, schema flexibility, and query flexibility compared to a relational database. It describes the key components of Hadoop including HDFS for storage and MapReduce for distributed processing. Examples of industry use cases are also presented, showing how Hadoop enables affordable long-term storage and scalable processing of large amounts of structured and unstructured data.

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri

Hive Training -- Motivations and Real World Use Casesnzhang

PASS Summit - SQL Server 2017 Deep DiveTravis Wright

[db tech showcase Tokyo 2018]　#dbts2018 #C25 『マルチモデル・データベースへの道: PostgreSQLを最も...Insight Technology, Inc.

Distributed computing poliivascucristian

The document provides an overview of distributed computing and related technologies. It discusses the history of distributed computing including local, parallel, grid and distributed computing. It then discusses applications of distributed computing like web indexing and recommendations. The document introduces Hadoop and its core components HDFS and MapReduce. It also discusses related technologies like HBase, Mahout and challenges in designing distributed systems. It provides examples of using Mahout for machine learning tasks like classification, clustering and recommendations.

Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal

This document discusses big data and Hadoop. It begins with defining big data and explaining its characteristics of volume, variety, velocity, and veracity. It then provides an overview of Hadoop, describing its core components of HDFS for storage and MapReduce for processing. Key technologies in Hadoop's ecosystem are also summarized like Hive, Pig, and HBase. The document concludes by outlining some challenges of big data like issues of heterogeneity and incompleteness of data.

What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit

This document discusses considerations for scaling Hadoop platforms at Yahoo. It covers topics such as deployment models (on-premise vs. public cloud), total cost of ownership, hardware configuration, networking, software stack, security, data lifecycle management, metering and governance, and debunking myths. The key takeaways are that utilization matters for cost analysis, hardware becomes increasingly heterogeneous over time, advanced networking designs are needed to avoid bottlenecks, security and access management must be flexible, and data lifecycles require policy-based management.

Hadoop & ZingLong Dao

The document discusses using Hadoop and Hive at Zing for log collecting, analyzing, and reporting. It provides an overview of Hadoop and Hive and how they are used at Zing to store and analyze large amounts of log and user data in a scalable, fault-tolerant manner. A case study is presented that describes how Zing evolved its log analysis system from using MySQL to using Scribe, Hadoop, and Hive to more efficiently collect, transform, analyze and report on log data.

Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution

講者：Informatica 資深產品顧問 | 尹寒柏議題簡介：Big Data 時代，比的不是數據數量，而是了解數據的深度。現在，因為 Big Data 技術的成熟，讓非資訊背景的 CXO 們，可以讓過去像是專有名詞的 CI (Customer Intelligence) 變成動詞，從 BI 進入 CI，更連結消費者經濟的脈動，洞悉顧客的意圖。不過，有個 Big Data 時代要注意的思維，那就是競爭到最後，不單只是看數據量的增長，還要比誰能更了解數據的深度。而 Informatica 正是這個最佳解決的答案。我們透過 Informatica 解決在企業及時提供可信賴數據的巨大壓力;同時隨著日益增高的數據量和複雜程度，Informatica 也有能力提供更快速彙集數據技術，從而讓數據變的有意義並可供企業用來促進效率提升、完善品質、保證確定性和發揮優勢的功能。Inforamtica 提供了更為快速有效地實現此目標的方案，是精誠集團在 Big Data 時代的最佳工具。

Survey Paper on Big Data and HadoopIRJET Journal

This document discusses big data and Hadoop. It defines big data as large datasets that are difficult to process using traditional methods due to their volume, variety, and velocity. Hadoop is presented as an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. The key components of Hadoop are the Hadoop Distributed File System (HDFS) for storage and MapReduce as a programming model for distributed processing. A number of other technologies in Hadoop's ecosystem are also described such as HBase, Avro, Pig, Hive, Sqoop, Zookeeper and Mahout. The document concludes that Hadoop provides solutions for efficiently processing and analyzing big data.

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev

6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi

In questa sessione vedremo, con il solito approccio pratico di demo hands on, come utilizzare il linguaggio R per effettuare analisi a valore aggiunto, Toccheremo con mano le performance di parallelizzazione degli algoritmi, aspetto fondamentale per aiutare il ricercatore nel raggiungimento dei suoi obbiettivi. In questa sessione avremo la partecipazione di Lorenzo Casucci, Data Platform Solution Architect di Microsoft.

Another Intro To HadoopAdeel Ahmad

Hive ICDE 2010ragho

Elephant in the room: A DBA's Guide to HadoopStuart Ainsworth

This document provides a high-level overview of Hadoop and big data concepts for DBAs with SQL experience. It introduces key big data terminology like the four V's of big data, and discusses how Hadoop uses HDFS for distributed storage and MapReduce for distributed processing at massive scales. Example use cases like word counting are demonstrated using both SQL Server and the Pig framework in Hadoop.

Hive @ Hadoop day seattle_2010nzhang

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.

Hadoop ensma poitiersRim Moussa

Experience SQL Server 2017: The Modern Data PlatformBob Ward

Hadoop: An Industry PerspectiveCloudera, Inc.

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri

Hive Training -- Motivations and Real World Use Casesnzhang

PASS Summit - SQL Server 2017 Deep DiveTravis Wright

[db tech showcase Tokyo 2018]　#dbts2018 #C25 『マルチモデル・データベースへの道: PostgreSQLを最も...Insight Technology, Inc.

Distributed computing poliivascucristian

Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal

What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit

Hadoop & ZingLong Dao

Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution

Survey Paper on Big Data and HadoopIRJET Journal

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev

6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi

Another Intro To HadoopAdeel Ahmad

Hive ICDE 2010ragho

Elephant in the room: A DBA's Guide to HadoopStuart Ainsworth

Recently uploaded (20)

Machine learning project on employee attrition detection using (2).pptxrajeswari89780

DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...charlesdick1345

How to use nRF24L01 module with ArduinoCircuitDigest

"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...Infopitaara

A Boiler Feed Pump (BFP) is a critical component in thermal power plants. It supplies high-pressure water (feedwater) to the boiler, ensuring continuous steam generation. ⚙️ How a Boiler Feed Pump Works Water Collection: Feedwater is collected from the deaerator or feedwater tank. Pressurization: The pump increases water pressure using multiple impellers/stages in centrifugal types. Discharge to Boiler: Pressurized water is then supplied to the boiler drum or economizer section, depending on design. 🌀 Types of Boiler Feed Pumps Centrifugal Pumps (most common): Multistage for higher pressure. Used in large thermal power stations. Positive Displacement Pumps (less common): For smaller or specific applications. Precise flow control but less efficient for large volumes. 🛠️ Key Operations and Controls Recirculation Line: Protects the pump from overheating at low flow. Throttle Valve: Regulates flow based on boiler demand. Control System: Often automated via DCS/PLC for variable load conditions. Sealing & Cooling Systems: Prevent leakage and maintain pump health. ⚠️ Common BFP Issues Cavitation due to low NPSH (Net Positive Suction Head). Seal or bearing failure. Overheating from improper flow or recirculation.

Raish Khanji GTU 8th sem Internship Report.pdfRaishKhanji

This report details the practical experiences gained during an internship at Indo German Tool Room, Ahmedabad. The internship provided hands-on training in various manufacturing technologies, encompassing both conventional and advanced techniques. Significant emphasis was placed on machining processes, including operation and fundamental understanding of lathe and milling machines. Furthermore, the internship incorporated modern welding technology, notably through the application of an Augmented Reality (AR) simulator, offering a safe and effective environment for skill development. Exposure to industrial automation was achieved through practical exercises in Programmable Logic Controllers (PLCs) using Siemens TIA software and direct operation of industrial robots utilizing teach pendants. The principles and practical aspects of Computer Numerical Control (CNC) technology were also explored. Complementing these manufacturing processes, the internship included extensive application of SolidWorks software for design and modeling tasks. This comprehensive practical training has provided a foundational understanding of key aspects of modern manufacturing and design, enhancing the technical proficiency and readiness for future engineering endeavors.

15th International Conference on Computer Science, Engineering and Applicatio...IJCSES Journal

Metal alkyne complexes.pptx in chemistrymee23nu

Artificial Intelligence introduction.pptxDrMarwaElsherif

Data Structures_Searching and Sorting.pptxRushaliDeshmukh2

Introduction to FLUID MECHANICS & KINEMATICSnarayanaswamygdas

Fluid mechanics is the branch of physics concerned with the mechanics of fluids (liquids, gases, and plasmas) and the forces on them. Originally applied to water (hydromechanics), it found applications in a wide range of disciplines, including mechanical, aerospace, civil, chemical, and biomedical engineering, as well as geophysics, oceanography, meteorology, astrophysics, and biology. It can be divided into fluid statics, the study of various fluids at rest, and fluid dynamics. Fluid statics, also known as hydrostatics, is the study of fluids at rest, specifically when there's no relative motion between fluid particles. It focuses on the conditions under which fluids are in stable equilibrium and doesn't involve fluid motion. Fluid kinematics is the branch of fluid mechanics that focuses on describing and analyzing the motion of fluids, such as liquids and gases, without considering the forces that cause the motion. It deals with the geometrical and temporal aspects of fluid flow, including velocity and acceleration. Fluid dynamics, on the other hand, considers the forces acting on the fluid. Fluid dynamics is the study of the effect of forces on fluid motion. It is a branch of continuum mechanics, a subject which models matter without using the information that it is made out of atoms; that is, it models matter from a macroscopic viewpoint rather than from microscopic. Fluid mechanics, especially fluid dynamics, is an active field of research, typically mathematically complex. Many problems are partly or wholly unsolved and are best addressed by numerical methods, typically using computers. A modern discipline, called computational fluid dynamics (CFD), is devoted to this approach. Particle image velocimetry, an experimental method for visualizing and analyzing fluid flow, also takes advantage of the highly visual nature of fluid flow. Fundamentally, every fluid mechanical system is assumed to obey the basic laws : Conservation of mass Conservation of energy Conservation of momentum The continuum assumption For example, the assumption that mass is conserved means that for any fixed control volume (for example, a spherical volume)—enclosed by a control surface—the rate of change of the mass contained in that volume is equal to the rate at which mass is passing through the surface from outside to inside, minus the rate at which mass is passing from inside to outside. This can be expressed as an equation in integral form over the control volume. The continuum assumption is an idealization of continuum mechanics under which fluids can be treated as continuous, even though, on a microscopic scale, they are composed of molecules. Under the continuum assumption, macroscopic (observed/measurable) properties such as density, pressure, temperature, and bulk velocity are taken to be well-defined at "infinitesimal" volume elements—small in comparison to the characteristic length scale of the system, but large in comparison to molecular length scale

Degree_of_Automation.pdf for Instrumentation and industrial specialistshreyabhosale19

MAQUINARIA MINAS CEMA 6th Edition (1).pdfssuser562df4

introduction to machine learining for beginersJoydebSheet

211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdfinmishra17121973

DSP and MV the Color image processing.pptHafizAhamed8

fluke dealers in bangalore..............Haresh Vaswani

The Fluke 925 is a vane anemometer, a handheld device designed to measure wind speed, air flow (volume), and temperature. It features a separate sensor and display unit, allowing greater flexibility and ease of use in tight or hard-to-reach spaces. The Fluke 925 is particularly suitable for HVAC (heating, ventilation, and air conditioning) maintenance in both residential and commercial buildings, offering a durable and cost-effective solution for routine airflow diagnostics.

"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...Infopitaara

A feed water heater is a device used in power plants to preheat water before it enters the boiler. It plays a critical role in improving the overall efficiency of the power generation process, especially in thermal power plants. 🔧 Function of a Feed Water Heater: It uses steam extracted from the turbine to preheat the feed water. This reduces the fuel required to convert water into steam in the boiler. It supports Regenerative Rankine Cycle, increasing plant efficiency. 🔍 Types of Feed Water Heaters: Open Feed Water Heater (Direct Contact) Steam and water come into direct contact. Mixing occurs, and heat is transferred directly. Common in low-pressure stages. Closed Feed Water Heater (Surface Type) Steam and water are separated by tubes. Heat is transferred through tube walls. Common in high-pressure systems. ⚙️ Advantages: Improves thermal efficiency. Reduces fuel consumption. Lowers thermal stress on boiler components. Minimizes corrosion by removing dissolved gases.

Value Stream Mapping Worskshops for Intelligent Continuous SecurityMarc Hornbeek

five-year-soluhhhhhhhhhhhhhhhhhtions.pdfAdityaSharma944496

Oil-gas_Unconventional oil and gass_reseviours.pdfM7md3li2

Machine learning project on employee attrition detection using (2).pptxrajeswari89780

DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...charlesdick1345

How to use nRF24L01 module with ArduinoCircuitDigest

"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...Infopitaara

Raish Khanji GTU 8th sem Internship Report.pdfRaishKhanji

15th International Conference on Computer Science, Engineering and Applicatio...IJCSES Journal

Metal alkyne complexes.pptx in chemistrymee23nu

Artificial Intelligence introduction.pptxDrMarwaElsherif

Data Structures_Searching and Sorting.pptxRushaliDeshmukh2

Introduction to FLUID MECHANICS & KINEMATICSnarayanaswamygdas

Degree_of_Automation.pdf for Instrumentation and industrial specialistshreyabhosale19

MAQUINARIA MINAS CEMA 6th Edition (1).pdfssuser562df4

introduction to machine learining for beginersJoydebSheet

211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdfinmishra17121973

DSP and MV the Color image processing.pptHafizAhamed8

fluke dealers in bangalore..............Haresh Vaswani

"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...Infopitaara

Value Stream Mapping Worskshops for Intelligent Continuous SecurityMarc Hornbeek

five-year-soluhhhhhhhhhhhhhhhhhtions.pdfAdityaSharma944496

Oil-gas_Unconventional oil and gass_reseviours.pdfM7md3li2

Intro to hadoop ecosystem

1. Gdynia TECH Group

3. What is cool? big data distributed systems libs (algorithms, collections, network, multithreading, serialization, ...) patterns, methodologies, best practices trends

9. technical presentations hackathons workshops conferences/local events What we want to do? trainings

13. Upcoming presentations... Distributed caching with HazelCast Storm - real time stream processing TDD - myth or good practice. Handling failures in distributed systems Serialization for everybody Test your code. Always. SQL Server Reporting Services - make your users happy and your life easier

14. Upcoming presentations... Reading (un)real-time feeds in Event Platform Distributed computing and clustering done right ActiveMQ usage in a SEM's Live Transcript process. 33 things we did wrong. EP lesson learned. Who do it better? GitFlow implemented in EP and SEM. Why Kafka is a standard?

15. Want to contribute? contact us

16. Q?

17. Introduction to Hadoop Ecosystem

18. What is NoSQL?

20. NoSQL (often interpreted as Not only SQL[1][2]) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases

21. What is Big Data?

23. 10TB

24. Hadoop is Big Data !?

25. What is Hadoop?

26. Google released the Google File System paper in October 2003

28. Google released the MapReduce paper in December 2004

30. In 2006, Cutting went to work with Yahoo, which was equally impressed by the Google File System and MapReduce papers and wanted to build open source technologies based on them

31. The transformation into Hadoop being “behind every click” (or every batch process, technically) at Yahoo was pretty much complete by 2008

32. By the time Yahoo spun out Hortonworks into a separate, Hadoop-focused software company in 2011, Yahoo’s Hadoop infrastructure consisted of 42,000 nodes and hundreds of petabytes of storage

33. What is Hadoop?

34. Hadoop

35. Hadoop

36. HDFS

37. Map Reduce

38. Map Reduce

39. YARN

40. Other YARN applications Storm Spark Tez Samza Impala

41. Hive

42. Hive is a data warehousing infrastructure based on Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing

43. Example CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' STORED AS SEQUENCEFILE;

44. Example SELECT pv.*, u.gender, u.age, f.friends FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid) WHERE pv.date = '2008-03-03';

45. Example SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender;

46. Pig

47. Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig

48. Example players = load 'baseball' as (name:chararray, team:chararray, position:bag{t:(p:chararray)}, bat:map[]); noempty = foreach players generate name, ((position is null or IsEmpty(position)) ? {('unknown')} : position)as position; pos = foreach noempty generate name, flatten(position) as position; bypos = group pos by position;

49. Example players = load 'baseball' as (name:chararray, team:chararray, position:bag{t:(p:chararray)}, bat:map[]); noempty = foreach players generate name, ((position is null or IsEmpty(position)) ? {('unknown')} : position)as position; pos = foreach noempty generate name, flatten(position) as position; bypos = group pos by position;

50. Other frameworks... Apache Spark Impala Apache Tez Apache Flink Storm, Samza, Spark S, Flink S (real-time analytics)

51. HBase

53. When Would I Use Apache HBase? Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware

54. Q?

Editor's Notes

#2: na poczatek troche was zmecze… odpowiemy sobie na kilka pytan… wiem, jakbyscie wiedzieli ze beda pytania, byscie nie przyszli…, dlatego dopiero teraz mowie
#3: Whoo do cool things?
#5: show ourselves outside the company, uwazacie ze nie ma nic ciekawego do pokazywania? no tak jak slysze ze testy nie maja sensu ponizej 10k kodu
#6: jezeli nie to sa dwie mozliwosci: albo nie macie racji albo cos generalnie jest nie tak
#8: to moze wynikac z roznych rzeczy: brak dzielenia sie wiedza - kazdy siedzi w swojej piaskownicy, kopie dolek lopatka, a w pokoju obok maja koparke
#11: 1.wy jestescie naszymi przyszlymi prelegentami… :) 2. mozna sporo skozystac; -respect -presentation skills -przygotowanie prezentacji bywa bardzo ksztalcace -budowanie wlasnej marki -miejsce dla osob ktore maja ochote to zrobic na zewnatrz ale nie ma gdzie sprobowac - My zapewniamy wsparcie: -pomoc w przygotowaniu prezentacji -wybor tematu - chcecie ‘cos’ pokazac ale nie macie tematu, nie wiecie co moze interesowac inne osoby? znajdziemy wam temat
#37: HDFS

Intro to hadoop ecosystem

Recommended

More Related Content

What's hot (20)

Similar to Intro to hadoop ecosystem (20)

Recently uploaded (20)

Intro to hadoop ecosystem

Editor's Notes