SlideShare a Scribd company logo
HCFS 初探
Introduction to
Hadoop Compatible File System
Jazz Yao-Tsung Wang
Co-founder of Hadoop.TW
https://fb.com/groups/hadoop.tw
2017-01-21 Hadoop.TW & GCPUG.TW Meetup #1 2017
HELLO!
I am Jazz Wang
Co-Founder of Hadoop.TW.
Hadoop Evangelist since 2008.
Open Source Promoter. System Admin (Ops).
You can find me at @jazzwang_tw or
https://fb.com/groups/hadoop.tw ,
https://forum.hadoop.tw
1.
What is
HCFS?
Let’s start with
brief introduction to
Apache Hadoop
Apache Hadoop from 0.x to 1.x
Master Worker #1 Worker #2 Worker #3
NameNode
DataNode DataNode DataNode DataNode
Job
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
TrackerComputation
Layer
MapReduce
Storage
Layer
HDFS
Master Worker #1 Worker #2 Worker #3
NameNode
DataNode DataNode DataNode DataNode
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
Node
ManagerComputation
Layer
YARN
Storage
Layer
HDFS
Apache Hadoop from 2.x to 3.x
Container
Needs / Trends:
Hadoop on the Cloud
http://www.slideshare.net/jazzwang/hadoop-deployment-model-osdctw
Why Hadoop on the Cloud ?
http://www.slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production
https://www.youtube.com/watch?v=XehH3iJJy3Q
Why might you need HCFS ...
https://www.facebook.com/groups/hadoop.tw/permalink/1061706333938741/?comment_id=1072414466201261&reply
_comment_id=1073302882779086&comment_tracking={%22tn%22%3A%22R%22}
http://www.slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production
https://www.youtube.com/watch?v=XehH3iJJy3Q
Spark / Hive
/ Impala ...
“
https://aws.amazon.com/lambda/
https://cloud.google.com/functions/
http://www.forbes.com/sites/janakirammsv/2016/02/09/google-brings-serverless-computing-to-its-cloud-platform/#76e1aa9425b8
Docker
Microservice
Serverless
NoOps !?!
$$$
Master Worker #1 Worker #2 Worker #3
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
Node
ManagerComputation
Layer
YARN
Storage
Layer
HCFS
What is HCFS ?
Windows
Azure Blob
AWS
S3
Google
Cloud Storage
CephFS
Hadoop Compatible File System
HCFS implementations
- Cloud Storage Connector ( for Public Cloud Provider )
https://wiki.apache.org/hadoop/HCFS
AWS S3
s3://
Hadoop 0.10
~ Hadoop 2.7
https://wiki.apache.org/hadoop/AmazonS3
http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/
s3n://
Hadoop 0.18
~ Hadoop 2.6
s3a:// Hadoop 2.7+
AWS EMRFS ?? 3rd party http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html
Windows Azure
Storage Blob
wasb:// Hadoop 2.7+
http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/
https://issues.apache.org/jira/browse/HADOOP-9629
Azure Data Lake adl:// Hadoop 3.0+
https://hadoop.apache.org/docs/current/hadoop-azure-datalake/
https://docs.microsoft.com/zh-tw/azure/data-lake-store/data-lake-store-h
dinsight-hadoop-use-portal
Google Cloud
Storage
gs://
3rd party
Hadoop 1.x
Hadoop 2.x
https://cloud.google.com/hadoop/google-cloud-storage-connector
https://github.com/GoogleCloudPlatform/bigdata-interop
HCFS implementations ( for Private Cloud Provider )
OpenStack
Swift
( rackspace )
swift:// Hadoop 2.7+
https://issues.apache.org/jira/browse/HADOOP-8545
http://hadoop.apache.org/docs/r2.7.3/hadoop-openstack/
https://github.com/steveloughran/Hadoop-and-Swift-integration/
CephFS
( OpenStack )
ceph://
3rd party
Hadoop 1.1.x
http://docs.ceph.com/docs/master/cephfs/hadoop/
https://github.com/houbin/cephfs-hadoop
Cassandra
File System
cfs:// 3rd party
http://www.datastax.com/dev/blog/cassandra-file-system-design
http://www.datastax.com/resources/whitepapers/hdfs-vs-cfs
GlusterFS glusterfs:/// 3rd party
https://github.com/gluster/glusterfs-hadoop
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Hadoop/
OrangeFS
3rd party
Hadoop 1.2.1
Hadoop 2.6.0
http://docs.orangefs.com/v_2_8_8/index.htm#Hadoop_Client.htm
http://docs.orangefs.com/v_2_9/Hadoop_Use_Cases.htm
QFS ( KFS ) qfs:// 3rd party https://github.com/quantcast/qfs/wiki/Migration-Guide
Lustre 3rd party http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre
MapR
File System
3rd party
https://www.mapr.com/products/mapr-fs
https://community.mapr.com/thread/7027
HCFS Architecture
http://www.slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production
https://www.youtube.com/watch?v=XehH3iJJy3Q
New API
https://strata.oreilly.com.cn/hadoop-big-data-cn/public/schedule/detail/51169
http://www.slideshare.net/jazzwang/hadoop-69818883
https://strata.oreilly.com.cn/hadoop-big-data-cn/public/schedule/detail/51169
http://www.slideshare.net/jazzwang/hadoop-69818883
AWS S3 Authentication
Support
Azure Blob support
encrypted Key
CephFS is not work well with
YARN because of JNI (Java
Native Interface) :(
Only HDFS and Azure Blob
support HBase !!
2.
AWS S3
Use Case :
Amazon EMR
Three generation of S3 support
s3:// s3n:// s3a://
The ‘classic’ s3: filesystem
The second-generation, s3n: filesystem,
making it easy to share data between hadoop and
other applications via the S3 object store
The third generation, s3a: filesystem.
replacement for s3n:, supports larger files and
promises higher performance.
introduced in Hadoop 0.10.0 (HADOOP-574)
deprecated and will be removed from Hadoop 3.0
introduced in Hadoop 0.18.0 (HADOOP-930)
rename support in Hadoop 0.19.0 (HADOOP-3361)
Hadoop 2.6 and earlier
introduced in Hadoop 2.6.0 (HADOOP-11571)
recommended for Hadoop 2.7 and later
Uploaded files can be larger than 5GB, but they
are not interoperable with other S3 tools.
requires a compatible version of jets3t requires exact version of amazon-aws-sdk
core-site.xml core-site.xml core-site.xml
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>AWS access key ID</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>AWS secret key</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>AWS access key ID</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>AWS secret key</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>AWS access key ID</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>AWS secret key</value>
</property>
https://wiki.apache.org/hadoop/AmazonS3
http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
1. You cannot use S3 as a replacement for HDFS
2. Amazon S3 is an "object store"
▸ eventual consistency
▸ non-atomic rename and delete operations.
3. Your AWS credentials are valuable
▸ core-site.xmlis readable in cluster-wide
▸ Don’t use embedding the credentials in the URI
▸ S3A supports more authentication mechanisms
4. Amazon's EMR Service is based upon Apache Hadoop, but
contains modifications and their own, proprietary, S3 client.
WARNING!!
https://wiki.apache.org/hadoop/AmazonS3
http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
For Mac OS X +
brew install hadoop
export HADOOP_CONF_DIR=${PATH of core-site.xml)
export HADOOP_CLASSPATH=/usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/*
hadoop fs -ls s3n://${bucket}/
For Linux / Windows - use BigTop docker image
docker run -it --name hcfs -h hcfs -v $(pwd):/data jazzwang/bigtop-hdfs
# cd /data
/data# export HADOOP_CONF_DIR=${PATH of core-site.xml)
/data# hadoop fs -ls s3n://${bucket}/
DEMO
https://wiki.apache.org/hadoop/AmazonS3
http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
To enable more log4j messages, you could try :
export HADOOP_ROOT_LOGGER=DEBUG,console
hadoop fs -ls s3n://${bucket}/
To access unofficial S3 services such as hicloud S3 and Ceph S3 (RGW)
Using s3n:// , you have to put a config file jets3t.properties
$ cat jets3t.properties
s3service.s3-endpoint=s3.hicloud.net
s3service.https-only=false
Using s3a:// , you could add following to core-site.xml
<property>
<name>fs.s3a.endpoint</name>
<value>s3.hicloud.net</value>
<description>default is s3.amazonaws.com</description>
</property>
Undocumented Secrets 除錯/繞道密技
3.
Windows Azure
Storage Blob
Use Case :
HDInsight /
Azure Data Lake
1. hadoop-azure.jar is located at
- /usr/lib/hadoop-mapreduce/hadoop-azure.jar (bigtop , CDH)
- ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-azure.jar ( official tar.gz , Mac brew)
2. Depends on Azure Storage SDK for Java -
https://github.com/Azure/azure-storage-java
3. Features
▸ Supports configuration of multiple Azure Blob Storage accounts.
▸ Supports both page blobs and block blobs
▸ wasbs:// scheme for SSL encrypted access.
▸ Can act as a source of data in a MapReduce job, or a sink.
▸ Tested on both Linux and Windows.
4. Limitation
▸ The append operation is not implemented.
▸ File owner and group are persisted,
but the permissions model is not enforced.
▸ File last access time is not tracked.
Hadoop Azure Support: Azure Blob Storage
http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html
In core-site.xml
<property>
<name>fs.azure.account.key. youraccount.blob.core.windows.net</name>
<value>YOUR ACCESS KEY</value>
</property>
Examples:
> hadoop fs -mkdir wasb://yourcontainer@youraccount.blob.core.windows.net/testDir
> hadoop fs -put testFile
wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
> hadoop fs -cat
wasbs://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
Configurations
http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html
My Use Case :
rsync between local and wasb
http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html
Take advantage of hadoop distcp
- Backup
hadoop distcp -update ${SOURCE_DIR} 
wasb://yourcontainer@youraccount.blob.core.windows.net/${BACKUP_DIR}
- Restore
hadoop distcp 
wasb://yourcontainer@youraccount.blob.core.windows.net/${BACKUP_DIR} 
${RESTOR_DIR}
Take Hadoop as a
rsync tool to sync with
Hybrid Cloud Storage
Use Case in TenMax:
Read / Write files from/to Azure Blob Storage
Spring Boot
FileSystem
Web Application
File System
Abstraction Layer
core-site.xml
Azure Blob
Storage
Cloud Storage
Take Hadoop as a
Java Library to access
Hybrid Cloud Storage
4.
Ceph
Master Worker #1 Worker #2 Worker #3
Mon
OSD OSD OSD OSD
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
Node
ManagerComputation
Layer
YARN
Storage
Layer
Ceph
High Level Architecture of Hadoop 2.x with CephFS
Mon Mon
hdfs01
192.168.1.239
hdfs02
192.168.1.238
hdfs03
192.168.1.237
hdfs04
192.168.1.236
virtual network ( hub )
node11
192.168.1.201
node21
192.168.1.211
node31
192.168.1.221
Ceph
mon
Ceph
OSD
Ceph
OSD
Ceph
OSD
Ceph
OSD
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
1. Compile https://github.com/ceph/cephfs-hadoop
2. Copy cephfs-hadoop.jar
and place it at ${HADOOP_HOME}/lib/
3. Copy ceph.conf and ceph.client.${ID}.keyring
to /etc/ceph
4. Copy cephfs-java.jar to ${HADOOP_HOME}/lib/
5. Copy JNI related files to ${HADOOP_HOME}/lib/native/
ln -s libcephfs.so.1 /usr/lib/hadoop/lib/native/libcephfs.so
ln -s libcephfs_jni.so.1 /usr/lib/hadoop/lib/native/libcephfs_jni.so
CephFS installation
http://docs.ceph.com/docs/master/cephfs/hadoop/
https://github.com/ceph/cephfs-hadoop
Known Issue :
MRAppMaster can not read find cephfs_jni
Root Cause :
There is no -Djava.library.path for MRAppMaster
Root Cause :
There is no -Djava.library.path for MRAppMaster
G.G
Official Support is limited to Hadoop 1.1.x
http://docs.ceph.com/docs/master/cephfs/hadoop/
Why it works
for MRv1??
Let’s take
a look at
MapReduce v1
Architecture
Why doesn’t
it work
on YARN??
Let’s take
a look at
YARN
Architecture
Without correct configuration,
HCFS or YARN Application that use JNI will fail :(
http://docs.orangefs.com/v_2_9/Hadoop_Use_Cases.htm
WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.map.child.java.opts can
cause programs to no longer function if hadoop native libraries are used. These values should be set as part
of the LD_LIBRARY_PATH in the map JVM env using mapreduce.admin.user.env config settings.
How to solve this issue ?
Official document and souce code said so ...
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/NativeLibraries.html#Native_Shared_Libraries
https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-c
re/src/main/resources/mapred-default.xml#L267
Conclusion
▸ S3 and WASB are the most mature HCFS.
▹ Sorry taht I’m not sure about Google Cloud Storage :(
▸ You’ll need more integration test for Hadoop Ecosystem
when using HCFS.
Take Hadoop as a
rsync tool to sync with
Hybrid Cloud Storage
Take Hadoop as a
Java Library to access
Hybrid Cloud Storage
THANKS!
Any questions?
You can find me at @jazzwang_tw &
https://fb.com/groups/hadoop.tw
CREDITS
Special thanks to all the people who made and released these
awesome resources for free:
▸ Presentation template by SlidesCarnival
▸ Photographs by Death to the Stock Photo (license)
PRESENTATION DESIGN
This presentations uses the following typographies and colors:
▸ Titles: Montserrat
▸ Body copy: Karla
You can download the fonts on this page:
http://www.google.com/fonts/#UsePlace:use/Collection:Montserrat:400,700|Ka
rla:400,400italic,700,700italic
Ad

More Related Content

What's hot (20)

Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
Yahoo Developer Network
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio, Inc.
 
Presto Fast SQL on Anything
Presto Fast SQL on AnythingPresto Fast SQL on Anything
Presto Fast SQL on Anything
Alluxio, Inc.
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Amy W. Tang
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
 
Bigdata : Big picture
Bigdata : Big pictureBigdata : Big picture
Bigdata : Big picture
Zekeriya Besiroglu
 
Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3
Alluxio, Inc.
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Alluxio, Inc.
 
Empower Data-Driven Organizations
Empower Data-Driven OrganizationsEmpower Data-Driven Organizations
Empower Data-Driven Organizations
DataWorks Summit/Hadoop Summit
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014
Gruter
 
Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
DataWorks Summit/Hadoop Summit
 
Speeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China UnicomSpeeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China Unicom
Alluxio, Inc.
 
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure
オラクルエンジニア通信
 
Hw09 Clouderas Distribution For Hadoop
Hw09   Clouderas Distribution For HadoopHw09   Clouderas Distribution For Hadoop
Hw09 Clouderas Distribution For Hadoop
Cloudera, Inc.
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
Alluxio, Inc.
 
Cloudera
ClouderaCloudera
Cloudera
Ahmed Salman
 
Apache: Big Data North America 2017 参加報告 #streamctjp
Apache: Big Data North America 2017 参加報告  #streamctjpApache: Big Data North America 2017 参加報告  #streamctjp
Apache: Big Data North America 2017 参加報告 #streamctjp
Yahoo!デベロッパーネットワーク
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN
Jim Dowling
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio, Inc.
 
Presto Fast SQL on Anything
Presto Fast SQL on AnythingPresto Fast SQL on Anything
Presto Fast SQL on Anything
Alluxio, Inc.
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Amy W. Tang
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
 
Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3
Alluxio, Inc.
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Alluxio, Inc.
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014
Gruter
 
Speeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China UnicomSpeeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China Unicom
Alluxio, Inc.
 
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure
オラクルエンジニア通信
 
Hw09 Clouderas Distribution For Hadoop
Hw09   Clouderas Distribution For HadoopHw09   Clouderas Distribution For Hadoop
Hw09 Clouderas Distribution For Hadoop
Cloudera, Inc.
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
Alluxio, Inc.
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN
Jim Dowling
 

Viewers also liked (10)

2017-03-27 From Researcher To Product Manager
2017-03-27 From Researcher To Product Manager2017-03-27 From Researcher To Product Manager
2017-03-27 From Researcher To Product Manager
Jazz Yao-Tsung Wang
 
社群、協會、國際連結
社群、協會、國際連結社群、協會、國際連結
社群、協會、國際連結
Jazz Yao-Tsung Wang
 
2006-11-16 RFID and OSS for Agriculture
2006-11-16 RFID and OSS for Agriculture2006-11-16 RFID and OSS for Agriculture
2006-11-16 RFID and OSS for Agriculture
Jazz Yao-Tsung Wang
 
Hadoop 生態系十年回顧與未來展望
Hadoop 生態系十年回顧與未來展望Hadoop 生態系十年回顧與未來展望
Hadoop 生態系十年回顧與未來展望
Jazz Yao-Tsung Wang
 
淺談台灣巨量資料產業發展現況
淺談台灣巨量資料產業發展現況淺談台灣巨量資料產業發展現況
淺談台灣巨量資料產業發展現況
Jazz Yao-Tsung Wang
 
Big Data Projet Management the Body of Knowledge (BDPMBOK)
Big Data Projet Management the Body of Knowledge (BDPMBOK)Big Data Projet Management the Body of Knowledge (BDPMBOK)
Big Data Projet Management the Body of Knowledge (BDPMBOK)
Jazz Yao-Tsung Wang
 
When R meet Hadoop
When R meet HadoopWhen R meet Hadoop
When R meet Hadoop
Jazz Yao-Tsung Wang
 
Introduction to K8S Big Data SIG
Introduction to K8S Big Data SIGIntroduction to K8S Big Data SIG
Introduction to K8S Big Data SIG
Jazz Yao-Tsung Wang
 
From Browser Fingerprint to SuperCookie
From Browser Fingerprint to SuperCookieFrom Browser Fingerprint to SuperCookie
From Browser Fingerprint to SuperCookie
Jazz Yao-Tsung Wang
 
Data Pipeline Matters
Data Pipeline MattersData Pipeline Matters
Data Pipeline Matters
Jazz Yao-Tsung Wang
 
2017-03-27 From Researcher To Product Manager
2017-03-27 From Researcher To Product Manager2017-03-27 From Researcher To Product Manager
2017-03-27 From Researcher To Product Manager
Jazz Yao-Tsung Wang
 
2006-11-16 RFID and OSS for Agriculture
2006-11-16 RFID and OSS for Agriculture2006-11-16 RFID and OSS for Agriculture
2006-11-16 RFID and OSS for Agriculture
Jazz Yao-Tsung Wang
 
Hadoop 生態系十年回顧與未來展望
Hadoop 生態系十年回顧與未來展望Hadoop 生態系十年回顧與未來展望
Hadoop 生態系十年回顧與未來展望
Jazz Yao-Tsung Wang
 
淺談台灣巨量資料產業發展現況
淺談台灣巨量資料產業發展現況淺談台灣巨量資料產業發展現況
淺談台灣巨量資料產業發展現況
Jazz Yao-Tsung Wang
 
Big Data Projet Management the Body of Knowledge (BDPMBOK)
Big Data Projet Management the Body of Knowledge (BDPMBOK)Big Data Projet Management the Body of Knowledge (BDPMBOK)
Big Data Projet Management the Body of Knowledge (BDPMBOK)
Jazz Yao-Tsung Wang
 
Introduction to K8S Big Data SIG
Introduction to K8S Big Data SIGIntroduction to K8S Big Data SIG
Introduction to K8S Big Data SIG
Jazz Yao-Tsung Wang
 
From Browser Fingerprint to SuperCookie
From Browser Fingerprint to SuperCookieFrom Browser Fingerprint to SuperCookie
From Browser Fingerprint to SuperCookie
Jazz Yao-Tsung Wang
 
Ad

Similar to Introduction to HCFS (20)

Охота на уязвимости Hadoop
Охота на уязвимости HadoopОхота на уязвимости Hadoop
Охота на уязвимости Hadoop
Positive Hack Days
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
IJRESJOURNAL
 
BIGDATA ANALYTICS LAB MANUAL final.pdf
BIGDATA  ANALYTICS LAB MANUAL final.pdfBIGDATA  ANALYTICS LAB MANUAL final.pdf
BIGDATA ANALYTICS LAB MANUAL final.pdf
ANJALAI AMMAL MAHALINGAM ENGINEERING COLLEGE
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
Padma shree. T
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
ryancox
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentaho
Martin Ferguson
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
Avkash Chauhan
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Laxmi Rauth
 
Hdfs design
Hdfs designHdfs design
Hdfs design
Không còn Phù Hợp
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
prabakaranbrick
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
Shashwat Shriparv
 
Unit 1
Unit 1Unit 1
Unit 1
SriKGangadharRaoAssi
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
puneet yadav
 
Ex-8-hive.pptx
Ex-8-hive.pptxEx-8-hive.pptx
Ex-8-hive.pptx
vishal choudhary
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()
Steve Loughran
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
Sudar Muthu
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
senthil0809
 
Охота на уязвимости Hadoop
Охота на уязвимости HadoopОхота на уязвимости Hadoop
Охота на уязвимости Hadoop
Positive Hack Days
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
IJRESJOURNAL
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
Padma shree. T
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
ryancox
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentaho
Martin Ferguson
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
Avkash Chauhan
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
prabakaranbrick
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
Shashwat Shriparv
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
puneet yadav
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()
Steve Loughran
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
Sudar Muthu
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
senthil0809
 
Ad

Recently uploaded (20)

SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 

Introduction to HCFS

  • 1. HCFS 初探 Introduction to Hadoop Compatible File System Jazz Yao-Tsung Wang Co-founder of Hadoop.TW https://fb.com/groups/hadoop.tw 2017-01-21 Hadoop.TW & GCPUG.TW Meetup #1 2017
  • 2. HELLO! I am Jazz Wang Co-Founder of Hadoop.TW. Hadoop Evangelist since 2008. Open Source Promoter. System Admin (Ops). You can find me at @jazzwang_tw or https://fb.com/groups/hadoop.tw , https://forum.hadoop.tw
  • 3. 1. What is HCFS? Let’s start with brief introduction to Apache Hadoop
  • 4. Apache Hadoop from 0.x to 1.x Master Worker #1 Worker #2 Worker #3 NameNode DataNode DataNode DataNode DataNode Job Tracker Task Tracker Task Tracker Task Tracker Task TrackerComputation Layer MapReduce Storage Layer HDFS
  • 5. Master Worker #1 Worker #2 Worker #3 NameNode DataNode DataNode DataNode DataNode Resource Manager Node Manager Node Manager Node Manager Node ManagerComputation Layer YARN Storage Layer HDFS Apache Hadoop from 2.x to 3.x Container
  • 6. Needs / Trends: Hadoop on the Cloud http://www.slideshare.net/jazzwang/hadoop-deployment-model-osdctw
  • 7. Why Hadoop on the Cloud ? http://www.slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production https://www.youtube.com/watch?v=XehH3iJJy3Q
  • 8. Why might you need HCFS ... https://www.facebook.com/groups/hadoop.tw/permalink/1061706333938741/?comment_id=1072414466201261&reply _comment_id=1073302882779086&comment_tracking={%22tn%22%3A%22R%22}
  • 11. Master Worker #1 Worker #2 Worker #3 Resource Manager Node Manager Node Manager Node Manager Node ManagerComputation Layer YARN Storage Layer HCFS What is HCFS ? Windows Azure Blob AWS S3 Google Cloud Storage CephFS Hadoop Compatible File System
  • 12. HCFS implementations - Cloud Storage Connector ( for Public Cloud Provider ) https://wiki.apache.org/hadoop/HCFS AWS S3 s3:// Hadoop 0.10 ~ Hadoop 2.7 https://wiki.apache.org/hadoop/AmazonS3 http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/ s3n:// Hadoop 0.18 ~ Hadoop 2.6 s3a:// Hadoop 2.7+ AWS EMRFS ?? 3rd party http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html Windows Azure Storage Blob wasb:// Hadoop 2.7+ http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/ https://issues.apache.org/jira/browse/HADOOP-9629 Azure Data Lake adl:// Hadoop 3.0+ https://hadoop.apache.org/docs/current/hadoop-azure-datalake/ https://docs.microsoft.com/zh-tw/azure/data-lake-store/data-lake-store-h dinsight-hadoop-use-portal Google Cloud Storage gs:// 3rd party Hadoop 1.x Hadoop 2.x https://cloud.google.com/hadoop/google-cloud-storage-connector https://github.com/GoogleCloudPlatform/bigdata-interop
  • 13. HCFS implementations ( for Private Cloud Provider ) OpenStack Swift ( rackspace ) swift:// Hadoop 2.7+ https://issues.apache.org/jira/browse/HADOOP-8545 http://hadoop.apache.org/docs/r2.7.3/hadoop-openstack/ https://github.com/steveloughran/Hadoop-and-Swift-integration/ CephFS ( OpenStack ) ceph:// 3rd party Hadoop 1.1.x http://docs.ceph.com/docs/master/cephfs/hadoop/ https://github.com/houbin/cephfs-hadoop Cassandra File System cfs:// 3rd party http://www.datastax.com/dev/blog/cassandra-file-system-design http://www.datastax.com/resources/whitepapers/hdfs-vs-cfs GlusterFS glusterfs:/// 3rd party https://github.com/gluster/glusterfs-hadoop https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Hadoop/ OrangeFS 3rd party Hadoop 1.2.1 Hadoop 2.6.0 http://docs.orangefs.com/v_2_8_8/index.htm#Hadoop_Client.htm http://docs.orangefs.com/v_2_9/Hadoop_Use_Cases.htm QFS ( KFS ) qfs:// 3rd party https://github.com/quantcast/qfs/wiki/Migration-Guide Lustre 3rd party http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre MapR File System 3rd party https://www.mapr.com/products/mapr-fs https://community.mapr.com/thread/7027
  • 16. https://strata.oreilly.com.cn/hadoop-big-data-cn/public/schedule/detail/51169 http://www.slideshare.net/jazzwang/hadoop-69818883 AWS S3 Authentication Support Azure Blob support encrypted Key CephFS is not work well with YARN because of JNI (Java Native Interface) :( Only HDFS and Azure Blob support HBase !!
  • 17. 2. AWS S3 Use Case : Amazon EMR
  • 18. Three generation of S3 support s3:// s3n:// s3a:// The ‘classic’ s3: filesystem The second-generation, s3n: filesystem, making it easy to share data between hadoop and other applications via the S3 object store The third generation, s3a: filesystem. replacement for s3n:, supports larger files and promises higher performance. introduced in Hadoop 0.10.0 (HADOOP-574) deprecated and will be removed from Hadoop 3.0 introduced in Hadoop 0.18.0 (HADOOP-930) rename support in Hadoop 0.19.0 (HADOOP-3361) Hadoop 2.6 and earlier introduced in Hadoop 2.6.0 (HADOOP-11571) recommended for Hadoop 2.7 and later Uploaded files can be larger than 5GB, but they are not interoperable with other S3 tools. requires a compatible version of jets3t requires exact version of amazon-aws-sdk core-site.xml core-site.xml core-site.xml <property> <name>fs.s3.awsAccessKeyId</name> <value>AWS access key ID</value> </property> <property> <name>fs.s3.awsSecretAccessKey</name> <value>AWS secret key</value> </property> <property> <name>fs.s3n.awsAccessKeyId</name> <value>AWS access key ID</value> </property> <property> <name>fs.s3n.awsSecretAccessKey</name> <value>AWS secret key</value> </property> <property> <name>fs.s3a.access.key</name> <value>AWS access key ID</value> </property> <property> <name>fs.s3a.secret.key</name> <value>AWS secret key</value> </property> https://wiki.apache.org/hadoop/AmazonS3 http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
  • 19. 1. You cannot use S3 as a replacement for HDFS 2. Amazon S3 is an "object store" ▸ eventual consistency ▸ non-atomic rename and delete operations. 3. Your AWS credentials are valuable ▸ core-site.xmlis readable in cluster-wide ▸ Don’t use embedding the credentials in the URI ▸ S3A supports more authentication mechanisms 4. Amazon's EMR Service is based upon Apache Hadoop, but contains modifications and their own, proprietary, S3 client. WARNING!! https://wiki.apache.org/hadoop/AmazonS3 http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
  • 20. For Mac OS X + brew install hadoop export HADOOP_CONF_DIR=${PATH of core-site.xml) export HADOOP_CLASSPATH=/usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/* hadoop fs -ls s3n://${bucket}/ For Linux / Windows - use BigTop docker image docker run -it --name hcfs -h hcfs -v $(pwd):/data jazzwang/bigtop-hdfs # cd /data /data# export HADOOP_CONF_DIR=${PATH of core-site.xml) /data# hadoop fs -ls s3n://${bucket}/ DEMO https://wiki.apache.org/hadoop/AmazonS3 http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
  • 21. To enable more log4j messages, you could try : export HADOOP_ROOT_LOGGER=DEBUG,console hadoop fs -ls s3n://${bucket}/ To access unofficial S3 services such as hicloud S3 and Ceph S3 (RGW) Using s3n:// , you have to put a config file jets3t.properties $ cat jets3t.properties s3service.s3-endpoint=s3.hicloud.net s3service.https-only=false Using s3a:// , you could add following to core-site.xml <property> <name>fs.s3a.endpoint</name> <value>s3.hicloud.net</value> <description>default is s3.amazonaws.com</description> </property> Undocumented Secrets 除錯/繞道密技
  • 22. 3. Windows Azure Storage Blob Use Case : HDInsight / Azure Data Lake
  • 23. 1. hadoop-azure.jar is located at - /usr/lib/hadoop-mapreduce/hadoop-azure.jar (bigtop , CDH) - ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-azure.jar ( official tar.gz , Mac brew) 2. Depends on Azure Storage SDK for Java - https://github.com/Azure/azure-storage-java 3. Features ▸ Supports configuration of multiple Azure Blob Storage accounts. ▸ Supports both page blobs and block blobs ▸ wasbs:// scheme for SSL encrypted access. ▸ Can act as a source of data in a MapReduce job, or a sink. ▸ Tested on both Linux and Windows. 4. Limitation ▸ The append operation is not implemented. ▸ File owner and group are persisted, but the permissions model is not enforced. ▸ File last access time is not tracked. Hadoop Azure Support: Azure Blob Storage http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html
  • 24. In core-site.xml <property> <name>fs.azure.account.key. youraccount.blob.core.windows.net</name> <value>YOUR ACCESS KEY</value> </property> Examples: > hadoop fs -mkdir wasb://[email protected]/testDir > hadoop fs -put testFile wasb://[email protected]/testDir/testFile > hadoop fs -cat wasbs://[email protected]/testDir/testFile Configurations http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html
  • 25. My Use Case : rsync between local and wasb http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html Take advantage of hadoop distcp - Backup hadoop distcp -update ${SOURCE_DIR} wasb://[email protected]/${BACKUP_DIR} - Restore hadoop distcp wasb://[email protected]/${BACKUP_DIR} ${RESTOR_DIR} Take Hadoop as a rsync tool to sync with Hybrid Cloud Storage
  • 26. Use Case in TenMax: Read / Write files from/to Azure Blob Storage Spring Boot FileSystem Web Application File System Abstraction Layer core-site.xml Azure Blob Storage Cloud Storage Take Hadoop as a Java Library to access Hybrid Cloud Storage
  • 28. Master Worker #1 Worker #2 Worker #3 Mon OSD OSD OSD OSD Resource Manager Node Manager Node Manager Node Manager Node ManagerComputation Layer YARN Storage Layer Ceph High Level Architecture of Hadoop 2.x with CephFS Mon Mon
  • 29. hdfs01 192.168.1.239 hdfs02 192.168.1.238 hdfs03 192.168.1.237 hdfs04 192.168.1.236 virtual network ( hub ) node11 192.168.1.201 node21 192.168.1.211 node31 192.168.1.221 Ceph mon Ceph OSD Ceph OSD Ceph OSD Ceph OSD Resource Manager Node Manager Node Manager Node Manager
  • 30. 1. Compile https://github.com/ceph/cephfs-hadoop 2. Copy cephfs-hadoop.jar and place it at ${HADOOP_HOME}/lib/ 3. Copy ceph.conf and ceph.client.${ID}.keyring to /etc/ceph 4. Copy cephfs-java.jar to ${HADOOP_HOME}/lib/ 5. Copy JNI related files to ${HADOOP_HOME}/lib/native/ ln -s libcephfs.so.1 /usr/lib/hadoop/lib/native/libcephfs.so ln -s libcephfs_jni.so.1 /usr/lib/hadoop/lib/native/libcephfs_jni.so CephFS installation http://docs.ceph.com/docs/master/cephfs/hadoop/ https://github.com/ceph/cephfs-hadoop
  • 31. Known Issue : MRAppMaster can not read find cephfs_jni
  • 32. Root Cause : There is no -Djava.library.path for MRAppMaster
  • 33. Root Cause : There is no -Djava.library.path for MRAppMaster
  • 34. G.G Official Support is limited to Hadoop 1.1.x http://docs.ceph.com/docs/master/cephfs/hadoop/
  • 35. Why it works for MRv1?? Let’s take a look at MapReduce v1 Architecture
  • 36. Why doesn’t it work on YARN?? Let’s take a look at YARN Architecture
  • 37. Without correct configuration, HCFS or YARN Application that use JNI will fail :( http://docs.orangefs.com/v_2_9/Hadoop_Use_Cases.htm
  • 38. WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.map.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the map JVM env using mapreduce.admin.user.env config settings. How to solve this issue ? Official document and souce code said so ... https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/NativeLibraries.html#Native_Shared_Libraries https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-c re/src/main/resources/mapred-default.xml#L267
  • 39. Conclusion ▸ S3 and WASB are the most mature HCFS. ▹ Sorry taht I’m not sure about Google Cloud Storage :( ▸ You’ll need more integration test for Hadoop Ecosystem when using HCFS. Take Hadoop as a rsync tool to sync with Hybrid Cloud Storage Take Hadoop as a Java Library to access Hybrid Cloud Storage
  • 40. THANKS! Any questions? You can find me at @jazzwang_tw & https://fb.com/groups/hadoop.tw
  • 41. CREDITS Special thanks to all the people who made and released these awesome resources for free: ▸ Presentation template by SlidesCarnival ▸ Photographs by Death to the Stock Photo (license) PRESENTATION DESIGN This presentations uses the following typographies and colors: ▸ Titles: Montserrat ▸ Body copy: Karla You can download the fonts on this page: http://www.google.com/fonts/#UsePlace:use/Collection:Montserrat:400,700|Ka rla:400,400italic,700,700italic