Scaling HDFS at Xiaomi

Scaling HDFS at Xiaomi
Chen Zhang

Outline
• Introduction of Xiaomi
• Scenarios and challenges
• Improvements on HDFS federation
• Experience on scaling up single NameNode
• Efficient management of hundreds of clusters

About Xiaomi
World’s 4th largest
smartphone maker
Sold 118 Million
phones in 2018

About Xiaomi
World’s Largest
consumer IoT platform
Over 150 Million
smart devices connected

Software and Internet Services
MIUI MiPay/Finance
App Market Ads
MiCloud Game
MiPush Smart Home
News Feeds …

Scenarios
HDFS
HBase
EMQ
Yarn Talos
FDS(S3) Spark HiveImpala

Scenarios
Micloud
MiPush Feeds User
Profile
Talos
Ads
Online Services
• 100+ Independent Clusters
• Low Latency
• High availability
Offline
Services
Hadoop
• Several Huge Clusters
• High throughput
• High Scalability, High availability

Data Growth
2
23
41
71
3
30
60
150
0
20
40
60
80
100
120
140
160
2015 2016 2017 2018
Data Growth of The Largest Cluster
File counts (10 million) Data Size (PB)

Challenges
• Challenges at late 2016
data growth is too fast dependency is too complex code change is almost impossible

What We Need
We need A Huge Single HDFS
Cluster

Improvements on HDFS Federation
• Problem of HDFS Federation at late 2016
– NameNode are independent, metadata is not shared
– Client side MountTable config, hard to maintain
– MountTable don’t support nesting mount-point
– ViewFileSystem is not compatible with DistributedFileSystem
– RBF is not stable and not fully functioning at late 2016

viewfs
Pool 1 Pool nPool k
Block Pools
Datanode 1
…
Datanode 2
…
Datanode m
…
NS 1 NS k
Foreign
NS n
Common Storage
NN-1 NN-k NN-n
… …
BlockStorageNamespace
Original HDFS Federation
user
/
yarn hive
service1 service2
small
dir1
small
dir2
small
service2
small
service1
…
…

viewfs
Pool 1 Pool nPool k
Block Pools
Datanode 2
…
Datanode 3
…
Datanode m
…
NS 1 NS k
Foreign
NS n
Common Storage
NN-1 NN-k NN-n
… …
BlockStorageNamespace
Support Nested MountPoints
Pool 1
NS 0
NN-0
…
Datanode 1
…
user
/
yarn hive
service1 service2
hdfs:// -> FederatedDFSFileSystem
extends DistributedFileSystem
Add Default NameSpace
Support rename across NameSpaces
Compatible with hdfs://, don’t need
to change any code
Update MountTable Config from ZK

Nested Mount table and Default NameSpace
1. Xiaomi is not only a hardware company, also an Internet
company, which develops very fast
2. There are more than 100 internet services, the new business and
services emerges quickly, based on our smart devices and more
than 300 million users
3. It’s hard for us to use a fixed mount table which is pre-divided

NN-1 NN-k NN-nNN-0
user
/
yarn hive
service1 service2
Nested Mount table and Default NameSpace
/some_new_nosql_service
/user/live_show_services
/user/short_video_services
1. At First, we divide the initial mount
point by data amount and QPS. Only
need to config a dozen of mountpoints
for the largest services, others fall into
the default NameSpaces
2. When new infrastructure-services and
internet-services emerges, the whole
mount table don’t need any updates
3. HADOOP-13055 supports linkFallback,
but our solution is more flexible
NS 1 NS kNS 0 NS n

Client Transparency
ViewFileSystem
FederatedDFSFilesystem
/user/service1 /user/service2in
fs.hdfs.impl=FederatedDFSFileSyste
m
hdfs://clustername/user/service1
access config
ZooKeeper fetch mounttable
watch
Admin
Tool
update

Client Transparency
RPC integration
• listStatus
• getContentSummary
• setQuota/getQuota
Admin Tools
• refreshNodes
• setBalancerBandwidth
• DataNode decommission
NN-1 NN-k NN-nNN-0
user
/
yarn hive
service1 service2
NS 1 NS kNS 0 NS n
/user/service1/.Trash/
Trash optimization
• moveToTrash is an rename operation
• moveToTrash across namenode is very
expensive

Rename Across NameSpaces
Client
locked
hardlink
namenode1 namenode2
datanode1 datanode2 datanode3
blockpool1
blockpool2
Link block

Rename Across NameSpaces in Detail
Source Phase 1
1. Sanity Check.
• Existence
• Permission
• Can’t be reserved directory
• Can’t be symlink
• Not in encryption zones
2. Serialize the inode-tree and blocks
information with ProtoBuf
• Name
• Permissions
• mtime/atime
• Replication factor
• Block locations
• Acl / Xattr / Quota …

Source Phase 1
3. Lock the directory
• Add a FederationRenameFeature. Record the information about renameId, source
and destination path
• With FederationRenameFeature, all sub-directories and files in this directory, and all
inodes in the parent path, is not writable
4. Add a federation-rename record
5. Return the serialized data to client

Dest Phase 1
1. Sanity Check
• permission, quota, not in encryption zones
2. Deserialize the inode-tree, graft it to the destination path
• Allocate inode id for each inode
• Allocate block id and new GS for each block
• Update acl and other features

Dest Phase 1
3. Lock the directory
• Also use FederationRenameFeature
4. Update quota count
5. Add a federation-rename record
6. Return a list of block information, inclouding:
• srcBlockId, destBlockId, blockSize, srcGenStamp, destGenStamp for each block

Link Block
1. For each DN, send request in batch
• Create new block file by hardlink, one by one
• With a total operation timeout
2. Using a ThreadPoolExecutor
3. For each block, count as complete if at least 2/3 replicas succeed
• Slow DN will not affect the total progress

Source Phase2
1. Delete the source directory/file
2. Delete all the inodes and blocks asyncronizely
3. Remove federation-rename record
Dest Phase2
1. Remove FedeartionRenameFeature, make the target directory
visible
2. Remove federation-rename record

Error Handling
Failed at How to Handle Result
Source Phase 1 Fail Fail
Dest Phase 1 Cancel source-phase1 Fail
Link Block
Request Fail
NameNode Fixer will redo the remaining steps
Will succeed
finally
Source Phase 2
Request Fail
Will succeed
finally
Dest Phase 2
Request Fail
Will succeed
finally

Error Handling
NameNode Failover and Restart
1. All operation have editlog
2. FederationRenameFeature will serialized to FsImage
3. Federation-rename records won’t serialized to FsImage, rebuild
from log replay or FsImage loading ( if some inode have
FederationRenameFeature, then add a Federation-rename record)

Scaling up NameNodes
Our Largest NameNode
1. 150GB heap
2. Use CMS GC
3. More than 500 million objects (240 million files and 260 million
blocks)
4. More than 20000 QPS

Scaling up NameNodes
Experience
• Throttle
– BlockReport / Incremental-BlockReport throttle
– Concurrent GetContentSummary throttle
• Lock optimization
• Config optimization
• Add more tracing information

Block Report Throttle
• Problem：Full GC when NameNode Startup
NameNode
60%
DN
DN
DN
DN
DN
Thousands of DN Block Report
at almost same time
DN
DN
DN
DN
DN
NameNode could only
process one block report
one time
Throttle the max concurrent
block reports, extra reports
will be rejected, and DN will
retry later

Other optimization
• Lock Optimization on exhausting operations
– When processing block report, release and re-gain the lock for every storage
– When processing getContentSummary, release the lock every N files
• Config optimization
– More handlers
– Longer heart-beat interval
– Longer full block report interval
– disable retry-cache and access-time

More tracing information
• Record Operations that hold the FSNamesystem lock too long
• Record QPS monitor on both server-side and client-side, push these
data to our internal monitor system
• Record failure reason and statistics of block allocation failure
• Add log for slow block report processing

How We Efficiently Manage 100+ Clusters
• We use HBase heavily in Xiaomi
• 20~30 HBase clusters for sensitive services and businesses in each
datacenter
• With the rapid growth of the global business, now there are more than 5
datacenters distributed in the whole world
• The number of total clusters also grows very quickly, make it hard to
maintain

How We Efficiently Manage 100+ Clusters
• Initially…
cluster-1
Canary
cluster-2
Canary
cluster-3
Canary
cluster-n
Canary

Efficiently manage 100+ clusters
cluster-1
Canary Task
cluster-2 cluster-3 cluster-n
ClustrerOne Monitor System
Canary Task
Canary Task
Balancer Task
Balancer Task
Balancer Task
ZooKeeper
NameService
metrics
generated
configuration

Scaling HDFS at Xiaomi

Recommended

More Related Content

What's hot (20)

Similar to Scaling HDFS at Xiaomi (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Scaling HDFS at Xiaomi

Editor's Notes