0% found this document useful (0 votes)
69 views12 pages

Discuss The Hadoop Distributed File System (HDFS)

HDFS is a distributed file system designed to store extremely large data files across commodity hardware. It allows for streaming access to data and high throughput even when reading large portions of data. HDFS uses a master-slave architecture with a namenode that manages metadata and datanodes that store file data in blocks. Data is replicated across multiple datanodes for fault tolerance. While HDFS provides high throughput and fault tolerance for large data, it exhibits high latency for small file access and has limited performance for workloads that require frequent random access.

Uploaded by

joe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views12 pages

Discuss The Hadoop Distributed File System (HDFS)

HDFS is a distributed file system designed to store extremely large data files across commodity hardware. It allows for streaming access to data and high throughput even when reading large portions of data. HDFS uses a master-slave architecture with a namenode that manages metadata and datanodes that store file data in blocks. Data is replicated across multiple datanodes for fault tolerance. While HDFS provides high throughput and fault tolerance for large data, it exhibits high latency for small file access and has limited performance for workloads that require frequent random access.

Uploaded by

joe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

QUESTION 1:

1. Discuss the Hadoop Distributed File System (HDFS)


HDFS (Hadoop Distributed File System) is a unique design that provides storage for extremely large
files with streaming data access pattern and it runs on commodity hardware.

 Extremely large files: Here we are talking about the data in range of petabytes (1000
TB).
 Streaming Data Access Pattern: HDFS is designed on principle of write-once and read-
many-times. Once data is written large portions of dataset can be processed any number
times.
 Commodity hardware: Hardware that is inexpensive and easily available in the market.
This is one of feature which specially distinguishes HDFS from other file system.
Nodes: Master-slave nodes typically forms the HDFS cluster.
 MasterNode:
 Manages all the slave nodes and assign work to them.
 It executes file system namespace operations like opening, closing, renaming files
and directories.
 It should be deployed on reliable hardware which has the high config. not on
commodity hardware.
 NameNode:
 Actual worker nodes, who do the actual work like reading, writing, processing
etc.
 They also perform creation, deletion, and replication upon instruction from the
master.
 They can be deployed on commodity hardware.
HDFS deamons: Deamons are the processes running in background.
 Namenodes:
 Run on the master node.
 Store metadata (data about data) like file path, the number of blocks, block Ids. etc.
 Require high amount of RAM.
 Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a persistent
copy of it is kept on disk.
 DataNodes:
 Run on slave nodes.
 Require high memory as data is actually stored here
Data storage in HDFS:
Lets assume that 100TB file is inserted, then masternode(namenode) will first divide the file into
blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these blocks are stored
across different datanodes(slavenode). Datanodes(slavenode)replicate the blocks among
themselves and the information of what blocks they contain is sent to the master. Default
replication factor is 3 means for each block 3 replicas are created (including itself).
In hdfs.site.xml we can increase or decrease the replication factor i.e we can edit its
configuration here.
MasterNode has the record of everything, it knows the location and info of each and every single
data nodes and the blocks they contain, i.e. nothing is done without the permission of
masternode.
Why divide the file into blocks?
Answer: Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB file on a
single machine. Even if we store, then each read and write operation on that whole file is going
to take very high seek time. But if we have multiple blocks of size 128MB then its become easy
to perform various read and write operations on it compared to doing it on a whole file at once.
So we divide the file to have faster data access i.e. reduce seek time.
Why replicate the blocks in data nodes while storing?
Let’s assume we don’t replicate and only one yellow block is present on datanode D1. Now if the
data node D1 crashes we will lose the block and which will make the overall data inconsistent
and faulty. So we replicate the blocks to achieve fault-tolerence.
Terms related to HDFS:
 HeartBeat : It is the signal that datanode continuously sends to namenode. If namenode
doesn’t receive heartbeat from a datanode then it will consider it dead.
 Balancing : If a datanode is crashed the blocks present on it will be gone too and the
blocks will be under-replicated compared to the remaining blocks. Here master
node(namenode) will give a signal to datanodes containing replicas of those lost blocks to
replicate so that overall distribution of blocks is balanced.
 Replication:: It is done by datanode.
Note: No two replicas of the same block are present on the same datanode.
Features:
 Distributed data storage.
 Blocks reduce seek time.
 The data is highly available as the same block is present at multiple datanodes.
 Even if multiple datanodes are down we can still do our work, thus making it highly
reliable.
 High fault tolerance.
Limitations: Though HDFS provide many features there are some areas where it doesn’t work
well.
 Low latency data access: Applications that require low-latency access to data i.e in the
range of milliseconds will not work well with HDFS, because HDFS is designed keeping
in mind that we need high-throughput of data even at the cost of latency.
 Small file problem: Having lots of small files will result in lots of seeks and lots of
movement from one datanode to another datanode to retrieve each small file, this whole
process is a very inefficient data access pattern.
QUESTION 2:
2. How Might Cyber security affect the Development and Implementation of the
Internet of Things (IoT), especially in the Kenya (4marks)
 Possible misses in the firewall and security settings: - companies that do not build
security directly into IoT devices before implementing the hardware into the system and
also building security around the device, almost an afterthought, the security is not sound
and cannot be tested ahead of time thereby increasing the chances of possible misses in
the firewall and security settings.

 Unstable security and sound out of gate opening the entire network up to the outside: -
when the new hardware is brought into the fold it often becomes the target of hackers
attempting to poke and prod of weaknesses.
QUESTION 3
3. Blockchain has drawn attention as the next-generation financial technology due to
its security that suits the informatization era. Justify the statement and also explain
its challenges with appropriate examples

Its distributed consensus based architecture eliminates single points of failure and reduces the
need for data intermediaries such as transfer agents, messaging system operators and inefficient
monopolistic utilities. Ethereum also enables implementation of secure application code
designed to be tamper-proof against fraud and malicious third parties, making it virtually
impossible to hack or manipulate
Challenges facing blockchain

 Scalability: -Legacy transaction processing networks are known to process thousands of


transactions in a second. Conversely, blockchain networks are considerably slow when it
comes to transactions per second. As an indication of the most prominent public
blockchains, the bitcoin blockchain can process three to seven transactions per second,
and Ethereum can handle approximately 20 transactions in a second.
 Interoperability: - With over 2,300 cryptocurrencies and thousands of projects that are
leveraging distributed ledger technology, numerous blockchain networks have floated to
the surface. Most of these blockchains work in silos and do not communicate with the
other peer-to-peer networks. There are no standards that allow for seamless interaction
between these blockchain projects.
 Limited Developer Supply: - Every instance of groundbreaking technology requires
time for the developer community to adopt it, and for educational institutions to introduce
relevant courses. The blockchain landscape is currently in its infancy, and therefore
suffers from an acute shortage of skilled developers. The lack of an adequately trained
and skilled workforce for managing the complexity of peer-to-peer networks further
translates into a sluggish rate of innovation.
 Standardization: - With the wide variety of networks that exist today, there are no
universal standards for blockchain applications. Standardization can help reduce costs,
develop more efficient consensus mechanisms, and introduce interoperability. The lack of
such uniformity across blockchain protocols further aggravates the issue of onboarding
new developers, plus it also takes away consistency from basic processes like security,
making mass adoption an almost impossible task. This has become a barrier to entry for
professionals as well as investors.
 Energy-Intensive: - Proof-of-work (POW) was the first consensus mechanism for
validating transactions and eliminating the need for centralization, and was introduced by
Bitcoin’s blockchain. These protocols require users to submit proof of ‘work’ by solving
complex mathematical puzzles, and require tremendous computing power. While proof-
of-work paves the way for disintermediation by offering a trustless and distributed
consensus, it also consumes huge amounts of energy.

QUESTION 4
4. There are number of security and privacy risks associated with cloud computing.
Discuss at least five of them and explain how each must be adequately addressed
Data breaches:-Cloud providers are the attractive target for the hackers to attack as massive
data stored on the clouds. How much severe the attack is depend upon the confidentiality of
the data which will be exposed. The information exposed may be financial or other will be
important the damage will be severe if the exposed information is personal related to health
information, trade secrets and intellectual property of a person of an organization. This will
produce a severe damage. When data breached happened companies will be fined some
lawsuits may also occur against these companies and criminal charges also. Break
examinations and client warnings can pile on critical expenses.
Compromised credentials and broken authentication:-Many cloud applications are
equipped towards client collaboration, however free programming trials and join openings
open cloud administrations to pernicious clients. A few genuine assault sorts can ride in on a
download or sign in: DoS assaults, email spam, computerized click extortion, and pilfered
substance are only a couple of them. Your cloud supplier is in charge of solid episode
reaction structures to distinguish and remediate this wellspring of assault. IT is in charge of
checking the quality of that structure and for observing their own cloud condition for
manhandle of resources.
Network security:-Security data will be taken from enterprise in Saas and processes and
stored by the Saas provides. To avoid the leakage of the confidential information Data all
over the internet must be secured. Strong network traffic encryption will be involved to
secure the network for traffic.
DoS attacks: - One cannot stop the denial of service attacks because it is not possible one
can mitigate the effect of these attacks but cannot stop these attacks. DoS assaults overpower
resources of a cloud service so clients can't get to information or applications. Politically
roused assaults get the front features, however programmers are similarly prone to dispatch
DoS assaults for pernicious goal including extortion. What's more, when the DoS assault
happens in a distributed computing condition, process burn charges experience the rooftop.
The cloud supplier ought to invert the charges, yet consulting over what was an assault and
what wasn't will take extra time and irritation. Most cloud suppliers are setup to deny DoS
assaults, which takes consistent observing and moment alleviation.
Account hijacking: - You may have seen an email that looks true legitimate. You tap on a
connection, and soon thereafter sirens blast and cautioning lights streak as your antivirus
program goes to fight. Or, then again you may have been genuinely unfortunate and had no
clue that you were recently the casualty of a phishing assault. At the point when a client picks
powerless secret key, or taps on a connection in a phishing endeavor, they are at genuine
danger of turning into the channel for genuine risk to information. Cloud based records are
no special case. Foundation solid two variable validation and computerize solid passwords
and watchword cycling to help secure yourself against this sort of digital assault.

QUESTION 5
5 “I think [block chain] is a fascinating area to keep an eye out for, but I think
it’s being over-hyped right now… from the aspect of its short-term impact
because there are still technical things that you need to solve and scale and
there are still counter-aspects – business model wise – that aren’t necessarily
fully clear.”
What is your take on the statement by bringing out a proper and deep support for or against the
statement with clear and well explained examples?
Answer
Blockchain technology has gained significant attention due to its various use case and potential for
disruption but there are a number of reasons that make the Blockchain technology since there exists some
shortcoming at the course of its some of the drawbacks that makes Blockchain inefficient includes:
 Transaction privacy Leakage
Because user behaviors in blockchain are traceable, blockchain systems need to protect the
transaction privacy of users. In practice, users need to assign a private key to each transactions so
attackers cannot determine whether the cryptocurrencies in different transaction is received by the
same users. Unfortunately, Privacy measures in blockchain aren’t very robust, and some research
found that 66% of transaction sampled do not contain any mixings or chaff coins, that prevents
attackers from inferring the linkage of coins spent by the transaction.
 Double Spending
This refers to a consumer who uses the same cryptocurrencies multiple times for transactions. An
attacker could leverage a race attack to initiate double spending. The attacker just needs to exploit
intermediate time between two transactions initiation, and confirmation to quickly launch an attack.
Before the second transactions is mined to be invalid, the attacker already got the first transactions
output, resulting in double spending
 Private Key Security
Once a user’s private key is lost, it cannot be recovered. If the private key is stolen by criminals, the
user’s blockchain account can be tampered by others and because there are no centralized institutions
that manage the blockchain, it’s difficult to track the criminal’s behavior and recover the modified
blockchain information.

Solving the above mentioned problem guarantees blockchain a fascinating area to keep an eye out for.
Otherwise this technology will be susceptible to other vulnerabilities of attack including the ones
highlighted.

QUESTION 6
6 How is Hadoop related to Big Data? Describe its components

The Big data and Hadoop are closely related. Big Data and Hadoop are related to one
another, but both are not alike. Big data is a problem and Hadoop are solution of that
problem.

Big data:

The Big data is nothing but huge amount of data. Big data is generally in terms of Petabytes
and it even includes lots of data which comes daily. Big data possess regular entry of data
like attendance record of employees in organizations and companies. It also keeps records of
social network websites such as images, likes, videos and shares. It regularly keeps adding
data to the already existing huge data. It is not so easy to store Big Data in a single space as it
would require lots of space.

Hadoop:

Before Hadoop the data was stored in traditional data storage system i.e., RDBMS. But
RDBMS could store only structured data which added limitations to its efficiency. The vast
scale unstructured data couldn’t be stored in RDBMS. So, Hadoop emerged as a solution to
this problem. HDFS which is Hadoop’s storage layer can reliably store any kind of data on
massive scale. It is helpful for the storage of increasing data. Hadoop stores all kinds of data
and it stores tremendous data. Even it efficiently processes huge amount of data.

Means Hadoop splits and stores big data in multiple servers and when necessary fetch that
data from those multiple servers and show them as a single one.
Components of a Hadoop
1) MapReduce- Distributed Data Processing Framework of Apache Hadoop:

MapReduce is a Java-based system created by Google where the actual data from the HDFS
store gets processed efficiently. MapReduce breaks down a big data processing job into smaller
tasks. MapReduce is responsible for the analyzing large datasets in parallel before reducing it to
find the results. In the Hadoop ecosystem, Hadoop MapReduce is a framework based on YARN
architecture. YARN based Hadoop architecture, supports parallel processing of huge data sets
and MapReduce provides the framework for easily writing applications on thousands of nodes,
considering fault and failure management.

2) Hadoop Distributed File System (HDFS) –

The default big data storage layer for Apache Hadoop is HDFS. HDFS is the “Secret Sauce” of
Apache Hadoop components as users can dump huge datasets into HDFS and the data will sit
there nicely until the user wants to leverage it for analysis. HDFS component creates several
replicas of the data block to be distributed across different clusters for reliable and quick data
access. HDFS comprises of 3 important components-NameNode, DataNode and Secondary
NameNode. HDFS operates on a Master-Slave architecture model where the NameNode acts as
the master node for keeping a track of the storage cluster and the DataNode acts as a slave node
summing up to the various systems within a Hadoop cluster..

3) Hadoop Common -

Apache Foundation has pre-defined set of utilities and libraries that can be used by other
modules within the Hadoop ecosystem. For example, if HBase and Hive want to access HDFS
they need to make of Java archives (JAR files) that are stored in Hadoop Common.

4) YARN

YARN forms an integral part of Hadoop 2.0.YARN is great enabler for dynamic resource
utilization on Hadoop framework as users can run various Hadoop applications without having to
bother about increasing workloads.
QUESTION 7
7 It’s important for companies to formally establish and publish their policies
regarding forensic investigations.
i. Give 4 (four) aspects or areas where these policies should address
 Database Forensics and eDiscovery
 Email and Social Media Forensics
 Computer/Disk Drive forensics
 Cyber Terrorism

ii. Give 4 (four) benefits that the company can get from establishing these
 Company is able to mitigate the risk of sampling
 Enables the team to compare relevant data collected from different
sources, helping complete the big picture investigating on a certain
crime
 Helps in understanding control environment
 Helps in containing network costs
QUESTION 8
8 Discuss Trust issues in the cloud computing and explain how they can be addressed.
Privileged access
Data Handling
Technology
IT Operations
Governance
Audit and Compliance

How to address Trust issues in cloud computing


Cloud computing has brought in tremendous opportunities to the IT world. Startups to large
enterprises view cloud as a desirable option, thanks to the advantages of resource sharing, on
demand provisioning and pay per use model. As we all know, the cloud represents a large pool of
resources unified through virtualization, which is capable of scaling up with the requirements
and accessible from anywhere anytime. Cloud also brings in increased efficiency and rapid
deployment of services within a matter of minutes rather than months for the clients without
them even worrying about the underlying infrastructure or maintenance costs. However, despite
these opportunities and benefits, the trust concerns are holding back many global CXO’s from
aggressively moving into the cloud bandwagon. The concerns regarding trust and security start
arising as soon as the organizational data leave the designated internal firewall and move towards
a public cloud domain. In short, trust management and information security have become a major
challenge in this age of advanced technologies and anywhere anytime connectivity.
Trust in the cloud is a very fuzzy concept for which there is no globally accepted definition.
When it comes to the cloud environment, the trust involves two aspects- the trust management
for the service provider and for cloud service requester. Trust is a factor which ensures a reliable
experience for the users without them getting concerned about the security of their assets in the
cloud. On the service provider side, the trust covers the factors like the credibility of the service,
data security, performance and availability. In addition, the cloud vendors expect that cloud
resources to be protected and to be utilized by trust worthy customers. The weak links in the trust
chain for example if the vendors subcontract the work without informing the customer or lack of
proper information on contractual terms can create serious implications for the client
organization. In short, Trust becomes the most crucial factor in deciding whether the engagement
is right between the cloud service provider and the service requester.

How to build a trusted cloud ecosystem?


Creating confidence and trust in the cloud are significant for both the cloud service provider and
service requester equally. For this, organizations should frame a comprehensive and well
organized cloud strategy secured with proper checks and balances keeping future requirements in
mind, while allocating a cloud infrastructure to the customers or receiving a service from the
provider. This helps organizations to assess, monitor, improve and enhance their operations in the
cloud environment and thereby efficiently deal the trust issues, if any, within the cloud.
Cultivating a trusted cloud environment helps organizations to address the client concerns in a
healthier way and can transform the general concerns around cloud into new opportunities. From
a customer point of view, organizations have to consider the full range of risks involved in their
on premise environment with externally hosted cloud environments. This approach will enable
them to maintain a similar risk exposure particularly as data and functions are moved from
internal to an external environment.
Let us examine the key factors that contribute to a trusted cloud environment.
Privileged access: Who has access to what on what level?
As information moves to cloud service providers, the end users would require different level of
access, both from internal or external data centers via remote access, to perform their tasks
associated with various roles. Lack of proper access control policies and guidelines can result in
unauthorized access and mishandling of sensitive data, especially in this age of revolutionary
concepts like Bring Your Own Device (BYOD). A centralized access management
solution which is affordable, secure, easily manageable and always available is required. This
allows authorized end users and service providers’ in accessing any cloud device or platform
securely. Information stored in the cloud environment must be secured and access to the
information must be limited to resources on ‘need to know’ basis. Organizations must give
emphasis on reviewing the access control policies in a timely manner and dedicate enough time
to increase awareness around access control policies and concepts around privileged access
management, segregation of duties, ‘need to know’ etc. All these measures will help in reducing
the trust violations and improve the trust among cloud customers.
Data Handling: Are appropriate controls including encryption and proper segregation in
place?
For building trust in the cloud both the service provider and the service requester should have a
clear understanding of the data stored, processed and accessed in the cloud. In order to ensure
trusted data handling within a cloud environment, all information must be properly classified,
segregated and access to the information or resources must be limited by enforcing policies
which grant privileged access to data only on ‘need to know basis’. Organizations must give
enough focus on data handling, where a customer’s requirements around data handling within the
cloud are taken care of and also respective regional regulations on how data is stored and
handled are taken care of.
Technology: Is the technology foolproof enough to rely on?
The technology base that the cloud is built upon is significant for both service providers and
requesters for keeping the data assets safe. If the virtualization technology or the third party data
center technologies are not secure enough to store and handle customer data, it will impact the
trust factor. On a technology point of view, cloud service providers should maintain compliance
with proper technical controls, industry certifications and fool proof virtualization technologies
for the customers. They should monitor vulnerabilities and provide audit reports at regular
intervals. The service requesters must evaluate a cloud service provider based on the factors
including infrastructure management, scalability, security controls and encryption in place.

IT Operations: Are the IT operations clear and transparent?


While providing cloud services the cloud service providers have to provide a complete picture of
their policies, infrastructure, support/maintenance procedures and DR strategies as part of the IT
operations. They should outline potential risks if any and mitigation measures to address them.
With transparency and appropriate communication on both side enterprises can deploy better
cloud solutions to address complex technology problems.
Governance: Does it adhere to a proper governance model?
To build trust in the cloud, organizations need to have a proper governance model in place
comprising of cloud infrastructure, third party entities and in-house resources. The cloud service
providers should ensure that the users are educated on the significance of adhering to the cloud
governance model. The regular evaluation must be done on people, process and technology. They
should establish periodic assessments with consumers to review the contractual terms and
discuss the risks or any potential issues that may affect the service.
Audit and Compliance: Does it meet the audit and compliance standards?
It is highly critical to meet the audit and compliance standards in the cloud environment. Cloud
customers must adopt the service after analyzing the history of the service providers in terms of
security policies. To ensure trust, the service providers must adhere and comply with the cloud
and data security regulations pertaining to a country. They should administer third party audits
with regular reviews and assessments to identify whether there are any issues in the established
policies or contractual terms. They should document and keep reviewing and updating the legal,
statutory and regulatory compliance as well.

You might also like