Chapter N1 Introduction To Big Data
Chapter N1 Introduction To Big Data
■ 2 years IT Instructor
■ Certificates:
2. Machine Learning
■ What is Data?
■ What is Information?
Introduction
■ What is Data?
– It may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or
– Big Data is also data but with a huge size and yet growing exponentially with time
– None of the traditional data management tools can store it or process it efficiently.
Introduction
■ What is Information?
– Information is a set of data which is processed in a meaningful way according to the given requirement.
– Information is processed, structured, or presented in a given context to make it meaningful and useful.
Data
Warehouse
■ What is ETL?
– ETL Stands for Extract Transform and Load Data Warehouse
Introduction
1 PetaByte = 103TB
1 kilo Byte (KB) =210Byte
1 Exa Byte = 103PB
= 1024 Byte
1 ZettaByte = 103EB
1 MegaByte = 103KB
Mobile = 128 GB
Photo = 10-15 MB
Introduction
• By 2025, it’s estimated that 463 exabytes of data will be created each day globally –that’s the equivalent of
➢ Volume:
➢ Velocity:
– Large amounts of data from transactions with high refresh rate resulting in data streams coming at great
speed and the time to act based on these data streams will often be very short . There is a shift from batch
➢ Variety:
– Data can come from both internal and external data source
OUTLINE
Introduction
■ Structured Data
Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data.
■ Unstructured Data
Any data with unknown form or the structure is classified as unstructured data
■ Semi-structured Data
Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form, but
• Example
The University of Alabama has more than 38,000 students and an ocean of data. In the past when there were no
real solutions to analyze that much of data, some of them seemed useless. Now, administrators are able to use
analytics and data visualizations for this data to draw out patterns of students revolutionizing the university’s
operations, recruitment, and retention efforts.
Application and use cases of Big Data
Example
• Wearable devices and sensors have been introduced in the healthcare industry
which can provide real-time feed to the electronic health record of a patient. One
• Apple has come up with Apple HealthKit, CareKit, and ResearchKit. The main
goal is to empower the iPhone users to store and access their real-time health
• Example
Food and Drug Administration (FDA) which runs under the jurisdiction
infections.
Application and use cases of Big Data
■ Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook,
every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting
comments etc
Application and use cases of Big Data
■ Weather Patterns
■ Problem—Schema-On-Write:
– This means that a lot of work must be done before new data sources can be analyzed.
– Example: Suppose a company wants to start analyzing a new source of data from unstructured or semi-
structured sources. A company will usually spend months (3–6 months) designing schemas and so on to store
the data in a data warehouse. That is 3 to 6 months that the company cannot use the data to make business
decisions. Then when the data warehouse design is completed 6 months later, often the data has changed
again. If you look at data structures from social media, they change on a regular basis. The schema-on-write
environment is too slow and rigid to deal with the dynamics of semi-structured and unstructured data
environments that are changing over a period of time.
■ The other problem with unstructured data is that traditional systems usually use Large Object Byte (LOB) types to
handle unstructured data, which is often very inconvenient and difficult to work with.
Limitations of traditional large-scale systems
■ Solution—Schema-On-Read:
– Hadoop systems are schema-on-read, which means any data can be written to the storage system
immediately. Data are not validated until they are read. This enables Hadoop systems to load any type of data
and begin analyzing it quickly.
Limitations of traditional large-scale systems
■ Problem—Cost of Storage: Traditional systems use shared storage. As organizations start to ingest larger
■ Solution—Local Storage: Hadoop can use the Hadoop Distributed File System (HDFS), a distributed file
system that leverages local disks on commodity servers. Shared storage is about $1.20/GB, whereas
local storage is about $.04/GB. Hadoop’s HDFS creates three replicas by default for high availability. So
at 12 cents per GB, it is still a fraction of the cost of traditional shared storage.
Limitations of traditional large-scale systems
■ Problem—Cost of Proprietary Hardware: Large proprietary hardware solutions can be cost prohibitive when deployed
to process extremely large volumes of data. Organizations are spending millions of dollars in hardware and software
licensing costs while supporting large data environments. Organizations are often growing their hardware in million
dollar increments to handle the increasing data. New technology in traditional vendor systems that can grow to
petabyte scale and good performance are extremely expensive.
■ Problem—Complexity: When you look at any traditional proprietary solution, it is full of extremely complex silos of
system administrators, DBAs, application server teams, storage teams, and network teams. Often there is one DBA
for every 40 to 50 database servers. Anyone running traditional systems knows that complex systems fail in complex
ways.
■ Solution—Simplicity: Because Hadoop uses commodity hardware and follows the “shared-nothing” architecture, it is
a platform that one person can understand very easily. Numerous organizations running Hadoop have one
administrator for every 1,000 data nodes. With commodity hardware, one person can understand the entire
technology stack.
Limitations of traditional large-scale systems
■ Problem—Causation: Because data is so expensive to store in traditional systems, data is filtered and aggregated,
and large volumes are thrown out because of the cost of storage. Minimizing the data to be analyzed reduces the
accuracy and confidence of the results. Not only are accuracy and confidence to the resulting data affected, but it
also limits an organization’s ability to identify business opportunities. Atomic data can yield more insights into the
data than aggregated data.
■ Solution—Correlation: Because of the relatively low cost of storage of Hadoop, the detailed records are stored in
Hadoop’s storage system HDFS. Traditional data can then be analyzed with non-traditional data in Hadoop to find
correlation points that can provide much higher accuracy of data analysis. We are moving to a world of correlation
because the accuracy and confidence of the results are factors higher than traditional systems. Organizations are
seeing big data as transformational. Companies building predictive models for their customers would spend weeks
or months building new profiles. Now these same companies are building new profiles and models in a few days.
One company would have a data load take 20 hours to complete, which is not ideal. They went to Hadoop and the
time for the data load went from 20 hours to 3 hours.
OUTLINE
Introduction
SLOW FAST
How a distributed way of computing is superior (cost and scale)
■ Scale Horizontal
OUTLINE
Introduction
■ Big data is characterized by heterogeneous data sources like images, videos and audios.
■ The way Big data stored effects not only cost but also analysis and processing. To meet service and
analysis requirements in Big data reliable, high performance, high availability and low cost storage need
to be developed
■ Databases and warehouses are unsatisfactory for processing of unstructured and semi structured data.
With Big data read/write operations are highly concurrent for large number of users. As the size of
database increases, algorithm may become insufficient
Opportunities and challenges with Big Data
■ Big Data enables enhanced discovery, access, availability, exploitation, and provisioning of information
within companies and the supply chain. It can enable the discovery of new data sets that are not yet
being used to drive value.
■ Big Data analytics can enhance customer segmentation, allowing for better scalability and mass
personalization. It can improve customer service levels, enhance customer acquisition and sales
strategies (through web and social), as well as enabling customization of delivery.
Opportunities and challenges with Big Data
■ A wide variety of data streams can aid innovation and product design. These include utilizable product
usage data, point-of-sales data, field data from devices, customer data, and supplier suggestions to drive
product and process innovation.
■ Big Data can reduce long-term costs, increase ability to invest, and improve understanding of cost drivers
and impacts.
References
1. https://www.erpublication.org/published_paper/IJETR042630.pdf
2. https://www.pearsonitcertification.com/articles/article.aspx?p=2427073&seqNum=2
3. https://intellipaat.com/blog/7-big-data-examples-application-of-big-data-in-real-life/
Thank you