BDA NOTES With Questions Included
BDA NOTES With Questions Included
ANALYTICS
1
UNIT-01
INTRODUCTION TO BIG DATA
For example: When you visit any website, they might store you IP address, that is data, in return
they might add a cookie in your browser, marking you that you visited the website, that is data,
your name, it's data, your age, it's data.
• What is Data?
– The quantities, characters, or symbols on which operations are performed by a
computer,
– which may be stored and transmitted in the form of electrical signals and
– recorded on magnetic, optical, or mechanical recording media.
• 3 Actions on Data
– Capture
– Transform
2
– Store
BigData
• Big Data may well be the Next Big Thing in the IT world.
• Big data burst upon the scene in the first decade of the 21st century.
• Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the
beginning.
• Like many new information technologies,
• big data can bring about dramatic cost reductions,
• substantial improvements in the time required to perform a computing task, or
– a collection of data sets so large and complex that it becomes difficult to process using
on-hand database management tools or traditional data processing applications.
3
Examples of Bigdata
– The New York Stock Exchange generates about one terabyte of new trade data per
day.
– Other examples of Big Data generation includes
• stock exchanges,
1. Structured
2. Unstructured
3. Semi-structured
• Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data.
• Developed techniques for working with such kind of data (where the format is well
known in advance) and also deriving value out of it.
• Foreseeing issues of today :
– when a size of such data grows to a huge extent, typical sizes are being in the rage of
multiple zetta bytes.
• Do you know?
• 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
– That is why the name Big Data is given and imagine the challenges involved in its
storage and processing?
4
• Do you know?
– Data stored in a relational database management system is one example of a
'structured' data.
Unstructured Data
• Any data with unknown form or the structure is classified as unstructured data.
• In addition to the size being huge,
– they don't know how to derive value out of it since this data is in its raw form or
unstructured format.
5
Semi-structured Data
– but it is actually not defined with e.g. a table definition in relational DBMS.
• Example of semi-structured data is
</rec>
<rec>
<name>Satish Mane</name>
<sex>Male</sex>
<age>29</age>
</rec>
<rec>
<name>Subrato Roy</name>
<sex>Male</sex>
<age>26</age>
</rec>
6
<rec>
<name>Jeremiah J.</name>
<sex>Male</sex>
<age>35</age></rec>
7
Storing Big Data
– Hbase
– Hive
Processing Big Data
• Integrating disparate data stores
– Mapping data to the programming framework
– Connecting and extracting data from storage
– Transforming data for processing
– Subdividing data in preparation for Hadoop MapReduce
• Employing Hadoop MapReduce
• Users
• Application
• Systems
• Sensors
Moved to
• Will be so overwhelmed
– Need the right people and solve the right problems
• Costs escalate too fast
Big Data platform is IT solution which combines several Big Data tools and utilities into one
packaged solution for managing and analyzing Big Data.
Big data platform is a type of IT solution that combines the features and capabilities of several
big data application and utilities within a single solution.
It is an enterprise class IT platform that enables organization in developing, deploying,
operating and managing a big data infrastructure /environment.
Big Data Platform is integrated IT solution for Big Data management which combines several
software system, software tools and hardware to provide easy to use tools system to enterprises.
It is a single one-stop solution for all Big Data needs of an enterprise irrespective of size and
data volume. Big Data Platform is enterprise class IT solution for developing, deploying and
managing Big Data.
There are several Open source and commercial Big Data Platform in the market with varied
features which can be used in Big Data environment.
10
Big data platform is a type of IT solution that combines the features and capabilities of several
big data application and utilities within a single solution.
It is an enterprise class ITplatformthat enables organization in developing, deploying,
operating and managing abig datainfrastructure /environment.
Big data platform generally consists of big data storage, servers, database, big data
management, business intelligence and other big data management utilities
It also supports custom development, querying and integration with other systems.
The primary benefit behind a big data platform is to reduce the complexity of multiple
vendors/ solutions into a one cohesive solution.
Big data platform are also delivered through cloud where the provider provides an all
inclusive big data solutions and services.
Here are most important features of any good Big Data Analytics Platform:
a) Big Data platform should be able to accommodate new platforms and tool based on the
business requirement. Because business needs can change due to new technologies or due to
change in business process.
g) It should have tools for searching the data through large data sets
Big data is a term for data sets that are so large or complex that traditional data processing applications are
inadequate.
Challenges include
11
Analysis,
Capture,
Data Curation,
Search,
Sharing,
Storage,
Transfer,
Visualization,
Querying,
Updating
Information Privacy.
The term often refers simply to the use of predictive analytics or certain other
advancedmethods to extract value from data, and seldom to a particular size of data set.
ACCURACY in big data may lead to more confident decision making, and better decisions
can result in greater operational efficiency, cost reduction and reduced risk.
Big data usually includes data sets with sizes beyond the ability of commonly used
software tools to capture, curate, manage, and process data within a tolerable elapsed
time. Big data "size" is a constantly moving target.
Big data requires a set of techniques and technologies with new forms of integration to
reveal insights from datasets that are diverse, complex, and of a massive scale
a) Hadoop
b) Cloudera
c) Amazon Web Services
d) Hortonworks
e) MapR
f) IBM Open Platform
g) Microsoft HDInsight
12
h) Intel Distribution for Apache Hadoop
i) Datastax Enterprise Analytics
j) Teradata Enterprise Access for Hadoop
k) Pivotal HD
a) Hadoop
What is Hadoop?
Hadoop is open-source, Java based programming framework and server software which is used
to save and analyze data with the help of 100s or even 1000s of commodity servers in a
clustered environment.
Hadoop is designed to storage and process large datasets extremely fast and in fault
tolerant way.
Hadoop uses HDFS (Hadoop File System) for storing data on cluster of commodity computers.
If any server goes down it know how to replicate the data and there is no loss of data even in
hardware failure.
Hadoop is Apache sponsored project and it consists of many software packages which runs
on the top of the Apache Hadoop system.
Top Hadoop based Commercial Big Data Analytics Platform
Hadoop provides set of tools and software for making the backbone of the Big Data
analytics system.
Hadoop ecosystem provides necessary tools and software for handling and analyzing Big
Data.
On the top of the Hadoop system many applications can be developed and plugged-in to
provide ideal solution for Big Data needs.
b) Cloudera
Cloudra is one of the first commercial Hadoop based Big Data Analytics Platform
offering Big Data solution.
Its product range includes Cloudera Analytic DB, Cloudera Operational DB, Cloudera Data
Science & Engineering and Cloudera Essentials.
13
All these products are based on the Apache Hadoop and provides real-time processing and
analytics of massive data sets.
Website: https://www.cloudera.com
c) Amazon Web Services
Amazon is offering Hadoop environment in cloud as part of its Amazon Web Services
package.
AWS Hadoop solution is hosted solution which runs on Amazon’s Elastic Cloud
Compute and Simple Storage Service (S3).
Enterprises can use the Amazon AWS to run their Big Data processing analytics in the cloud
environment.
Amazon EMR allows companies to setup and easily scale Apache Hadoop, Spark,
HBase, Presto, Hive, and other Big Data Frameworks using its cloud hosting
environment.
Website: https://aws.amazon.com/emr/
d) Hortonworks
Hortonworks is using 100% open-source software without any propriety software.
Hortonworks were the one who first integrated support for Apache HCatalog.
The Hortonworks is a Big Data company based in California.
This company is developing and supports application for Apache Hadoop.
Hortonworks Hadoop distribution is 100% open source and its enterprise ready with following features:
14
e) MapR
MapR is another Big Data platform which us using the Unix file system for handling data.
It is not using HDFS and this system is easy to learn anyone familiar with the Unix
system.
This solution integrates Hadoop, Spark, and Apache Drill with a real-time data
processing feature.
Website: https://mapr.com
It uses the latest Hadoop software and provides following features (IBM Open Platform Features):
Support for heterogeneous storage which includes HDFS for in-memory and SSD in
addition to HDD
Native support for Spark, developers can use Java, Python and Scala to written program
Platform includes Ambari, which is a best tool for provisioning, managing & monitoring
Apache Hadoop clusters
IBM Open Platform includes all the software of Hadoop ecosystem e.g. HDFS, YARN,
MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig, Snappy, Solr, Spark,
Sqoop, Zookeeper, Open JDK, Knox, Slider
Developer can download the trial Docker Image or Native installer for testing and
learning the system
15
Application is well supported by IBM technology team
Website: https://www.ibm.com/analytics/us/en/technology/hadoop/
g) Microsoft HDInsight
The Microsoft HDInsight is also based on the Hadoop distribution and it’s a commercial Big
Data platform from Microsoft.
Microsoft is software giant which is into development of windows operating system for
Desktop users and Server users.
This is the big Hadoop distribution offering which runs on the Windows and Azure
environment.
It offer customized, optimized open source Hadoop based analytics clusters which uses Spark,
Hive, MapReduce, HBase, Strom, Kafka and R Server which runs on the Hadoop system on
windows/Azure environment.
Website: https://azure.microsoft.com/en-in/services/hdinsight/
Website: http://www.intel.com/content/www/us/en/software/intel-distribution-for-apache-
hadoop-software-solutions.html
Features:
It provides powerful indexing, search, analytics and graph functionality into the Big Data
system
It supports advanced indexing and searching features
Website: http://www.datastax.com/
Company offers:
Teradata
Hadoop
Website: http://www.teradata.com
17
k) Pivotal HD
Pivotal HD offers is another Hadoop distribution with includes includes database tools Greenplum
and analytics platform Gemfire.
Features:
Indian railways, BMW, China Citic Bank and many other big players are using this distribution of Big
Data Platform
Website: https://pivotal.io/
There are various open-source Big Data Platform which can be used for Big Data handling and data
analytics in real-time environment.
Both small and Big Enterprise can use these tools for managing their enterprise data for getting best value
from their enterprise data.
i) Apache Hadoop
Apache Hadoop is Big Data platform and software package which is Apache sponsored
project.
Under Apache Hadoop project various other software is being developed which runs on the top
of Hadoop system to provide enterprise grade data management and analytics solutions to
enterprise.
Apache Hadoop is open-source, distributed file system which provides data processing and
analysis engine for analyzing large set of data.
Hadoop can run on Windows, Linux and OS X operating systems, but it is mostly used on
Ubunut and other Linux variants.
18
ii) MapReduce
The MapReduce engine was originally written by Google and this is the system which enables
the developers to write program which can run in parallel on 100 or even 1000s of computer
nodes to process vast data sets.
After processing all the job on the different nodes it comes the results and return it to the
program which executed the MapReduce job.
This software is platform independent and runs on the top of Hadoop ecosystem. It can
process tremendous data at very high speed in Big Data environment.
iii) GridGain
GridGain is another software system for parallel processing of data just like MapRedue.
GridGain is an alternative of Apache MapReduce.
GridGain is used for the processing of in-memory data and its is based on Apache Iginte
framework.
GridGain is compatable with the Hadoop HDFS and runs on the top of Hadoop
ecosystem.
Then enterprise version of GridGain can be purchased from official website of
GridGain. While free version can be downloaded from GitHub repository.
Website: https://www.gridgain.com/
HPCC Systems stands for "high performance computing cluster” and this system is
developed by LexisNexis Risk Solutions.
According to the company this software is much faster than Hadoop and can be used in the
cloud environment.
HPCC Systems is developed in C++ and compiled into binary code for distribution.
HPCC Systems is open-source, massive parallel processing system which is installed in cluster
19
to process data in real-time.
It requires Linux operating system and runs on the commodity servers connected with high-
speed network.
It is scalable from one node to 1000s of nodes to provide performance and scalability.
Website: https://hpccsystems.com/
v) Apache Storm
Realtime analytics
Continuous computation
Distributed RPC
ETL
Apache Strom is used by Yahoo, Twitter, Spotify, Yelp, Flipboard and many other data giants.
Website: http://storm.apache.org/
20
vii) Apache Spark
Apache Spark is software that runs on the top of Hadoop and provides API for real-time, in-
memory processing and analysis of large set of stored in the HDFS.
It stores the data into memory for faster processing.
Apache Spark runs program 100 times faster in-memory and 10 times faster on disk as
compared to the MapRedue.
Apache Spark is here to faster the processing and analysis of big data sets in Big Data
environment.
Apache Spark is being adopted very fast by the business to analyze their data set to get real
value of their data.
Website: http://spark.apache.org/
viii) SAMOA
Thus, the Big Data industry is growing very fast in 2017 and companies are fast moving their
data to Big Data Platform. There is huge requirement of Big Data in the job market; many
companies are providing training and certifications in Big Data technologies.
*********************
21
UNIT-II
Conventional Systems.
The system consists of one or more zones each having either manually operated call points
or automatic detection devices, or a combination of both.
Big data is huge amount of data which is beyond the processing capacity ofconventional
data base systems to manage and analyze the data in a specific time interval.
The conventional computing functions logically with a set of rules and calculations while
the neural computing can function via images, pictures, and concepts.
Conventional computing is often unable to manage the variability of data obtained in the real
world.
On the other hand, neural computing, like our own brains, is well suited to situations that have
no clear algorithmic solutions and are able to manage noisy imprecise data. This allows them to
excel in those areas that conventional computing often finds difficult.
22
Unstructured data such as text, video, and Normally structured data such as numbers
audio. and categories, but it can take other forms as
well.
Hard-to-perform queries and analysis Relatively easy-to-perform queries and
analysis.
Needs a new methodology for analysis. Data analysis can be achieved by using
conventional methods.
Need tools such as Hadoop, Hive, Tools such as SQL, SAS, R, and Excel
Hbase, Pig, Sqoop, and so on. alone may be sufficient.
Used for reporting, basic analysis, and Used for reporting, advanced analysis, and
text mining. Advanced analytics is only in a predictive modeling .
starting stage in big data.
Big data analysis needs both programming Analytical skills are sufficient for conventional
skills (such as Java) and analytical skills to data; advanced analysis tools don’t require
perform analysis. expert programing skills.
Petabytes/exabytes of data. Millions/billions of accounts.
Generated by big financial institutions, Facebook, Generated by small enterprises and small banks.
Google, Amazon, eBay,
Walmart, and so on.
The following list of challenges has been dominating in the case Conventional systems in real time
scenarios:
23
1) Uncertainty of Data Management Landscape
2) The Big Data Talent Gap
3) The talent gap that exists in the industry Getting data into the big data platform
4) Need for synchronization across data sources
5) Getting important insights through the use of Big data analytics
A big challenge for companies is to find out which technology works bests for them
without the introduction of new risks and problems.
2) The Big Data Talent Gap:
While Big Data is a growing field, there are very few experts available in this field.
This is because Big data is a complex field and people who understand the complexity and
intricate nature of this field are far few and between.
3) The talent gap that exists in the industry Getting data into the big data platform:
Data is increasing every single day. This means that companies have to tackle limitless
amount of data on a regular basis.
The scale and variety of data that is available today can overwhelm any data practitioner and
that is why it is important to make data accessibility simple and convenient for brand mangers
and owners.
24
It is important that companies gain proper insights from big data analytics and it is
important that the correct department has access to this information.
A major challenge in the big data analytics is bridging this gap in an effective fashion.
1. Data
2. Process
3. Management
1. Data Challenges
Volume
Social media plays a key role: Twitter generates 7+ terabytes (TB) of data every day. Facebook, 10 TB.
• Mobile devices play a key role as well, as there were estimated 6 billion mobile phones in 2011.
• The challenge is how to deal with the size of Big Data.
25
Variety•A lot of this data is unstructured, or has a complex structure that’s hard to represent in rows and
columns.
2. Processing
More than 80% of today’s information isunstructured and it is typically too big to
manage effectively.
Today, companies are looking to leverage a lot more data from a wider variety of
sources both inside and outside the organization.
Things like documents, contracts, machine data, sensor data, social media, health
records, emails, etc. The list is endless really.
3. Management
A lot of this data is unstructured, or has acomplex structure that’s hard to represent in rows
and columns.
Big Data Challenges
• due to the additional information derivable from analysis of a single large set of related data,
– as compared to separate smaller sets with the same total amount of data, allowing
correlations to be found to
• "spot business trends, determine quality of research, prevent diseases, link legal
citations, combat crime, and determine real-time roadway traffic conditions.”
b) Visualization helps organizations perform analyses and make decisions much more
rapidly, but the challenge is going through the sheer volumes of data and accessing the level
of detail needed, all at a high speed.
c) The challenge only grows as the degree of granularity increases. One possible solution is
hardware. Some vendors are using increased memory and powerful parallel
processing to crunch large volumes of data extremely quickly
Intelligent Data Analysis (IDA) is one of the hot issues in the field of artificial
intelligence and information.
… used for extracting useful information from large quantities of online data; extracting desirable
knowledge or interesting patterns from existing databases;
28
the distillation of information that has been collected, classified, organized, integrated,
abstracted and value-added;
at a level of abstraction higher than the data, and information on which it is based and can be
used to deduce new information and new knowledge;
the distillation of information that has been collected, classified, organized, integrated,
abstracted and value-added;
at a level of abstraction higher than the data, and information on which it is based and can be
used to deduce new information and new knowledge;
Goal:
Goal of Intelligent data analysis is to extract useful knowledge, the process demands a combination of
extraction, analysis, conversion, classification, organization, reasoning, and so on.
29
1,3,2 Uses / Benefits of IDA
Intelligent Data Analysis provides a forum for the examination of issues related to the research and
applications of Artificial Intelligence techniques in data analysis across a variety of disciplines and the
techniques include (but are not limited to):
Data Visualization
Data pre-processing (fusion, editing, transformation, filtering, sampling)
Data Engineering
Database mining techniques, tools and applications
Use of domain knowledge in data analysis
Big Data applications
Evolutionary algorithms
Machine Learning(ML)
Neural nets
Fuzzy logic
Statistical pattern recognition
Knowledge Filtering and
Post-processing
30
Why IDA?
Multidimensionality of problems is looking for methods for adequate and deep data
processing and analysis
31
Sample of examinees died from cardiovascular diseases during the period
2 – they were ill (drug treatment, positive clinical and laboratory findings)
Knowledge Acquisition
A Rule :
Example of IDA
2 – they were ill (drug treatment, positive clinical and laboratory findings)
32
Illustration of IDA by using See5
application.names - lists the classes to which cases may belong and the attributes
used to describe each case.
Attributes are of two types: discrete attributes have a value drawn from a set of
possibilities, and continuous attributes have numeric values.
application.data - provides information on the training cases from which See5 will extract
patterns.
The entry for each case consists of one or more lines that give the values for all
attributes.
application.data - provides information on the training cases from which See5 will extract
patterns.
The entry for each case consists of one or more lines that give the values for all
attributes.
application.test - provides information on the test cases (used for evaluation of results).
The entry for each case consists of one or more lines that give the values for all
attributes.
Goal 1.1 :
application.names – example
gender:M,F activity:1,2,3
age: continuous smoking:
No, Yes
…
33
Goal:1,2 :
application.data – example
M,1,59,Yes,0,0,0,0,119,73,103,86,247,87,15979,?,?,?,1,73,2.5
M,1,66,Yes,0,0,0,0,132,81,183,239,?,783,14403,27221,19153,23187,1,73,2.6
M,1,61,No,0,0,0,0,130,79,148,86,209,115,21719,12324,10593,11458,1,74,2.5
… …
Result:
Results – example
Sensitivity=0.97
Specificity=0.81
Sensitivity=0.97
Specificity=0.81
Sensitivity=0.98
Specificity=0.90
NATURE OF DATA
INTRODUCTION
Data
Properties of Data
For examining the properties of data, reference to the various definitions of data.
Reference to these definitions reveals that following are the properties of data:
a) Amenability of use
b) Clarity
c) Accuracy
d) Essence
e) Aggregation
f) Compression
g) Refinement
.Amenability of use: From the dictionary meaning of data it is learnt that data are facts used in deciding
something. In short, data are meant to be used as a base for arriving at definitive conclusions.
a) Clarity: Data are a crystallized presentation. Without clarity, the meaning desired to be
communicated will remain hidden.
b) Accuracy: Data should be real, complete and accurate. Accuracy is thus, an essential
36
property of data.
c) Essence: A large quantities of data are collected and they have to be Compressed and
refined. Data so refined can present the essence or derived qualitative value, of the matter.
e) Compression: Large amounts of data are always compressed to make them more meaningful.
Compress data to a manageable size.Graphs and charts are some examples of compressed
data.
f) Refinement: Data require processing or refinement. When refined, they are capable of
leading to conclusions or even generalizations. Conclusions can be drawn only when data are
processed or refined.
TYPES OF DATA
In order to understand the nature of data it is necessary to categorize them into various types.
Different categorizations of data are possible.
The first such categorization may be on the basis of disciplines, e.g., Sciences, Social
Sciences, etc. in which they are generated.
Within each of these fields, there may be several ways in which data can be categorized into types.
Ordinal
Interval
Ratio
Each offers a unique set of characteristics, which impacts the type of analysis that can be performed.
37
The distinction between the four types of scales center on three different characteristics:
Nominal Scales
Distance: Nominal scales do not hold distance. The distance between a 1 and a 2 is not the
same as a 2 and 3.
True Zero: There is no true or real zero. In a nominal scale, zero is uninterruptable.
Ordinal Scales
At the risk of providing a tautological definition, ordinal scales measure, well, order. So, our
38
characteristics for ordinal scales are:
True Zero: There is no true or real zero. An item, observation, or category cannot finish zero.
Interval Scales
Interval scales provide insight into the variability of the observations or data.
Classic interval scales are Likert scales (e.g., 1 - strongly agree and 9 - strongly disagree) and Semantic
Differential scales (e.g., 1 - dark and 9 - light).
In an interval scale, users could respond to “I enjoy opening links to thwebsite from a company email”
with a response ranging on a scale of values.
Distance: Interval scales do offer distance. That is, the distance from 1 to 2 appears the same as
4 to 5. Also, six is twice as much as three and two is half of four. Hence, we can perform
arithmetic operations on the data.
True Zero: There is no zero with interval scales. However, data can be rescaled in a manner
that contains zero. An interval scales measure from 1 to 9 remains the same as 11 to 19
because we added 10 to all values. Similarly, a 1 to 9 interval scale is the same a -4 to 4 scale
because we subtracted 5 from all values. Although the new scale contains zero, zero remains
uninterruptable because it only appears in the scale from the transformation.
Appropriate statistics for interval scales: count, frequencies, mode, median, mean, standard deviation
(and variance), skewness, and kurtosis.
39
Displays: histograms or bar charts, line charts, and scatter plots.
Ratio Scales
Zero dollars means we have no income (or, in accounting terms, our revenue exactly equals
our expenses!)
Distance is interpretable, in that $20 appears as twice $10 and $50 is half of a $100. For
the web analyst, the statistics for ratio scales are the same as for interval scales.
Appropriate statistics for ratio scales: count, frequencies, mode, median, mean, standard deviation (and
variance), skewness, and kurtosis.
The table below summarizes the characteristics of all four types of scales.
40
DATA CONVERSION
We can convert or transform our data from ratio to interval to ordinal to nominal. However,
we cannot convert or transform our data from nominal to ordinal to interval to ratio.
DATA SELECTION
Example :
41
60 degrees 12.5 feet 80 Miles per hour
In this case, 93% of all hospital have lower patient satisfaction scores than Eastridge hospital. 31% have
lower satisfaction scores than Westridge Hospital.
Thus the nature of data and its value have great influence on data insight in it.
***********************
42
UNIT-III
1. Deployment
2. Business Understanding
3. Data Exploration
4. Data Preparation
5. Data Modeling
6. Data Evaluation
43
Step 1: Deployment
– In this phase,
• Business Understanding
– For the further process, we need to gather initial data, describe and explore the data
44
and verify data quality to ensure it contains the data we require.
– Data collected from the various sources is described in terms of its application and
the need for the project in this phase.
– we need to select data as per the need, clean it, construct it to get useful
information and
• Data is selected, cleaned, and integrated into the format finalized for the analysis in this phase.
• we need to
– select a modeling technique, generate test design, build a model and assess the model
built.
• The data model is build to
45
– Distributed Storage (e.g. Amazon S3)
• What is the programming model?
– Distributed Processing (e.g. MapReduce)
– SaS
– R
– Hadoop
Thus the BDA tools are used through out the BDA applications development.
46
ANALYSIS AND REPORTING
What is Analysis?
• The process of exploring data and reports
– in order to extract meaningful insights,
– which can be used to better understand and improve business performance.
• What is Reporting ?
• Reporting is
47
• A firm may be focused on the general area of analytics (strategy, implementation,
reporting, etc.)
– but not necessarily on the specific aspect of analysis.
• It’s almost like some organizations run out of gas after the initial set-up-related
activities and don’t make it to the analysis stage
Analysis Reporting
Provides what is needed Provides what is asked for
Is typically customized Is Typically standardized
Involves a person Does not involve a person
Is extremely flexible Is fairly Inflexible
• while analysis focuses on explaining why it is happening and what you can do about it.
Reports are like Robots n monitor and alter you and where as analysis is like parents - c an
48
figure out what is going on (hungry, dirty diaper, no pacifier, , teething, tired, ear infection,
etc).
Reporting and analysis can go hand-in-hand:
Reporting provides no limited context about what is happening in the data. Context is critical
to good analysis.
Reporting translate a raw data into information
Reporting usually raises a question – What is happening ?
Analysis transforms the data into insights - Why is it happening ? What you can do about
it?
Thus, Analysis and Reporting is synonym to each other with respect their need and utilizing in the
needy context.
• An organization can schedule batch processing for a time when there is little
activity on their computer systems, for example overnight or at weekends.
49
• One of the most famous and powerful batch process-based Big Data tools is
Apache Hadoop.
It provides infrastructures and platforms for other specific Big Data
applications.
50
b) Stream Processing tools
• Stream processing – Envisioning (predicting) the life in data as and when it
transpires.
• The key strength of stream processing is that it can provide insights faster, often
within milliseconds to seconds.
It helps understanding the hidden patterns in millions of data records in real
time.
It translates into processing of data from single or multiple sources
in real or near-real time applying the desired business logic and emitting the
processed information to the sink.
• Stream processing serves
multiple
b) Apache flink
• Apache flink is
51
– flink is a top level project of Apache flink is scalable data analytics framework that is
fully compatible to hadoop .
– flink can execute both stream processing and batch processing easily.
– flink was designed as an alternative to map-reduce.
c) Kinesis
– Kinesis as an out of the box streaming data tool.
– Kinesis comprises of shards which Kafka calls partitions.
– For organizations that take advantage of real-time or near real-time access to large
stores of data,
– Amazon Kinesis is great.
– Kinesis Streams solves a variety of streaming data problems.
– One common use is the real-time aggregation of data which is followed by
loading the aggregate data into a data warehouse.
– Data is put into Kinesis streams.
– This ensures durability and elasticity.
reviewed,
compared and
analyzed
• in tabular or graphic format or both at the same time.
52
IA -Big Data Tools -
a) Google’s Dremel is the google proposed an interactive analysis system in 2010. And named
named Dremel.
– which is scalable for processing nested data.
– Dremel provides
• a very fast SQL like interface to the data by using a different technique than
MapReduce.
• Dremel has a very different architecture:
by means of:
b) Apache drill
• Apache drill is:
Drill is an Apache open-source SQL query engine for Big Data exploration.
It is similar to Google’s Dremel.
• For Drill, there is:
53
• Drill is designed from the ground up to:
v. Collective model
a) MapReduce Model
Jeffrey Dean et al. MapReduce: Simplified Data Processing on Large Clusters. OSDI
2004.
54
b) Apache Hadoop (2005)
Apache Hadoop YARN: Yet Another Resource Negotiator, SOCC 2013.
Key Features of MapReduce Model
Designed for clouds
Large clusters of commodity machines
Designed for big data
Support from local disks based distributed file system (GFS / HDFS) Disk
based intermediate data transfer in Shuffling
MapReduce programming model
Computation pattern: Map tasks and Reduce tasks Data
abstraction: KeyValue pairs
• HaLoop
• An efficient Data Processing on Large clusters
• Have features:
– RDD operations
• MapReduce-like parallel operations
– DAG of execution stages and pipelined transformations
– Simple collectives: broadcasting and aggregation
55
DAG (Directed Acyclic Graph) Model
– A Distributed Data-Parallel Programs from Sequential Building Blocks,
• Apache Spark
56
GraphLab (2010)
• GraphLab: A New Parallel Framework for Machine Learning. UAI 2010.
• Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud.
• Data graph
• PowerGraph (2012)
– PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs.
– Gather, apply, Scatter (GAS) model
• GraphX (2013)
– A Resilient Distributed Graph System on Spark. GRADES
Collective Model
• Harp (2013)
Thus the modern analytical tools play an important role in the modern data world.
**********
58
STATISTICAL CONEPTS: SAMPLING DISTRIBUTIONS
Fundamental Statistics
Statistics is a very broad subject, with applications in a vast number of different fields.
In generally one can say that statistics is the methodology for collecting, analyzing,
interpreting and drawing conclusions from information.
Putting it in other words, statistics is the methodology which scientists and mathematicians
have developed for interpreting and drawing conclusions from collected data.
Everything that deals even remotely with the collection, processing, interpretation and
presentation of data belongs to the domain of statistics, and so does the detailed planning of that
precedes all these activities.
Data and Statistics
• Data consists of
– information coming from observations, counts, measurements, or responses.
Statistics is :
-science of collecting, organizing, analyzing, and interpreting data in order to make decisions.
A population is:
- the collection of all outcomes, responses, measurement, or counts that are of
interest.
A sample is:
- a subset of a population.
59
• All data can be classified as either,
– 1) Categorical or
– 2) Quantitative.
Definition 1.1 (Statistics). Statistics consists of a body of methods for collecting and analyzing data.
(Agresti & Finlay, 1997)
– collecting,
– classifying,
– summarizing,
– organizing,
– analyzing, and
– interpreting numerical information.
Statistics is much more than just the tabulation of numbers and the graphical
presentation of these tabulated numbers.
Statistics is the science of gaining information from numerical and categorical data.
60
• How can we assess the strength of the conclusions and evaluate their
uncertainty?
Categorical data (or qualitative data) results from descriptions, e.g. the blood type of
person, marital status or religious affiliation.
Why Statistics ?
Statistical Concepts
Data shape
Definition Sample: Sample is that part of the population from which information is collected. (Weiss,
1999)
A population is:
- the collection of all outcomes, responses, measurement, or counts that are of
interest.
A sample is:
- a subset of a population.
63
Question: In a recent survey,
• 250 college students at Union College were asked:
– if they smoked cigarettes regularly.
• Answer:
– 35 of the students said yes.
• Identify the population and the sample.
A parameter is
-a numerical description of a population characteristic.
A statistic is:
- a numerical description of a sample characteristic.
Statistical process
• Nominal - Categorical variables with no inherent order or ranking sequence such as names
or classes (e.g., gender). Value may be a numerical, but without numerical value (e.g., I, II,
III). The only operation that can be applied to Nominal variables is enumeration.
• Ordinal - Variables with an inherent rank or order, e.g. mild, moderate, severe. Can be
compared for equality, or greater or less, but not how much greater or less.
• Interval - Values of the variable are ordered as in Ordinal, and additionally, differences
between values are meaningful, however, the scale is not absolutely anchored. Calendar dates
and temperatures on the Fahrenheit scale are examples. Addition and subtraction, but not
multiplication and division are meaningful operations.
65
• Ratio - Variables with all properties of Interval plus an absolute, non-arbitrary zero point,
e.g. age, weight, temperature (Kelvin). Addition, subtraction, multiplication, and division are
all meaningful operations.
• Economics
– Forecasting
– Demographics
• Sports
– Individual & Team Performance
• Engineering
– Construction
– Materials
• Business
66
– Consumer Preferences
– Financial Trends
– investigate data,
• so that useful decision-making information results.
1. Experimental unit
• Object upon which we collect data
2. Population
• All items of interest
3. Variable
• Characteristic of an individual
experimental unit
4. Sample
• Subset of the units of a population
67
• P in Population & Parameter
arameter
• S in Sample & Statistic
tatistic
5. Statistical Inference
• Estimate or prediction or generalization about a population based on information
contained in a sample
6. Measure of Reliability
• Statement (usually qualified) about the degree of uncertainty associated with a
statistical inference
The study of statistics has two major branches: descriptive statistics and inferential statistics.
• Descriptive statistics: –
– Methods of organizing, summarizing, and presenting data in an informative way.
– Involves: Collecting Data
Presenting Data
Characterizing Data
Purpose
68
Describe Data
• Inferential statistics: –
– The methods used to determine something about a population on the basis of a
sample:
• Example:
• In a recent study, volunteers who had less than 6 hours of sleep were four times more likely
to answer incorrectly on a science test than were participants who had at least 8 hours of
sleep.
• Decide which part is the descriptive statistic and what conclusion might be drawn using
inferential statistics?
Answer:-
The statement “four times more likely to answer incorrectly” is a descriptive statistic.
An inference drawn from the sample is that
all individuals sleeping less than 6 hours are more likely to answer science question
incorrectly than individuals who sleep at least 8 hours.
69
Inferential Statistics & Its’ Techniques
• Estimation
• Hypothesis
Testing
Purpose
• Make decisions about population characteristics
• Estimation:-
– e.g., Estimate the population mean weight using the sample mean weight
• Hypothesis testing:-
– e.g., Test the claim that the population mean weight is 70 kg
SAMPLING DISTRIBUTION
Introduction to Sampling
• What is sample?
• A sample is “a smaller (but hopefully representative) collection of units from a
population used to determine truths about that population” (Field, 2005)
Why sample?
– Resources (time, money) and workload
– Gives results with known accuracy that can be calculated mathematically
• The sampling frame is the list from which the potential respondents are drawn
– Registrar’s office
– Class rosters
– Must assess sampling frame errors
Major Types of Samples
1. Stratified Samples
2. Cluster Samples
3. Systematic Samples
4. Convenience Samples
1. Stratified Samples
A stratified sample has:
i. members from each segment of a population.
ii. This ensures that each segment from the population is represented.
2. Cluster Samples
• This is used when the population falls into naturally occurring subgroups.
72
3. Systematic Samples
• A systematic sample is
• a sample in which each member of the population is assigned a number.
• A starting number is
4. Convenience Samples
A convenience sample consists only of available members of the population.
Example:
You are doing a study to determine the number of years of education each teacher at your
college has.
Identify the sampling technique used if you select the samples listed.
73
1.8.5.2 Sampling Distribution
74
• From Vogt:
A theoretical frequency distribution of the scores for or values of a
statistic, such as a mean.
– Any statistic that can be computed for a sample has a sampling distribution.
• Sampling distributions is
– all possible values of a statistic and
– their probabilities of occurring for a sample of a particular size.
• Sampling distributions are used to
– calculate the probability that sample statistics
– could have occurred by chance and
– thus to decide whether something that is true of a sample statistic is
• also likely to be true of a population parameter.
75
Examples of Sampling Distribution
Take another sample of size 1,500 from the US. Record the mean income. Our census said
the mean is $30K.
Take another sample of size 1,500 from the US. Record the mean income. Our census said
the mean is $30K.
Take another sample of size 1,500 from the US. Record the mean income. Our census said
the mean is $30K.
….
Say that the standard deviation of this distribution is $10K.
Think back to the empirical rule. What are the odds you would get a sample mean that is
more than $20K off.
77
• The first sampling distribution above, a, has a lower standard error. Now
a definition!
The standard deviation of a normal sampling distribution is called the standard error.
• Statisticians have found that
– the standard error of a sampling distribution is :
• quite directly affected by
• the number of cases in the sample(s), and
78
Population Variability:
The standard error of income’s sampling distribution will be a lot higher than car price’s.
Probability Distributions
• A Note:
– Not all theoretical probability distributions are Normal. One example of many is the
binomial distribution.
• The binomial distribution gives
– the discrete probability distribution of obtaining exactly n successes out of N trials
• where the result of each trial is true with known probability of success and
79
false with the inverse probability
probability.
• The binomial distribution has
– a formula and
– changes shape with each probability of success and number of trials.
• However, in this class the normal probability distribution is the most useful!
Permutation Tests
In classical hypothesis testing,
we start with assumptions about the underlying distribution and
then derive the sampling distribution of the test statistic under H0.
In Permutation testing,
the initial assumptions
mptions are not needed (except exchangeability), and
the sampling distribution of the test statistic under H0 is computed by using
permutations of the data.
• Under H1,
• the outcome tends to different, say larger for rats labeled Tx.
• A test statistic T measures
• the difference in observed outcomes for the two groups.
• T may be the difference in the two group means (or medians), denoted as t for the
observed data.
• Under H0,
• the individual labels of Tx and C are unimportant, since they have no impact on the
outcome. Since they are unimportant, the label can be randomly shuffled among the
rats without changing the joint null distribution of the data.
• Shuffling the data creates a:
• “new” dataset.
• It has the same rats, but with the group labels changed so as to appear as there were
different group assignments.
• Let t be the value of the test statistic from the original dataset.
• Let t1 be the value of the test statistic computed from a one dataset with permuted labels.
• Consider all M possible permutations of the labels, obtaining the test statistics, t 1, …,
tM .
• Under H0, t1, …, tM are all generated from the same underlying distribution that
generated t.
• Thus, t can be compared to the permuted data test statistics, t 1, …, tM , to test the
hypothesis and obtain a p-value or to construct confidence limits for the statistic.
• Survival times
• Treated mice 94, 38, 23, 197, 99, 16, 141
• Mean: 86.8
81
• Untreated mice 52, 10, 40, 104, 51, 27, 146, 30, 46
• Mean: 56.2
(Efron & Tibshirani)
Calculate the difference between the means of the two observed samples – it’s 30.6 days in
favor of the treated mice.
Consider the two samples combined (16 observations) as the relevant universe to
resample from.
Draw 7 hypothetical observations and designate them "Treatment"; draw 9 hypothetical
observations and designate them "Control".
Compute and record the difference between the means of the two samples.
Repeat steps 3 and 4 perhaps 1000 times.
Determine how often the resampled difference exceeds the observed difference of 30.6
If the group means are truly equal,
then shifting the group labels will not have a big impact the sum of the two groups (or mean
with equal sample sizes).
Some group sums will be larger than in the original data set and some will be smaller.
1. Bootstrap
• The bootstrap is
– a widely applicable tool that
– can be used to quantify the uncertainty associated with a given estimator or
statistical learning approach, including those for which it is difficult to obtain a
82
measure of variability.
• The bootstrap generates:
– distinct data sets by repeatedly sampling observations from the original data set.
– These generated data sets can be used to estimate variability in lieu of sampling
independent data sets from the full population.
• 1969 Simon publishes the bootstrap as an example in Basic Research Methods in Social
Science (the earlier pigfood example)
• 1979 Efron names and publishes first paper on the bootstrap
• The sampling employed by the bootstrap involves randomly selecting n observations with
replacement,
– which means some observations can be selected multiple times while other
observations are not included at all.
• This process is repeated B times to yield B bootstrap data sets,
– Z∗1,Z∗2,…,Z∗B,which can be used to estimate other quantities such as standard error.
The bootstrap procedure is a means of estimating the statistical accuracy . . . from the data in
a single sample.
Bootstrapping is used to mimic the process of selecting many samples when the
population is too small to do otherwise
The samples are generated from the data in the original sample by copying it many
number of times (Monte Carlo Simulation)
Samples can then selected at random and descriptive statistics calculated or regressions run for
each sample
The results generated from the bootstrap samples can be treated as if it they were the result
of actual sampling from the original population
Characteristics of Bootstrapping
84
Bootstrapping Example
Bootstrapping is especially useful in situations when no analytic formula for the sampling distribution
is available.
• Traditional forecasting methods, like exponential smoothing, work well when demand is
constant – patterns easily recognized by software
• In contrast, when demand is irregular, patterns may be difficult to recognize.
• Therefore, when faced with irregular demand, bootstrapping may be used to provide more
accurate forecasts, making some important assumptions…
• No equal variance
• Allows for accurate forecasts of intermittent demand
• If the sample is a good approximation of the population, the sampling distribution may be
85
estimated by generating a large number of new samples
• For small data sets, taking a small representative sample of the data and replicating it will
yield superior results
Bootstrap Types
a) Parametric Bootstrap
b) Non-parametric Bootstrap
a) Parametric Bootstrap
• Re-sampling makes no assumptions about the population distribution.
• If we have information about the population distr., this can be used in resampling.
• In this case, when we draw randomly from the sample we can use population distr.
• For example, if we know that the population distr. is normal then estimate its parameters using
the sample mean and variance.
• Then approximate the population distr. with the sample distr. and use it to draw new
samples.
86
• As expected, if the assumption about population distribution is correct then the
parametric bootstrap will perform better than the nonparametric bootstrap.
• If not correct, then the nonparametric bootstrap will perform better.
Bootstrap Example
• A new pigfood ration is tested on twelve pigs, with six-week weight gains as follows:
• 496 544 464 416 512 560 608 544 480 466 512 496
Draw simulated samples from a hypothetical universe that embodies all we know about the universe that
this sample came from – our sample, replicated an infinite number of times.
The Bootstrap process steps
1. Put the observed weight gains in a hat
2. Sample 12 with replacement
3. Record the mean
4. Repeat steps 2-3, say, 1000 times
5. Record the 5th and 95th percentiles (for a 90% confidence interval)
Bootstrapped sample means
b) Nonparametric bootstrap
nonparametric bootstrap method which relies on the empirical distribution function.
The idea of the nonparametric bootstrap is to simulate data from the empirical cdf Fn.
87
Since the bootstrap samples are generated from Fn, this method is called the
nonparametric bootstrap.
Here Fn is a discrete probability distribution that gives probability 1/n to each
observed value x1, · · · , xn.
A sample of size n from Fn is thus a sample of size n drawn with replacement from the
collection x1, · · · , xn.
The standard deviation of ˆθ is then estimated by
sθˆ = SQRT( 1 B X B i=1 (θ ∗ i − ¯θ ∗)^ 2 )
where θ ∗ 1 , . . . , θ∗ B are produced from B sample of size n from the collection x1, ·
· · , xn.
• Want to obtain the correlation between the test scores and the variance of the correlation
estimate.
2. Jackknife Method
• Jackknife method was introduced by Quenouille (1949)
– to estimate the bias of an estimator.
• The method is later shown to be useful in reducing the bias as well as in estimating the
variance of an estimator.
• Let ˆθn be an estimator of θ based on n i.i.d. random vectors X1, . . . , Xn, i.e., ˆθn = fn(X1,
. . . , Xn), for some function fn. Let
• A statistical method for estimating and removing bias* and for deriving robust estimates of
standard errors and confidence intervals.
• Created by systematically dropping out subsets of data one at a time and assessing the
resulting variation.
88
• θn,−i = fn−1(X1, . . . , Xi−1, Xi+1, . . . , Xn) be the corresponding recomputed statistic based
on all but the i-th observation. The jackknife estimator of bias E( ˆθn) − θ is given by
• biasJ = (n − 1) n Xn i=1 ¡ ˆθn,−i − ˆθn ¢ .
• In cross-validation, you make a fixed number of folds (or partitions) of the data, run the
analysis on each fold, and then average the overall error estimate.
• Cross-Validation
– Not a resampling technique
– Requires large amounts of data
– Extremely useful in data mining and artificial intelligence
90
1. holdout method
– give the mean absolute test set error, which is used to evaluate the model.
• The advantage of this method is that
• K-fold cross validation is one way to improve over the holdout method.
– The data set is divided into k subsets, and the holdout method is repeated k times.
• Each time, one of the k subsets is used as
• Every data point gets to be in a test set exactly once, and gets to be in a training set k- 1
times.
• The variance of the resulting estimate is reduced as k is increased.
• The disadvantage of this method is that the training algorithm has to be rerun from
scratch k times, which means it takes k times as much computation to make an evaluation.
• A variant of this method is to randomly divide the data into a test and training set
kdifferent times.
• The advantage of doing this is that
– you can independently choose how large each test set is and how many trials you
average over.
92
STATISTICAL INFERENCE
1) Estimation of parameters
-Point Estimation (X
X or p)
-Intervals Estimation
2) Hypothesis Testing
Statistical Inference
93
Use a statistical approach to make an inference about the distribution of a sample of data we
collect.
The population or macroscopic phenomenon is always unknown itself, because some, but not all,
of the data of our interest can be taken.
There are Two most common types of Statistical Inference and they are:
1. Confidence intervals and
2. Tests of significance.
94
1. Confidence Intervals
• If s known
95
• Compute 95% C.I.
• IQ scores
– s = 15
• Sample: 114, 118, 122, 126
– SXi = 480, X = 120, sX = 7.5
– 120 ± 1.96(7.5)
– 120 + 14.7
– 105.3 < m < 134.7
• We are 95% confident that population means lies between 105.3 and 134.7 ~
Degrees of Freedom
• Width of t depends on n
• Degrees of Freedom
– related to sample size
– larger sample ---> better estimate
– n - 1 to compute s ~
Critical Values of t
• Table A.2: “Critical Values of t”
• df = n - 1
• level of significance for two-tailed test
– a
– area in both tails for critical value
• level of confidence for CI ~
– 1-a ~
97
Confidence Intervals: s unknown
• Same as known but use t
• Narrower CI by...
1. Increasing n
– decreases standard error
2. Decreasing s or s
– little control over this ~
3. s known
– use z distribution, critical values
4. Decreasing level of confidence
– increases uncertainty that m lies within interval
– costs / benefits ~
2. Test of Significance ( Hypothesis testing)
98
• It is intended to help researchers differentiate between real and random patterns in the data.
What is a Hypothesis?
• A hypothesis is an assumption about the population parameter.
– A parameter is a Population mean or proportion
– The parameter must be identified before analysis.
Hypothesis Testing
• Is also called significance testing
• Tests a claim about a parameter using evidence (data in a sample
• The technique is introduced by considering a one-sample z test
99
The Alternative Hypothesis, H1
• Is the opposite of the null hypothesis e.g. The average # TV sets in US homes is
less than 3 (H1: m < 3)
• Challenges the Status Quo
• Never contains the ‘=‘ sign
100
Sampling Distribution :
Level of Significance, a
• Defines Unlikely Values of Sample Statistic if Null Hypothesis Is True
• Type II Error
– Do Not Reject False Null Hypothesis
– Probability of Type II Error Is b (Beta)
101
Hypothesis Testing: Steps
Test the Assumption that the true mean SBP of participants is 120 mmHg. State
H0 H0 : m = 120
State H1 H1 : m 120
Choose a a = 0.05
Choose n n = 100
Choose Test: Z, t, X2 Test (or p Value)
• t test statistic
102
Example 2:
• In a survey of diabetics in a large city, it was found that 100 out of 400 have
diabetic foot. Can we conclude that 20 percent of diabetics in the sampled
population have diabetic foot.
• Test at the a =0.05 significance level.
Solution
103
Decision:
We have sufficient evidence to reject the Ho value of 20%
We conclude that in the population of diabetic the proportion who have diabetic foot does not equal
0.20
****************
104
PREDICTION ERROR
Errors are an inescapable element of predictive analytics that should also be quantified and
presented along with any model, often in the form of a confidence interval that indicates how
accurate its predictions are expected to be.
Analysis of prediction errors from similar or previous models can help determine
confidence intervals.
Predictions always contain errors
Predictive analytics has many applications, the above mentioned examples are just the tip of
the iceberg.
Many of them will add value, but it remains important to stress that the outcome of a
prediction model will always contain an error. Decision makers need to know how big that
error is.
To illustrate, in using historic data to predict the future you assume that the future will have
105
the same dynamics as the past, an assumption which history has proven to be dangerous.
In artificial intelligence (AI), the analysis of prediction errors can help guide machine
learning (ML), similarly to the way it does for human learning.
In reinforcement learning, for example, an agent might use the goal of minimizing error
feedback as a way to improve.
Prediction errors, in that case, might be assigned a negative value and predicted outcomes
a positive value, in which case the AI would be programmed to attempt to maximize its
score.
That approach to ML, sometimes known as error-driven learning, seeks to stimulate
learning by approximating the human drive for mastery.
The standard error of the estimate is closely related to this quantity and is defined below:
where σest is the standard error of the estimate, Y is an actual score, Y' is a predicted score, and N is the
number of pairs of scores.
106
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an
estimator (of a procedure for estimating an unobserved quantity) measures
the average of the squares of the errors—that is, the average squared difference between the
estimated values and what is estimated.
MSE is a risk function, corresponding to the expected value of the squared error loss.
The fact that MSE is almost always strictly positive (and not zero) is because
of randomness or because the estimator does not account for information that could produce a
more accurate estimate.
Mean squared prediction error
107
RMSD is always non-negative, and a value of 0 (almost never achieved in practice) would
indicate a perfect fit to the data.
In general, a lower RMSD is better than a higher one. However, comparisons across different
types of data would be invalid because the measure is dependent on the scale of the numbers
used.
RMSD is the square root of the average of squared errors.
The effect of each error on RMSD is proportional to the size of the squared error; thus larger
errors have a disproportionately large effect on RMSD.
Thus, the prediction error influence the analytics functionalities and its applications
areas.
******************
108