DAUnit-1
DAUnit-1
BASIC TERMINOLOGIES
BIG DATA
Big data is a field that treats ways to analyze, systematically extract information from, or
otherwise deal with data sets that are too large or complex to be dealt with by traditional
data-processing application software.
Volume
Variety
Velocity
Veracity
The volume of data refers to the size of the data sets that need to be analyzed and processed,
which are now frequently larger than terabytes and petabytes. The sheer volume of the data
requires distinct and different processing technologies than traditional storage and processing
capabilities. In other words, this means that the data sets in Big Data are too large to process
with a regular laptop or desktop processor. An example of a high-volume data set would be
all credit card transactions on a day within Europe.
Velocity refers to the speed with which data is generated. High velocity data is generated
with such a pace that it requires distinct (distributed) processing techniques. An example of a
data that is generated with high velocity would be Twitter messages or Facebook posts.
Variety makes Big Data really big. Big Data comes from a great variety of sources and
generally is one out of three types: structured, semi structured and unstructured data. The
variety in data types frequently requires distinct processing capabilities and specialist
algorithms. An example of high variety data sets would be the CCTV audio and video files
that are generated at various locations in a city.
Veracity refers to the quality of the data that is being analyzed. High veracity data has many
records that are valuable to analyze and that contribute in a meaningful way to the overall
results. Low veracity data, on the other hand, contains a high percentage of meaningless data.
The non-valuable in these data sets is referred to as noise. An example of a high veracity data
set would be data from a medical experiment or trial.
Data that is high volume, high velocity and high variety must be processed with advanced
tools (analytics and algorithms) to reveal meaningful information. Because of these
characteristics of the data, the knowledge domain that deals with the storage, processing, and
analysis of these data sets has been labeled Big Data.
FORMS OF DATA
– STRUCTURED FORM
– UNSTRUCTURED FORM.
• Any form of data that does not have predefined structure is represented as
unstructured form of data. Eg: video, images, comments, posts, few
websites such as blogs and wikipedia
SOURCES OF DATA
DATA ANALYSIS
Data analysis is a process of inspecting, cleansing, transforming and modeling data with
the goal of discovering useful information, informing conclusions and supporting decision-
making.
DATA ANALYTICS
• Data analytics is the science of analyzing raw data in order to make conclusions about
that information...... This information can then be used to optimize processes to increase
the overall efficiency of a business or system.
Types:
In descriptive statistics the result is always going lead with probability among ‘n’
number of options where each option has an equal chance of probability.
– Predictive analytics Eg: healthcare, sports, weather, insurance, social media analysis.
This type of analytics deals with predicting past data to make decisions based on
certain algorithms. In case of a doctor the doctor questions the patient about the
past to correct his illness through already existing procedures.
Prescriptive analytics works with predictive analytics, which uses data to determine
near-term outcomes. Prescriptive analytics makes use of machine learning to help
businesses decide a course of action based on a computer program's predictions.
DIFFERENCE BETWEEN DATA ANALYTICS AND DATA ANALYSIS
Prediction Analytics means we are trying to Analysis means we analyze always what
find conclusions about future. has happened in the past
MACHINE LEARNING
Analytics
In general data is passed to a machine learning tool to perform descriptive data analytics
through set of algorithms built in it. Here both data analytics and data analysis is done by the
tool automatically. Hence we can say that Data analysis is a sub component of data analytics.
And data analytics is a sub component of machine learning tool. All these are described in
figure 0.2. The output of this machine learning tool generates a model. And from this model
predictive analytics and prescriptive analytics can be performed because the model gives
output as data to machine learning tool. This cycle continues till we get an efficient output.
UNIT - I
1.1 DESIGN DATA ARCHITECTURE AND MANAGE THE DATA FOR ANALYSIS
Data architecture is composed of models, policies, rules or standards that govern which
data
is collected, and how it is stored, arranged, integrated, and put to use in data systems and
in organizations. Data is usually one of several architecture domains that form the pillars of
an enterprise architecture or solution architecture.
Various constraints and influences will have an effect on data architecture design. These
include enterprise requirements, technology drivers, economics, business policies and data
processing needs.
• Enterprise requirements
These will generally include such elements as economical and effective system
expansion, acceptable performance levels (especially system access speed), transaction
reliability, and transparent data management. In addition, the conversion of raw data such as
transaction records and image files into more useful information forms through such
features as data warehouses is also a common organizational requirement, since this enables
managerial decision making and other organizational processes. One of the architecture
techniques is the split between managing transaction data and (master) reference data.
Another one is splitting data capture systems from data retrieval systems (as done in a
datawarehouse).
• Technology drivers
These are usually suggested by the completed data architecture and database
architecture designs. In addition, some technology drivers will derive from existing
organizational integration frameworks and standards, organizational economics, and
existing site resources (e.g. previously purchased software licensing).
• Economics
These are also important factors that must be considered during the data architecture phase.
It is possible that some solutions, while optimal in principle, may not be potential
candidates due to their cost. External factors such as the business cycle, interest rates,
market conditions, and legal considerations could all have an effect on decisions relevant to
data architecture.
Business policies
Business policies that also drive data architecture design include internal organizational
policies, rules of regulatory bodies, professional standards, and applicable governmental
laws that can vary by applicable agency. These policies and rules will help describe the
manner in which enterprise wishes to process their data.
• Data processing needs
These include accurate and reproducible transactions performed in high volumes, data
warehousing for the support of management information systems (and potential data
mining), repetitive periodic reporting, ad hoc reporting, and support of various
organizational initiatives as required (i.e. annual budgets, new productdevelopment).
The logical view/user's view, of a data analytics represents data in a format that is
meaningful to a user and to the programs that process those data. That is, the logical
view tells the user, in user terms, what is in the database. Logical level consists of data
requirements and process models which are processed using any data modelling techniques to
result in logical data model.
Physical level is created when we translate the top level design in physical tables in
the database. This model is created by the database architect, software architects, software
developers or database administrator. The input to this level from logical level and various
data modeling techniques are used here with input from software developers or database
administrator. These data modelling techniques are various formats of representation of data
such as relational data model, network model, hierarchical model, object oriented model,
Entity relationship model.
Implementation level contains details about modification and presentation of data through
the use of various data mining tools such as (R-studio, WEKA, Orange etc). Here each tool
has a specific feature how it works and different representation of viewing the same data.
These tools are very helpful to the user since it is user friendly and it does not require much
programming knowledge from the user.
Data can be generated from two types of sources namely Primary and Secondary
Sources of Primary Data
Observation Method:
There exist various observation practices, and our role as an observer may
vary according to the research approach. We make observations from either the
outsider or insider point of view in relation to the researched phenomenon and the
observation technique can be structured or unstructured. The degree of the outsider
or insider points of view can be seen as a movable point in a continuum between
the extremes of outsider and insider. If you decide to take the insider point of view,
you will be a participant observer in situ and actively participate in the observed
situation or community. The activity of a Participant observer in situ is called field
work. This observation technique has traditionally belonged to the data collection
methods of ethnology and anthropology. If you decide to take the outsider point of
view, you try to try to distance yourself from your own cultural ties and observe the
researched community as an outsider observer. These details are seen in figure 1.2.
Experimental Designs
There are number of experimental designs that are used in carrying out and
experiment. However, Market researchers have used 4 experimental designs most
frequently. These are –
The balance arrangement achieved in a Latin Square is its main strength. In this
design, the comparisons among treatments, will be free from both differences
between rows and columns. Thus the magnitude of error will be smaller than any
other design.
FD - Factorial Designs
This design allows the experimenter to test two or more variables simultaneously. It
also measures interaction effects of the variables and analyzes the impacts of each of the
variables. In a true experiment, randomization is essential so that the experimenter can infer
cause and effect without any bias.
Internal sources
If available, internal secondary data may be obtained with less time, effort and
money than the external secondary data. In addition, they may also be more
pertinent to the situation at hand since they are from within the organization. The
internal sources include
Accounting resources- This gives so much information which can be used by the
marketing researcher. They give information about internal factors.
Sales Force Report- It gives information about the sale of a product. The
information provided is of outside the organization.
Internal Experts- These are people who are heading the various departments. They
can give an idea of how a particular thing is working
Miscellaneous Reports- These are what information you are getting from
operational reports.If the data available within the organization are unsuitable or
inadequate, the marketer should extend the search to external secondary data sources.
Based on various features (cost, data, process, source time etc.) various
sources of data can be compared as per table 1.
Sensor data is the output of a device that detects and responds to some type
of input from the physical environment. The output may be used to provide
information or input to another system or to guide a process. Examples are as follows
The simplest form of signal is a direct current (DC) that is switched on and
off; this is the principle by which the early telegraph worked. More complex signals
consist of an alternating-current (AC) or electromagnetic carrier that contains one
or more data streams.
Data must be transformed into electromagnetic signals prior to transmission
across a network. Data and signals can be either analog or digital. A signal is
periodic if it consists of a continuously repeating pattern.
1.5 Understanding Sources of Data from GPS
Accuracy and Precision: This characteristic refers to the exactness of the data.
It cannot have any erroneous elements and must convey the correct message without
being misleading. This accuracy and precision have a component that relates to its
intended use. Without understanding how the data will be consumed, ensuring
accuracy and precision could be off-target or more costly than necessary. For
example, accuracy in healthcare might be more important than in another industry
(which is to say, inaccurate data in healthcare could have more serious
consequences) and, therefore, justifiably worth higher levels of investment.
Legitimacy and Validity: Requirements governing data set the boundaries of this
characteristic. For example, on surveys, items such as gender, ethnicity, and
nationality are typically limited to a set of options and open answers are not
permitted. Any answers other than these would not be considered valid or legitimate
based on the survey’s requirement. This is the case for most data and must be
carefully considered when determining its quality. The people in each department
in an organization understand what data is valid or not to them, so the requirements
must be leveraged when evaluating data quality.
Timeliness and Relevance: There must be a valid reason to collect the data to
justify the effort required, which also means it has to be collected at the right
moment in time. Data collected too soon or too late could misrepresent a
situation and drive
inaccurate decisions.
Availability and Accessibility: This characteristic can be tricky at times due to legal
and regulatory constraints. Regardless of the challenge, though, individuals need the
right level of access to the data in order to perform their jobs. This presumes that
the data exists and is available for access to be granted.
Noisy data
Origins of noise
Duplicate Data
Data set may include data objects that are duplicates, or almost duplicates of
one another A major issue when merging data from multiple, heterogeneous
sources
Examples: Same person with multiple email addresses
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
1. Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
Data discretization is the process of converting continuous data into
discrete buckets or intervals. Here's an example:
Example: Discretizing Age into Age Groups
Suppose we have a dataset with a continuous "Age" column:
Person Age
A 23
B 35
C 42
D 51
E 67
If we want to discretize "Age" into categorical bins, we can define age
groups:
0-30 → "Young"
31-50 → "Middle-aged"
51+ → "Senior"
Applying discretization:
Person Age Age Group
A 23 Young
B 35 Middle-aged
C 42 Middle-aged
D 51 Senior
E 67 Senior
This process simplifies analysis, especially for machine learning
models that prefer categorical features. You can use methods like:
Equal-width binning (dividing data into equal-sized ranges)
Equal-frequency binning (each bin has roughly the same number of observations)
K-means clustering (grouping similar values)
3. Smoothing by Clustering (K-Means)
Data points are grouped using clustering algorithms.
Each value is replaced by its cluster centroid.
2. Regression:
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable)
or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways.
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For
Example- The attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to
get rid of this, we uses data reduction technique. It aims to increase the storage
efficiency and reduce data storage and analysis costs.