DA total notes
DA total notes
(Data Analytics)
Data Management: Design Data Architecture and manage the data for analysis,
understand various sources of Data like Sensors/Signals/GPS etc. Data
Management, Data Quality (noise, outliers, missing values, duplicate data) and
Data Processing & Processing.
Conceptual model:
It is a business model which uses Entity Relationship (ER) model for relation
between entities and their attributes.
Logical model:It is a model where problems are represented in the form of logic
such as rows and column of data, classes, xml tags and other DBMS techniques.
Physical model:
Physical models hold the database design like which type of database technology
will be suitable for architecture.
Factors that influence Data Architecture:
Few influences that can have an effect on data architecture are business policies,
business requirements, Technology used, economics, and data processing needs.
➢ Business requirements
➢ Business policies
➢ Technology in use
➢ Business economics
➢ Data processing needs
Business requirements:
These include factors such as the expansion of business, the performance
of the system access, data management, transaction management, making
use of raw data by converting them into image files and records, and then
storing in data warehouses. Data warehouses are the main aspects of storing
transactions in business.
Business policies:
The policies are rules that are useful for describing the way of processing
data. These policies are made by internal organizational bodies and other
government agencies.
Technology in use:
This includes using the example of previously completed data architecture
design and also using existing licensed software purchases, database
technology.
Business economics:
The economical factors such as business growth and loss, interest rates,
loans, condition of the market, and the overall cost will also have an effect
on design architecture.
Data processing needs:
These include factors such as mining of the data, large continuous
transactions, database management, and other data preprocessing needs.
Data management:
Data management is an administrative process that includes acquiring,
validating, storing, protecting, and processing required data to ensure the
accessibility, reliability, and timeliness of the data for its users.
Data management software is essential, as we are creating and
consuming data at unprecedented rates.
Data management is the practice of managing data as a valuable resource to
unlock its potential for an organization. Managing data effectively requires having
a data strategy and reliable methods to access, integrate, cleanse, govern, store
and prepare data for analytics. In our digital world, data pours into organizations
from many sources – operational and transactional systems, scanners, sensors,
smart devices, social media, video and text. But the value of data is not based on
its source, quality or format. Its value depends on what you do with it.
Motivation/Importance of Data management:
➢ Data management plays a significant role in an organization's ability to
generate revenue, control costs.
➢ Data management helps organizations to mitigate risks.
➢ It enables decision making in organizations.
What are the benefits of good data management?
➢ Optimum data quality
➢ Improved user confidence
➢ Efficient and timely access to data
➢ Improves decision making in an organization
Managing data Resources:
➢ An information system provides users with timely, accurate, and relevant
information.
➢ The information is stored in computer files. When files are properly arranged
and maintained, users can easily access and retrieve the information when
they need.
➢ If the files are not properly managed, they can lead to chaos in information
processing.
➢ Even if the hardware and software are excellent, the information system
can be very inefficient because of poor file management.
Areas of Data Management:
Data Modeling: Is first creating a structure for the data that you collect and use
and then organizing this data in a way that is easily accessible and efficient to
store and pull the data for reports and analysis.
Data warehousing: is storing data effectively so that it can be accessed and used
efficiently in future
Data Movement: is the ability to move data from one place to another. For instance,
data needs to be moved from where it is collected to a database andthen to an
end user.
Understand various sources of the Data:
Data are the special type of information generally obtained through observations,
surveys, inquiries, or are generated as a result of human activity. Methods of data
collection are essential for anyone who wish to collect data.
Data collection is a fundamental aspect and as a result, there are different
methods of collecting data which when used on one particular set will result in
different kinds of data.
Collection of data refers to a purpose gathering of information and relevant to the
subject-matter of the study from the units under investigation. The method of
collection of data mainly depends upon the nature, purpose and the scope of inquiry
on one hand and availability of resources, and the time to the other.
Data can be generated from two types of sources namely
l. Primary sources of data
2. Secondary sources of data
l. Primary sources of data:
Primary data refers to the first hand data gathered by the researcher himself.
Sources of primary data are surveys, observations, Experimental Methods.
Survey: Survey method is one of the primary sources of data which is used
to collect quantitative information about an items in a population. Surveys are
used in different areas for collecting the data even in public and private sectors.
A survey may be conducted in the field by the researcher. The respondents are
contacted by the research person personally, telephonically or through mail. This
method takes a lot of time, efforts and money but the data collected are of high
accuracy, current and relevant to the topic.
When the questions are administered by a researcher, the survey is called a
structured interview or a researcher-administered survey.
Observations: Observation as one of the primary sources of data. Observation is a
technique for obtaining information involves measuring variables or gathering of
data necessary for measuring the variable under investigation.
Observation is defined as accurate watching and noting of phenomena as they
occur in nature with regards to cause and effect relation.
Interview: Interviewing is a technique that is primarily used to gain an
understanding of the underlying reasons and motivations for people’s attitudes,
preferences or behavior. Interviews can be undertaken on a personal one-to-one
basis or in a group.
Experimental Method: There are number of experimental designs that are used
in carrying out and experiment. However, Market researchers have used 4
experimental designs most frequently. These are
CRD - Completely Randomized Design
RBD - Randomized Block Design
LSD - Latin Square Design
FD - Factorial Designs
CRD: A completely randomized design (CRD) is one where the treatments are assigned
completely at random so that each experimental unit has the same chance of
receiving any one treatment.
CRD is appropriate only for experiments with homogeneous experimental
units.
Example:
RBD - The term Randomized Block Design has originated from agricultural
research. In this design several treatments of variables are applied to different
blocks of land to ascertain their effect on the yield of the crop. Blocks are formed
in such a manner that each block contains as many plots as a number of
treatments so that one plot from each is selected at random for each treatment.
The production of each plot is measured after the treatment is given. These data
are then interpreted and inferences are drawn by using the analysis of Variance
Technique so as to know the effect of various treatments like different dozes of
fertilizers, different types of irrigation etc.
LSD - Latin Square Design - A Latin square is one of the experimental designs
which has a balanced two way classification scheme say for example - 4 X 4
arrangement. In this scheme each letter from A to D occurs only once in each row
and also only once in each column. The balance arrangement, it may be noted
that, will not get disturbed if any row gets changed with the other.
The balance arrangement achieved in a Latin Square is its main strength. In this
design, the comparisons among treatments will be free from both differences
between rows and columns. Thus the magnitude of error will be smaller than any
other design.
FD - Factorial Designs - This design allows the experimenter to test two or more
variables simultaneously. It also measures interaction effects of the variables and
analyzes the impacts of each of the variables. In a true experiment, randomization
is essential so that the experimenter can infer cause and effect without any bias.
A experiment which involves multiple independent variables is known as afactorial
design.
A factor is a major independent variable. In this example we have two factors:
time in instruction and setting. A level is a subdivision of a factor. In this example,
time in instruction has two levels and setting has two levels.
➢ Internal Sources:
If available, internal secondary data may be obtained with less time, effort and
money than the external secondary data. In addition, they may also be more
pertinent to the situation at hand since they are from within the organization.
The internal sources include
Accounting resources- This gives so much information which can be used by
the marketing researcher. They give information about internal factors.
Sales Force Report- It gives information about the sale of a product. The
information provided is of outside the organization.
Internal Experts- These are people who are heading the various departments.
They can give an idea of how a particular thing is working
Miscellaneous Reports- These are what information you are getting from
operational reports. If the data available within the organization are unsuitable or
inadequate, the marketer should extend the search to external secondary data
sources.
Box plot:
A boxplot is a standardized way of displaying the distribution of data based on a
five number summary
• Minimum
• First quartile (Q1),
• Median,
• Third quartile (Q3), and
• Maximum”).
Most common causes of outliers on a data set:
➢ Data entry errors (human errors)
➢ Measurement errors (instrument errors)
➢ Experimental errors (data extraction or experiment
planning/executing errors)
➢ Intentional (dummy outliers made to test detection methods)
➢ Data processing errors (data manipulation or data set unintended
mutations)
➢ Sampling errors (extracting or mixing data from wrong or various
sources)
➢ Natural (not an error, novelties in data)
How to remove Outliers?
Most of the ways to deal with outliers are similar to the methods of missing values
like deleting observations, transforming them, binning them, treat them as a
separate group, imputing values and other statistical methods. Here, we will
discuss the common techniques used to deal with outliers:
Deleting observations: We delete outlier values if it is due to data entry error,
data processing error or outlier observations are very small in numbers. We can
also use trimming at both ends to remove outliers.
Transforming and binning values: Transforming variables can also eliminate
outliers. Natural log of a value reduces the variation caused by extreme values.
Binning is also a form of variable transformation. Decision Tree algorithm allows
to deal with outliers well due to binning of variable. We can also use the process
of assigning weights to different observations.
Imputing: Like imputation of missing values, we can also impute outliers. We can
use mean, median, mode imputation methods. Before imputing values, we should
analyse if it is natural outlier or artificial. If it is artificial, we can go withimputing
values. We can also use statistical model to predict values of outlier observation
and after that we can impute it with predicted values.
Missing data:
Missing data in the training data set can reduce the power / fit of a model or can lead
to a biased model because we have not analysed the behavior and relationship with
other variables correctly. It can lead to wrong prediction orclassification.
Why my data has missing values?
We looked at the importance of treatment of missing values in a dataset. Now,
let’s identify the reasons for occurrence of these missing values. They may occur
at two stages:
l. Data Extraction: It is possible that there are problems with extraction process.
In such cases, we should double-check for correct data with data guardians. Some
hashing procedures can also be used to make sure data extraction is correct. Errors
at data extraction stage are typically easy to find andcan be corrected easily as
well.
2. Data collection: These errors occur at time of data collection and are harderto
correct. They can be categorized in four types:
➢ Missing completely at random: This is a case when the probability of
missing variable is same for all observations. For example:
respondents of data collection process decide that they will declare
their earning after tossing a fair coin. If an head occurs, respondent
declares his / her earnings & vice versa. Here each observation has
equal chance of missing value.
➢ Missing at random: This is a case when variable is missing at random
and missing ratio varies for different values / level of other input
variables. For example: We are collecting data for age andfemale has
higher missing value compare to male.
➢ Missing that depends on unobserved predictors: This is a case when
the missing values are not random and are related to the unobserved
input variable. For example: In a medical study, if a particular
diagnostic causes discomfort, then there is higher chance of drop out
from the study. This missing value is not at random unless we have
included “discomfort” as an input variable for all patients.
➢ Missing that depends on the missing value itself: This is a case when
the probability of missing value is directly correlated with missing
value itself. For example: People with higher or lower income are likely
to provide non-response to their earning.
Which are the methods to treat missing values?
l. Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
➢ In list wise deletion, we delete observations where any of the variable is
missing. Simplicity is one of the major advantage of this method, but this
method reduces the power of model because it reduces the sample size.
➢ In pair wise deletion, we perform analysis with all cases in which the
variables of interest are present. Advantage of this method is, it keeps as
many cases available for analysis. One of the disadvantage of this method,
it uses different sample size for different variables.
➢ Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the model
output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing
values with estimated ones. The objective is to employ known relationships that
can be identified in the valid values of the data set to assist in estimating the
missing values. Mean / Mode / Median imputation is one of the most frequently
used methods. It consists of replacing the missing data for a given attribute by
the mean or median (quantitative attribute) or mode (qualitative attribute) of all
known values of that variable.
It can be of two types:
➢ Generalized Imputation: In this case, we calculate the mean or median for
all non missing values of that variable then replace missing value with mean
or median. Like in above table, variable “Manpower” is missing so we take
average of all non missing values of “Manpower” (28.33) and then replace
missing value with it.
➢ Similar case Imputation: In this case, we calculate average for gender
“Male” (29.75) and “Female” (25) individually of non missing values then
replace the missing value based on gender. For “Male“, we will replace
missing values of manpower with 29.75 and for “Female” with 25.
Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be
generated due to faulty data collection, data entry errors etc. It can be handled
in following ways:
Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete the
task. Each segmented is handled separately. One can replace alldata in a
segment by its mean or boundary values can be used to complete thetask.
➢ Smoothing by bin means: In smoothing by bin means, each value in a bin
is replaced by the mean value of the bin.
➢ Smoothing by bin median: In this method each bin value is replaced by
its bin median value.
➢ Smoothing by bin boundary: In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
Regression: Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
Clustering: This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
Duplicate values: A dataset may include data objects which are duplicates of
one another. It may happen when say the same person submits a form more than
once. The term deduplication is often used to refer to the process of dealing with
duplicates. In most cases, the duplicates are removed so as to not give that
particular data object an advantage or bias, when running machine learning
algorithms.
Redundant data occurs while we merge data from multiple databases. If the
redundant data is not removed incorrect results will be obtained during data
analysis. Redundant data occurs due to the following reasons.
There are mainly 2 major approaches for data integration – one is “Tight coupling
approach” and another is “Loose coupling approach”.
Tight Coupling:
Here, a data warehouse is treated as an information retrieval component. In this
coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation and Loading.
Loose Coupling:
Here, an interface is provided that takes the query from the user, transforms it in
a way the source database can understand and then sends the query directly to
the source databases to obtain the result.
Issues in Data Integration:
There are three issues to consider during data integration: Schema Integration,
Redundancy Detection, and resolution of data value conflicts. These areexplained
in brief below.
➢ Schema Integration: Integrate metadata from different sources. The real-
world entities from multiple sources are referred to as the entity
identification problem.
➢ Redundancy: An attribute may be redundant if it can be derived orobtained
from another attribute or set of attributes. Inconsistencies in attributes can
also cause redundancies in the resulting data set. Some redundancies can
be detected by correlation analysis.
➢ Detection and resolution of data value conflicts: This is the third critical
issue in data integration. Attribute values from different sources may differ
for the same real-world entity. An attribute in one system may be recorded
at a lower level of abstraction than the “same” attribute in another.
3. Data Transformation: This step is taken in order to transform the data in
appropriate forms suitable for mining process. This involves following ways.
➢ Normalization: It is done in order to scale the data values in a
specified range (-1.0 to 1.0 or 0.0 to 1.0).
Min-Max Normalization:
This transforms the original data linearly. Suppose that min_F is
the minima and max_F is the maxima of an attribute, F
We Have the Formula:
Where v is the value you want to plot in the new range. v’ is the new
value you get after normalizing the old value.
UNIT – II
INTRODUCTION TO ANALYTICS
2.1 Introduction to Analytics
As an enormous amount of data gets generated, the need to extract useful insights is a
must for a business enterprise. Data Analytics has a key role in improving your business.
Here are 4 main factors which signify the need for Data Analytics:
Gather Hidden Insights – Hidden insights from data are gathered and then analyzed
with respect to business requirements.
Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in
business.
Perform Market Analysis – Market Analysis can be performed to understand the
strengths and the weaknesses of competitors.
Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.
Data Analytics refers to the techniques to analyze data to enhance productivity and
business gain. Data is extracted from various sources and is cleaned and categorized to
analyze different behavioral patterns. The techniques and the tools used vary according to the
organization or individual.
Data analysts translate numbers into plain English. A Data Analyst delivers value to their
companies by taking information about specific topics and then interpreting, analyzing,
and presenting findings in comprehensive reports. So, if you have the capability to collect
data from various sources, analyze the data, gather hidden insights and generate reports, then
you can become a Data Analyst. Refer to the image below:
In general data analytics also deals with bit of human knowledge as discussed below
in figure 2.2 in this under each type of analytics there is a part of human knowledge required
in prediction. Descriptive analytics requires the highest human input while predictive
analytics requires less human input. In case of prescriptive analytics no human input is
required since all the data is predicted.
In general data analytics deals with three main parts, subject knowledge, statistics and
person with computer knowledge to work on a tool to give insight in to the business. In the
mainly used tool is Rand Phyton as shown in figure 2.3
With the increasing demand for Data Analytics in the market, many tools have emerged
with various functionalities for this purpose. Either open-source or user-friendly, the top tools
in the data analytics market are as follows.
R programming – This tool is the leading analytics tool used for statistics and data
modeling. R compiles and runs on various platforms such as UNIX, Windows, and Mac
OS. It also provides tools to automatically install all packages as per user-requirement.
Python – Python is an open-source, object-oriented programming language which is easy
to read, write and maintain. It provides various machine learning and visualization
libraries such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras etc. It also can be
assembled on any platform like SQL server, a MongoDB database or JSON
Tableau Public – This is a free software that connects to any data source such as Excel,
corporate Data Warehouse etc. It then creates visualizations, maps, dashboards etc with
real-time updates on the web.
QlikView – This tool offers in-memory data processing with the results delivered to the
end-users quickly. It also offers data association and data visualization with data being
compressed to almost 10% of its original size.
SAS – A programming language and environment for data manipulation and analytics,
this tool is easily accessible and can analyze data from different sources.
Microsoft Excel – This tool is one of the most widely used tools for data analytics.
Mostly used for clients’ internal data, this tool analyzes the tasks that summarize the data
with a preview of pivot tables.
RapidMiner – A powerful, integrated platform that can integrate with any data source
types such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc. This tool is
mostly used for predictive analytics, such as data mining, text analytics, machine
learning.
KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics
platform, which allows you to analyze and model data. With the benefit of visual
programming, KNIME provides a platform for reporting and integration through its
modular data pipeline concept.
OpenRefine – Also known as GoogleRefine, this data cleaning software will help you
clean up data for analysis. It is used for cleaning messy data, the transformation of data
and parsing data from websites.
Apache Spark – One of the largest large-scale data processing engine, this tool executes
applications in Hadoop clusters 100 times faster in memory and 10 times faster on disk.
This tool is also popular for data pipelines and machine learning model development.
Data Analytics
Apart from the above-mentioned capabilities, a Data Analyst should also possess skills
such as Statistics, Data Cleaning, Exploratory Data Analysis, and Data Visualization. Also, if
you have knowledge of Machine Learning, then that would make you stand out from the
crowd.
Data analytics is mainly involved in field of business in various concerns for the
following purpose and it varies according to business needs and it is discussed below in
detail. Nowadays majority of the business deals with prediction with large amount of data to
work with.
Using big data as fundamental factor of making decision which need new capability, most
firms are far away from accessing all data resources. Companies in various sectors have
acquired crucial insight from the structured data collected from different enterprise systems
and anatomize by commercial database management systems. Eg:
1.) Facebook and Twitter to standard the instantaneous influence on campaign and to
examine consumer opinion about their products
2.) Some companies, like Amazon, eBay, and Google, considered as early commandants,
examining factors that control performance to define what raise sales revenue and
user interactivity.
Hadoop is an open source software platform that enables processing of large data sets in a
distributed computing environment", it discusses some concepts according to big data, the
rules for building, organizing and analyzing huge data-sets in the business environment, they
offered 3 architecture layers and also they indicate some graphical tools to explore and
represent unstructured-data, the authors specified how the famous companies could improve
their business. Eg: Google, Twitter and Facebook show their attention in processing big data
within cloud-environment
Data Analytics
The Map() step: Each worker node applies the Map() function to the local data and writes the
output to atemporary storage space. The Map() code is run exactly once for each K1 key
value, generating output that isorganized by key values K2. A master node arranges it so that
for redundant copies of input data only one isprocessed.
The Shuffle ()step: The map output is sent to the reduce processors, which assign the K2 key
value that eachprocessor should work on, and provide that processor with all of the map-
generated data associated with that keyvalue, such that all data belonging to one key are
located on the same worker node.
The Reduce() step: Worker nodes process each group of output data(perkey) in parallel,
executing the userprovidedReduce() code; each function is run exactly once for each K2 key
value pro-duced by the map step.
Produce the final output: The MapReduce system collects all of the reduce outputs and sorts
them by K2 to producethe final out-come.
Fig.2.4 shows the classical “word count problem” using the MapReduce paradigm. As shown
in Fig.2.4, initially aprocess will split the data into a subset of chunks that will later be
processed by the mappers. Once the key/values aregenerated by mappers, a shuffling process
is used to mix (combine) these key values (combining the same keys in the sameworker
node). Finally, the reduce functions are used to count the words that generate a common
output as a result of thealgorithm. As a result of the execution or wrappers/reducers, the out-
put will generate a sorted list of word counts from theoriginal text input.
IBM and Microsoft are prominent representatives. IBM represented many big data options
that enable users to storing, managing, and analyzing data through various resources; it has a
good rendering on business-intelligence also healthcare areas. Compared with IBM, also
Microsoft showed powerful work in the area of cloud computing activities and techniques
another example is Face-book and Twitter, who are collecting various data from user's
profiles and using it to increase their revenue
Big data analytics and Business intelligence are united fields which became widely
significant in the business and academic area, companies are permanently trying to make
insight from the extending the three V's ( variety, volume and velocity) to support decision
making
2.4 Databases
Database is an organized collection of structured information, or data, typically
stored electronically in a computer system. A database is usually controlled by
Data Analytics
The database can be divided into various categories such as text databases,
desktop database programs, relational database management systems (RDMS), and NoSQL
and object-oriented databases
A text database is a system that maintains a (usually large) text collection and
provides fast and accurate access to it. Eg: Text book, magazine, journals, manuals, etc..
NoSQL databases are non-tabular, and store data differently than relational
tables. NoSQL databases come in a variety of types based on their data model. The main
types are document, key-value, wide-column, and graph. Eg: JSON,Mango DB,CouchDB etc
Object-oriented databases (OODB) are databases that represent data in the form
of objects and classes. In object-oriented terminology, an object is a real-world entity, and a
class is a collection of objects. Object-oriented databases follow the fundamental principles
of object-oriented programming (OOP). Eg: c++, java, c#, small talk, LISP etc..
In any database we will be working with data to perform any kind of analysis and
predication. In relational data base management system we normally use rows to represent
data and columns to represent the attribute.
InNominal Data there is no natural ordering in values in the attribute of the dataset.
Eg: color, Gender, nouns (name, place, animal, thing). These categories cannot be predefined
with order for example there is no specific way to arrange gender of 50 students in a class. In
this case the first student can be male or female similarly for all 50 students. So ordering
Data Analytics
cannot be valid.
Data Analytics
In Ordinal Data there isnatural ordering in values in the attribute of the dataset. Eg:
size (S, M, L, XL, XXL), rating (excellent, good, better, worst). In the above example we can
quantify the amount of data after performing ordering which gives valuable insights into the
data.
Discrete Attribute which takes only finite number of numerical values (integers). Eg:
number of buttons, no of days for product delivery etc.. These data can be represented at
every specific interval in case of time series data mining or even in ratio based entries.
Continuous Attribute which takes finite number of fractional values. Eg: price,
discount, height, weight, length, temperature, speed etc….. These data can be represented at
every specific interval in case of time series data mining or even in ratio based entries.
Data modelling is nothing but a process through which data is stored structurally in a
format in a database. Data modelling is important because it enables organizations to make
data-driven decisions and meet varied business goals.
The entire process of data modelling is not as easy as it seems, though. You are
required to have a deeper understanding of the structure of an organization and then propose
Data Analytics
a solution that aligns with its end-goals and suffices it in achieving the desired objectives.
Data Analytics
Data modeling can be achieved in various ways. However, the basic concept of each
of them remains the same. Let’s have a look at the commonly used data modeling methods:
Hierarchical model
As the name indicates, this data model makes use of hierarchy to structure the data in
a tree-like format as shown in figure 2.6. However, retrieving and accessing data is difficult
in a hierarchical database. This is why it is rarely used now.
Relational model
The network model is inspired by the hierarchical model. However, unlike the
hierarchical model, this model makes it easier to convey complex relationships as each record
can be linked with multiple parent records as shown in figure 2.8. In this model data can be
shared easily and the computation becomes easier.
This database model consists of a collection of objects, each with its own features and
methods. This type of database model is also called the post-relational database model as
shown in figure 2.8.
Entity-relationship model
Data Analytics
The entity relationship diagram explains relation between variables and with their
primary key and foreign key as shown in figure 2.10. along with this it also explains the
multiple instances of relation between tables.
Now that we have a basic understanding of data modeling, let’s see why it is important.
You will agree with us that the main goal behind data modeling is to equip your business and
contribute to its functioning. As a data modeler, you can achieve this objective only when
you know the needs of your enterprise correctly.
It is essential to make yourself familiar with the varied needs of your business so that you can
prioritize and discard the data depending on the situation.
Key takeaway: Have a clear understanding of your organization’s requirements and organize
your data properly.
Things will be sweet initially, but they can become complex in no time. This is why it is
highly recommended to keep your data models small and simple, to begin with.
Once you are sure of your initial models in terms of accuracy, you can gradually introduce
more datasets. This helps you in two ways. First, you are able to spot any inconsistencies in
the initial stages. Second, you can eliminate them on the go.
Key takeaway: Keep your data models simple. The best data modeling practice here is to use
a tool which can start small and scale up as needed.
Organize your data based on facts, dimensions, filters, and order
You can find answers to most business questions by organizing your data in terms of four
elements – facts, dimensions, filters, and order.
Let’s understand this better with the help of an example. Let’s assume that you run four e-
commerce stores in four different locations of the world. It is the year-end, and you want to
analyze which e-commerce store made the most sales.
In such a scenario, you can organize your data over the last year. Facts will be the overall
sales data of last 1 year, the dimensions will be store location, the filter will be last 12
months, and the order will be the top stores in decreasing order.
This way, you can organize all your data properly and position yourself to answer an array
of business intelligence questions without breaking a sweat.
Key takeaway: It is highly recommended to organize your data properly using individual
tables for facts and dimensions to enable quick analysis.
While you might be tempted to keep all the data with you, do not ever fall for this trap!
Although storage is not a problem in this digital age, you might end up taking a toll over your
Data Analytics
machines’ performance.
Data Analytics
More often than not, just a small yet useful amount of data is enough to answer all the business-
related questions. Spending huge on hosting enormous data of data only leads to performance
issues, sooner or later.
Key takeaway: Have a clear opinion on how much datasets you want to keep. Maintaining
more than what is actually required wastes your data modeling, and leads to performance
issues.
Data modeling is a big project, especially when you are dealing with huge amounts of data.
Thus, you need to be cautious enough. Keep checking your data model before continuing to
the next step.
For example, if you need to choose a primary key to identify each record in the dataset
properly, make sure that you are picking the right attribute. Product ID could be one such
attribute. Thus, even if two counts match, their product ID can help you in distinguishing
each record. Keep checking if you are on the right track. Are product IDs same too? In those
aces, you will need to look for another dataset to establish the relationship.
Key takeaway: It is the best practice to maintain one-to-one or one-to-many relationships.
The many-to-many relationship only introduces complexity in the system.
Key takeaway: Data models become outdated quicker than you expect. It is necessary that
you keep them updated from time to time.
The Wrap Up
Data modeling plays a crucial role in the growth of businesses, especially when you
organizations to base your decisions on facts and figures. To achieve the varied business
intelligence insights and goals, it is recommended to model your data correctly and use
appropriate tools to ensure the simplicity of the system.
In statistics, imputation is the process of replacing missing data with substituted values. ...
Because missing data can create problems for analyzing data, imputation is seen as a way
Data Analytics
to avoid pitfalls involved with list-wise deletion of cases that have missing values.
Data Analytics
Advantages:
• Works well with numerical dataset.
• Very fast and reliable.
Disadvantage:
• Does not work with categorical attributes
• Does not correlate relation between columns
• Not very accurate.
• Does not account for any uncertainty in data
The k nearest neighbours is an algorithm that is used for simple classification. The algorithm
uses ‘feature similarity’ to predict the values of any new data points. This means that the new
point is assigned a value based on how closely it resembles the points in the training set. This can
be very useful in making predictions about the missing values by finding the k’s closest
neighbours to the observation with missing data and then imputing them based on the non-
missing values in the neighbourhood.
Advantage:
• This method is very accurate than mean, median and mode
Disadvantage:
• Sensitive to outliers
UNIT-3
BLUE Property
assumptions
The Gauss Markov theorem tells us that if a certain set of assumptions are met, the
ordinary least squares estimate for regression coefficients gives you the Best Linear
Unbiased Estimate (BLUE) possible.
Linearity:
o The parameters we are estimating using the OLS method must be themselves
linear.
Random:
o Our data must have been randomly sampled from the population.
Non-Collinearity:
o The regressors being calculated aren’t perfectly correlated with each other.
Exogeneity:
o The regressors aren’t correlated with the error term.
Homoscedasticity:
o No matter what the values of our regressors might be, the error of the variance is
constant.
Checking how well our data matches these assumptions is an important part of estimating
regression coefficients.
DATA ANALYTICS UNIT 4
Object Segmentation
Object segmentation in data analytics involves dividing data into meaningful groups or
segments. Each segment shares common characteristics, which helps in analyzing patterns or
behaviors within a dataset. Object segmentation is widely used in various domains such as
marketing, finance, and healthcare to group data points that are similar, making it easier to
interpret and predict trends.
Diagram: Below is a sample diagram illustrating the difference between regression and
segmentation. In regression, there’s a continuous line representing predictions. In segmentation,
data points are grouped into distinct clusters.
Example: In retail, regression might be used to predict sales based on historical sales data, while
segmentation can help group customers into categories (e.g., high-spenders, occasional buyers)
based on purchasing behavior.
● In supervised learning, the model is trained on labeled data where both input and the
corresponding output are provided.
● The primary goal is to learn the mapping function that relates input to output.
Examples:
○ Predicting house prices based on features like area, number of rooms, and
location.
○ Email spam classification.
● Techniques:
○ Regression: Predict continuous values, e.g., predicting stock prices.
○ Classification: Predict discrete categories, e.g., classifying emails as spam or
non-spam.
○ Linear Regression: For continuous target variables.
○ Logistic Regression: For binary classification tasks.
● Key Features:
○ Requires labeled data for training.
○ The output can be continuous (regression) or categorical (classification).
● Applications:
○ Predictive Analytics: Forecasting sales, predicting customer churn.
○ Classification Problems: Identifying whether an email is spam or not.
Example: Predicting housing prices based on features like area, location, and the number of
bedrooms.
The model learns from this data to predict prices for new housing data.
Advantages:
Challenges:
Unsupervised Learning:
● In unsupervised learning, the model is trained on data without labeled outputs.
● It identifies patterns and relationships in the data.
● The goal is to identify underlying patterns, structures, or clusters within the data.
Examples:
○ Customer segmentation for marketing.
○ Identifying fraudulent transactions in financial data.
● Techniques:
○ Clustering: Grouping similar data points, e.g., K-means, hierarchical clustering.
○ Dimensionality Reduction: Reducing the number of features, e.g., PCA
(Principal Component Analysis).
● Key Features:
○ Does not rely on labeled outputs.
○ Focuses on exploring the dataset's hidden structures.
● Applications:
○ Customer Segmentation: Grouping customers based on purchasing behavior.
○ Anomaly Detection: Identifying fraudulent credit card transactions.
Example: Clustering shopping data to group customers based on their buying patterns.
Advantages:
Challenges:
Example: A financial institution categorizes loan applicants as "low risk" or "high risk" based on
their credit history and income. The algorithm is trained on labeled data to classify new
applicants accordingly.
Unsupervised Learning: Here, the model is trained on unlabeled data, and it automatically
identifies patterns within the data. Clustering algorithms like K-means and DBSCAN are
commonly used for unsupervised segmentation.
Example: In marketing, clustering algorithms are applied to identify different customer groups
based on purchasing habits without predefined categories, enabling targeted marketing strategies.
Comparison Table:
Tree Building
Tree-building algorithms are widely used in supervised learning for both regression and
classification tasks.
Decision Trees
1. Root Node: Represents the entire dataset and the initial decision to be made.
2. Internal Nodes: Represent decisions or tests on attributes. Each internal node has one
or more branches.
3. Branches: Represent the outcome of a decision or test, leading to another node.
4. Leaf Nodes/ Terminal Nodes: Represent the final decision or prediction. No further
1. Selecting the Best Attribute: Using a metric like Gini impurity, entropy, or
3. Repeating the Process: The process is repeated recursively for each subset, creating
a new internal node or leaf node until a stopping criterion is met (e.g., all instances in
a node belong to the same class or a predefined depth is reached).
● Information Gain: Measures the reduction in entropy or Gini impurity after a dataset is
split on an attribute.
● Simplicity and Interpretability: Decision trees are easy to understand and interpret.
The visual representation closely mirrors human decision-making processes.
● Versatility: Can be used for both classification and regression tasks.
● No Need for Feature Scaling: Decision trees do not require normalization or scaling
of the data.
● Handles Non-linear Relationships: Capable of capturing non-linear relationships
between features and target variables.
● Overfitting: Decision trees can easily overfit the training data, especially if they are
deep with many nodes.
● Instability: Small variations in the data can result in a completely different tree being
generated.
● Bias towards Features with More Levels: Features with more levels can dominate
the tree structure.
Key Concepts:
● Decision Trees:
○ A tree-like structure where each internal node represents a feature, each branch
represents a decision rule, and each leaf node represents an output.
○ Algorithms: CART (Classification and Regression Trees), ID3.
● Regression Trees:
○ Used for predicting continuous values.
○ Example: Predicting the sales of a product based on pricing and advertising.
● Classification Trees:
○ Used for predicting discrete categories.
○ Example: Determining whether a customer will buy a product based on age and
income.
Challenges:
● Overfitting: A tree model that is too complex and performs well on training data but
poorly on unseen data.
● Trees may become too complex, leading to poor performance on unseen data.
○ Solution: Use techniques like pruning (removing unnecessary nodes), limiting
tree depth, or ensembling methods like Random Forest.
● Pruning:
○ Reduces tree size by removing sections of the tree that provide little predictive
power.
○ Types: Pre-pruning (limit growth during training) and Post-pruning (simplify after
tree creation).
○ Pre-pruning: Stops tree growth early by limiting depth or the number of nodes.
○ Post-pruning: Simplifies a fully grown tree by removing redundant nodes.
○
● Random Forest: Builds multiple decision trees and aggregates their outputs for better
accuracy and robustness.
● Gradient Boosting Machines (GBM): Combines weak learners (small trees) iteratively
to improve overall performance.
Applications:
Diagram:
Decision Trees: Decision trees are widely used in segmentation, especially when the data has a
clear hierarchy. A decision tree divides the data based on the value of specific features
(variables), making a series of splits that result in a tree-like structure.
Regression Trees: Used when the outcome is continuous. For instance, predicting the price of a
house based on features like size and location.
Classification Trees: Used when the outcome is categorical. For example, classifying emails as
"spam" or "not spam."
Overfitting: A decision tree can become too complex by adding too many branches that fit the
training data very well but don’t generalize to new data. Overfitting makes the model less
effective in real-world scenarios.
Complexity: Large trees may become very complex and less interpretable. Simplifying the trees
with pruning can help make them easier to understand.
Pruning: To address overfitting, pruning techniques are applied. Pruning removes sections of the
tree that provide little predictive power, thereby reducing complexity and making the model
more generalizable.
To overcome overfitting, pruning techniques are used. Pruning reduces the size of the tree by
removing nodes that provide little power in classifying instances. There are two main types of
pruning:
● Pre-pruning (Early Stopping): Stops the tree from growing once it meets certain
criteria (e.g., maximum depth, minimum number of samples per leaf).
● Post-pruning: Removes branches from a fully grown tree that do not provide
significant power.
Example: In a medical diagnosis decision tree, some branches might only apply to specific cases
and are not representative of general patterns. Pruning these branches improves the model’s
accuracy on unseen data.
Applications of Decision Trees
Multiple Decision Trees involve leveraging multiple decision tree algorithms to improve
prediction accuracy and reduce the risk of overfitting. These techniques are commonly used in
ensemble learning methods, where a group of trees collaborates to make better predictions than a
single tree.
● Overfitting: Single trees may fit the training data too closely, resulting in poor
generalization to unseen data.
● Bias and Variance: A single tree may have high variance or high bias, depending on its
configuration.
● Stability: Small changes in the dataset can lead to significantly different trees.
1) Random Forest
● Concept:
○ Builds multiple decision trees on different subsets of the dataset (created through
bootstrapping) and features (randomly selected for each split).
○ The final prediction is obtained by:
■ Regression: Taking the average of predictions from all trees.
■ Classification: Using majority voting among the trees.
● Key Features:
○ Reduces overfitting by averaging multiple trees.
○ Handles missing data well.
○ Can measure feature importance.
● Concept:
○ Builds trees sequentially, where each tree attempts to correct the errors of the
previous trees.
○ A loss function (e.g., Mean Squared Error) guides how trees are built.
○ Models like XGBoost, LightGBM, and CatBoost are popular implementations.
● Key Features:
○ High predictive accuracy.
○ Can handle both regression and classification tasks.
○ Requires careful tuning of hyperparameters (e.g., learning rate, number of trees).
● Example: Predicting product sales:
○ First tree predicts 100100100, but the actual value is 120120120.
○ Second tree tries to predict the residual (202020).
○ Final prediction is the sum of predictions from all trees.
● Concept:
○ Trains multiple decision trees on different random subsets of the training data
(using bootstrapping).
○ Combines predictions by averaging (for regression) or voting (for classification).
● Key Features:
○ Reduces variance and avoids overfitting.
○ Often used as a base for Random Forest.
● Example: Predicting stock prices:
○ Each tree is trained on a different subset of the dataset.
○ Predictions are averaged to produce the final output.
● Concept:
○ A variant of Random Forest where splits are made randomly, rather than choosing
the best split.
○ Uses the entire dataset (no bootstrapping).
● Key Features:
○ Faster than Random Forest.
○ Adds additional randomness to reduce overfitting.
● Concept:
○ Focuses on improving weak learners (e.g., shallow decision trees).
○ Adjusts the weights of incorrectly predicted samples, so subsequent trees focus
more on them.
● Key Features:
○ Works well with imbalanced data.
○ Sensitive to outliers.
● Example: Classifying fraudulent transactions:
○ First tree classifies 90%90\%90% of the data correctly.
○ Second tree focuses on the 10%10\%10% misclassified cases, and so on.
Applications
Time series analysis focuses on understanding patterns and trends in data over time to make
forecasts. Time series analysis deals with data that is collected over time in sequential order. This
type of data can reveal trends, cycles, and seasonal patterns. Time series analysis is crucial in
fields like finance, retail, and meteorology, where forecasting future values based on historical
patterns is valuable. Time series analysis involves analyzing data points collected or recorded at
specific time intervals. It is widely used for forecasting trends and predicting future values.
Key Concepts:
Techniques:
● Weather prediction.
● Sales forecasting.
● Anomaly detection in IoT data.
The ARIMA model is one of the most widely used time series forecasting models. ARIMA is a
statistical model used for time series forecasting. It combines three main
components—Auto-Regression (AR), Integration (I), and Moving Average (MA)—to predict
future values based on past observations.
AR (Auto-Regressive): Uses past values to predict future values. For example, the sales on a
particular day could depend on the sales from previous days.
I (Integrated): Differencing the data to make it stationary. Stationarity means that the data’s
statistical properties (mean, variance) are consistent over time.
MA (Moving Average): Incorporates the dependency between an observation and residual
errors from previous observations.
Components of ARIMA
1. Auto-Regression (AR):
● Refers to a model that uses the relationship between a variable and its past values.
● Example: Predicting the current sales of a product based on sales in the previous
months.
● Represented as p: The number of lagged observations to include in the model.
2. Integration (I):
● Involves differencing the data to make it stationary (removing trends or
seasonality).
● Represented as d: The number of differencing operations required.
● Example: If sales consistently increase by 10 units every month, differencing will
subtract one month’s sales from the next to stabilize the trend.
Example: Suppose a retailer wants to forecast monthly sales for the next year. Using ARIMA,
the model would learn from monthly sales data over the past few years, capturing trends and
seasonality, to predict future sales values.
Parameters of ARIMA
Each component in ARIMA functions as a parameter with a standard notation. For ARIMA
models, a standard notation would be ARIMA with p, d, and q, where integer values substitute
for the parameters to indicate the type of ARIMA model used. The parameters can be defined as:
● p: the number of lag observations in the model, also known as the lag order.
● d: the number of times the raw observations are differenced; also known as the degree of
differencing.
● q: the size of the moving average window, also known as the order of the moving
average.
For example, a linear regression model includes the number and type of terms. A value of zero
(0), which can be used as a parameter, would mean that a particular component should not be
used in the model. This way, the ARIMA model can be constructed to perform the function of an
ARMA model, or even simple AR, I, or MA models.
1. Check Stationarity:
○ Stationarity means that the statistical properties (mean, variance) of the time
series do not change over time.
○ Check whether the data is stationary. (ensure stationarity by testing)
○ If not stationary, apply differencing until the series becomes stationary.
2. Identify Parameters (p, d, q):
○ Use Autocorrelation Function (ACF) and Partial Autocorrelation Function
(PACF) plots to identify the values for p and q.
○ The differencing order d is determined by the number of times differencing was
applied.
3. Fit the Model:
○ Use the chosen p, d, and q values to fit the ARIMA model.
4. Validate the Model:
○ Check residual errors for randomness (using residual plots and statistical tests).
○ If residuals are not random, refine the model parameters.
5. Forecast:
○ Use the fitted ARIMA model to predict future values.
ARIMA Equation: The general ARIMA model combines AR, I, and MA components-
Applications of ARIMA
Let’s assume you want to predict the daily sales of a product for the next week. You have daily
sales data for the past year, which shows both trend and seasonal patterns.
1. Step 1: Data Collection You collect daily sales data from your e-commerce platform for
the past year.
2. Step 2: Make the Data Stationary Before applying ARIMA, you check whether the
data is stationary. If not, you apply differencing to remove trends and seasonality.
3. Step 3: Choose ARIMA Model Parameters (p, d, q) You choose the order of the AR
(p), differencing (d), and MA (q) parts using statistical techniques like the ACF
(Auto-Correlation Function) and PACF (Partial Auto-Correlation Function).
4. Step 4: Train the Model You fit the ARIMA model using historical data.
5. Step 5: Make Predictions Once the model is trained, you can use it to make predictions
for the next 7 days.
Forecast accuracy is crucial in evaluating the performance of predictive models. It ensures that
forecasts generated by models align closely with actual values. The measures of forecast
accuracy help quantify the error between predicted and observed values, enabling analysts to
choose or improve forecasting methods.
a) Positive Error
b) Negative Error
1) Mean Absolute Error (MAE): Measures the average of the absolute errors between
predicted and actual values.
● Definition: The average of the absolute differences between observed and predicted
values.
● Characteristics:
○ Simple to calculate and interpret.
○ Treats all errors equally, irrespective of their magnitude.
2) Mean Squared Error (MSE): Measures the average of the squared errors between predicted
and actual values, emphasizing larger errors.
● Definition: The average of the squared differences between observed and predicted
values.
● Characteristics:
○ Penalizes larger errors more heavily.
○ Sensitive to outliers.
3) Root Mean Squared Error (RMSE): The square root of MSE, giving an indication of the
model’s prediction error in the original units.
● Definition: The square root of the MSE, providing error in the same units as the data.
● Characteristics:
○ Combines the advantages of MSE but is interpretable in the original scale of the
data.
Use Case: In weather forecasting, RMSE is commonly used to measure the accuracy of
temperature predictions.
● Definition: Measures the average percentage error between observed and predicted
values.
● Characteristics:
○ Expresses errors as percentages, making it scale-independent.
○ May give misleading results if actual values are close to zero.
● Characteristics:
○ Addresses the issue of zero or near-zero actual values.
○ Useful for more balanced percentage error calculations.
● Definition: Measures the average error between observed and predicted values (signed).
● Characteristics:
○ Indicates bias in forecasts (negative MFE shows overestimation, positive MFE
shows underestimation).
7) Tracking Signal
● Definition: Monitors the consistency of forecast errors over time.
● Use:
○ Helps detect bias or systematic error in forecasts.
Practical Examples
Other Applications:
● Weather forecasting.
● Stock market predictions.
● Predictive maintenance in manufacturing.
Example: A retail company can use STL to decompose monthly sales data. This allows them to
separate seasonal effects (e.g., holiday sales boosts) from the underlying trend in sales growth.
STL is a robust method used to decompose a time series into three components:
1. Seasonal Component: Represents the periodic patterns (e.g., weekly, monthly, or yearly
cycles).
2. Trend Component: Represents the long-term movement in the data (e.g., increasing
sales over years).
3. Residual (Remainder) Component: Represents the irregular or random variations in the
data.
Key Features of STL
1. Input Data:
○ The time series data is provided as input.
○ Example: Monthly sales data over the past 3 years.
2. Seasonal Extraction:
○ The seasonal component is extracted using smoothing techniques.
○ This component captures repeating patterns (e.g., higher sales in December).
3. Trend Extraction:
○ The trend component is obtained by removing the seasonal component and
applying smoothing to capture the long-term movement.
4. Residual Calculation:
○ After removing the seasonal and trend components, the remainder (residuals) is
calculated, representing noise or unexplained variation.
Mathematical Representation
Serialization refers to saving time-ordered data in a format that can be easily transmitted or
stored for later analysis. Common serialization formats include JSON and CSV. Serialization
ensures data integrity and allows time series data to be analyzed across different systems and
applications.
Example: In financial applications, serialization is used to store historical stock prices in JSON
format for real-time analytics and forecasting.
Data Extraction: Select key features or variables from the dataset. In time series analysis, this
might involve identifying important dates, events, or anomalies.
Data Analysis: Apply models like ARIMA or machine learning algorithms (e.g., RNNs or
LSTMs) to analyze time series data. These models can learn patterns over time, allowing for
more accurate predictions.
Example: A company might extract sales data during promotional events, analyze trends during
these periods, and predict future sales for upcoming promotions using ARIMA.
Extract Features from Generated Model as Height, Average Energy etc., and
Analyze for Prediction
Feature extraction is a crucial step in machine learning and data analysis, where meaningful
information is derived from raw data or model outputs to improve prediction accuracy. In the
context of analyzing a generated model, features like Height, Average Energy, and other
derived metrics can provide insightful information for predictive analytics.
Understanding Features
Feature extraction is a fundamental process in data analysis and machine learning. It involves
identifying and deriving significant attributes from raw data or a model's output that can be
utilized for prediction. Features such as Height, Average Energy, and other derived metrics
serve as inputs for predictive models, helping to uncover patterns and trends.
1) Height
● Definition: The peak value or maximum value in the dataset or model output.
● Purpose: Height often signifies the intensity or magnitude of a phenomenon, such as the
highest sales in a month, peak temperature in a year, or the maximum value in a
waveform.
● Significance:
○ Highlights extreme conditions or events.
○ Useful in trend detection and anomaly identification.
● Examples:
○ Stock Market: Height can represent the highest stock price in a time frame.
○ Energy Usage: The highest energy consumption during a day.
○ Waveform Analysis: The peak amplitude in signal processing.
○ Weather Analysis: The highest temperature recorded in a season.
2) Average Energy
● Definition: The mean of the energy values across the data, representing the overall
intensity or activity over time.
● Purpose: Helps in understanding the typical level of activity or variation in the data.
● Significance:
○ Provides a general measure of the dataset's activity over time.
○ Useful for understanding trends and deviations.
● Calculation:
● Examples:
○ Audio Signal Processing: Average energy can reflect the loudness or intensity of
a sound.
○ IoT Sensors: Average power consumption of a device over a period.
○ Time Series Data: Average sales per week for retail forecasting.
○ Signal Processing: Average amplitude of a waveform.
○ IoT Applications: Average sensor readings over a day.
Other Features
After extracting features, the next step is to analyze them for their predictive capabilities.
a) Statistical Analysis
● Correlation: Measure how strongly a feature is associated with the target variable.
● Example: Correlate energy peaks (Height) with outdoor temperature.
b) Feature Selection
● Use algorithms like Recursive Feature Elimination (RFE) to select the most relevant
features.
● Example: Choose Height and Average Energy as key predictors for future energy
consumption.
c) Model Training
e) Correlation Analysis
● Check the relationship between extracted features and the target variable.
● Example:
○ Height of sales peaks correlating with promotional campaigns.
○ Average energy consumption correlating with seasonal changes.
f) Feature Engineering
g) Model Building
h) Visualization
Scenario:
A smart grid collects data on daily energy usage. The goal is to predict future consumption
patterns.
Feature Extraction:
Visualization:
● Plot energy usage trends to highlight peaks (Height) and averages over time.
Predictive Model:
a) Fourier Transform
● Use decision trees, Random Forest, or LASSO regression to rank feature importance.
a) Healthcare
b) Finance
● Extract features like peak energy (Height) and average daily usage (Average Energy) to
predict future consumption and identify energy-saving opportunities.
e) Medical Diagnosis
f) Financial Forecasting
g) Predictive Maintenance
Dataset
Step-by-Step Analysis
1. Feature Extraction:
○ Height: Identify the day with the highest power consumption (e.g., during
summer months with heavy air conditioning usage).
○ Average Energy: Calculate the daily average power consumption over the year.
2. Visualization:
○ Plot a time series of daily consumption, marking the highest points (Height).
○ Create a bar chart of average monthly energy usage.
3. Predictive Analysis:
○ Use features to predict periods of high energy usage.
○ Example Model: Linear regression to predict future daily consumption.
4. Insights:
○ Height might indicate days of peak activity (e.g., holidays or extreme weather).
○ Average Energy provides a baseline for typical consumption, helping identify
anomalies.
Diagram:
5.1 Standard Operating Procedures
SOPs are clear, step-by-step instructions that guide how to collect, clean, analyze, document,
and share data. They help ensure that everyone on the team works in a consistent, accurate,
and repeatable way.
Standard Operating Procedures (SOPs) for documentation and knowledge sharing ensure
consistency, reproducibility, and team collaboration.
2. Code Documentation
Project description
Setup instructions
Data schema
o Include meaningful commit messages (e.g., "added EDA for customer churn").
5. Final Deliverables
2. Meeting Routines
o Weekly stand-ups or data team syncs for updates and challenges.
3. Documentation of Learnings
4. Internal Wiki
Data definitions
Common queries
6. Cross-Training
Problem Statement:
Start with a clear understanding of the problem you are trying to solve with data analysis. What
are the business drivers, stakeholders, and the potential impact of the solution?
Value Proposition:
Identify the specific benefits the data analysis will provide. How will it improve decision-making,
optimize processes, or generate new insights?
SMART Goals:
Define Specific, Measurable, Achievable, Relevant, and Time-bound goals for the analysis.
Data Requirements:
Identify the specific data sources, types of data, and data quality standards needed for the
analysis. This helps determine the feasibility and complexity of the project.
Deliverables:
Clearly outline the specific products or outputs of the analysis, such as reports, dashboards, or
models. This helps manage stakeholder expectations.
Resources:
Determine the resources required for the project, including tools, software, and personnel.
Boundaries:
Define the limits of the project, including what is included and excluded from the scope. This
helps prevent scope creep and ensures that the project stays focused.
Timeline:
Establish a realistic timeline for the project, including milestones and deadlines.
Ensures that everyone involved understands the project's goals and objectives.
Helps teams concentrate on the most relevant tasks and avoid unnecessary work.
Improved Communication:
Facilitates clear and consistent communication among team members and stakeholders.
Measurable Success:
Great topic! 🎯 In data analytics, defining a purpose and scope document is one of the first
and most important steps when starting a project. Here's a beginner-friendly explanation:
✅ What It Is:
It keeps everyone on the same page — from data analysts to business stakeholders.
✍️ What to Include:
What problem are we solving? Why “To analyze customer churn to help
🎯 Purpose
does it matter? improve retention strategies.”
📊 Data
Where the data is coming from “Customer database, CRM export”
Sources
🛠 Tools What tools/software will be used “Python, Power BI, SQL Server”
Intellectual Property Rights (IPR) refer to the legal protections granted to individuals or
businesses over their intangible assets, preventing unauthorized use or exploitation. These
rights ensure that creators maintain control over their work, including:
IPR provides a temporary monopoly over the use of the protected property, and violations can
lead to strict legal penalties.
🔐 Why Is IPR Important?
1. Boosts Business Growth – Protects unique ideas from competitors, helping especially
small businesses maintain market share and grow.
2. Supports Marketing – Builds brand identity and prevents copying, making it easier to
connect with customers.
4. Attracts Funding – IPR assets can be sold, licensed, or used as collateral to raise money.
5. Expands Global Reach – Enables businesses to enter new markets and form international
partnerships via protected brands or patents.
2. Trademark
Identifies and distinguishes goods or services using names, symbols, or logos (e.g.,
Apple, Audi). Registration is not mandatory but necessary to claim exclusive ownership.
4. Patent
Grants exclusive rights to inventors over their inventions (not discoveries), such as new
devices or processes. A patent prevents others from making, using, or selling the
invention without permission.
5. Design
Protects the aesthetic or visual aspects of products (e.g., car shapes, kitchen tools). It
ensures exclusive rights over commercial production and sale based on the protected
design.
6. Plant Variety Protection
Provides rights to breeders for developing new plant varieties. It ensures protection for
genetically developed or selectively bred plant species under laws like the Plant Variety
Protection Act.
5.4 Copyright
Copyright refers to the legal protection of original works created during the data analytics
process. While raw data itself is not copyrightable, many outputs and tools used or produced
during data analysis can be protected.
o Custom Python, R, SQL, or other language scripts written for data cleaning,
analysis, or visualization are protected as literary works.
2. Data Visualizations
4. Software Tools
Best Practices:
Always use data and code with proper licenses (e.g., open-source tools under MIT/GPL).