DA_unit1_notes
DA_unit1_notes
NOTES
Faculty – Ms Priyanka (Assistant Professor)
Introduction to Data Analytics
Different Sources of Data for Data Analysis
Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form like text,
video, audio, XML files, records, or other image files used in later stages of data
analysis. In the process of big data analysis, “Data collection” is the initial step before starting
to analyze the patterns or useful information in data. The data which is to be analyzed must
be collected from different valid sources.
The data which is collected is known as raw data which is not useful now but on cleaning the
impure and utilizing that data for further analysis forms information, the information obtained
is known as “knowledge”. Knowledge has many meanings like business knowledge
or sales of enterprise products, disease treatment, etc. The main goal of data
collection is to collect information-rich data.
Data collection starts with asking some questions such as what type of data is to be collected
and what is the source of collection. Most of the data collected are of two types known as
“qualitative data “which is a group of non-numerical data such as words, sentences mostly
focus on behavior and actions of the group and another one is “quantitative data” which is in
numerical forms and can be calculated using different scientific tools and sampling data.
The actual data is then further divided mainly into two types known as:
•Primary data
•Secondary data
1. Primary Data:
The data which is Raw, original, and extracted directly from the official sources is known as
primary data. This type of data is collected directly by performing techniques
such as questionnaires, interviews, and surveys. The data collected must be according to the
demand and requirements of the target audience on which analysis is performed otherwise it
would be a burden in the data processing.
Few methods of collecting primary data:
a. Interview method:
The data collected during this process is through interviewing the target audience by a person
called interviewer and the person who answers the interview is known as the interviewee.
Some basic business or product related questions are asked and noted down in the form of
notes, audio, or video and this data is stored for processing. These can be both structured and
unstructured like personal interviews or formal interviews through telephone, face to face,
email, etc.
b. Survey method:
The survey method is the process of research where a list of relevant questions is asked and
answers are noted down in the form of text, audio, or video. The survey method can be
obtained in both online and offline mode like through website forms and email. Then that
survey answers are stored for analyzing data. Examples are online surveys or surveys through
social media polls.
c. Observation method:
The observation method is a method of data collection in which the researcher keenly
observes the behavior and practices of the target audience using some data collecting tool
and stores the observed data in the form of text, audio, video, or any raw formats. In this
method, the data is collected directly by posting a few question on the participants.
For example, observing a group of customers and their behavior towards the products.
The data obtained will be sent for processing.
d. Experimental method:
The experimental method is the process of collecting data through performing experiments,
research, and investigation. The most frequently used experiment methods are CRD, RBD,
LSD, FD.
analytics which are based on randomization and replication. It is mostly used for comparing
the experiments.
divided into small units called blocks. Random experiments are performed on each of the
blocks and results are drawn using a technique known as analysis of variance (ANOVA).
•LSD – Latin Square Design is an experimental design that is similar to CRD and RBD
blocks but contains rows and columns. It is an arrangement of NxN squares with an equal
number of rows and columns which contain letters that occurs only once in a row. Hence the
differences can be easily found with fewer errors in the experiment. Sudoku puzzle is an
•FD- Factorial design is an experimental design where each experiment has two factors each
with possible values and on performing trail other combinational factors are derived.
2. Secondary data:
Secondary data is the data which has already been collected and reused again for some valid
purpose. This type of data is previously recorded from primary data and it has two types of
sources named internal source and external source.
Internal source:
These types of data can easily be found within the organization such as market record, a sales
record, transactions, customer data, accounting resources, etc. The cost and time
consumption is less in obtaining internal sources.
External source:
The data which can’t be found at internal organizations and can be gained through external
third-party resources is external source data. The cost and time consumption is more because
this contains a huge amount of data. Examples of external sources are Government
publications, news publications, Registrar General of India, planning commission,
international labor bureau, syndicate services, and other non-governmental publications.
Other sources:
•Sensors data: With the advancement of IoT devices, the sensors of these devices collect
data which can be used for sensor data analytics to track the performance and usage of
products.
•Satellites data: Satellites collect a lot of images and data in terabytes on daily basis through
•Web traffic: Due to fast and cheap internet facilities many formats of data which is uploaded
by users on different platforms can be predicted and collected with their permission for data
analysis. The search engines also provide their data through keywords and queries searched
mostly.
1. Structured data –
Structured data is data whose elements are addressable for effective analysis. It has been
organized into a formatted repository that is typically a database. It concerns all data which
can be stored in database SQL in a table with rows and columns. They have relational keys
and can easily be mapped into pre-designed fields. Today, those data are most processed in
the development and simplest way to manage information. Example: Relational data.
some organizational properties that make it easier to analyze. With some process, you can
store them in the relation database (it could be very hard for some kind of semi-structured
3. Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does not have a
predefined data model, thus it is not a good fit for a mainstream relational database. So for
Unstructured data, there are alternative platforms for storing and managing, it is increasingly
Accuracy and Precision: This characteristic refers to the exactness of the data. It cannot have
any erroneous elements and must convey the correct message without being misleading. This
accuracy and precision have a component that relates to its intended use. Without
understanding how the data will be consumed, ensuring accuracy and precision could
be off-target or more costly than necessary. For example, accuracy in healthcare might
be more important than in another industry (which is to say, inaccurate data in
healthcare could have more serious consequences) and, therefore, justifiably worth
higher levels of investment.
Legitimacy and Validity: Requirements governing data set the boundaries of this
characteristic. For example, on surveys, items such as gender, ethnicity, and nationality are
typically limited to a set of options and open answers are not permitted. Any answers other
than these would not be considered valid or legitimate based on the survey’s requirement.
This is the case for most data and must be carefully considered when determining its quality.
The people in each department in an organization understand what data is valid or not to
them, so the requirements must be leveraged when evaluating data quality.
Reliability and Consistency: Many systems in today’s environments use and/or collect the
same source data. Regardless of what source collected the data or where it resides, it cannot
contradict a value residing in a different source or collected by a different system. There must
be a stable and steady mechanism that collects and stores the data without contradiction or
unwarranted variance.
Timeliness and Relevance: There must be a valid reason to collect the data to justify the effort
required, which also means it has to be collected at the right moment in time. Data collected
too soon or too late could misrepresent a situation and drive inaccurate decisions.
Completeness and Comprehensiveness: Incomplete data is as dangerous as inaccurate data.
Gaps in data collection led to a partial view of the overall picture to be displayed. Without a
complete picture of how operations are running, uninformed actions will occur. It’s important
to understand the complete set of requirements that constitute a comprehensive set of data
to determine whether or not the requirements are being fulfilled.
Availability and Accessibility: This characteristic can be tricky at times due to legal and
regulatory constraints. Regardless of the challenge, though, individuals need the right level of
access to the data in order to perform their jobs. This presumes that the data exists and is
available for access to be granted.
Granularity and Uniqueness: The level of detail at which data is collected is important,
because confusion and inaccurate decisions can otherwise occur. Aggregated, summarized
and manipulated collections of data could offer a different meaning than the data implied at
a lower level. An appropriate level of granularity must be defined to provide sufficient
uniqueness and distinctive properties to become visible. This is a requirement for operations
to function effectively.
2. Social media: The statistic shows that 500+terabytes of new data get ingested into the
databases of social media site Facebook, every day. This data is mainly generated in terms of
photo and video uploads, message exchanges, putting comments etc.
3. Jet Engine: A single Jet engine can generate 10+terabytes of data in 30 minutes of flight
time. With many thousand flights per day, generation of data reaches up to many Petabytes.
•Volume
•Variety
•Velocity
•Variability
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays
a very crucial role in determining value out of data. Also, whether a particular data can actually
be considered as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is
one characteristic which needs to be considered while dealing with Big Data.
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of data
considered by most of the applications. Nowadays, data in the form of emails, photos, videos,
monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications.
This variety of unstructured data poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data
is
generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The
flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
1. Better Decision-Making
Explanation: Businesses use data analytics to make informed decisions rather than relying on
intuition. It helps in risk assessment and predicting future trends.
Explanation: Banks and financial institutions use analytics to detect anomalies and prevent
fraud.
A fraud detection process:
5. Competitive Advantage
Characteristics:
Characteristics:
• Explosion of unstructured and semi-structured data (social media, IoT, web logs).
• Technologies like Hadoop and NoSQL databases enabled large-scale data storage.
Characteristics:
• Need for real-time insights led to technologies like Apache Spark and Kafka.
Characteristics:
The analytic process involves a series of steps to extract insights from raw data. Below is a
structured approach:
What Happens?
• Data is gathered from multiple sources (databases, APIs, IoT devices, social media,
etc.).
What Happens?
What Happens?
What Happens?
What Happens?
1. What is Reporting?
Purpose:
Examples:
• Financial statements.
2. What is Analysis?
Examples:
4. Analogy
• Reporting is like the speedometer and fuel gauge – it tells you what’s happening.
• Analysis is like a mechanic diagnosing why the car is not running efficiently and
predicting maintenance needs.
Modern Data Analytics Tools
Modern data analytics tools help organizations collect, process, analyze, and visualize data
efficiently. These tools can be categorized based on their primary functions:
Popular Tools:
Data Sources (APIs, Databases, IoT, Social Media) ➡ Data Collection Tools ➡ Centralized
Storage
Popular Tools:
• SQL Databases (MySQL, PostgreSQL, Microsoft SQL Server) – For structured data.
Popular Tools:
• Apache Spark – Big data processing and machine learning.
• Python (Pandas, NumPy, Scikit-learn) – Widely used for statistical and machine
learning analysis.
These tools present insights in an understandable way through dashboards and reports.
Popular Tools:
These tools apply advanced analytics, predictive modeling, and AI-driven decision-making.
Popular Tools:
Popular Platforms:
• Example: Supermarkets place related items together (Diapers & Baby Wipes).
c) Customer Segmentation
d) Sentiment Analysis
• Uses NLP to analyze customer opinions from social media, reviews, and surveys.
• Uses machine learning to detect diseases based on medical records and imaging data.
a) Fraud Detection
b) Risk Management
c) Algorithmic Trading
a) Predictive Maintenance
b) Inventory Optimization
• Uses analytics to determine the fastest and most cost-effective delivery routes.
• Uses real-time data to optimize traffic, waste management, and energy consumption.
d) Disaster Management
• Uses data analytics to predict and mitigate the impact of natural disasters.
a) Personalized Recommendations
b) Dynamic Pricing
d) Demand Forecasting
a) Performance Analytics
a) Personalized Learning
b) Dropout Prediction
a) Threat Detection
d) Identity Verification
a) Business Analyst
b) Data Engineer
c) Data Scientist
e) Data Analyst
f) Domain Expert
Having the right combination of these roles ensures an effective and successful analytics
implementation.
Phase 1: Discovery
Objective:
Key Tasks:
Deliverables:
• Project charter.
Objective:
• Collect, clean, and preprocess raw data to make it suitable for analysis.
Key Tasks:
Objective:
Key Tasks:
Deliverables:
Objective:
Key Tasks:
Objective:
Key Tasks:
Deliverables:
Phase 6: Operationalization
Objective:
Key Tasks:
• Implement the model into an operational environment (e.g., cloud services, APIs).
Deliverables:
4. Summary
The Data Analytics Lifecycle provides a structured framework to develop data-driven
solutions effectively. Each phase plays a critical role in ensuring the accuracy, relevance, and
scalability of analytics projects. Successful execution depends on:
By following this lifecycle, organizations can extract valuable insights from data and make
informed decisions.