0% found this document useful (0 votes)
19 views

DA_Unit_1

The document provides an overview of data analytics, covering its definition, types, and lifecycle phases. It discusses structured, semi-structured, and unstructured data, along with their characteristics, tools, and applications across various sectors. Additionally, it emphasizes the importance of data analytics in enhancing decision-making and operational efficiency in businesses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

DA_Unit_1

The document provides an overview of data analytics, covering its definition, types, and lifecycle phases. It discusses structured, semi-structured, and unstructured data, along with their characteristics, tools, and applications across various sectors. Additionally, it emphasizes the importance of data analytics in enhancing decision-making and operational efficiency in businesses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Data Analytics

4CS 1220
Module-I
Content
• Introduction to Data Analytics

• Sources and nature of data,


• classification of data (structured, semi-structured, unstructured),
• characteristics of data,
• need of data analytics,
• applications of data analytics.
• Phases of data analytics lifecycle – discovery, data preparation, model
planning, model building, communicating results, operationalization.
• Data: Anything that is recorded is data. Observations and
facts are data. Anecdotes and opinions are also data, of a
different kind. Data can be numbers, like the record of daily
weather, or daily sales. Data can be alphanumeric, such as
the names of employees and customers.
What is Data • Data could come from any number of sources.
• Data can come in many ways.
• Data could be an unordered collection of values
• Data could be ordered values like small, medium and large.
• Another type of data has discrete numeric values defined in
a certain range, with the assumption of equal distance
between the values.
• Analytics: Analytics is the discovery, interpretation and
communication of meaningful patterns or summary in
What is data.
Data • Data analytics takes raw data and turns it into useful
information. It uses various tools and methods to discover
Analytics? patterns and solve problems with data.
• Data analysis is defined as a process of cleaning,
transforming, and modelling data to discover useful
information for business decision making.
• The purpose of Data Analysis is to extract useful
information from data and taking the decision based upon
the data analysis better decisions and grow.
Types of Data
Analytics
•Predictive analytics: It tells you what will happen. It
can be achieved by building predictive models.
Types of •Example: Predicting the total units of chairs that

Data
would sell and the profit we can expect in the future.

Analytics Descriptive analytics: It tells you what has happened.


It can be done using an exploratory data analysis.

Example: Studying the total units of chairs sold and


the profit that was made in the past.
Prescriptive analytics: It tells you how to make
something happen. It can be done by deriving key
Types of insights and hidden patterns from the data.

Data • Example: Finding ways to improve sales and profit


of chairs.
Analytics
Diagnostic analytics: To address any query or provide
a solution to any issue, we typically rely on past data
in this study rather than other data. We look for
patterns and dependencies in the historical data related
to the specific issue.
Data
Analytics • Python, R , Tableau:, Power BI, QlikView,
Tools Apache Spark
Data Classification

Unstructured Semistructured
Structured data
data data
• Structured Data: Structured data is created using a fixed
schema and is maintained in tabular format. The elements in
Data structured data are addressable for effective analysis. It
contains all the data which can be stored in the SQL
Classification database in a tabular format. Today, most of the data is
: Structured
developed and processed in the simplest way to manage
information.
Data • Examples of structured data include dates, names,
addresses, credit card numbers.
• Consider an example for Relational Data like you have to
maintain a record of students for a university like the name
of the student, ID of a student, address, and Email of the
student. To store the record of students used the following
relational schema and table for the same.
Example:

S_ID S_Name S_Address S_Email

1001 A Delhi [email protected]

1002 B Mumbai [email protected]


• Easily used by machine learning (ML) algorithms: The
Structured specific and organized architecture of structured data
eases the manipulation and querying of ML data.
Data : Pros • Easily used by business users: Structured data do not
require an in-depth understanding of different types of
data and how they function. With a basic understanding of
the topic relative to the data, users can easily access and
interpret the data.
• Accessible by more tools: Since structured data predates
unstructured data, there are more tools available for using
and analyzing structured data.
• Cons
Structured • Limited usage: Data with a predefined structure can
only be used for its intended purpose, which limits its
Data:Cons flexibility and usability.
• Limited storage options: Structured data are usually
stored in data storage systems with rigid schemas (for
example, “data warehouses”). Therefore, changes in
data requirements necessitate an update of all
structured data, which leads to a massive expenditure
of time and resources.
Structured • OLAP: Performs high-speed, multidimensional data
analysis from unified, centralized data stores.
data tools • SQLite: (link resides outside ibm.com) Implements a
self-contained, serverless, zero-configuration,
transactional relational database engine.
• MySQL: Embeds data into mass-deployed software,
particularly mission-critical, heavy-load production
system.
• Customer relationship management (CRM): CRM
Use cases for software runs structured data through analytical
structured tools to create datasets that reveal customer
behavior patterns and trends.
data • Online booking: Hotel and ticket reservation data
(for example, dates, prices, destinations, among
others.) fits the “rows and columns” format
indicative of the pre-defined data model.
• Accounting: Accounting firms or departments use
structured data to process and record financial
transactions.
Unstructured • Unstructured data, typically categorized as qualitative data,
cannot be processed and analyzed through conventional data
data tools and methods. Since unstructured data does not have a
predefined data model, it is best managed in non-relational
(NoSQL) databases.
• Examples of unstructured data include text, mobile activity,
social media posts, Internet of Things (IoT) sensor data,
among others. Their benefits involve advantages in format,
speed and storage, while liabilities revolve around expertise
and available resources
Unstructured • Native format: Unstructured data, stored in its native format,
remains undefined until needed. Its adaptability increases file
data: Pros formats in the database, which widens the data pool and enables
data scientists to prepare and analyze only the data they need.
• Fast accumulation rates: Since there is no need to predefine
the data, it can be collected quickly and easily.
• Data lake storage: Allows for massive storage and pay-as-you-
use pricing, which cuts costs and eases scalability.
Unstructured • Requires expertise: Due to its undefined or non-formatted
data: Cons nature, data science expertise is required to prepare and
analyze unstructured data. This is beneficial to data analysts
but alienates unspecialized business users who might not fully
understand specialized data topics or how to utilize their data.
• Specialized tools: Specialized tools are required to
manipulate unstructured data, which limits product choices for
data managers.
• MongoDB: Uses flexible documents to process data for cross-
Unstructured platform applications and services.
data tools • DynamoDB: (link resides outside ibm.com) Delivers single-
digit millisecond performance at any scale through built-in
security, in-memory caching and backup and restore.
• Hadoop: Provides distributed processing of large data sets
using simple programming models and no formatting
requirements.
• Azure: Enables agile cloud computing for creating and
managing apps through Microsoft’s data centers.
• Data mining: Enables businesses to use
unstructured data to identify consumer
behavior, product sentiment and purchasing
patterns to better accommodate their customer
Use cases for base.
• Predictive data analytics: Alert businesses
unstructured of important activity ahead of time so they can
data properly plan and accordingly adjust to
significant market shifts.
• Chatbots: Perform text analysis to route
customer questions to the appropriate answer
sources.
• Semi-Structured Data : Semi-structured data is
information that does not reside in a relational database
but that have some organizational properties that make
it easier to analyze. With some process, you can store
them in a relational database but is very hard for some
Semi- kind of semi-structured data, but semi-structured exist to
ease space.
Structured • Semi-structured data uses “metadata” (for example,
tags and semantic markers) to identify specific data
Data characteristics and scale data into records and preset
fields. Metadata ultimately enables semi-structured data
to be better cataloged, searched and analyzed than
unstructured data.
• Semi-structured data (for example, JSON, CSV, XML) is
the “bridge” between structured and unstructured data.
• Example of metadata usage: An online
article displays a headline, a snippet, a
featured image, image alt-text, slug, among
others, which helps differentiate one piece
of web content from similar pieces.
Semi- • Example of semi-structured data vs.
Structured
structured data: A tab-delimited file
containing customer data versus a database
Data containing CRM tables.
• Example of semi-structured data vs.
unstructured data: A tab-delimited file
versus a list of comments from a customer’s
Instagram.
• Sources: Structured data is sourced from GPS sensors, online
forms, network logs, web server logs, OLTP systems, among
others; whereas unstructured data sources include email
messages, word-processing documents, PDF files, and others.
• Forms: Structured data consists of numbers and values,
whereas unstructured data consists of sensors, text files, audio
and video files, among others.
• Models: Structured data has a predefined data model and is
Key differences formatted to a set data structure before being placed in data
between structured storage (for example, schema-on-write), whereas unstructured
data is stored in its native format and not processed until it is
and unstructured data used (for example, schema-on-read).
• Storage: Structured data is stored in tabular formats (for
example, excel sheets or SQL databases) that require less
storage space. It can be stored in data warehouses, which makes
it highly scalable. Unstructured data, on the other hand, is
stored as media files or NoSQL databases, which require more
space. It can be stored in data lakes, which makes it difficult to
scale.
• Uses: Structured data is used in machine learning (ML) and
drives its algorithms, whereas unstructured data is used
in natural language processing (NLP) and text mining.
• The characteristics of data depend:
• Accuracy
• Data should be precise and correct to ensure reliability in analysis
and decision-making.
• 2. Completeness
• Data should be whole and include all relevant details to avoid
gaps or missing information.
• 3. Consistency
Characteristics
• Data should remain uniform and coherent across different
of data datasets and time periods.
• 4. Timeliness
• Data must be up-to-date and relevant to the current situation or
timeframe.
• 5. Relevance
• The data collected should meet the specific requirements or
objectives of its intended use.
• 6. Validity
• The data should conform to specific rules or formats and
should be logically sound.
• 7. Granularity
• Refers to the level of detail in the data. High granularity
means highly detailed data.
• 8. Volume
Characteristics • The size of the dataset can influence its usability. Large
datasets require special tools and techniques to process.
of data • 9. Variety
• Data can come in many forms, such as structured,
unstructured, text, images, audio, or video.
• 10. Veracity
• Refers to the quality and reliability of the data, addressing
issues like bias, noise, or errors.
• 11. Accessibility
• Data should be easy to retrieve, share, and use by
authorized users.
• 12. Scalability
• The data should be able to grow or shrink as per the needs
of the application or system.
• 13. Interoperability
Characteristics • The ability of data to be used and integrated across
different systems or platforms.
of data • 14. Actionability
• Data should provide insights that can lead to meaningful
actions or decisions.
• 15. Security
• Data should be protected against unauthorized access or
breaches to ensure confidentiality.
• Data analytics provide a critical component of a business’s
probability of success. Gathering, sorting, analyzing, and
presenting information can significantly enhance and
benefit society, particularly in fields such as healthcare and
crime prevention.
• But the uses of data analytics can be equally beneficial for
Need of small enterprises and startups that are looking for an edge
over the business next door, albeit on a smaller scale.
Data • Data analysis also helps in the marketing and advertising of
the business to make it popular and thus more customers
Analytics will know about the business.
• The valuable information which is taken out from the raw
data can bring advantage to the organisation by examining
present situations and predicting future outcomes.
• From data Analytics the business can get better by
targeting the right audience, disposable outcomes and
audience spending habits which helps the business to set
prices according to the interest and budget of customers.
• Gain greater insight into target
markets
• Enhance decision-making capabilities
Need in • Create targeted strategies and
marketing campaigns
business • Improve operational inefficiencies and
analytics minimize risk
• Identify new product and service
opportunities
Data analytics is used in almost every sector of business:

• Retail: Data analytics helps retailers understand their customer


needs and buying habits to predict trends, recommend new
products, and boost their business. They optimize the supply chain,
Data and retail operations at every step of the customer journey.

Analytics • Healthcare: Healthcare industries analyze patient data to provide


Applications lifesaving diagnoses and treatment options. Data analytics help in
discovering new drug development methods as well.

• Manufacturing: Using data analytics, manufacturing sectors can


discover new cost-saving opportunities. They can solve complex
supply chain issues, labor constraints, and equipment breakdowns.
• Banking sector: Banking and financial institutions use
analytics to find out probable loan defaulters and customer
Data
churn out rate. It also helps in detecting fraudulent
transactions immediately.
Analytics • Logistics: Logistics companies use data analytics to
Applications develop new business models and optimize routes. This, in
turn, ensures that the delivery reaches on time in a cost-
efficient manner.
Phases of
data
analytics
lifecycle
• Define the business problem or objective to be
solved.
• Identify the scope of the data analytics project.
• Determine the relevant data sources (internal and
external).
• Collaborate with stakeholders to clarify
Phase 1: objectives and requirements.
• Assess available resources, including data, tools,
Discovery and team expertise.
• The data science team is trained and researches the
issue.
• Create context and gain understanding.
• Learn about the data sources that are needed and
accessible to the project.
• The team comes up with an initial hypothesis, which
can be later confirmed with evidence.
• Collect data from various sources like databases, APIs, and
spreadsheets.
• Clean the data by handling missing values, removing duplicates,
and correcting inconsistencies.
• Transform and format data to make it suitable for analysis.
• Ensure data is standardized and normalized where necessary to
Phase 2: maintain consistency.

Data • Conduct data profiling to understand data distributions, types, and


patterns.
Preparation • Methods to investigate the possibilities of pre-processing,
analysing, and preparing data before analysis and modelling.
• It is required to have an analytic sandbox. The team performs,
loads, and transforms to bring information to the data sandbox.
• Data preparation tasks can be repeated and not in a predetermined
sequence.
• Some of the tools used commonly for this process include -
Hadoop, Alpine Miner, Open Refine, etc.
• Select the variables (features) most relevant to solving the problem.
• Choose appropriate algorithms (e.g., regression, clustering,
classification) based on the problem type.
• Develop a roadmap for the modeling phase, outlining how the data

Phase 3:
will be used.
• Create an initial hypothesis about how the model will behave with

Model the chosen data.


• The team studies data to discover the connections between
Planning variables. Later, it selects the most significant variables as well as
the most effective models.
• In this phase, the data science teams create data sets that can be
used for training for testing, production, and training goals.
• The team builds and implements models based on the work
completed in the modelling planning phase.
• Some of the tools used commonly for this stage are MATLAB
• Build models using the selected algorithms and the prepared dataset.
• Train models by feeding them the training dataset to allow them to
learn patterns.
• Use a test dataset to validate model performance and avoid
overfitting.
Phase 4: • Iterate and refine the model by adjusting parameters for better

Model
accuracy.
• The team creates datasets for training, testing as well as production

Building use.
• The team is also evaluating whether its current tools are sufficient to
run the models or if they require an even more robust environment to
run models.
• Tools that are free or open-source or free tools Rand PL/R, Octave,
WEKA.
• Commercial tools - MATLAB
• Visualize results through charts, graphs, dashboards, and other
visuals to make insights understandable.
• Present key findings to stakeholders in a concise, actionable format.
• Summarize the insights derived from the data and explain how they
align with business goals.
Phase 5:
• Following the execution of the model, team members will need to
Communication evaluate the outcomes of the model to establish criteria for the success
Results or failure of the model.
• The team is considering how best to present findings and outcomes to
the various members of the team and other stakeholders while taking
into consideration cautionary tales and assumptions.
• The team should determine the most important findings, quantify their
value to the business and create a narrative to present findings and
summarize them to all stakeholders.
• Deploy the model into production environments, integrating it into
business processes.
• Automate tasks like decision-making or predictions based on the
model’s insights.
• Continuously monitor model performance to ensure it remains
accurate and relevant as new data becomes available.
Phase 6: • The team distributes the benefits of the project to a wider audience. It
Operationalize sets up a pilot project that will deploy the work in a controlled manner
prior to expanding the project to the entire enterprise of users.
• This technique allows the team to gain insight into the performance
and constraints related to the model within a production setting at a
small scale and then make necessary adjustments before full
deployment.
• The team produces the last reports, presentations, and codes.
• Open source or free tools such as WEKA, SQL, MADlib, and Octave.
Methods of data analytics
• The action of grouping a set of data elements in a
way that said elements are more similar (in a
particular sense) to each other than to those in other
Cluster groups – hence the term ‘cluster.’

analysis • Since there is no target variable when clustering, the


method is often used to find hidden patterns in the
data. The approach is also used to provide additional
context to a trend or dataset.
• This type of data analysis method uses historical data to
examine and compare a determined segment of users' behavior,
which can then be grouped with others with similar
characteristics.
Cohort • By using this data analysis methodology, it's possible to gain a
analysis wealth of insight into consumer needs or a firm understanding of
a broader target group.
• Cohort analysis can be really useful to perform analysis in
marketing as it will allow you to understand the impact of your
campaigns on specific groups of customers.
• The regression analysis uses historical data to
understand how a dependent variable's value is
affected when one (linear regression) or more
independent variables (multiple regression) change
Regression or stay the same.
analysis • • By understanding each variable's relationship and
how they developed in the past, you can anticipate
possible outcomes and make better business decisions in the
future.
• This entails taking a complex dataset with
Factor many variables and reducing the variables to
a small number. The goal of this is to attempt
analysis to discover hidden trends that would
otherwise have been more difficult to see.
• Text analysis, also known in the industry as text
mining, is the process of taking large sets of textual
data and arranging it in a way that makes it easier to
Text manage.
• • By working through this cleansing process in
analysis stringent detail, you will be able to extract the data
that is truly relevant to your business and use it to
develop actionable insights that will propel you
forward.
• A method of analysis that is the umbrella term for
engineering metrics and insights for additional value,
Data direction, and context.
• • By using exploratory statistical evaluation, data mining
Mining aims to identify dependencies, relations, data patterns,
and trends to generate and advanced knowledge.
• • When considering how to analyze data, adopting a data
mining mindset is essential to success - as such, it’s an
area that is worth exploring in greater detail

You might also like