UNIT I Notes BA
UNIT I Notes BA
Data Science is a field that deals with extracting meaningful information and insights by
applying various algorithms preprocessing and scientific methods on structured and
unstructured data. This field is related to Artificial Intelligence and is currently one of the
most demanded skills. Data science comprises mathematics, computations, statistics,
programming, etc to gain meaningful insights from the large amount of data provided in
various format
What is Data Analytics
Data Analytics is used to get conclusions by processing the raw data. It is helpful in various
businesses as it helps the company to make decisions based on the conclusions from the
data. Basically, data analytics helps to convert a Large number of figures in the form of
data into Plain English i.e., conclusions which are further helpful in making in-depth
decisions
If you’re considering a career as a data scientist, and are wondering, “What does a data scientist do?”,
here are the six main steps in the data science process:
1. Goal definition. The data scientist works with business stakeholders to define goals and
objectives for the analysis. These goals can be defined specifically, such as optimizing an
advertising campaign or broadly, such as improving overall production efficiency.
2. Data collection. If systems are not already in place to collect and store source data, the data
scientist establishes a systematic process to do so.
3. Data integration & management. The data scientist applies best practices of data
integration to transform raw data into clean information that’s ready for analysis. The data
integration and management process involves data replication, ingestion and transformation
to combine different types of data into standardized formats which are then stored in a
repository such as a data lake or data warehouse.
4. Data investigation & exploration. In this step, the data scientist performs an initial
investigation of the data and exploratory data analysis. This investigation and exploration are
typically performed using a data analytics platform or business intelligence tool.
5. Model development. Based on the business objective and the data exploration, the data
scientist chooses one or more potential analytical models and algorithms and then builds
these models using languages such as SQL, R or Python and applying data science
techniques, such as AutoML, machine learning, statistical modeling, and artificial
intelligence. The models are then “trained” via iterative testing until they operate as required.
Model deployment and presentation. Once the model or models have been selected and refined,
they are run using the available data to produce insights. These insights are then shared with all
stakeholders using sophisticated data visualization and dashboards. Based on feedback from
stakeholders, the data scientist makes any necessary adjustments in the model.
Given the pace of change and the volume of data at hand in today’s business world, data scientists
play a critical role in helping an organization achieve its goals. A modern data scientist is expected to
do the following:
The ideal data scientist is be able to solve highly complex problems because they are able to do the
following:
Help define objectives and interpret results based on business domain expertise
Manage and optimize the organization’s data infrastructure
Utilize relevant programming languages, statistical techniques and software tools
Have the curiosity to explore and spot trends and patterns in data
Communicate and collaborate effectively across and organization
The below Venn diagram, adapted from Stephan Kolassa, shows how a data science consultant (in
the heart of the diagram) must combine their skills in communication, statistics and programming
with a deep understanding of the business.
The primary steps in the data analytics process involve defining requirements, integrating and
managing the data, analyzing the data and sharing the insights.
1. Project Requirements & Data Collection. Determine which question(s) you seek to answer
and ensure that you have collected the source data you need.
2. Data Integration & Management: Transform raw data into clean, business ready
information. This step includes data replication and ingestion to combine different types of
data into standardized formats which are stored in a repository such as a data warehouse or
data lake and governed by a set of specific rules.
1. Data Analysis, Collaboration and Sharing. Explore your data and collaborate with others
to develop insights using data analytics software. Then share your findings across the
organization in the form of compelling interactive dashboards and reports. Some modern
tools offer self-service analytics, which enables any user to analyze data without writing code
and let you use natural language to explore data. These capabilities increase data literacy so
that more users can work with and get value from their data.
Data Science is the application of tools, processes, and techniques towards combining,
preparing and examining large datasets and then using programming, statistics, machine
learning and algorithms to design and build new data models.
Data analytics is the use of tools and processes to combine, prepare and analyze datasets to
identify patterns and develop actionable insights.
The main difference in data science vs data analytics is highlighted in bold in the first process
diagram: data science involves data models.
The goal of both data science and data analytics is often to identify patterns and develop actionable
insights. But data science can also seek to produce broad insights by asking questions, finding the
right questions to ask and identifying areas to study.
Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. predictive analytics
uses data to determine the probable outcome of an event or a likelihood of a situation
occurring. Predictive analytics holds a variety of statistical techniques from
modeling, machine learning , data mining , and game theory that analyze current and
historical facts to make predictions about a future event. Techniques that are used for
predictive analytics are:
Linear Regression
Time Series Analysis and Forecasting
Data Mining
Basic Cornerstones of Predictive Analytics
Predictive modeling
Decision Analysis and optimization
Transaction profiling
Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to approach
future events. It looks at past performance and understands the performance by mining
historical data to understand the cause of success or failure in the past. Almost all
management reporting such as sales, marketing, operations, and finance uses this type of
analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify
customers or prospects into groups. Unlike a predictive model that focuses on predicting
the behavior of a single customer, Descriptive analytics identifies many different
relationships between customer and product.
Common examples of Descriptive analytics are company reports that provide historic
reviews like:
Data Queries
Reports
Descriptive Statistics
Data dashboard
Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science, business
rule, and machine learning to make a prediction and then suggests a decision option to take
advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefits from the predictions and showing the decision maker the implication of each
decision option. Prescriptive Analytics not only anticipates what will happen and when to
happen but also why it will happen. Further, Prescriptive Analytics can suggest decision
options on how to take advantage of a future opportunity or mitigate a future risk and
illustrate the implication of each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using
analytics to leverage operational and usage data combined with data of external factors such
as economic data, population demography, etc.
Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any question or
for the solution of any problem. We try to find any dependency and pattern in the historical
data of the particular problem.
For example, companies go for this analysis because it gives a great insight into a problem,
and they also keep detailed information about their disposal otherwise data collection may
turn out individual for every problem and it will be very time-consuming. Common
techniques used for Diagnostic Analytics are:
Data discovery
Data mining
Correlations
The Role of Data Analytics
Data analytics plays a pivotal role in enhancing operations, efficiency, and performance
across various industries by uncovering valuable patterns and insights. Implementing data
analytics techniques can provide companies with a competitive advantage. The process
typically involves four fundamental steps:
Data Mining : This step involves gathering data and information from diverse sources
and transforming them into a standardized format for subsequent analysis. Data mining
can be a time-intensive process compared to other steps but is crucial for obtaining a
comprehensive dataset.
Data Management : Once collected, data needs to be stored, managed, and made
accessible. Creating a database is essential for managing the vast amounts of
information collected during the mining process. SQL (Structured Query Language)
remains a widely used tool for database management, facilitating efficient querying and
analysis of relational databases.
Statistical Analysis : In this step, the gathered data is subjected to statistical analysis to
identify trends and patterns. Statistical modeling is used to interpret the data and make
predictions about future trends. Open-source programming languages like Python, as
well as specialized tools like R, are commonly used for statistical analysis and graphical
modeling.
Data Presentation : The insights derived from data analytics need to be effectively
communicated to stakeholders. This final step involves formatting the results in a
manner that is accessible and understandable to various stakeholders, including
decision-makers, analysts, and shareholders. Clear and concise data presentation is
essential for driving informed decision-making and driving business growth.
Steps in Data Analysis
Define Data Requirements : This involves determining how the data will be grouped
or categorized. Data can be segmented based on various factors such as age,
demographic, income, or gender, and can consist of numerical values or categorical
data.
Data Collection : Data is gathered from different sources, including computers, online
platforms, cameras, environmental sensors, or through human personnel.
Data Organization : Once collected, the data needs to be organized in a structured
format to facilitate analysis. This could involve using spreadsheets or specialized
software designed for managing and analyzing statistical data.
Data Cleaning : Before analysis, the data undergoes a cleaning process to ensure
accuracy and reliability. This involves identifying and removing any duplicate or
erroneous entries, as well as addressing any missing or incomplete data. Cleaning the
data helps to mitigate potential biases and errors that could affect the analysis results.
Usage of Data Analytics
There are some key domains and strategic planning techniques in which Data Analytics has
played a vital role:
Improved Decision-Making – If we have supporting data in favour of a decision, then
we can implement them with even more success probability. For example, if a certain
decision or plan has to lead to better outcomes then there will be no doubt in
implementing them again.
Better Customer Service – Churn modeling is the best example of this in which we try
to predict or identify what leads to customer churn and change those things accordingly
so, that the attrition of the customers is as low as possible which is a most important
factor in any organization.
Efficient Operations – Data Analytics can help us understand what is the demand of
the situation and what should be done to get better results then we will be able to
streamline our processes which in turn will lead to efficient operations.
Effective Marketing – Market segmentation techniques have been implemented to
target this important factor only in which we are supposed to find the marketing
techniques which will help us increase our sales and leads to effective marketing
strategies.
Future Scope of Data Analytics
Retail : To study sales patterns, consumer behavior, and inventory management, data
analytics can be applied in the retail sector. Data analytics can be used by retailers to
make data-driven decisions regarding what products to stock, how to price them, and
how to best organize their stores.
Healthcare : Data analytics can be used to evaluate patient data, spot trends in patient
health, and create individualized treatment regimens. Data analytics can be used by
healthcare companies to enhance patient outcomes and lower healthcare expenditures.
Finance : In the field of finance, data analytics can be used to evaluate investment data,
spot trends in the financial markets, and make wise investment decisions. Data analytics
can be used by financial institutions to lower risk and boost the performance of
investment portfolios.
Marketing : By analyzing customer data, spotting trends in consumer behavior, and
creating customized marketing strategies, data analytics can be used in marketing. Data
analytics can be used by marketers to boost the efficiency of their campaigns and their
overall impact.
Manufacturing : Data analytics can be used to examine production data, spot trends in
production methods, and boost production efficiency in the manufacturing sector. Data
analytics can be used by manufacturers to cut costs and enhance product quality.
Transportation : To evaluate logistics data, spot trends in transportation routes, and
improve transportation routes, the transportation sector can employ data analytics. Data
analytics can help transportation businesses cut expenses and speed up delivery times.
Conclusion
Data Analytics act as tool that is used for both organizations and individuals that seems to
use the power of data. As we progress in this data-driven age, data analytics will continue
to play a pivotal role in shaping industries and influencing future.
To make things easy for you, we have explained the four key elements to help you write your
business problem statement. They include:
Defining the problem is the primary aspect of a business problem statement. Summarize your
problem in simple and layman’s terms. It is highly recommended to avoid industrial lingo
and buzzwords.
Support your summary with insights from both internal and external reports to add credibility
and context. Write a 3-5 sentence long summary, avoid writing more than it.
For example: “The manual auditing process is causing delays and errors in our finance
department, leading to increased workload and missed deadlines.”
2. Provide the problem analysis
Here, explains the background of the problem. Add relevant statistics and results from
surveys, industry trends, customer demographics, staffing reports, etc, to help the reader
understand the current situation. These references should describe your problem and its
effects on various attributes of your business.
Avoid adding too many stats in your problem statement, and include only the necessary ones.
It’s best to include no more than three significant stats.
3. Propose a solution
Your business problem statement should conclude with a solution to the previously described
problem. The solution should describe how the current state can be improved.
The solution must not exceed two sentences. Also, avoid including elaborate actions and
steps in a problem statement, because it will lead to the solution looking messy. These can be
further explained when you write a project plan.
When you start writing your business problem statement, or any formal document, it is
important to be aware of the reader. Write your problem statement considering the reader’s
knowledge about the situation, requirements, and expectations.
While your gut feeling can be helpful, focusing on facts and research will lead to better
solutions. If the readers are unfamiliar with the problem’s context, ensure you introduce it
thoroughly before presenting your proposed solutions.
What: What is the problem that needs to be solved? Include the root cause of the problem.
Mention other micro problems that are connected with the macro ones.
Why: Why is it a problem? Describe the reasons why it is a problem. Include supporting facts
and statistics to highlight the trouble.
Where: Where is the problem observed? Mention the location and the specifics of it. Include
the products or services in which the problem is seen.
Who: Who is impacted by this problem? Define and mention the target audience, staff,
departments, and businesses affected by the problem.
When: When was the problem first observed? Talk about the timeline. Explain how the
intensity of the problem has changed from the time it was first observed.
How: Describe how the problem is observed. Include signs or symptoms of the problem and
discuss the observations you made during your analysis.
How much: How often is the problem observed? If you have identified a trend during your
research, mention it. Comment on the error rate and the frequency and magnitude of the
problem.
The problem: The problem statement begins with mentioning and explaining the current
state.
Who it affects: Mention the people who are affected by the problem.
How it impacts: Explain the impacts of the problem.
The solution: Your problem statement ends with a proposed solution.
One technique that is extremely useful to gain a better understanding of the problems before
determining a solution is problem analysis.
Problem analysis is the process of understanding real-world problems and user’s needs
and proposing solutions to meet those needs. The goal of problem analysis is to gain a
better understanding of the problem being solved before developing a solution.
There are five useful steps that can be taken to gain a better understanding of the problem
before developing a solution.
The problem of having to manually maintain an accurate single source of truth for finance
product data across the business, affects the finance department. The results of which has
the impact of not having to have duplicate data, having to do workarounds and difficulty of
maintaining finance product data across the business and key channels. A successful solution
would have the benefit of providing a single source of truth for finance product data that can
be used across the business and channels and provide an audit trail of changes, stewardship
and maintain data standards and best practices.
Root cause analysis helps prevents the development of solutions that are focussed on
symptoms alone.
To help identify the root cause, or the problem behind the problem, ask the people directly
involved.
The Ishikawa diagrams (also called fishbone diagrams) is a good visual way of showing
potential root-causes and sub root-causes.
Another popular technique to compliment and understand the problem behind a problem is
the 5 whys methods- which is part of the Toyota Production System and a technique that
became an integral part of the Lean philosophy.
The primary goal of the technique is to determine the root cause of a defect or problem
by repeating the question “Why?”. Each answer forms the basis of the next question.
The “five” in the name derives from an anecdotal observation on the number of
iterations needed to resolve the problem.
The problem statement format can be used in businesses and across industries.
Data collection is the process of collecting and evaluating information or data from multiple
sources to find answers to research problems, answer questions, evaluate outcomes, and
forecast trends and probabilities. It is an essential phase in all types of research, analysis, and
decision-making, including that done in the social sciences, business, and healthcare.
During data collection, researchers must identify the data types, the sources of data, and the
methods being used. We will soon see that there are many different data collection methods.
Data collection is heavily reliance on in research, commercial, and government fields.
Before an analyst begins collecting data, they must answer three questions first:
What methods and procedures will be used to collect, store, and process the information?
Additionally, we can divide data into qualitative and quantitative types. Qualitative data
covers descriptions such as color, size, quality, and appearance. Unsurprisingly, quantitative
data deals with numbers, such as statistics, poll numbers, percentages, etc.
Before a judge makes a ruling in a court case or a general creates a plan of attack, they must
have as many relevant facts as possible. The best courses of action come from informed
decisions, and information and data are synonymous.
The concept of data collection isn’t new, as we’ll see later, but the world has changed. There
is far more data available today, and it exists in forms that were unheard of a century ago.
The data collection process has had to change and grow, keeping pace with technology.
Whether you’re in academia, trying to conduct research, or part of the commercial sector,
thinking of how to promote a new product, you need data collection to help you make better
choices.
Now that you know what data collection is and why we need it, let's look at the different
methods of data collection. Data collection could mean a telephone survey, a mail-in
comment card, or even some guy with a clipboard asking passersby some questions. But let’s
see if we can sort the different data collection methods into a semblance of organized
categories.
Primary and secondary methods of data collection are two approaches used to gather
information for research or analysis purposes. Let's explore each data collection method in
detail:
The first techniques of data collection is Primary data collection which involves the
collection of original data directly from the source or through direct interaction with the
respondents. This method allows researchers to obtain firsthand information tailored to their
research objectives. There are various techniques for primary data collection, including:
b. Interviews: Interviews involve direct interaction between the researcher and the
respondent. They can be conducted in person, over the phone, or through video conferencing.
Interviews can be structured (with predefined questions), semi-structured (allowing
flexibility), or unstructured (more conversational).
c. Observations: Researchers observe and record behaviors, actions, or events in their natural
setting. This method is useful for gathering data on human behavior, interactions, or
phenomena without direct intervention.
e. Focus Groups: Focus groups bring together a small group of individuals who discuss
specific topics in a moderated setting. This method helps in understanding the opinions,
perceptions, and experiences shared by the participants.
2. Secondary Data Collection
The next techniques of data collection is Secondary data collection which involves using
existing data collected by someone else for a purpose different from the original intent.
Researchers analyze and interpret this data to extract relevant information. Secondary data
can be obtained from various sources, including:
b. Online Databases: Numerous online databases provide access to a wide range of secondary
data, such as research articles, statistical information, economic data, and social surveys.
e. Past Research Studies: Previous research studies and their findings can serve as valuable
secondary data sources. Researchers can review and analyze the data to gain insights or build
upon existing knowledge.
Now that we’ve explained the various techniques let’s narrow our focus even further by
looking at some specific tools. For example, we mentioned interviews as a technique, but we
can further break that down into different interview types (or “tools”).
Word Association
The researcher gives the respondent a set of words and asks them what comes to mind when
they hear each word.
Sentence Completion
Researchers use sentence completion to understand the respondent's ideas. This tool involves
giving an incomplete sentence and seeing how the interviewee finishes it.
Role-Playing
Respondents are presented with an imaginary situation and asked how they would act or react
if it were real.
In-Person Surveys
Online/Web Surveys
These surveys are easy to accomplish, but some users may be unwilling to answer truthfully,
if at all.
Mobile Surveys
These surveys take advantage of the increasing proliferation of mobile technology. Mobile
collection surveys rely on mobile devices like tablets or smartphones to conduct surveys via
SMS or mobile apps.
Phone Surveys
No researcher can call thousands of people at once, so they need a third party to handle the
chore. However, many people have call screening and won’t answer.
Observation
Sometimes, the simplest method is the best. Researchers who make direct observations
collect data quickly and easily, with little intrusion or third-party bias. Naturally, this method
is only effective in small-scale situations.
Accurate data collecting is crucial to preserving the integrity of research, regardless of the
subject of study or preferred method for defining data (quantitative, qualitative). Errors are
less likely to occur when the right data gathering tools are used (whether they are brand-new
ones, updated versions of them, or already available).
Among the effects of data collection done incorrectly include the following:
When these study findings are used to support recommendations for public policy, there is
the potential to result in disproportionate harm, even if the degree of influence from flawed
data collecting may vary by discipline and the type of investigation.
Let us now look at the various issues that we might face while maintaining the integrity of
data collection.
To assist the error detection process in the data gathering process, whether they were done
purposefully (deliberate falsifications) or not, maintaining data integrity is the main
justification (systematic or random errors).
Quality assurance and quality control are two strategies that help protect data integrity and
guarantee the scientific validity of study results. Each strategy is used at various stages of the
research timeline:
Quality control - tasks that are performed both after and during data collecting
Quality Assurance
As data collecting comes before quality assurance, its primary goal is "prevention" (i.e.,
forestalling problems with data collection). The best way to protect the accuracy of data
collection is through prevention. The uniformity of protocol created in the thorough and
exhaustive procedures manual for data collecting serves as the best example of this proactive
step.
The likelihood of failing to spot issues and mistakes early in the research attempt increases
when guides are written poorly. There are several ways to show these shortcomings:
Failure to determine the precise subjects and methods for retraining or training staff
employees in data collecting
List of goods to be collected, in part
There isn't a system in place to track modifications to processes that may occur as the
investigation continues.
Uncertainty regarding the date, procedure, and identity of the person or people in charge
of examining the data
Incomprehensible guidelines for using, adjusting, and calibrating the data collection
equipment.
Quality Control
Despite the fact that quality control actions (detection/monitoring and intervention) take place
both after and during data collection, the specifics should be meticulously detailed in the
procedures manual. Establishing monitoring systems requires a specific communication
structure, which is a prerequisite. Following the discovery of data collection problems, there
should be no ambiguity regarding the information flow between the primary investigators and
staff personnel. A poorly designed communication system promotes slack oversight and
reduces opportunities for error detection.
Direct staff observation conference calls, during site visits, or frequent or routine assessments
of data reports to spot discrepancies, excessive numbers, or invalid codes can all be used as
forms of detection or monitoring. Site visits might not be appropriate for all disciplines. Still,
without routine auditing of records, whether qualitative or quantitative, it will be challenging
for investigators to confirm that data gathering is taking place in accordance with the
manual's defined methods. Additionally, quality control determines the appropriate solutions,
or "actions," to fix flawed data gathering procedures and reduce recurrences.
Problems with data collection, for instance, that call for immediate action include:
Fraud or misbehavior
Researchers are trained to include one or more secondary measures that can be used to verify
the quality of information being obtained from the human subject in the social and behavioral
sciences where primary data collection entails using human subjects.
For instance, a researcher conducting a survey would be interested in learning more about the
prevalence of risky behaviors among young adults as well as the social factors that influence
these risky behaviors' propensity for and frequency. Let us now explore the common
challenges with regard to data collection.
Once you’ve gathered your data through various methods of data collection, here is what
happens next:
At this stage, you’ll use various methods to explore your data more thoroughly. This can
involve statistical methods to uncover patterns or qualitative techniques to understand the
broader context. The goal is to turn raw data into actionable insights that can guide decisions
and strategies moving forward.
After analyzing the data collected through methods of data collection in research, the next
step is to interpret and present your findings. The format and detail depend on your audience,
researchers might require academic papers, M&E teams need comprehensive reports, and
field teams often rely on real-time feedback. What’s key here is ensuring that the data is
communicated clearly, allowing everyone to make informed decisions.
Once your data has been analyzed, proper storage is essential. Cloud storage is a reliable
option, offering both security and accessibility. Regular backups are also important, as is
limiting access to ensure that only the right people are handling sensitive information. This
helps maintain the integrity and safety of your data throughout the project.
The main threat to the broad and successful application of machine learning is poor data
quality. Data quality must be your top priority if you want to make technologies like machine
learning work for you. Let's talk about some of the most prevalent data quality problems in
this blog article and how to fix them.
Inconsistent Data
When working with various data sources, it's conceivable that the same information will have
discrepancies between sources. The differences could be in formats, units, or occasionally
spellings. The introduction of inconsistent data might also occur during firm mergers or
relocations. Inconsistencies in data tend to accumulate and reduce the value of data if they are
not continually resolved. Organizations that focus heavily on data consistency do so because
they only want reliable data to support their analytics.
Data Downtime
Data is the driving force behind the decisions and operations of data-driven businesses.
However, there may be brief periods when their data is unreliable or not prepared. Customer
complaints and subpar analytical outcomes are only two ways this data unavailability can
significantly impact businesses. A data engineer spends significant amount of their time
updating, maintaining, and guaranteeing the integrity of the data pipeline. To ask the next
business question, there is a high marginal cost due to the lengthy operational lead time from
data capture to insight.
Schema modifications and migration problems are just two examples of the causes of data
downtime. Due to their size and complexity, data pipelines can be difficult to manage. Data
downtime must be continuously monitored and reduced through automation.
Ambiguous Data
Even with thorough oversight, some errors can still occur in massive databases or data lakes.
The issue becomes more overwhelming when data streams at a fast speed. Spelling mistakes
can go unnoticed, formatting difficulties can occur, and column heads might be deceptive.
This unclear data might cause several problems for reporting and analytics.
Duplicate Data
Streaming data, local databases, and cloud data lakes are just a few of the data sources that
modern enterprises must contend with. They might also have application and system silos.
These sources are likely to duplicate and overlap each other quite a bit. For instance,
duplicate contact information has a substantial impact on customer experience. Marketing
campaigns suffer if certain prospects are ignored while others are engaged repeatedly. The
likelihood of biased analytical outcomes increases when duplicate data are present. It can also
result in ML models with biased training data.
Abundance of Data
While we emphasize data-driven analytics and its advantages, a data quality problem with
excessive data exists. There is a risk of getting lost in abundant data when searching for
information pertinent to your analytical efforts. Data scientists, data analysts, and business
users devote 80% of their work to finding and organizing the appropriate data. With
increased data volume, other problems with data quality become more serious, mainly when
dealing with streaming data and significant files or databases.
Inaccurate Data
Data accuracy is crucial for highly regulated businesses like healthcare. Given the current
experience, it is more important than ever to increase the data quality for COVID-19 and later
pandemics. Inaccurate information does not provide a true picture of the situation and cannot
be used to plan the best course of action. Personalized customer experiences and marketing
strategies underperform if your customer data is inaccurate.
Data inaccuracies can be attributed to several things, including data degradation, human
mistakes, and data drift. Worldwide data decay occurs at a rate of about 3% per month, which
is quite concerning. Data integrity can be compromised while transferring between different
systems, and data quality might deteriorate with time.
Hidden Data
The majority of businesses only utilize a portion of their data, with the remainder sometimes
being lost in data silos or discarded in data graveyards. For instance, the customer service
team might not receive client data from sales, missing an opportunity to build more precise
and comprehensive customer profiles. Missing out on possibilities to develop novel products,
enhance services, and streamline procedures is caused by hidden data.
Finding relevant data is not so easy. There are several factors that we need to consider while
trying to find relevant data, which include -
Relevant Domain
Relevant demographics
We need to consider Relevant Time periods and many more factors while trying to find
appropriate data.
Data irrelevant to our study in any of the factors renders it obsolete, and we cannot
effectively proceed with its analysis. This could lead to incomplete research or analysis, re-
collecting data repeatedly, or shutting down the study.
Determining what data to collect is one of the most important factors while collecting data
and should be one of the first factors in collecting data. We must choose the subjects the data
will cover, the sources we will use to gather it, and the required information. Our responses to
these queries will depend on our aims, or what we expect to achieve utilizing your data. As
an illustration, we may choose to gather information on the categories of articles that website
visitors between the ages of 20 and 50 most frequently access. We can also decide to compile
data on the typical age of all the clients who purchased from your business over the previous
month.
Not addressing this could lead to double work, the collection of irrelevant data, or the ruin of
your study.
Big data refers to massive data sets with more intricate and diversified structures. These traits
typically result in increased challenges while storing, analyzing, and using additional
methods of extracting results. Big data refers especially to data sets that are so enormous or
intricate that conventional data processing tools are insufficient. The overwhelming amount
of data, both unstructured and structured, that a business faces daily.
Poor design and low response rates were shown to be two issues with data collecting,
particularly in health surveys that used questionnaires. This might lead to an insufficient or
inadequate data supply for the study. Creating an incentivized data collection program might
be beneficial in this case to get more responses.
In the Data Collection Process, there are five key steps. They are explained briefly below:
1. Decide What Data You Want to Gather
The first thing that we need to do is decide what information we want to gather. We must
choose the subjects the data will cover, the sources we will use to collect it, and the quantity
of information that we will require. For instance, we may choose to gather information on the
categories of products that an average e-commerce website visitor between the ages of 30 and
45 most frequently searches for.
The process of creating a strategy for data collection can now begin. We should set a deadline
for our data collection at the outset of our planning phase. Some forms of data we might want
to collect continuously. For instance, we might want to build up a technique for tracking
transactional data and website visitor statistics over the long term. However, we will track the
data throughout a certain time frame if we are tracking it for a particular campaign. In these
situations, we will have a schedule for beginning and finishing gathering data.
At this stage, we will select the data collection technique to serve as the foundation of our
data-gathering plan. We must consider the type of information we wish to gather, the period
we will receive it, and the other factors we decide on when choosing the best gathering
strategy.
4. Gather Information
Once our plan is complete, we can implement our data collection plan and begin gathering
data. In our DMP, we can store and arrange our data. We need to be careful to follow our
plan and keep an eye on how it's doing. Especially if we are collecting data regularly, setting
up a timetable for when we will be checking in on how our data gathering is going may be
helpful. As circumstances alter and we learn new details, we might need to amend our plan.
It's time to examine our data and arrange our findings after gathering all our information. The
analysis stage is essential because it transforms unprocessed data into insightful knowledge
that can be applied to better our marketing plans, goods, and business judgments. The
analytics tools included in our DMP can assist with this phase. We can put the discoveries to
use to enhance our business once we have discovered the patterns and insights in our data.
Let us now look at some data collection considerations and best practices that one might
follow.
Data Collection Considerations and Best Practices
We must carefully plan before spending time and money traveling to the field to gather data.
While saving time and resources, effective data collection strategies can help us collect
richer, more accurate, and richer data.
Once we have decided on the data we want to gather, we need to consider the expense of
doing so. Our surveyors and respondents will incur additional costs for each additional data
point or survey question.
There is a dearth of freely accessible data. Sometimes the data is there, but we may not have
access to it. For instance, unless we have a compelling cause, we cannot openly view another
person's medical information. It could be challenging to measure several types of
information.
Consider how time-consuming and complex it will be to gather each piece of information
while deciding what data to acquire.
3. Think About Your Choices for Data Collecting Using Mobile Devices
IVRS (interactive voice response technology) - Will call the respondents and ask them
questions that have already been recorded.
SMS data collection - Will send a text message to the respondent, who can then respond to
questions by text on their phone.
Field surveyors - Can directly enter data into an interactive questionnaire while speaking
to each respondent, thanks to smartphone apps.
We need to select the appropriate tool for our survey and respondents because each has its
own disadvantages and advantages.
It's all too easy to get information about anything and everything, but it's crucial only to
gather the information we require.
Identifiers, or details describing the context and source of a survey response, are just as
crucial as the information about the subject or program that we are researching.
Adding more identifiers will enable us to pinpoint our program's successes and failures more
accurately, but moderation is the key.
Although collecting data on paper is still common, modern technology relies heavily on
mobile devices. They enable us to gather various data types at relatively lower prices and are
accurate and quick. With the boom of low-cost Android devices, there aren't many reasons
not to choose mobile-based data collecting.
Conclusion
To sum up, it is vital to master data collection for making decisions that are well-informed
and conducting effective research. Once you understand the different data collection
techniques and know about the right tools and best practices, you can gather meaningful and
accurate data. However, you must address the common challenges and concentrate on the
essential steps involved in the process to maintain your data's credibility and achieve good
results.
We live in the Data Age, and if you want a career that entirely takes advantage of this, you
should consider a career in data science. Simplilearn offers a Caltech Post Graduate Program
in Data Science that will train you in everything you need to know to secure the perfect
position. This Data Science PG program is ideal for all working professionals, covering job-
critical topics like R, Python programming, machine learning algorithms, NLP concepts,
and data visualization with Tableau in great detail. Our interactive learning model provides
this with live sessions by global practitioners, practical labs, and industry projects
Hypothesis generation involves making informed guesses about various aspects of a business,
market, or problem that need further exploration and testing. It's a crucial step while applying
the scientific method to business analysis and decision-making.
A bicycle manufacturer noticed that their sales had dropped significantly in 2002 compared
to the previous year. The team investigating the reasons for this had many hypotheses. One of
them was: “many cycling enthusiasts have switched to walking with their iPods plugged
in.” The Apple iPod was launched in late 2001 and was an immediate hit among young
consumers. Data collected manually by the team seemed to show that the geographies around
Apple stores had indeed shown a sales decline.
These tools have also revolutionised experimentation by optimising test designs, reducing
resource-intensive processes, and delivering faster results. LLMs' role in hypothesis
generation goes beyond mere assistance, bringing innovation and easy, data-driven decision-
making to businesses.
Hypotheses come in various types, such as simple, complex, null, alternative, logical,
statistical, or empirical. These categories are defined based on the relationships between the
variables involved and the type of evidence required for testing them. In this article, we aim
to demystify hypothesis generation. We will explore the role of LLMs in this process and
outline the general steps involved, highlighting why it is a valuable tool in your arsenal.
A hypothesis is born from a set of underlying assumptions and a prediction of how those
assumptions are anticipated to unfold in a given context. Essentially, it's an educated,
articulated guess that forms the basis for action and outcome assessment.
A hypothesis is a declarative statement that has not yet been proven true. Based on past
scholarship, we could sum it up as the following:
In a business setting, hypothesis generation becomes essential when people are made to
explain their assumptions. This clarity from hypothesis to expected outcome is crucial, as it
allows people to acknowledge a failed hypothesis if it does not provide the intended result.
Promoting such a culture of effective hypothesising can lead to more thoughtful actions and a
deeper understanding of outcomes. Failures become just another step on the way to success,
and success brings more success.
Hypothesis generation is a continuous process where you start with an educated guess and
refine it as you gather more information. You form a hypothesis based on what you know or
observe.
Say you're a pen maker whose sales are down. You look at what you know:
1. I can see that pen sales for my brand are down in May and June.
2. I also know that schools are closed in May and June and that schoolchildren use a lot
of pens.
3. I hypothesise that my sales are down because school children are not using pens in
May and June, and thus not buying newer ones.
The next step is to collect and analyse data to test this hypothesis, like tracking sales before
and after school vacations. As you gather more data and insights, your hypothesis may
evolve. You might discover that your hypothesis only holds in certain markets but not others,
leading to a more refined hypothesis.
Once your hypothesis is proven correct, there are many actions you may take - (a) reduce
supply in these months (b) reduce the price so that sales pick up (c) release a limited supply
of novelty pens, and so on.
Once you decide on your action, you will further monitor the data to see if your actions are
working. This iterative cycle of formulating, testing, and refining hypotheses - and using
insights in decision-making - is vital in making impactful decisions and solving complex
problems in various fields, from business to scientific research.
5. Hypothesis Testing
The default action is what you would naturally do, regardless of any hypothesis or in a case
where you get no further information. The alternative action is the opposite of your default
action.
The null hypothesis, or H0, is what brings about your default action. The alternative
hypothesis (H1) is essentially the negation of H0.
For example, suppose you are tasked with analysing a highway tollgate data (timestamp,
vehicle number, toll amount) to see if a raise in tollgate rates will increase revenue or cause a
volume drop. Following the above steps, we can determine:
Alternative
“I will keep my rates constant.”
Action
“A 10% increase in the toll rate will not cause a significant dip in traffic
H0
(say 3%).”
“A 10% increase in the toll rate will cause a dip in traffic of greater than
H1
3%.”
Now, we can start looking at past data of tollgate traffic in and around rate increases for
different tollgates. Some data might be irrelevant. For example, some tollgates might be
much cheaper so customers might not have cared about an increase. Or, some tollgates are
next to a large city, and customers have no choice but to pay.
Ultimately, you are looking for the level of significance between traffic and rates for
comparable tollgates. Significance is often noted as its P-value or probability value. P-value
is a way to measure how surprising your test results are, assuming that your H0 holds true.
The lower the p-value, the more convincing your data is to change your default action.
Usually, a p-value that is less than 0.05 is considered to be statistically significant, meaning
there is a need to change your null hypothesis and reject your default action. In our example,
a low p-value would suggest that a 10% increase in the toll rate causes a significant dip in
traffic (>3%). Thus, it is better if we keep our rates as is if we want to maintain revenue.
In other examples, where one has to explore the significance of different variables, we might
find that some variables are not correlated at all. In general, hypothesis generation is an
iterative process - you keep looking for data and keep considering whether that data
convinces you to change your default action.
Hypothesis generation feeds on data. Data can be internal or external. In businesses, internal
data is produced by company owned systems (areas such as operations, maintenance,
personnel, finance, etc). External data comes from outside the company (customer data,
competitor data, and so on).
Multinational company Johnson & Johnson was looking to enhance employee performance
and retention.
Initially, they favoured experienced industry candidates for recruitment, assuming they'd stay
longer and contribute faster. However, HR and the people analytics team at J&J hypothesised
that recent college graduates outlast experienced hires and perform equally well.
They compiled data on 47,000 employees to test the hypothesis and, based on it,
Johnson & Johnson increased hires of new graduates by 20%, leading to reduced turnover
with consistent performance.
For an analyst (or an AI assistant), external data is often hard to source - it may not be
available as organised datasets (or reports), or it may be expensive to acquire. Teams might
have to collect new data from surveys, questionnaires, customer feedback and more.
Further, there is the problem of context. Suppose an analyst is looking at the dynamic pricing
of hotels offered on his company’s platform in a particular geography. Suppose further that
the analyst has no context of the geography, the reasons people visit the locality, or of local
alternatives; then the analyst will have to learn additional context to start making hypotheses
to test.
Internal data, of course, is internal, meaning access is already guaranteed. However, this
probably adds up to staggering volumes of data.
Looking Back, and Looking Forward
Data analysts often have to generate hypotheses retrospectively, where they formulate and
evaluate H0 and H1 based on past data. For the sake of this article, let's call it retrospective
hypothesis generation.
For example:
A pen seller has a hypothesis that during the lean periods of summer, when schools are
closed, a Buy One Get One (BOGO) campaign will lead to a 100% sales recovery because
customers will buy pens in advance. He then collects feedback from customers in the form of
a survey and also implements a BOGO campaign in a single territory to see whether his
hypothesis is correct, or not.
Or,
The HR head of a multi-office employer realises that some of the company’s offices have
been providing snacks at 4:30 PM in the common area, and the rest have not. He has a hunch
that these offices have higher productivity. The leader asks the company’s data science team
to look at employee productivity data and the employee location data. “Am I correct, and to
what extent?”, he asks.
These examples also reflect another nuance, in which the data is collected differently:
Such data-backed insights are a valuable resource for businesses because they allow for more
informed decision-making, leading to the company's overall growth. Taking a data-driven
decision, from forming a hypothesis to updating and validating it across iterations, to taking
action based on your insights reduces guesswork, minimises risks, and guides businesses
towards strategies that are more likely to succeed.
Now, imagine an AI assistant helping you with hypothesis generation. LLMs are not born
with context. Instead, they are trained upon vast amounts of data, enabling them to develop
context in a completely unfamiliar environment. This skill is instrumental when adopting a
more exploratory approach to hypothesis generation. For example, the HR leader from earlier
could simply ask an LLM tool: “Can you look at this employee productivity data and find
cohorts of high-productivity and see if they correlate to any other employee data like
location, pedigree, years of service, marital status, etc?”
Together, these technologies empower data analysts to unravel hidden insights within their
data. For our pen maker, for example, an AI tool could aid data analytics. It can look through
historical data to track when sales peaked or go through sales data to identify the pens that
sold the most. It can refine a hypothesis across iterations, just as a human analyst would. It
can even be used to brainstorm other hypotheses. Consider the situation where you ask the
LLM, "Where do I sell the most pens?". It will go through all of the data you have made
available - places where you sell pens, the number of pens you sold - to return the answer.
Now, if we were to do this on our own, even if we were particularly meticulous about
keeping records, it would take us at least five to ten minutes, that too, IF we know how to
query a database and extract the needed information. If we don't, there's the added effort
required to find and train such a person. An AI assistant, on the other hand, could share the
answer with us in mere seconds. Its finely-honed talents in sorting through data, identifying
patterns, refining hypotheses iteratively, and generating data-backed insights enhance
problem-solving and decision-making, supercharging our business model.
There is also the bottom-up method, where you start by going through your data and
figuring out if there are any interesting correlations that you could leverage better. This
method is usually not as focused as the earlier approach and, as a result, involves even more
data collection, processing, and analysis. AI is a stellar tool for Exploratory Data
Analysis (EDA). Wading through swathes of data to highlight trends, patterns, gaps,
opportunities, errors, and concerns is hardly a challenge for an AI tool equipped with NLP
and powered by LLMs.
An AI assistant performing EDA can help you review your data, remove redundant data
points, identify errors, note relationships, and more. All of this ensures ease, efficiency, and,
best of all, speed for your data analysts.
Good hypotheses are extremely difficult to generate. They are nuanced and, without
necessary context, almost impossible to ascertain in a top-down approach. On the other hand,
an AI tool adopting an exploratory approach is swift, easily running through available data -
internal and external.
If you want to rearrange how your LLM looks at your data, you can also do that. Changing
the weight you assign to the various events and categories in your data is a simple process.
That’s why LLMs are a great tool in hypothesis generation - analysts can tailor them to their
specific use cases.
There are numerous reasons why you should adopt AI tools into your hypothesis generation
process. But why are they still not as popular as they should be?
Some worry that AI tools can inadvertently pick up human biases through the data it is fed.
Others fear AI and raise privacy and trust concerns. Data quality and ability are also often
questioned. Since LLMs and Generative AI are developing technologies, such issues are
bound to be, but these are all obstacles researchers are earnestly tackling.
One oft-raised complaint against LLM tools (like OpenAI's ChatGPT) is that they 'fill in'
gaps in knowledge, providing information where there is none, thus giving inaccurate,
embellished, or outright wrong answers; this tendency to "hallucinate" was a major cause for
concern. But, to combat this phenomenon, newer AI tools have started providing citations
with the insights they offer so that their answers become verifiable. Human validation is an
essential step in interpreting AI-generated hypotheses and queries in general. This is why we
need a collaboration between the intelligent and artificially intelligent mind to ensure
optimised performance.
Conclusion
In addition to shaping the scope and direction of your data analytics project,
hypothesis generation serves as a compass for informed decision-making. Beyond
formulating hypotheses, it prompts you to anticipate potential outcomes and identify
critical factors influencing your research question. This forward-thinking approach
allows you to proactively design experiments or data collection strategies that not
only validate the initial hypothesis but also unearth unexpected insights. In essence,
hypothesis generation is not just a starting point but a strategic tool that fosters
adaptability and a deeper understanding of the intricacies within your chosen
variables.
Use hypothesis generation in data analytics by: Defining objectives and understanding
the business context. Exploring data for patterns and trends. Formulating specific,
testable hypotheses. Prioritizing hypotheses based on significance. Selecting
appropriate analytical techniques. Designing experiments to test hypotheses.
Collecting relevant data for analysis. Iterating and refining hypotheses as needed.
Communicating findings and providing actionable insights. Using validated
hypotheses to guide strategic decision-making.
Hypothesis generation defines what to learn from data, focusing on analysis and
driving efficient resource use. By challenging assumptions and considering diverse
perspectives, generate testable propositions that lead to actionable insights and better
business decisions. It's like a road map for data exploration: asking the right questions
will guide the business to impactful discoveries using data.
Hypothesis generation is essential because it allows you to focus on the most pertinent and
influential aspects of your problem or question, instead of wasting time and resources on
irrelevant or misleading data or analysis. It also encourages multiple perspectives and
alternatives to explore and compare, as well as helps you communicate your assumptions,
expectations, and results clearly and effectively. Moreover, it provides an opportunity to learn
from your data and improve your knowledge and skills.
Hypothesis generation is important in shaping your data analytics strategy because it
provides a roadmap for your investigation. By formulating educated guesses about
potential outcomes beforehand, you gain focus and direction. This process helps you
identify what specific insights you aim to uncover and guides your data collection and
analysis efforts. It acts as a compass, steering your approach toward meaningful
results. Additionally, hypotheses serve as benchmarks for evaluating success,
allowing you to measure your findings against initial expectations. In essence,
hypothesis generation is the cornerstone of a purposeful and effective data analytics
strategy, ensuring a targeted and productive exploration of data.
Generating hypotheses for a data analytics project is not a one-size-fits-all process, but there
are some general steps to make it easier and more effective. Firstly, define the research
question or goal and review the background knowledge and literature. Secondly, identify the
data sources and methods, then brainstorm possible hypotheses. Finally, prioritize and select
the most relevant, feasible, and impactful hypotheses using criteria such as SMART or ICE.
To stimulate creativity and generate diverse hypotheses, you can use techniques such as mind
mapping, brainstorming, or SCAMPER (Substitute, Combine, Adapt, Modify, Put to another
use, Eliminate, Reverse).
With data analytics, hypothesis generation is crucial for guiding strategy, especially in
digital marketing. It's essentially about forming educated guesses based on the data
you have in order for someone to generate a hypothesis they can refer to public
literature, best practices, user-surveys or user testing. For example, if you notice a
surge in website traffic after specific social media posts, you might hypothesise that
certain types of content are more effective. This serves as a starting point for more
thoroughly analysing the root cause of the traffic surge. It also helps in testing and
refining your marketing efforts and tailor your digital marketing strategies to be data-
driven and in line with your audience's preferences.
Hypotheses have to be tied to the answers you want to get from your questions. List
down all the most important questions/answers to your subject and brainstorm
possible hypotheses for each one of them. Break them down into smaller parts and
you will get the hypotheses. For example, I want to know does external ads increase
sales. I can generate hypothese like, "do most viewers of my website come from
external ads?" "is conversion higher from the group of viewers from external ads?"
4How to test hypotheses?
Once you have selected your hypotheses, you need to design and conduct your data analysis
to test them. Depending on the complexity of your hypotheses, you might use descriptive,
exploratory, inferential, or predictive data analysis and utilize different methods of data
visualization, like charts, graphs, tables, or dashboards. The general steps for testing
hypotheses include defining variables and metrics, collecting and preparing data, analyzing
and interpreting results, and reporting and communicating findings. You must operationalize
and quantify the variables you want to measure or manipulate in your analysis. Additionally,
you must ensure that your data is accurate, complete, and consistent. Finally, you must
summarize and present your findings to your audience and address any feedback or questions
that may arise.
Testing hypotheses is not a one-time activity, but an iterative and learning process. You may
need to refine or revise your hypotheses based on your data analysis results, new information,
or changing conditions. It could also be necessary to generate new hypotheses to explore
further or deeper aspects of your problem or question. To refine your hypotheses, review the
data analysis results and feedback, identify the gaps or opportunities for improvement, then
generate and test new or modified hypotheses. Design and conduct the data analysis to test
them, then report and communicate the findings.
Refining hypotheses is the heartbeat of dynamic data analysis. Post-testing, delve into
results and feedback, pinpointing gaps or opportunities. Generate and test new or
modified hypotheses, conducting rigorous data analysis to iterate findings. This
iterative cycle not only hones precision but fosters adaptability, ensuring hypotheses
align with evolving insights. In the dynamic landscape of data analysis, refinement
isn't just a step—it's the continuous evolution that propels impactful decision-making.
This is a space to share examples, stories, or insights that don’t fit into any of the previous
sections. What else would you like to add?
Business Modelling
With the help of modelling techniques, we can create a complete description of existing and
proposed organizational structures, processes, and information used by the enterprise.
Business Model is a structured model, just like a blueprint for the final product to be
developed. It gives structure and dynamics for planning. It also provides the foundation for
the final product.
Business modelling is used to design current and future state of an enterprise. This model is
used by the Business Analyst and the stakeholders to ensure that they have an accurate
understanding of the current “As-Is” model of the enterprise.
It is used to verify if, stakeholders have a shared understanding of the proposed “To-be of the
solution.
Analyzing requirements is a part of business modelling process and it forms the core focus
area. Functional Requirements are gathered during the “Current state”. These requirements
are provided by the stakeholders regarding the business processes, data, and business rules
that describe the desired functionality which will be designed in the Future State.
After defining the business needs, the current state (e.g. current business processes, business
functions, features of a current system and services/products offered and events that the
system must respond to) must be identified to understand how people, processes and
technology, structure and architecture are supporting the business by seeking input from IT
staff and other related stakeholders including business owners.
A gap analysis is then performed to assess, if there is any gap that prevents from achieving
business needs by comparing the identified current state with the desired outcomes.
If there is no gap (i.e. the current state is adequate to meet the business needs and desired
outcomes), it will probably not be necessary to launch the IT project. Otherwise, the
problems/issues required to be addressed in order to bridge the gap should be identified.
Techniques such as SWOT (Strengths, Weaknesses, Opportunities and Threats) Analysis and
document analysis can be used.
Explore our latest online courses and learn new skills at your own pace. Enroll and become
a certified expert to boost your career.
BA should assist the IT project team in assessing the proposed IT system to ensure that it
meets the business needs and maximizes the values delivered to stakeholders. BA should also
review the organization readiness for supporting the transition to the proposed IT system to
ensure a smooth System Implementation.
BA should help the IT project team to determine whether the proposed system option and the
high-level system design could meet the business needs and deliver enough business value to
justify the investment. If there are more than one system options, BA should work with the IT
staff to help to identify the pros and cons of each option and select the option that delivers the
greatest business value.
The primary role of business modelling is mostly during inception stage and elaboration
stages of project and it fades during the construction and transitioning stage. It is mostly to do
with analytical aspects of business combined with technical mapping of the application or
software solution.
Domain and User variation − Developing a business model will frequently reveal areas of
disagreement or confusion between stakeholders. The Business Analyst will need to
document the following variations in the as-is model.
Multiple work units perform the same function − Document the variances in the AS-IS
model. This may be different divisions or geographies.
Multiples users perform the same work − Different stakeholders may do similar work
differently. The variation may be the result of different skill sets and approaches of different
business units or the result of differing needs of external stakeholders serviced by the
enterprise. Document the variances in the AS-IS model.
Resolution Mechanism − The Business Analyst should document whether the ToBe solution
will accommodate the inconsistencies in the current business model or whether the solution
will require standardization. Stakeholders need to determine which approach to follow. The
To-Be model will reflect their decision.
A Business analyst is supposed to define a standard business process and set up into an ERP
system which is of key importance for efficient implementation. It is also the duty of a BA to
define the language of the developers in understandable language before the implementation
and then, utilize best practices and map them based on the system capabilities.
A requirement to the system is the GAAP fit analysis, which has to balance between −
The need for the technical changes, which are the enhancements in order to achieve identity
with the existing practice.
Effective changes, which are related to re-engineering of existing business processes to allow
for implementation of the standard functionality and application of process models.
Domain expertise is generally acquired over a period by being in the “business” of doing
things. For example,
A banking associate gains knowledge of various types of accounts that a customer
(individual and business) can operate along with detailed business process flow.
An insurance sales representative can understand the various stages involved in procuring
of an Insurance policy.
A marketing analyst has more chances of understanding the key stakeholders and business
processes involved in a Customer Relationship Management system.
A Business Analyst involved in capital markets project is supposed to have subject matter
expertise and strong knowledge of Equities, Fixed Income and Derivatives. Also, he is
expected to have handled back office, front office, practical exposure in applying risk
management models.
A Healthcare Business Analyst is required to have basic understanding of US Healthcare
Financial and Utilization metrics, Technical experience and understanding of EDI
837/835/834, HIPAA guidelines, ICD codification – 9/10 and CPT codes, LOINC, SNOMED
knowledge.
Some business analysts acquire domain knowledge by testing business applications and
working with the business users. They create a conducive learning environment though their
interpersonal and analytical skills. In some cases, they supplement their domain knowledge
with a few domain certifications offered by AICPCU/IIA and LOMA in the field of Insurance
and financial services. There are other institutes that offer certification in other domains.
Following a thorough examination of current business processes, you can offer highly
professional assistance in identifying the optimal approach of modelling the system.
In the next section, we will discuss briefly about some of the popular Business Modelling
Tools used by large organizations in IT environments.
MS-Visio is a drawing and diagramming software that helps transform concepts into a visual
representation. Visio provides you with pre-defined shapes, symbols, backgrounds, and
borders. Just drag and drop elements into your diagram to create a professional
communication tool.
Step 1 − To open a new Visio drawing, go to the Start Menu and select Programs → Visio.
Step 2 − Move your cursor over “Business Process” and select “Basic Flowchart”.
A − the toolbars across the top of the screen are like other Microsoft programs such as Word
and PowerPoint. If you have used these programs before, you may notice a few different
functionalities, which we will explore later.
Selecting Help Diagram Gallery is a good way to become familiar with the types of drawings
and diagrams that can be created in Visio.
B − The left side of the screen shows the menus specific to the type of diagram you are
creating. In this case, we see −
Arrow Shapes
Backgrounds
Basic Flowchart Shapes
Borders and Titles
C − The center of the screen shows the diagram workspace, which includes the actual
diagram page as well as some blank space adjacent to the page.
D − The right side of the screen shows some help functions. Some people may choose to
close this window to increase the area for diagram workspace, and re-open the help functions
when necessary.
Enterprise architect is a visual modeling and design tool based on UML. The platform
supports the design and construction of software systems, modeling business processes and
modeling industry based domains. It is used by business and organizations to not only model
the architecture of their systems. But to process the implementation of these models across
the full application development life cycle.
The intent of Enterprise architect is to determine how an organization can most effectively
achieve its current and future objectives.
Business perspective − The Business perspective defines the processes and standards by
which the business operates on day to day basis.
Application Perspective − The application perspective defines the interactions among the
processes and standards used by the organization.
Information Perspective − This defines and classifies the raw data like document files,
databases, images, presentations and spreadsheets that organization requires in order to
efficiency operate.
Technology Prospective − This defines the hardware, operating systems, programming and
networking solutions used by organization.
The process of eliciting, documenting organizing tracking and changing Requirements and
communicating this information across the project teams to ensure that iterative and
unanticipated changes are maintained throughout the project life cycle.
Monitoring status and controlling changes to the requirement baseline. The Primary elements
are Change control and Traceability.
Requisite Pro is used for the above activities and project administration purposes, the tool is
used for querying and searching, Viewing the discussion that were part of the requirement.
In Requisite Pro, the user can work on the requirement document. The document is a MS-
Word file created in Reqpro application and integrated with the project database.
Requirements created outside Requisite pro can be imported or copied into the document.
In Requisite Pro, we can also work with traceability, here it is a dependency relationship
between two requirements. Traceability is a methodical approach to managing change by
linking requirements that are related to each other.
Requisite Pro makes it easy to track changes to a requirement throughout the development
cycle, so it is not necessary to review all your documents individually to determine which
elements need updating. You can view and manage suspect relationships using a Traceability
Matrix or a Traceability Tree view.
Requisite Pro projects enable us to create a project framework in which the project artifacts
are organized and managed. In each project the following are included.
Requisite Pro allows multiple user to access the same project documents and database
simultaneously hence the project security aspect is to very crucial. Security prevents the
system use, potential harm, or data loss from unauthorized user access to a project document.
It is recommended that the security is enabled for all RequisitePro projects. Doing so ensures
that all changes to the project are associated with the proper username of the Individual who
made the change, thereby ensuring that you have a complete audit trail for all changes.
Model Validation
Model validation is defined within regulatory guidance as “the set of processes and
activities intended to verify that models are performing as expected, in line with their
design objectives, and business uses.” It also identifies “potential limitations and
development or use. Models, therefore, should not be validated by their owners as they
can be highly technical, and some institutions may find it difficult to assemble a model
risk team that has sufficient functional and technical expertise to carry out independent
validation. When faced with this obstacle, institutions often outsource the validation task
to third parties.
In statistics, model validation is the task of confirming that the outputs of a statistical
model are acceptable with respect to the real data-generating process. In other words,
model validation is the task of confirming that the outputs of a statistical model have
enough fidelity to the outputs of the data-generating process that the objectives of the
1. Conceptual Design
The foundation of any model validation is its conceptual design, which needs documented
coverage assessment that supports the model’s ability to meet business and regulatory needs
and the unique risks facing a bank.
The design and capabilities of a model can have a profound effect on the overall effectiveness
of a bank’s ability to identify and respond to risks. For example, a poorly designed risk
assessment model may result in a bank establishing relationships with clients that present a
risk that is greater than its risk appetite, thus exposing the bank to regulatory scrutiny and
reputation damage.
A validation should independently challenge the underlying conceptual design and ensure that
documentation is appropriate to support the model’s logic and the model’s ability to achieve
All technology and automated systems implemented to support models have limitations. An
effective validation includes: firstly, evaluating the processes used to integrate the model’s
conceptual design and functionality into the organisation’s business setting; and, secondly,
examining the processes implemented to execute the model’s overall design. Where gaps or
limitations are observed, controls should be evaluated to enable the model to function
effectively.
Data errors or irregularities impair results and might lead to an organisation’s failure to
identify and respond to risks. Best practise indicates that institutions should apply a risk-based
data validation, which enables the reviewer to consider risks unique to the organisation and the
model.
To establish a robust framework for data validation, guidance indicates that the accuracy of
source data be assessed. This is a vital step because data can be derived from a variety of
sources, some of which might lack controls on data integrity, so the data might be incomplete
or inaccurate.
4. Process Validation
To verify that a model is operating effectively, it is important to prove that the established
processes for the model’s ongoing administration, including governance policies and
procedures, support the model’s sustainability. A review of the processes also determines
whether the models are producing output that is accurate, managed effectively, and subject to
various models’ accuracy, as well as aligning them with the bank’s business and regulatory
expectations. By failing to validate models, banks increase the risk of regulatory criticism,
sufficient resources to it. An independent validation team well versed in data management,
technology, and relevant financial products or services — for example, credit, capital
management, insurance, or financial crime compliance — is vital for success. Where shortfalls
in the validation process are identified, timely remedial actions should be taken to close the
gaps.
Model Evaluation
Model Evaluation is an integral part of the model development process. It helps to find
the best model that represents our data and how well the chosen model will work in the
future. Evaluating model performance with the data used for training is not acceptable in
data science because it can easily generate overoptimistic and overfitted models. There
are two methods of evaluating models in data science, Hold-Out and Cross-Validation.
To avoid overfitting, both methods use a test set (not seen by the model) to evaluate
model performance.
Hold-Out: In this method, the mostly large dataset is randomly divided to three subsets:
2. Validation set is a subset of the dataset used to assess the performance of model built in
the training phase. It provides a test platform for fine tuning model’s parameters and
selecting the best-performing model. Not all modelling algorithms need a validation set.
3. Test set or unseen examples is a subset of the dataset to assess the likely future
performance of a model. If a model fit to the training set much better than it fits the test
cross-validation, we divide the data into k subsets of equal size. We build models ktimes,
each time leaving out one of the subsets from training and use it as the test set. If k equals
Classification Evaluation
Regression Evaluation
In summary, data interpretation is essential for leveraging the power of data and
transforming it into actionable insights. It enables organizations and individuals to make
informed decisions, identify opportunities and risks, optimize performance, enhance
customer experience, predict future trends, and gain a competitive advantage in their
respective domains.
The Role of Data Interpretation in Decision-Making Processes
Data interpretation plays a crucial role in decision-making processes across organizations
and industries. It empowers decision-makers with valuable insights and helps guide their
actions. Here are some key roles that data interpretation fulfills in decision-making:
Understanding Data
Before delving into data interpretation, it’s essential to understand the fundamentals of
data. Data can be categorized into qualitative and quantitative types, each requiring
different analysis methods. Qualitative data represents non-numerical information, such
as opinions or descriptions, while quantitative data consists of measurable quantities.
Types of Data
Exploratory Data Analysis (EDA) is a vital step in data interpretation, helping you
understand the data’s characteristics and uncover initial insights. By employing various
graphical and statistical techniques, you can gain a deeper understanding of the data
patterns and relationships.
Univariate Analysis
Univariate analysis focuses on examining individual variables in isolation, revealing their
distribution and basic characteristics. Here are some common techniques used in
univariate analysis:
Histograms: Graphical representations of the frequency distribution of a variable.
Histograms display data in bins or intervals, providing a visual depiction of the
data’s distribution.
Box plots: Box plots summarize the distribution of a variable by displaying its
quartiles, median, and any potential outliers. They offer a concise overview of the
data’s central tendency and spread.
Frequency distributions: Tabular representations that show the number of
occurrences or frequencies of different values or ranges of a variable.
Bivariate Analysis
Bivariate analysis explores the relationship between two variables, examining how they
interact and influence each other. By visualizing and analyzing the connections between
variables, you can identify correlations and patterns. Some common techniques for
bivariate analysis include:
Scatter plots: Graphical representations that display the relationship between two
continuous variables. Scatter plots help identify potential linear or nonlinear
associations between the variables.
Correlation analysis: Statistical measure of the strength and direction of the
relationship between two variables. Correlation coefficients, such as Pearson’s
correlation coefficient, range from -1 to 1, with higher absolute values indicating
stronger correlations.
Heatmaps: Visual representations that use color intensity to show the strength of
relationships between two categorical variables. Heatmaps help identify patterns
and associations between variables.
Multivariate Analysis
Multivariate analysis involves the examination of three or more variables simultaneously.
This analysis technique provides a deeper understanding of complex relationships and
interactions among multiple variables. Some common methods used in multivariate
analysis include:
Measures of Dispersion
Measures of dispersion quantify the spread or variability of the data points.
Understanding variability is essential for assessing the data’s reliability and drawing
meaningful conclusions. Common measures of dispersion include:
Range: The difference between the maximum and minimum values in a dataset,
providing a simple measure of spread.
Variance: The average squared deviation from the mean, measuring the dispersion
of data points around the mean.
Standard Deviation: The square root of the variance, representing the average
distance between each data point and the mean.
Percentiles: Divisions of data into 100 equal parts, indicating the percentage of
values that fall below a given value. The median corresponds to the 50th
percentile.
Quartiles: Divisions of data into four equal parts, denoted as the first quartile
(Q1), median (Q2), and third quartile (Q3). The interquartile range (IQR)
measures the spread between Q1 and Q3.
Confidence Intervals
Confidence intervals provide a range of values within which the population parameter is
likely to fall. They quantify the uncertainty associated with estimating population
parameters based on sample data. The construction of a confidence interval involves:
Parametric tests:
o t-tests: Compare means between two groups or assess differences in paired
observations.
o Analysis of Variance (ANOVA): Compare means among multiple groups.
o Chi-square test: Assess the association between categorical variables.
Non-parametric tests:
o Mann-Whitney U test: Compare medians between two independent groups.
o Kruskal-Wallis test: Compare medians among multiple independent groups.
o Spearman’s rank correlation: Measure the strength and direction of
monotonic relationships between variables.
Data interpretation techniques enable you to extract actionable insights from your data,
empowering you to make informed decisions. We’ll explore key techniques that facilitate
pattern recognition, trend analysis, comparative analysis, predictive modeling, and causal
inference.
Time series analysis: Analyzes data points collected over time to identify
recurring patterns and trends.
Moving averages: Smooths out fluctuations in data, highlighting underlying
trends and patterns.
Seasonal decomposition: Separates a time series into its seasonal, trend,
and residual components.
Cluster analysis: Groups similar data points together, identifying patterns or
segments within the data.
Association rule mining: Discovers relationships and dependencies between
variables, uncovering valuable patterns and trends.
Comparative Analysis
Comparative analysis involves comparing different subsets of data or variables to identify
similarities, differences, or relationships. This analysis helps uncover insights into the
factors that contribute to variations in the data.
Data visualization is a powerful tool for presenting data in a visually appealing and
informative manner. Visual representations help simplify complex information, enabling
effective communication and understanding.
Tableau: A powerful business intelligence and data visualization tool that allows
you to create interactive dashboards, charts, and maps.
Power BI: Microsoft’s business analytics tool that enables data visualization,
exploration, and collaboration.
Python libraries: Matplotlib, Seaborn, and Plotly are popular Python libraries for
creating static and interactive visualizations.
R programming: R offers a wide range of packages, such as ggplot2 and Shiny,
for creating visually appealing data visualizations.
Data interpretation plays a vital role across various industries and domains. Let’s explore
how data interpretation is applied in specific fields, providing real-world examples and
applications.
Marketing and Consumer Behavior
In the marketing field, data interpretation helps businesses understand consumer behavior,
market trends, and the effectiveness of marketing campaigns. Key applications include:
Clinical trials: Analyzing clinical trial data to assess the safety and efficacy of
new treatments or interventions.
Epidemiological studies: Interpreting population-level data to identify disease
risk factors and patterns.
Healthcare analytics: Leveraging patient data to improve healthcare delivery,
optimize resource allocation, and enhance patient outcomes.
Spreadsheet Software
Spreadsheet software like Excel and Google Sheets offer a wide range of data analysis
and interpretation functionalities. These tools allow you to:
Statistical Software
Statistical software packages, such as R and Python, provide a more comprehensive and
powerful environment for data interpretation. These tools offer advanced statistical
analysis capabilities, including:
Data interpretation comes with its own set of challenges and potential pitfalls. Being
aware of these challenges can help you avoid common errors and ensure the accuracy and
validity of your interpretations.
Effective data interpretation relies on following best practices throughout the entire
process, from data collection to drawing conclusions. By adhering to these best practices,
you can enhance the accuracy and validity of your interpretations.
Perform sales trend analysis: Analyze sales data over time to identify seasonal
patterns, peak sales periods, and fluctuations in customer demand.
Conduct customer segmentation: Segment customers based on purchase
behavior, demographics, or preferences to personalize marketing campaigns and
offers.
Analyze product performance: Examine sales data for each product category to
identify top-selling items, underperforming products, and opportunities for cross-
selling or upselling.
Evaluate marketing campaigns: Analyze the impact of marketing initiatives on
sales by comparing promotional periods, advertising channels, or customer
responses.
Forecast future sales: Utilize historical sales data and predictive models to
forecast future sales trends, helping the company optimize inventory management
and resource allocation.
Analyze patient data: Extract insights from electronic health records, medical
history, and treatment outcomes to identify factors impacting patient outcomes.
Identify risk factors: Analyze patient populations to identify common risk factors
associated with specific medical conditions or adverse events.
Conduct comparative effectiveness research: Compare different treatment
methods or interventions to assess their impact on patient outcomes and inform
evidence-based treatment decisions.
Optimize resource allocation: Analyze healthcare utilization patterns to allocate
resources effectively, optimize staffing levels, and improve operational efficiency.
Evaluate intervention effectiveness: Analyze intervention programs to assess
their effectiveness in improving patient outcomes, such as reducing readmission
rates or hospital-acquired infections.
These examples illustrate how data interpretation techniques can be applied across
various industries and domains. By leveraging data effectively, organizations can unlock
valuable insights, optimize strategies, and make informed decisions that drive success.
Types Of Data Interpretation
Bar Graphs – by using bar graphs we can interpret the relationship between the variables in
the form of rectangular bars. These rectangular bars could be drawn either horizontally or
vertically. The different categories of data are represented by bars and the length of each bar
represents its value. Some types of bar graphs include grouped graphs, segmented graphs,
stacked graphs etc.
Pie Chart – the circular graph used to represent the percentage of a variable is called a pie
chart. The pie charts represent numbers as proportions or percentages. Some types of pie
charts are simple pie charts, doughnut pie charts, and 3D pie charts.
Tables – statistical data are represented by tables. The data are placed in rows and columns.
Types of tables include simple tables and complex tables.
Line Graph – the charts or graphs that show information in a series of points are included in
the line graphs. Line charts are very good to visualise continuous data or sequence of values.
Some of the types of line graphs are simple line graphs, stacked line graphs etc.
Deployment happens frequently after each iteration, releasing small, working versions of the
product for user interaction and feedback, which informs the next iteration.
Iterative Development:
Definition:
Iterative development is a software development approach that breaks down projects into
smaller, manageable chunks called iterations.
Process:
Each iteration involves planning, analysis, design, development, testing, and deployment.
Focus:
Benefits:
Allows for early and frequent feedback, leading to a better final product.
Deployment:
In iterative development, deployment happens frequently after each iteration, allowing users
to interact with working versions of the product.
Feedback Loop:
Each deployment provides an opportunity to gather feedback and improve the next iteration.
Agile Methodologies:
Iterative development is commonly used in conjunction with Agile methodologies like Scrum
and Kanban.