Da CH1 Slqa
Da CH1 Slqa
= 1) Data science is the field of applying advanced analytics techniques and scientific
principles to extract valuable information from data for business decision-making, strategic
planning and other uses. 2) The purpose of data science is to find patterns. Understanding
patterns means understanding the world. 3) A data science can include developing
strategies for analysing data, preparing data for analysis, exploring, analyzing, and
visualizing data, building models with data using programming languages, such as Python
and R, and deploying models into applications.
2. Define the term analytics.
= Analytics is the process of discovering, interpreting, and communicating significant
patterns in data. Quite simply, analytics helps us see insights and meaningful data that we
might not otherwise detect.
3. Enlist types of data analytics.
= Predictive (forecasting), Descriptive (business intelligence and data mining), Prescriptive
(optimization and simulation), Diagnostic analytics. Cognitive Analytics.
4. Define data analysis.
= Data Analysis is the process of systematically applying statistical and/or logical
techniques to describe and illustrate, condense and recap, and evaluate data.3)
5. Define mathematical model.
= A structural model of a system is a mathematical relationship between one or several
input variables and parameters and one or several output variables.
6. What is the purpose of diagnostic analytics?
= The purpose of diagnostic analytics is to determine the root cause of an occurrence or
trend.
7. Define class imbalance.
= A classification data set with skewed class proportions is called imbalanced. Classes that
make up a large proportion of the data set are called majority classes. Those that make up
a smaller proportion are minority classes.
8. Differentiate between predictive analytics and prescriptive analytics. Any two
points.
= Predictive Analytics predicts what is most likely to happen in the future. Predictive
analytics provides you with the raw material for making informed decisions, while
prescriptive analytics provides you with data-backed decision options that you can weigh
against one another. Prescriptive Analytics recommends actions you can take to affect
those outcomes.
9. Define exploratory analysis.
= Exploratory Data Analysis refers to the critical process of performing initial investigations
on data so as to discover patterns, to spot anomalies, to test hypothesis and to check
assumptions with the help of summary statistics and graphical representations.
10. Define linear model.
= Linear models describe a continuous response variable as a function of one or more
predictor variables. They can help you understand and predict the behavior of complex
systems or analyse experimental, financial, and biological data.
11. What is model evaluation?
= Model evaluation is the process of using different evaluation metrics to understand a
machine learning model's performance, as well as its strengths and weaknesses.
12. Define predictive analytics.
= Predictive analytics is a branch of advanced analytics that makes predictions about future
outcomes using historical data combined with statistical modeling, data mining techniques
and machine learning.
13. What is the purpose of AUC and AOC curves?
= AUC - ROC curve is a performance measurement for the classification problems at
various threshold settings. ROC is a probability curve and AUC represents the degree or
measure of separability.
14. Define baseline model.
= A baseline model is essentially a simple model that acts as a reference in a machine
learning project.
15. Define descriptive analytics.
= Descriptive analytics is the process of using current and historical data to identify trends
and relationships.
16. Define the terms metric and classifier.
= Metrics describe the exact numbers that make up the data. Classification is
about predicting the class labels given input data.
2. What is data analytics? Enlist its different roles. Also state its advantages and
disadvantages.
= 1) Data analytics (DA) is the process of examining data sets in order to find trends and
draw conclusions about the information they contain. 2) A business intelligence analyst's
primary job is to extract value from their company's data. At most companies, BI Analysts
need to be comfortable analyzing data, working with SQL, and creating data visualizations
and models.
Advantages: Data analytics helps businesses get real-time insights about sales,
marketing, finance, product development, and more. It allows teams within businesses to
collaborate and achieve better results. It is useful for businesses to analyse past business
performance and optimize future business processes. Disadvantages: The companies
may exchange these useful customer databases for their mutual benefits.The cost of data
analytics tools vary based on applications and features supported. Moreover some of the
data analytics tools are complex to use and require training.
5. Differentiate between data analysis and data analytics.
= Analysis: *It is described as a particularized form of analytics. *It analyzes the data by
focusing on insights into business data. *It uses different tools to analyze data such as
Rapid Miner, Open Refine, Node XL, KNIME, etc. *A Descriptive analysis can be
performed on this. *One cannot find anonymous relations with the help of this. *It supports
inferential analysis.
Analytics: *It is described as a traditional form or generic form of analytics. *It supports
decision making by analyzing enterprise data. *It uses various tools to process data such
as Tableau, Python, Excel, etc. *Descriptive analysis cannot be performed on this.
*One can find anonymous relations with the help of this. *It does not deal with inferential
analysis.
9. Write a short note on: Mechanistic analytics.
= Mechanistic analysis is for measuring the exact changes in variables that lead to other
changes in other variables.
3. With the help of diagram describe lifecycle of data analytics.
= Diagram: Discovery < > Data Preparation < > Model Planning < > Model Building < >
Communication Results > Operationalize
1) Discovery: The data science team is trained and researches the issue.Create context
and gain understanding. Learn about the data sources that are needed and accessible to
the project. The team comes up with an initial hypothesis, which can be later confirmed with
evidence. 2) Data Preparation: Methods to investigate the possibilities of pre-processing,
analysing, and preparing data before analysis and modelling. It is required to have an
analytic sandbox. The team performs, loads, and transforms to bring information to the data
sandbox. Data preparation tasks can be repeated and not in a predetermined
sequence.Some of the tools used commonly for this process include - Hadoop, Alpine
Miner, Open Refine, etc. 3)Model Planning: The team studies data to discover the
connections between variables. Later, it selects the most significant variables as well as the
most effective models. In this phase, the data science teams create data sets that can be
used for training for testing, production, and training goals. The team builds and
implements models based on the work completed in the modelling planning phase. Some
of the tools used commonly for this stage are MATLAB and STASTICA. 4) Model
Building: The team creates datasets for training, testing as well as production use. The
team is also evaluating whether its current tools are sufficient to run the models or if they
require an even more robust environment to run models. Tools that are free or open-source
or free tools Rand PL/R, Octave, WEKA. Commercial tools - MATLAB, STASTICA.
5) Communication Results: Following the execution of the model, team members will
need to evaluate the outcomes of the model to establish criteria for the success or failure of
the model. The team is considering how best to present findings and outcomes to the
various members of the team and other stakeholders while taking into consideration
cautionary tales and assumptions. The team should determine the most important findings,
quantify their value to the business and create a narrative to present findings and
summarize them to all stakeholders.6) Operationalize: The team distributes the benefits of
the project to a wider audience. It sets up a pilot project that will deploy the work in a
controlled manner prior to expanding the project to the entire enterprise of users. This
technique allows the team to gain insight into the performance and constraints related to
the model within a production setting at a small scale and then make necessary
adjustments before full deployment. The team produces the last reports, presentations, and
codes. Open source or free tools such as WEKA, SQL, MADlib, and Octave.
4. Explain four layers in data analytics framework diagrammatically.
= Use Cases: use case is the manner in which the business user leverages data and the
analytics system to derive insights to answer tangible business questions for decision
making. Data Sets: A data set is a collection of related, discrete items of related data that
may be accessed individually or in combination or managed as a whole entity. A data set is
organized into some type of data structure. , Data Collection: The DCF is a custom
application developed by our engineers to collect data from a number of inventory systems
implemented across our client's estates., Data Preparation: Data preparation is the
process of cleaning and transforming raw data prior to processing and analysis., Learning:
and Intelligent Actions: A system that deliverer trustworthy, reliable data, while also
providing intelligence about said data, or metadata..
6. What are the types of data analytics? Describe two of them in detail.
= there are four main types of analysis: Descriptive, diagnostic, predictive, and prescriptive.
Predictive analytics: is a branch of advanced analytics that makes predictions about
future outcomes using historical data combined with statistical modeling, data mining
techniques and machine learning. Prescriptive analytics is the process of using data to
determine an optimal course of action. By considering all relevant factors, this type of
analysis yields recommendations for next steps. Prescriptive analytics can cut through the
clutter of immediate uncertainty and changing conditions. It can help prevent fraud, limit
risk, increase efficiency, meet business goals, and create more loyal customers.
8. What exploratory analytics? What is its purpose? Explain with example.
= 1) Exploratory Data Analysis (EDA) is an approach to analyze the data using visual
techniques.2) critical process of performing initial investigations on data so as to discover
patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of
summary statistics and graphical representations. 3) You are open to the fact that any
number of people might buy any number of different types of shoes. You visualize the data
using exploratory data analysis to find that most customers buy 1-3 different types of
shoes. Sneakers, dress shoes, and sandals seem to be the most popular ones.
10. What is mathematical model? List its types. Explain two of them in detail.
= 1) Mathematical modelling of complex, large-scale random systems. Advanced methods
for the analysis of complex structured and unstructured data sets, in particular financial
data sets. Rough path analysis, signatures and data analysis.
2) Linear Algebra & Calculus. ...Statistics. ...Machine Learning/Statistical
Model. ...Operation Research.
3) Linear algebra & calculus would be considered the most basic. This is especially true
given the “Deep Learning” environment that we are in. Deep learning requires us to
understand linear algebra & calculus, to understand how it works, for example forward
propagation, backward propagation, parameters setting etc.
Statistics are simple statistics like measurement of centrality, distributions and different
probability distributions (Weibull, Poisson etc), Baye’s Theorem (there’s a strong emphasis
on it when it comes to learning about Artificial Intelligence later), hypothesis testing etc.
12. What is baseline model? Enlist two of them in detail.
= 1) A baseline model is essentially a simple model that acts as a reference in a machine
learning project. Its main function is to contextualize the results of trained models. Baseline
models usually lack complexity and may have little predictive power. Regardless, their
inclusion is a necessity for many reasons.
here are three types of the baseline model that are Random Baseline Models: In the real
world, data can not always be predictable. In these such problems, the best baseline model
is a dummy classifier or dummy regressor. That baseline model shows you to your ml
model is learning or not. ML Baseline Models, and Automated ML Baseline Models:
AutoML uses machine learning to analyze the structure and meaning of text data. You can
use AutoML to train an ML model to classify text data, extract information, or understand
the sentiment of authors..
19.Write a short note on: Evaluating value prediction models.
= To get the true value of a predictive model, you have to know how good your model fits
the data. Your model should also withstand the change in the data sets, or being put
through a completely new data set. To start, you need to get clear about what business
challenge this model is helping solve.
13. How to evaluate a model? Describe in detail.
= Model evaluation is the process of using different evaluation metrics to understand a
machine learning model's performance, as well as its strengths and weaknesses. Model
evaluation is important to assess the efficacy of a model during initial research phases, and
it also plays a role in model monitoring.
Evaluation is a process during development of the model to check whether the model is
best fit for the given problem and corresponding data. Keras model provides a function,
evaluate which does the evaluation of the model.
14. Write a short note on: Metrics for evaluating classifiers.
= Classification is about predicting the class labels given input data. In binary classification,
there are only two possible output classes (i.e., Dichotomy). In multiclass classification,
more than two possible classes can be present. I’ll focus only on binary classification. A
very common example of binary classification is spam detection, where the input data could
include the email text and metadata (sender, sending time), and the output label is
either “spam” or “not spam.” Sometimes, people use some other names also for the two
classes: “positive” and “negative,” or “class 1” and “class 0.”
17. What is ROC curve? How to implement it? Explain with example.
= 1) A Receiver Operator Characteristic (ROC) curve is a graphical plot used to show the
diagnostic ability of binary classifiers. It was first used in signal detection theory but is now
used in many other areas such as medicine, radiology, natural hazards and machine
learning.
2) Recipe Objective. *Import the library - GridSearchCv. *Setup the Data. *Spliting the data
and Training the model. *Using the models on test dataset. *Creating False and True
Positive Rates and printing Scores. *Ploting ROC Curves
3) ROC curves are frequently used to show in a graphical way the connection/trade-off
between clinical sensitivity and specificity for every possible cut-off for a test or a
combination of tests. In addition the area under the ROC curve gives an idea about the
benefit of using the test(s) in question.
16. Define accuracy, precision, recall and f-score.
= 1. Precision — Precision explains how many of the correctly predicted cases actually
turned out to be positive. Precision is useful in the cases where False Positive is a higher
concern than False Negatives. The importance of Precision is in music or video
recommendation systems, e-commerce websites, etc. where wrong results could lead to
customer churn and this could be harmful to the business.
Precision for a label is defined as the number of true positives divided by the
number of predicted positives.
2. Recall (Sensitivity) — Recall explains how many of the actual positive cases we were
able to predict correctly with our model. It is a useful metric in cases where False Negative
is of higher concern than False Positive. It is important in medical cases where it doesn’t
matter whether we raise a false alarm but the actual positive cases should not go
undetected!
Recall for a label is defined as the number of true positives divided by the total
number of actual positives.
3. F1 Score — It gives a combined idea about Precision and Recall metrics. It is maximum
when Precision is equal to Recall.
F1 Score is the harmonic mean of precision and recall.
The F1 score punishes extreme values more. F1 Score could be an effective evaluation
metric in the following cases: 1) When FP and FN are equally costly. 2) Adding more data
doesn’t effectively change the outcome 3)True Negative is high
4. AUC-ROC — The Receiver Operator Characteristic (ROC) is a probability curve that plots
the TPR (True Positive Rate) against the FPR (False Positive Rate) at various threshold
values and separates the ‘signal’ from the ‘noise’. The Area Under the Curve (AUC) is the
measure of the ability of a classifier to distinguish between classes. From the graph, we
simply say the area of the curve ABDE and the X and Y-axis.