Sample Minor Project Report
Sample Minor Project Report
ANALYSIS
A PROJECT REPORT
Submitted by
M.TECH (Integrated)
COMPUTER SCIENCE WITH SPECIALIZATION
IN DATA SCIENCE
NOVEMBER 2022
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR-603203
BONAFIDE CERTIFICATE
Certified that this project report titled “Wine Quality Analysis” is the bonafide work
of “Animesh Raj [Reg No: RA2112704010005]“ who carried out the project work
under my supervision. Certified further, that to the best of my knowledge the work
reported herein does not form part of any other thesis or dissertation on the basis of
which a degree or award was conferred on an earlier occasion for this or any other
candidate.
This project addresses the problem of Wine Quality Analysis; This datasets is related
to red variants of the Portuguese "Vinho Verde" wine.The dataset describes the
amount of various chemicals present in wine and their effect on it's quality. The
datasets can be viewed as classification or regression tasks. The classes are ordered
and not balanced (e.g. there are much more normal wines than excellent or poor
ones). The aim of this project is to predict the quality of wine using the given data.
ACKNOWLEDGEMENTS
We extend our sincere thanks to Dean-CET, SRM Institute of Science and Technology, Dr
T.V.Gopal, for his invaluable support.
We register our immeasurable thanks to our Faculty Advisor, Shantha Kumari, Assistant
Professor, Department of Data Science and Business Systems, SRM Institute of Science and
Technology, for leading and helping us to complete our course.
Our inexpressible respect and thanks to my guide, Dr. A.V.Kalpana , Assistant Professor,
Department of Data Science and Business Systems, for providing me with an opportunity to
pursue my project under his mentorship. He provided us with the freedom and support to
explore the research topics of our interest. His passion for solving problems and making a
difference in the world has always been inspiring.
We sincerely thank the Data Science and Business Systems staff and students, SRM Institute
of Science and Technology, for their help during our project. Finally, we would like to thank
parents, family members, and friends for their unconditional love, constant support, and
encouragement.
Animesh Raj
TABLE OF CONTENTS
PAGE
CHAPTER NO. TITLE NO.
ABSTRACT iii
ACKNOWLEDGMENT
S iv
LIST OF FIGURES vi
LIST OF SYMBOLS, ABBREVIATIONS vii
1. INTRODUCTION 1
LITERATURE
2 REVIEW 3
3 DATA WRANGLING AND UNDERSTANDING 6
4 MACHINE LEARNING 7
5 EXPLORATORY DATA ANALYSIS 9
6 MODEL DEVELOPMENT/ CODE 12
7 CONCLUSION 20
8 REFERENCES 21
LIST OF FIGURES
3.0 Distribution of data.……………………...………………………………..16
5.1 Confusion Matrix...…………………………...……………………….…..19
5.2 Confusion Matrix...…………………………...……………………….…..19
5.3 SVM Diagram…....…………………………...……………………….…..19
5.3 Confusion Matrix...…………………………...……………………….…..19
7.1 Confusion Matrix of LR...…………………………...…………..…….…..19
7.2 Confusion Matrix of SVM...…………………………...………………….19
7.3 Confusion Matrix of BernoulliNB...…………………………...……...…..19
ABBREVIATIONS
AI Artificial Intelligence
IOT Internet Of Things
GUI Graphical User Interface
URL Uniform Resource Locator
NB Naïve Bayes
LIST OF SYMBOLS
^ Conjunction
CHAPTER 1
INTRODUCTION
This project of Analysis of Wine Quality Data comes under the domain of “Pattern
Classification” and “Data Mining”. Both of these terms are very closely related and
intertwined, and they can be formally defined as the process of discovering “useful” patterns
in large set of data, either automatically (unsupervised) or semi-automatically (supervised).
The project would heavily rely on techniques of “Natural Language Processing” in extracting
significant patterns and features from the large data set of Wine Quality and on “Machine
Learning” techniques for accurately classifying individual unlabelled data samples according
to whichever pattern model best describes them.
we consider a set of observations on a number of red and white wine varieties involving their
chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as
social drinking is on the rise. The price of wine depends on a rather abstract concept of wine
appreciation by wine tasters, opinion among whom may have a high degree of variability.
Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine
certification and quality assessment is physicochemical tests which are laboratory-based and
takes into account factors like acidity, pH level, the presence of sugar and other chemical
properties. For the wine market, it would be of interest if human quality of tasting can be
related to the chemical properties of wine so that certification and quality assessment and
assurance process is more controlled.
data mining will cover commonly used techniques and applications in this field. Data mining
and learning techniques developed in fields other than statistics, e.g., machine learning are
also introduced. Data Mining refers to a set of methods applicable to large and complex
databases to eliminate the randomness and discover the hidden pattern. Data mining methods
are almost always computationally intensive. Data mining is about tools, methodologies, and
theories for revealing patterns in data — which is a critical step in knowledge discovery.
There are several driving forces for why data mining has become such an important area of
study.
Classification techniques can also be divided into two categories: Supervised vs.
unsupervised and non-adaptive vs. adaptive/reinforcement techniques. Supervised approach
is when we have pre-labelled data samples available and we use them to train our classifier.
Training the classifier means to use the pre-labelled to extract features that best model the
patterns and differences between each of the individual classes, and then classifying an
unlabelled data sample according to whichever pattern best describes it. Unsupervised
1
classification is when we do not have any labelled data for training. In addition to this
adaptive classification techniques deal with feedback from the environment. There are two
further types of adaptive techniques: Passive and active. Passive techniques are the ones
which use the feedback only to learn about the environment but not using this improved
learning in our current classification algorithm, while the active approach continuously keeps
changing its classification algorithm according to what it learns at real-time.
2
CHAPTER 2
LITERATURE REVIEW
From the beginning of mankind, there has been the existence of different kinds of wine. It has
also become very important for us to know the quality of the wine, before consuming it. In
the last few decades, the food industry has grown enormously and so are the food quality
analysis and its “rating” process. We often come across cases in which a consumer falls sick
because of consuming low-quality food, so it has become a necessary evil for us to have a
quality analysis of a product before selling the product, “evil” because it adds up extra cost to
the production of the final product. Similarly, it is also necessary to do a quality analysis of
wine and there have been different methods used to determine the quality of the wine, but we
often get confused regarding which method to rely on! This paper focuses on the comparative
study over different classification algorithms for wine quality analysis which are: SVM,
random forest and multilayer perceptron and to know which of the above-mentioned
classification algorithms give more accurate result.
Certification and quality assessment are crucial issues within the wine industry. Currently,
wine quality is mostly assessed by physicochemical (e.g alcohol levels) and sensory (e.g.
human expert evaluation) tests. In this paper, we propose a data mining approach to predict
wine preferences that is based on easily available analytical tests at the certification step. A
large dataset is considered with white vinho verde samples from the Minho region of
Portugal. Wine quality is modeled under a regression approach, which preserves the order of
the grades. Explanatory knowledge is given in terms of a sensitivity analysis, which measures
the response changes when a given input variable is varied through its domain. Three
regression techniques were applied, under a computationally efficient procedure that
performs simultaneous variable and model selection and that is guided by the sensitivity
analysis. The support vector machine achieved promising results, outperforming the multiple
regression and neural network methods. Such model is useful for understanding how
physicochemical tests affect the sensory preferences. Moreover, it can support the wine
expert evaluations and ultimately improve the production.
[3] Selection of important features and predicting wine quality using machine learning
techniques
3
Nowadays, industries are using product quality certifications to promote their products. This
is a time taking process and requires the assessment given by human experts which makes
this process very expensive. This paper explores the usage of machine learning techniques
such as linear regression, neural network and support vector machine for product quality in
two ways. Firstly, determine the dependency of target variable on independent variables and
secondly, predicting the value of target variable. In this paper, linear regression is used to
determine the dependency of target variable on independent variables. On the basis of
computed dependency, important variables are selected those make significant impact on
dependent variable. Further, neural network and support vector machine are used to predict
the values of dependent variable. All the experiments are performed on Red Wine and White
Wine datasets. This paper proves that the better prediction can be made if selected features
(variables) are being considered rather than considering all the features.
The use of correlation coefficients, F-statistics and LSDs was described for measuring judge
performance in terms of agreement, reliability, discrimination, stability, and variability. The
technique was applied to a number of wine-quality evaluation experiments. It was shown that
a single analysis of variance of all judges' scores in an experiment will often be inappropriate.
Further, it was demonstrated that judges' performances varied over time. It was, therefore,
recommended that each judge's performance be monitored continually and that when judges
were unreliable and nondiscriminating, they would be ignored in drawing conclusions from
the experiment. The analysis for wine differences should be based on a separate analysis of
each set of homogeneous judges as determined from the measures for agreement, reliability,
and variability.
[5] Analysis of Italian Wine Quality Using Freely Available Meteorological Information
Meteorological conditions strongly affect viticultural activity, modifying grapevine (Vitis vinifera)
responses and determining the quality and quantity of production. The analysis of meteorological
information can provide viticulturists with operational and forecasting tools for improving the management
of vineyards. Meteorological information is presently available on Internet sites with different spatial and
temporal scales, allowing easy access and overcoming the necessity of installing costly weather station
networks. The present research was performed for the purpose of analyzing the relationship between
meteorological information freely available on Internet and the average quality (defined by vintage ratings)
of Italian wine. Temperature and precipitation data were analyzed. The presence of teleconnections and
their effect on quality was investigated by considering 500 hPa geopotential height, sea surface
temperature, and meteorological indices such as North Atlantic Oscillation and Southern Oscillation.
Results highlight strong relationships between meteorological conditions and wine quality. Higher-quality
wines were obtained in the years characterized by a reduction in rainfall and high temperature patterns.
Teleconnections were also detected in different periods of the growing season, thus allowing for the
4
development of wine-quality forecasting tools. Results could aid in the evaluation of operations concerning
the analysis and forecasting of wine quality.
[6] The Effect of Weather on Wine Quality and Prices: An Australian Spatial Analysis
In the context of the important implications of climate change, this paper analyzes the impact
of weather on wine quality and prices for Australian premium wines. Motivated by a
recognition of consumers’ accessed information sets, the impact of temperature and rainfall
on retail wine prices is assessed through their relation with quality ratings from a high-profile
wine guide and then on prices. For a broad spectrum of different quality wines from a cross
section of wines available in 2014 and a separate analysis of eight wine varieties, the indirect
approach to modeling weather effects through wine quality is found to be superior than
assuming weather impacts directly on retail prices. The results also demonstrate the
importance of regional variations in weather conditions in influencing prices and identify the
optimal season growing temperatures for different grape varieties.
5
CHAPTER 3
This Stage of the process is where we acquire the data listed in the project resources.
Describe the methods used to acquire them and any problems encountered. Record problems
you encountered and any resolutions achieved. This initial collection includes extraction
details and source details, and subsequently loaded into python and analysed in jupyter
notebook, Kaggle, google colab, etc.
DATA EXTRACTION: -
Describe the data that has been acquired including its format, its quantity (for example, the
number of records and fields in each table), the identities of the fields and any other surface
features which have been discovered. Evaluate whether the data acquired satisfies
requirements.
6
CHAPTER 4
MACHINE LEARNING
Here we will predict the quality of wine on the basis of given features. We use the wine
quality dataset available on Internet for free. This dataset has the fundamental features which
are responsible for affecting the quality of the wine. By the use of several Machine learning
models, we will predict the quality of the wine.
7
Let’s explore the type of data present in each of the columns present in the dataset.
8
CHAPTER 5
EXPLORATORY DATA ANALYSIS
EDA is an approach to analyzing the data using visual techniques. It is used to discover
trends, and patterns, or to check assumptions with the help of statistical summaries and
graphical representations.
Now let’s check the number of null values in the dataset columns wise.
Let’s draw the histogram to visualise the distribution of the data with continuous values in the
columns of the dataset.
9
Univariate Analysis: Uni” means one and “Variate” means variable hence univariate analysis
means analysis of one variable or one feature. Univariate basically tells us how data in each
feature is distributed and also tells us about central tendencies like mean, median, and mode.
Bivariate Analysis: Bivariate Analysis is used to find the relationship between two variables.
Analysis can be performed for combination of categorical and continuous variables. Scatter
plot is suitable for analyzing two continuous variables. It indicates the linear or non-linear
relationship between the variables.
10
Comparison/ Relation between various or 2 variables.
11
CHAPTER 6
MODEL DEVELOPMENT/ CODE
6.1 Algorithm
• Step 4: Preprocessing the Data using Stemming, Lemmatization and removing Stop
words
Let’s prepare our data for training and splitting it into training and validation data so, that we
can select which model’s performance is best as per the use case. We will train some of the
state-of-the-art machine learning classification models and then select best out of them using
validation data.
12
13
14
15
16
17
………
18
19
CHAPTER 7
CONCLUSION
In this project we tried to show the basic way of classifying Wine Analysis and can produce
better results. We realized that the random forest model gives better prediction than other
classification Techniques.
20
CHAPTER 8
REFERENECES
1] Ashenfelter, O. (2008). Predicting the quality and prices of Bordeaux wine. Economic
Journal, 188, F174–F184.CrossRefGoogle Scholar
2] Ashenfelter, O., and Storchmann, K. (2016). Climate change and wine: A review of the
economic implications. Journal of Wine Economics, 11(1), 105–138.CrossRef Google
Scholar
3] Byron, R.P., and Ashenfelter, O. (1995). Predicting the quality of an unborn Grange.
Economic Record, 71(212), 400–414.CrossRefGoogle Scholar
4] Cardebat, J.-M., Figuet, J.-M., and Paroissien, E. (2014). Expert opinion and Bordeaux
wine prices: an attempt to correct biases in subjective judgments. Journal of Wine
Economics, 9(3), 282–303.CrossRefGoogle Scholar
5] Chevet, J.-M., Lecocq, S., and Visser, M. (2011). Climate, grapevine phenology, wine
production and prices: Pauillac (1800–2009). American Economic Review: Papers and
Proceedings, 101, 142–146.CrossRefGoogle Scholar
21