Mining Educational Data To Analyze Students' Performance A Case Study of Mawuli School, Ho
Mining Educational Data To Analyze Students' Performance A Case Study of Mawuli School, Ho
ISSN No:-2456-2165
Analyzing data gathered from scholastic In spite of this situation, stakeholders always expect
establishments through the use of DM systems to solve high performance even though the stock of candidates
challenges associated with research in education is termed presented for WASSCE may not be the best, and most often
EDM. The purpose of undertaking EDM is to have a better results of poor performance hit them by surprise. Perception
knowledge of students and their place of learning. among some stakeholders is that the stock of students being
Therefore, EDM’s center of attention is gathering, storing admitted through the Computerized School Selection and
and interpretation of facts and figures from students’ studies Placement System (CSSPS) to a large extent is affecting
and evaluation. performance negatively.
Various methods are applied in undertaking EDM
which include Naïve Bayes, Decision Trees, Nearest B. Purpose of the Study
neighbor, Neural Networks, K- Regression, Correlation, etc. The study intended to gain insight into students’
performance through knowledge discovery in order to
Employing methods like classifications, clustering and establish a predictable system capable of predicting
association rules can discover a lot of information that students’ grades in WASSCE in order to give an overview
would be useful in predicting student enrolment for a given of the general performance of students and better inform
course, detecting electronic exam malpractices, errors in management and other stake holders with regard to their
student reports, performance of students and many more. decisions and expectations. This will also help identify how
From the perspective of learning, standardized methods for weak or high students’ grades may be. These projected
measuring the ability and qualities of the human mind which grades will help determine the amount of remedial efforts
lack precision can find a replacement of EDM. At present, a that need to be expended by teachers to help students
good deal of EDM study areas well accommodated in the improve their performance before their final examination to
environment of Intelligent Tutoring System (ITS) because of reduce failure rate.
the greater number of researchers in the EDM domain
focusing attention on mining data originating from ITS. C. Objectives of the Study
1. Analyze the performance of students in the subjects
EDM searches for a better understanding of the namely Integrated Science, Mathematics, English
learning process and the participation of students of it, Language, and Social Studies, using linear regression.
always searching for a quality improvement and cost- 2. Verify the extent to which the grade of students in Basic
effectiveness of the education system. Educational Data Education Certificate Examination (BECE) is connected
Mining has the following objectives (Romero and Ventura, to the grade of students in Senior High School (SHS).
2007): 3. Verify the level of connection between first and second
Pedagogical: to help in the design of didactic contents year performances of students in all core subjects.
and learners’ scholastic accomplishment promotion; 4. Construct a predictive model for projecting students’ final
Managerial: optimization of the establishment and grade in all core subjects (Integrated Science,
preservation of structures of training; Mathematics, English Language, and Social Studies)
Commercial: enrollment of students in institutes of using linear regression analysis.
learning that are owned by individuals nongovernmental
groups in particular. D. Research Questions
1. Do all students perform above average (50%) in
A. Statement of Problem Integrated Science, Mathematics, English Language, and
In Ghana’s educational system, students’ performance Social Studies?
in the senior high school is defined by summing up of the 2. Do students’ grades in junior high school exam (BECE)
marks obtained from examination set by an independent determine their performance in senior high school and to
body and marks accumulated from terminal exams and class what extent?
work. A student is successful when he or she excels above 3. Do students’ terminal performance at SHS determine
the minimum academic requirements both in school and in their final WASSCE grades and to what extent?
WASSCE.
Higher educational establishments seek to offer its Areas where DM had is employed in various
students with education of the highest quality and advance organizations, in defense as well as in commerce however, it
their decision-making ability. And this can be achieved by has been underutilized in the area of education.(Malik and
unearthing information from their academic data to evaluate Ranjan, 2007).The story is different today; however, as data
the issues that influence learners’ scholastic mining has seen a significant application in the field of
accomplishment. The extracted knowledge is beneficial in education. This buttresses the point Chakrabarti and others
contributing to decision making in the administrative level made that in recent times that data mining is now greatly
as it is productive reference decision makers. Some benefits considered both in education and business (Chakrabarti et
it provides include improving the academic output of al., 2009).
students, reduce poor performance, good comprehension of
student conduct and improve the overall educational process DM in education is an aspect of evaluating and
(Chadha and Kumar, 2011). Educational data mining utilizes applying DM in solving challenges pertaining to learners
numerous methods in discovering data employing and their settings in which they learn (educationally-related
techniques for instance association rules, clustering and problems).Modern techniques in unearthing designs as well
classification. These include decision tree, rule induction, as developments in diverse sets of information in education
neural networks and so on (Romero and Ventura, 2007).
The general means of discovering knowledge by the To build predictive or inference models whose aim is
use of data mining techniques involves an interactive order to forecast or project future tendencies or behaviors
as follows: i) Preprocessing step which involves data established on the examination of prearranged data is a
cleaning and amalgamation ii)modeling step which involves function of data mining (Han and Kamber, 2001).
selecting data, transforming it, structuring, trend valuation, Prediction as used contextually means the building of
and information presentation (Han and Kamber, 2006). Data structures that is used to evaluate the kind of unidentified
is analyzed for the purpose of removing noise, missing instance. Alternatively, it means to examine or analyze the
values and inconsistencies before the processing stage. The estimate or value ranges of a character trait that a given
resultant preprocessed is deposited in data storage bases. instance probably has. Two mainly used predicting methods
Several approaches for example, recognizing of patterns are Classification which is exercised in predicting distinct
‘dig out’ and assess configurations in data during the figures and regression which is applied in predicting well-
modeling step. The relevant trends discovered could be ordered data constitute two principal techniques for
delivered to customers through visualizing and presentation forecasting. Classification is the method employed in
methods (Chakrabarti et al., 2009). predicting discrete or nominal estimates and method of
regression utilized for projecting constant estimates (Larose,
2006).
Educational data mining (also referred to as “EDM”) is It is widely accepted that Research methodology
defined as the area of scientific inquiry centered on the involves how to identify the outcome of a particular problem
development of methods for making discoveries within the of an issue under study. Researchers apply various diverse
unique kinds of data that come from educational settings, measures in tackling research problems of varying sources.
and using those methods to understand students and the Therefore, how a solution is obtained or the process of
settings which they learn better. searching for a research problem is termed Methodology
(Industrial Research Institute, 2010).
Forecasting focuses on building a prototype for
making inference on a distinct portion of the data referred to The methodology by which a research is undertaken is
as predicted variable, based on the integration of other orderly and organized and it is a discipline concerned about
portion of the data known as predictor variables. In ways a research is undertaken. Importantly, how a
prediction it required to have labels for the variable whose researcher describes the work, explains and predicts trends
output is being determined for a restricted set of data. The
The main object of carrying out researches is to arrive Interpretive Philosophy: It of the belief that the
at an answer; hence the information gathered as well as the societal domain of managing and commerce is sophisticated
undetermined parameters of problem at hand should co- in formulating theories and laws for example in the natural
ordinate, to increase the possibility of achieving a solution. science. The aspect of positivism philosophy that deals with
With respect to this, there are 3 categories for grouping extensive thinking is interpretive philosophy. It is of the
methods of research: view that, there exist numerous suitable solutions to every
research issue from of a simple statistics (Johnson and
i. Methods associated with collecting information. Christensen, 2010).
ii. Techniques that relate the available information with the
undetermined factors, usually called statistical methods. Interpretive Philosophy is of a vital function in
iii. Evaluating Methods employed to analyze how accurate obtaining out comes from information gathered. With
the results are. Interpretive Philosophy researchers do not necessarily
interrelate with their surroundings but try to understand by
B. Research Paradigms making judgments and making sense out of them. An
Traditionally, approaches to research are divided into individual’s character can be influenced by difference in the
two main paradigms, namely, the Qualitative/Interpretivist standards of living, different societal and traditional settings,
and the Quantitative/positivist approach. character differences and family inclination etc. (Saunders,
2003).
The interpretation of relationship between the variables For the purpose of this research x is the marks obtained
(BECE grades and SHS grades) were based on the premise by students in first year, y is the marks obtained by students
that the closer the data points come when plotted to making in second year, n is the sample size (180) and Y is the
a straight line, the higher the correlation between the two predicted mark. A prediction model was then formed for the
variables, or the stronger the relationship. If the data points four core subjects studied by students. The computation that
make a straight line going from the origin out to high x- and went into the formation of each model is presented as
y-values, then the variables are said to have a positive follows:
correlation. If the line goes from a high-value on the y-axis
down to a high-value on the x-axis, the variables have a 9) Prediction Model for Students’ Final Marks in
negative correlation. Mathematics
The y-intercept is given as 16.278 and the Slope is
Data collected on students’ academic performance in given as 0.532 in table 4.2.
all core subjects was summarized into average marks. Two
sets of such average scores, one set for first year and another The slope and the y-intercept were substituted in the
set for second year were used as the independent and following linear equation to predict students’ final score in
dependent variables respectively in the regression equation Mathematics: Y = aX + b. In this case the values of a, b, x,
to build a model that would predict students’ final and y were as follows:
mark/grade.
a = 0.532
6) Regression Analysis b = 16.278
Regression was used to fit an equation to the dataset. X =First Year Score
This data mining technique/statistical tool was used for the Y = Second Year Score
investigation of relationships between the variables. It was
to establish a causal relationship between the dependent or
outcome variable (final scores) and the predictors (second
The regression equation Final Year Score = 0.465 * The computation of the coefficient of b is explained by
Second Year Score + 30.267was used to project students’ the algorithm below:
final WASSCE scores in English Language. 1. Sum the values of x
2. Sum the values of y
11) Prediction Model for Students’ Final Marks in 3. Multiply the values of x and y and find the sum
Integrated Science 4. Multiply the result in 3 by the sample size n
The y-intercept is given as 2.445 and the slope is given 5. Multiply the result in 1 by the result in 2
as 0.865 in table 4.5. The slope and the y-intercept were 6. Find the squares of the values of x and sum them
substituted in the following linear equation to predict 7. Multiply the result in 6 by the sample size n
students’ final score in English Language: Y = aX + b. in 8. Find the square of the result in 1
this case the values of a, b, x, and y will be as follows: 9. Subtract the result in 5 from the result in 4
10. Subtract the result in 8 from the result in 6
a = 0.865 11. Divide the result in 9 by the result in 10
b = 2.445
X =First Year Score
Y = Second Year Score
14) Correlation Analysis All the methods used by a researcher during a research
The relationship between the marks obtained by study are termed as research methods. They are essentially
students in Basic Education Certificate Examination planned, scientific and value-neutral. They include
(BECE), and the marks of students at the Senior High theoretical procedures, experimental studies, numerical
School (SHS) level was studied using correlation analysis. schemes, statistical approaches, etc. Research methods help
Correlation as a tool was used to study and measure the us collect samples data and find a solution to a problem.
extent of the relationship between the two variables.
According to (Johnson and Christensen, 2005),
15. Coefficient of Correlation research paradigm is a perspective that is based on the set of
One of the statistics used in the analysis of students’ shared assumptions, values, concepts and practices. In other
data is the coefficient of correlation, ‘r’, which measures the world, paradigm can be defined as a function of how
degree of association between the two values of related researcher thinks about the development of knowledge.
variables given in the data set. That is the degree of Research paradigm is a combination of two ideas that are
association between the marks obtained by students in related to the nature of world and the function of researcher.
BECE, and the marks of students at the SHS. It takes values It helps researcher to conduct the study in an effective
from + 1 to – 1. If two sets of data have r = +1, they are said manner. Traditionally, approaches to research are divided
to be perfectly correlated positively; if r = -1 they are said to into two main paradigms, namely, the
be perfectly correlated negatively; and if r = 0 they are Qualitative/Interpretivist and the Quantitative/positivist
uncorrelated. approach. This thesis therefore adopts the
The coefficient of correlation, ‘r’, is given by the formula Quantitative/Positivism research paradigm in which the
researcher made use of quantitative research approach by
n ∑ xy−∑ x ∑ y applying statistical analysis to numerical data
r=
√(𝑛 ∑ 𝑥 2 −∑(𝑥)2 ) (n ∑ y²−(∑ y)²) (students‘marks) thatrsultedin various parameters which
were interpreted and generalized for the entire year group.
In the formula above, n is the sample size (180), x is This research paradigm is considered the most appropriate
the marks obtained by students in BECE, y is the marks because the study is based on the measurement of quantity
obtained by students in SHS, and r is the coefficient that or amount. The research process is expressed in terms of one
determines the degree of association between students’ or more quantities, and hence the result of this research is a
performance in BECE and SHS. set of numbers.
The following algorithm explains how the coefficient A research design represents a plan, structure, and
(r) is calculated: strategy of investigation conceived so as to obtain answers
1. Multiply the values of x and y and sum them to research questions and to control variance. In order to
2. Sum the values of x achieve the primary objective of the study, which is to
3. Sum the values of y predict the performance of students in WASSCE, a cross
IV. DATA ANALYSIS AND RESULTS The following tables present results of simple
regression. R Square (.311) indicates that this model
The data was analyzed and the results presented in accounts for 31% of total variation in the data; the
formulas, tables and charts. Various interpretations of the proportion of the variation that is explained by the model. R
result follow every table and chart. Each result was (0.558) is the absolute value of correlation coefficient.
examined in terms of the objectives and the research Adjusted R Square is the value adjusted for the number of
questions of the study. variables in the regression model.
A. Regression Analysis of Students’ Marks in Mathematics TABLE 4.1 MODEL SUMMARY FOR MATHEMATICS
Model Summary
Table 4.1 Model Summary for Mathematics
Model Summary
R Adjusted R Std. Error of the
Model R Square Square Estimate
1 .558 .311 .307 11.42813
a. Predictors: (Constant), First Year
Scores
Std.
Predicted -4.378 2.737 .000 1.000 180
Value
Std.
-2.300 2.340 .000 .997 180
Residual
TABLE 5.3 DETAILS OF STUDENTS’ PERFORMANCE E. The Degree of association between students’
IN INTEGRATED SCIENCE. performance in BECE and SHS
Year One Year Two There is a significant degree of correlation between
Cu Cu BECE result and SHS performance. The correlation
Grad m. Perce Gra m. Perce coefficient 0.518 shows that students’ performance in BECE
No. No. affects their performance at SHS moderately. What this
e Fre nt de Fre nt
q. q. means is that the poor performance of students in SHS
A1 30 16.7 A1 14 7.8 cannot be blamed entirely on the grades they obtained in
B2 18 48 10.0 B2 19 33 10.6 BECE. The fact that the correlation coefficient is not exactly
B3 29 77 16.1 B3 18 51 10.0 1(perfect correlation), is an indication that there exist other
C4 30 107 16.7 C4 22 73 12.2 factors that contribute to performance at the SHS level, most
C5 33 140 18.3 C5 28 101 15.5 of which had been addressed in literature.
C6 23 163 12.8 C6 23 124 12.7
D7 8 171 4.4 D7 21 145 11.7 This finding agrees with literature; (Bharadwaj and
Pal., 2011) undertook a study in which Bayesian
E8 3 174 1.7 E8 14 159 7.8
categorization was used to assess learner accomplishment.
F9 6 180 3.3 F9 21 180 11.7
300 learners were selected as sample of which 226 were
Total Percent 100.0 Total Percent 100.0
males and 74 were females. A total number of 17
characteristics were considered to ascertain which of them
D. Performance of Students in Social Studies have impact on the learner accomplishment in Bachelor of
Performance in Social Studies in the first year was
Computer Application program in an Indian higher
averagely good, recording a class average of 69.54%.
institution of learning called Dr. R. M. L. Awadh University.
However, 6.1% fell below 50% leaving 93.9 % scoring Questionnaire and institution’s database were used to gather
above 50%. Remarkably, this performance improved in the statistics on learner’s scholastic, personal and social and
second year by recording 96.7% scoring above 50% and the
financial characteristics. Learner’s scores were gathered
remaining 3.3% below 50%, as indicated in table 5.4. More
from the institution’s assessment unit. Learner’s rating in
work need to be done to push the few students who scored senior secondary exam, residence, channel of receiving
below 50% up. tutorial, prerequisite of mother, other addiction of learner,
family yearly earning and status of leaner’s family were
TABLE 5.4 DETAILS OF STUDENTS’ PERFORMANCE
discovered to be extremely linked with scholastic
IN SOCIAL STUDIES.
accomplishment of the learner. The study concluded that
Year One Year Two scholastic accomplishment of learners does not always
Cum Cum depend on the learner’s own endeavor.
Gra . Perce Gra . Perce
No. No.
de Freq nt de Freq nt Also, (Kotsiantis et al., 2004) employed several step-
. . by-step methods of DM for predicting the accomplishment
A1 32 17.8 A1 43 23.9 of computer science students in distance learning course in a
B2 30 62 16.7 B2 33 76 18.3 higher institution. For every individual learner, a number of
B3 28 90 15.5 B3 42 118 23.3 personal characteristics e.g. sex, age, marital status and
C4 34 124 18.8 C4 26 144 14.4 accomplishment qualities such as scores were used as
C5 26 150 14.4 C5 12 156 6.7 parameters of a dual classification of pass or fail. The finest
C6 16 166 8.8 C6 15 171 8.3 result was achieved by a Naive Bayes method which gave
D7 3 169 1.6 D7 3 174 1.6 precision of 74%. In addition, discovery was made that
E8 5 174 2.8 E8 4 178 2.2 grades obtained in former institute of scholastic
F9 6 180 3.3 F9 2 180 1.1 accomplishment has much greater influence than personal
Total Percent 100.0 Total Percent 100.0 characteristics.