Department of CSE
Kattankulathur. [email protected]
Department of CSE
Kattankulathur. [email protected]
Department of CSE
In this era of computerization, education
has also revitalized itself and is no more limited to the old methods. The
quest to find new and advanced ways to make educational system more efficient
and to make students intellect have begun. These days, a lot of data is
collected in educational databases, but it remains unutilized in the database. To
make legitimate use of such a large amount of data, powerful tools and
algorithms are required. It is very important to study and analyze educational
data to help & improvise the students. Educational Data Mining (EDM) is an
emerging field, exploring data in educational context by applying different Data
Mining (DM) techniques/tools. It provides intrinsic knowledge of teaching and
learning process for effective educational planning. This paper presents a
comprehensive survey, a travelogue towards educational data mining & its scope
the span of last 10-20 years, the number of education institutions have procreated
rapidly. Large number of graduates are produced by them every year. Institutes
may follow best of the inculcation methods; but still they face the problem of
dropout students, low achievers and unemployed students.
Data Mining (EDM) is an emerging field exploring data in educational context by
applying different Data Mining (DM) techniques. EDM inherits properties from
areas like Learning Analytics, Artificial Intelligence, Information Technology,
Machine learning, Statics, Database Management System, Computing and Data
Mining. It can be considered as interdisciplinary research field which provides
intrinsic knowledge of teaching and learning process for effective education.
Data Mining is a new trend in the data mining and Knowledge Discovery in
Databases (KDD) field which focuses in mining useful patterns and discovering
useful knowledge from the educational information systems, such as, admissions
systems, registration systems, course management systems and any other systems
dealing with students from schools, to colleges and universities. Researchers
in this field focus on discovering useful knowledge either to help the
educational institutes manage their students in a better fashion, or to help
students to improvise their education and enhance their performance.
and analyzing the factors for poor performance is a complex and ceaseless process
based on the past and present information besieged from academic performance
and students’ behavior. Powerful techniques and algorithms are required to
analyze and predict the performance of students scientifically.
institutions collect a humongous number of students’ data, but this data
remains unutilized and does not help in any way to improve the performance of
could identify the factors for low performance earlier and is able to predict
students’ behavior, this knowledge can help them in taking pro-active actions,
so as to improve the performance of such students. It will be a win-win
situation for all of them involved i.e. management, teachers, students and
parents. Students will be able to identify their weaknesses beforehand and can
improve themselves. Teachers will be able to plan their lectures as per the
need of students and can provide better guidance to such students. Parents will
be reassured of their ward performance in such institutes. Eventually, this
will help in the proper growth of the nation.
and Pal conducted a research on a group of 50 students enrolled in a specific
course program across a period of 4 years, with multiple performance
indicators, which includes
used ID3 decision tree algorithm to construct a decision tree, if-then rules. This
application is supposed to help the instructors as well as the students to
better understand and predict students’ performance at the end of the semester.
They defined their objective of this study as: “This study will also work to
identify those students which needed special attention to reduce fail ration
and taking appropriate action for the next semester examination”.
and Elaraby conducted a research that mainly focuses on generating
classification rules and predicting students’ performance in a selected course
program based on previously recorded students’ behavior and activities. They
processed and analyzed previously enrolled students’ data in a specific course
program across 6 years, with multiple attributes collected from the university.
As a result, they were able to predict, the students’ final grades in the
selected course program. They defined their objective of study as: “Help the
students to improve the student’s performance, to identify those students which
needed special attention to reduce failing ration and taking appropriate action
at right time”.
and Pal conducted a significant data mining research using the Naïve Bayes
classification method, on a group of BCA students. A questionnaire was conducted
with the help of each and every student before the final examination, which had
multiple personal, social questions which was used in the study to identify
relation between these factors and the student’s performance and grades. They stated
their main objectives of this study as:
of a data source of predictive variables
of different factors, which effects a student’s learning behavior and
performance during academic career
of a prediction model using classification data mining techniques on the basis
of identified predictive variables
of the developed model for higher education students studying in Indian
Universities or Institutions.
found that the most influencing factor for student’s performance is his grade
in senior secondary school, i.e. those students who performed well in their
secondary school, will definitely perform well in their bachelors. It was also
found that the living location, medium of teaching, mother’s qualification,
student other habits, family annual income, and student family status, highly
contribute in the students’ educational performance.
and Yacef describes the following to be the four goals of EDM:
student’s future learning behavior
or improving domain models
the effects of educational support
scientific knowledge about learning and learners
student’s future learning behavior – This
goal can be achieved by creating student models that incorporate the learner’s
characteristics, including detailed information such as their knowledge,
behaviors and motivation to learn.
improving domain models – Through the
various methods and applications of EDM, discovery of new and improvements to
existing models is possible.
effects of educational support – It can
be achieved through learning systems.
scientific knowledge about learning and learners
– By building and incorporating student models, the field of EDM research and
the technology can be improvised to a lot extent.
MINING DEFINITION AND TECHNIQUES
Data mining refers to extracting or
“mining” knowledge from large amounts of data. Data mining techniques are
used to operate on large amount of data to find new and hidden patterns, relationships
which can be helpful in decision making.
The various techniques used in Data Mining
Association analysis is the discovery of
association rules showing attribute-value conditions that occur frequently
together in a given set of data. Association analysis is widely used for
transaction data analysis.
In prediction, the goal is to develop a
model which can infer a single aspect of data from some combination of other
aspects of data. If we study prediction extensively then we get three types of
prediction: classification, regression and density estimation. In any category
of prediction, the input variables will be either categorical or continuous.
Classification is the processing of
finding a set of models (or functions) which describe and distinguish data
classes or concepts, for the purposes of being able to use the model to predict
the class of objects whose class label is unknown.
q Clustering Analysis
Unlike classification and predication,
which analyze class labeled data objects, clustering analyzes data objects
without consulting a known class label. In general, the class labels are not
present in the training data simply because they are not known to begin with.
Clustering can be used to generate such labels. The objects are clustered or
grouped based on the principle of maximizing the intraclass similarity and
minimizing the interclass similarity.
That is, clusters of objects are formed so
that objects within a cluster have high similarity in comparison to one
another, but are very dissimilar to objects in other clusters. Each cluster
that is formed can be viewed as a class of objects, from which rules can be
Naive Bayes: classifiers are a collection of classification
algorithms based on Bayes’ Theorem. It is not a
single algorithm but a family of algorithms where all of them share a common
principle – “every pair of features being classified is independent of each
model is the default model that predicts the classes of all examples in a
dataset as the class of its mode (highest frequency). For example, let’s
consider a dataset of 100 records and 2 classes (Yes & No), the “Yes”
occurs 70 times and “No” occurs 30 times, the default model for this dataset
will classify all objects as “Yes”, hence, its accuracy will be 70%. Even
though it is useless, but equally important, it allows to evaluate the
accuracies produced by other classification models. This concept can be
generalized to all classes/labels in the data to produce an expectation of the
class recall as well.
input data records may or may not be the same length. For example, if you’re
working with sentences for sentiment analysis they’ll be of various lengths.
sentences are of different length, we pad our sentences with special
lengths of the two sentences equal, if documents are longer, they will be
So now we have
our sentences modified as :
Sentence 1 : the camera
quality is very good
Sentence 2 : the battery
life is good
Now, both the sentences are of same length. We proceed to build the vocabulary
Vocabulary index is a
mapping of integer to each unique word in the corpus.
In our case, size of vocabulary index will be 9, since there are 9 unique
tokens. Vocabulary is as follows
tensorflow.contrib.learn.preprocessing.VocabularyProcessor is used for building
The VocabularyProcessor maps your text
documents into vectors, and you need these vectors to be of a consistent
length. Each row in raw_documents variable will be mapped to a vector of length max_document_length. You provide this parameter to
the VocabularyProcessor so that it can
adjust the length of output vectors.
sentence is converted into vector of integers.
Sentence 1: 1, 2, 3, 4, 5, 6
Sentence 2: 1, 7, 8, 4, 6, 0
& FUTURE WORK
Data mining is a tremendously vast area
that includes employing different techniques and algorithms for pattern
finding. The algorithms discussed in this paper are the ones used in education
mining. These algorithms have shown a remarkable improvement in strategies like
course outline formation, teacher student understanding and high output and
turn out ratio. ICDM conference encourages employment and development of
algorithms helpful in data mining. An appreciable research is still being done
on various algorithms.
Prediction with data mining has reaped
benefits; such as finding set of weak students, determining student’s
satisfaction for a particular course, Faculty Evaluation, Comprehensive student
evaluation, Class room teaching language selection, predicting students’
dropout, course registration planning, predicting the enrollment headcount,
evaluation of collaborative activities etc.
of the most recent and biggest challenge that higher education faces today is
making students skillfully employable. Many universities/institutes are not in
position to guide their students because of lack of information and assistance
from their teaching-learning systems. To better administer and serve student
population, the universities/institutions need better assessment, analysis, and
q Nat’l Research Council,
Building a Workforce for the Information Economy, Nat’l Academies Press, 2001.
q C. Romero, S. Ventura, and E. Garca,”Data
Mining in Course Management Systems: Moodle Case Study and Tutorial,” Computers
& Education, vol. 51, no. 1, 2008, pp. 368–384.
q . L. Pappano, “The Year
of the MOOC,”The New York Times, 2 Nov. 2012;
q Z. Pardos et al., “Adapting Bayesian Knowledge
Tracing to a Massive Open Online Course in edX,” Proc. 6th Int’l Conf.
Educational Data Mining (EDM 13), 2013; www.educational
q A. Elbadrawy, R.S. Studham, and G. Karypis,
“Collaborative Multiregression Models for Predicting Students’ Performance in
Course Activities,” Proc. 5th Int’l Conf. Learning Analytics and Knowledge (LAK
15), 2015, pp. 103–107.