1 USING data MINING TO PREDICT . SECONDARY SCHOOL STUDENT performance . Paulo Cortez and Alice Silva Dep. Information Systems/Algoritmi R&D Centre University of Minho 4800-058 Guimar aes, PORTUGAL. Email: KEYWORDS for 18 to 24 year olds, while the European Union average Business Intelligence in Education, Classification and value was just 15% (Eurostat 2007). In particular, fail- Regression, Decision Trees, Random Forest ure in the core classes of Mathematics and Portuguese (the native language) is extremely serious, since they ABSTRACT provide fundamental knowledge for the success in the Although the educational level of the Portuguese pop- remaining SCHOOL subjects ( physics or history).
2 Ulation has improved in the last decades, the statistics On the other hand, the interest in Business Intelligence keep Portugal at Europe's tail end due to its high stu- (BI)/ data MINING (DM) (Turban et al. 2007), arose due dent failure rates. In particular, lack of success in the to the advances of Information Technology, leading to core classes of Mathematics and the Portuguese lan- an exponential growth of business and organizational guage is extremely serious. On the other hand, the databases. All this data holds valuable information, fields of Business Intelligence (BI)/ data MINING (DM), such as trends and patterns, which can be used to im- which aim at extracting high-level knowledge from raw prove decision making and optimize success.
3 Yet, hu- data , offer interesting automated tools that can aid the man experts are limited and may overlook important education domain. The present work intends to ap- details. Hence, the alternative is to use automated tools proach student achievement in SECONDARY education us- to analyze the raw data and extract interesting high- ing BI/DM techniques. Recent real-world data ( level information for the decision-maker. student grades, demographic, social and SCHOOL related The education arena offers a fertile ground for BI ap- features) was collected by USING SCHOOL reports and ques- plications, since there are multiple sources of data ( tionnaires. The two core classes ( Mathematics and traditional databases, online web pages) and diverse in- Portuguese) were modeled under binary/five-level clas- terest groups ( students , teachers, administrators or sification and regression tasks.)
4 Also, four DM mod- alumni) (Ma et al. 2000). For instance, there are sev- els ( Decision Trees, Random Forest, Neural Net- eral interesting questions for this domain that could be works and Support Vector Machines) and three input answered USING BI/DM techniques (Luan 2002, Minaei- selections ( with and without previous grades) were Bidgoli et al. 2003): Who are the students taking most tested. The results show that a good predictive accuracy credit hours? Who is likely to return for more classes? can be achieved, provided that the first and/or second What type of courses can be offered to attract more stu- SCHOOL period grades are available. Although student dents?
5 What are the main reasons for student transfers? achievement is highly influenced by past evaluations, an Is is possible to PREDICT student performance ? What are explanatory analysis has shown that there are also other the factors that affect student achievement? This paper relevant features ( number of absences, parent's job will focus in the last two questions. Modeling student and education, alcohol consumption). As a direct out- performance is an important tool for both educators and come of this research, more efficient student prediction students , since it can help a better understanding of this tools can be be developed, improving the quality of ed- phenomenon and ultimately improve it.
6 For instance, ucation and enhancing SCHOOL resource management. SCHOOL professionals could perform corrective measures for weak students ( remedial classes). INTRODUCTION In effect, several studies have addressed similar topics. Ma et al. (2000) applied a DM approach based in As- Education is a key factor for achieving a long-term eco- sociation Rules in order to select weak tertiary SCHOOL nomic progress. During the last decades, the Portuguese students of Singapore for remedial classes. The input educational level has improved. However, the statistics variables included demographic attributes ( sex, re- keep the Portugal at Europe's tail end due to its high gion) and SCHOOL performance over the past years and student failure and dropping out rates.
7 For example, in the proposed solution outperformed the traditional al- 2006 the early SCHOOL leaving rate in Portugal was 40% location procedure. In 2003 (Minaei-Bidgoli et al. 2003), online student grades from the Michigan State Univer- followed by higher education. Most of the students join sity were modeled USING three classification approaches the public and free education system. There are several ( binary: pass/fail; 3-level: low, middle, high; and courses ( Sciences and Technologies, Visual Arts). 9-level: from 1 - lowest grade to 9 - highest score). The that share core subjects such as the Portuguese Lan- database included 227 samples with online features ( guage and Mathematics.)
8 Like several other countries number of corrected answers or tries for homework) and ( France or Venezuela), a 20-point grading scale is the best results were obtained by a classifier ensemble used, where 0 is the lowest grade and 20 is the perfect ( Decision Tree and Neural Network) with accu- score. During the SCHOOL year, students are evaluated racy rates of 94% (binary), 72% (3-classes) and 62% (9- in three periods and the last evaluation (G3 of Table 1). classes). Kotsiantis et al. (2004) applied several DM al- corresponds to the final grade. gorithms to PREDICT the performance of computer science This study will consider data collected during the 2005- students from an university distance learning program.
9 2006 SCHOOL year from two public schools, from the Alen- For each student, several demographic ( sex, age, tejo region of Portugal. Although there has been a trend marital status) and performance attributes ( mark for an increase of Information Technology investment in a given assignment) were used as inputs of a binary from the Government, the majority of the Portuguese pass/fail classifier. The best solution was obtained by a public SCHOOL information systems are very poor, rely- Naive Bayes method with an accuracy of 74%. Also, it ing mostly on paper sheets (which was the current case). was found that past SCHOOL grades have a much higher Hence, the database was built from two sources: SCHOOL impact than demographic variables.
10 More recently, Par- reports, based on paper sheets and including few at- dos et al. (2006) collected data from an online tutoring tributes ( the three period grades and number of system regarding USA 8th grade Math tests. The au- SCHOOL absences); and questionnaires, used to comple- thors adopted a regression approach, where the aim was ment the previous information. We designed the latter to PREDICT the math test score based on individual skills. with closed questions ( with predefined options) re- The authors used Bayesian Networks and the best result lated to several demographic ( mother's education, was an predictive error of 15%. family income), social/emotional ( alcohol consump- In this work, we will analyze recent real-world data tion) (Pritchard and Wilson 2003) and SCHOOL related from two Portuguese SECONDARY schools.