Example: marketing

USING DATA MINING TO PREDICT SECONDARY SCHOOL …

USING DATA MINING TO PREDICT . SECONDARY SCHOOL STUDENT PERFORMANCE. Paulo Cortez and Alice Silva Dep. Information Systems/Algoritmi R&D Centre University of Minho 4800-058 Guimar aes, PORTUGAL. Email: KEYWORDS for 18 to 24 year olds, while the European Union average Business Intelligence in Education, Classification and value was just 15% (Eurostat 2007). In particular, fail- Regression, Decision Trees, Random Forest ure in the core classes of Mathematics and Portuguese (the native language) is extremely serious, since they ABSTRACT provide fundamental knowledge for the success in the Although the educational level of the Portuguese pop- remaining SCHOOL subjects ( physics or history). ulation has improved in the last decades, the statistics On the other hand, the interest in Business Intelligence keep Portugal at Europe's tail end due to its high stu- (BI)/Data MINING (DM) (Turban et al.)

system regarding USA 8th grade Math tests. The au-thors adopted a regression approach, where the aim was ... (PCC), while in regression the Root. Table 1: The preprocessed student related variables Attribute Description (Domain) sex student’s sex (binary: female or male)

Tags:

  Math

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of USING DATA MINING TO PREDICT SECONDARY SCHOOL …

1 USING DATA MINING TO PREDICT . SECONDARY SCHOOL STUDENT PERFORMANCE. Paulo Cortez and Alice Silva Dep. Information Systems/Algoritmi R&D Centre University of Minho 4800-058 Guimar aes, PORTUGAL. Email: KEYWORDS for 18 to 24 year olds, while the European Union average Business Intelligence in Education, Classification and value was just 15% (Eurostat 2007). In particular, fail- Regression, Decision Trees, Random Forest ure in the core classes of Mathematics and Portuguese (the native language) is extremely serious, since they ABSTRACT provide fundamental knowledge for the success in the Although the educational level of the Portuguese pop- remaining SCHOOL subjects ( physics or history). ulation has improved in the last decades, the statistics On the other hand, the interest in Business Intelligence keep Portugal at Europe's tail end due to its high stu- (BI)/Data MINING (DM) (Turban et al.)

2 2007), arose due dent failure rates. In particular, lack of success in the to the advances of Information Technology, leading to core classes of Mathematics and the Portuguese lan- an exponential growth of business and organizational guage is extremely serious. On the other hand, the databases. All this data holds valuable information, fields of Business Intelligence (BI)/Data MINING (DM), such as trends and patterns, which can be used to im- which aim at extracting high-level knowledge from raw prove decision making and optimize success. Yet, hu- data, offer interesting automated tools that can aid the man experts are limited and may overlook important education domain. The present work intends to ap- details. Hence, the alternative is to use automated tools proach student achievement in SECONDARY education us- to analyze the raw data and extract interesting high- ing BI/DM techniques.

3 Recent real-world data ( level information for the decision-maker. student grades, demographic, social and SCHOOL related The education arena offers a fertile ground for BI ap- features) was collected by USING SCHOOL reports and ques- plications, since there are multiple sources of data ( tionnaires. The two core classes ( Mathematics and traditional databases, online web pages) and diverse in- Portuguese) were modeled under binary/five-level clas- terest groups ( students, teachers, administrators or sification and regression tasks. Also, four DM mod- alumni) (Ma et al. 2000). For instance, there are sev- els ( Decision Trees, Random Forest, Neural Net- eral interesting questions for this domain that could be works and Support Vector Machines) and three input answered USING BI/DM techniques (Luan 2002, Minaei- selections ( with and without previous grades) were Bidgoli et al.)

4 2003): Who are the students taking most tested. The results show that a good predictive accuracy credit hours? Who is likely to return for more classes? can be achieved, provided that the first and/or second What type of courses can be offered to attract more stu- SCHOOL period grades are available. Although student dents? What are the main reasons for student transfers? achievement is highly influenced by past evaluations, an Is is possible to PREDICT student performance? What are explanatory analysis has shown that there are also other the factors that affect student achievement? This paper relevant features ( number of absences, parent's job will focus in the last two questions. Modeling student and education, alcohol consumption). As a direct out- performance is an important tool for both educators and come of this research, more efficient student prediction students, since it can help a better understanding of this tools can be be developed, improving the quality of ed- phenomenon and ultimately improve it.

5 For instance, ucation and enhancing SCHOOL resource management. SCHOOL professionals could perform corrective measures for weak students ( remedial classes). INTRODUCTION In effect, several studies have addressed similar topics. Ma et al. (2000) applied a DM approach based in As- Education is a key factor for achieving a long-term eco- sociation Rules in order to select weak tertiary SCHOOL nomic progress. During the last decades, the Portuguese students of Singapore for remedial classes. The input educational level has improved. However, the statistics variables included demographic attributes ( sex, re- keep the Portugal at Europe's tail end due to its high gion) and SCHOOL performance over the past years and student failure and dropping out rates. For example, in the proposed solution outperformed the traditional al- 2006 the early SCHOOL leaving rate in Portugal was 40% location procedure.

6 In 2003 (Minaei-Bidgoli et al. 2003), online student grades from the Michigan State Univer- followed by higher education. Most of the students join sity were modeled USING three classification approaches the public and free education system. There are several ( binary: pass/fail; 3-level: low, middle, high; and courses ( Sciences and Technologies, Visual Arts). 9-level: from 1 - lowest grade to 9 - highest score). The that share core subjects such as the Portuguese Lan- database included 227 samples with online features ( guage and Mathematics. Like several other countries number of corrected answers or tries for homework) and ( France or Venezuela), a 20-point grading scale is the best results were obtained by a classifier ensemble used, where 0 is the lowest grade and 20 is the perfect ( Decision Tree and Neural Network) with accu- score. During the SCHOOL year, students are evaluated racy rates of 94% (binary), 72% (3-classes) and 62% (9- in three periods and the last evaluation (G3 of Table 1).)

7 Classes). Kotsiantis et al. (2004) applied several DM al- corresponds to the final grade. gorithms to PREDICT the performance of computer science This study will consider data collected during the 2005- students from an university distance learning program. 2006 SCHOOL year from two public schools, from the Alen- For each student, several demographic ( sex, age, tejo region of Portugal. Although there has been a trend marital status) and performance attributes ( mark for an increase of Information Technology investment in a given assignment) were used as inputs of a binary from the Government, the majority of the Portuguese pass/fail classifier. The best solution was obtained by a public SCHOOL information systems are very poor, rely- Naive Bayes method with an accuracy of 74%. Also, it ing mostly on paper sheets (which was the current case). was found that past SCHOOL grades have a much higher Hence, the database was built from two sources: SCHOOL impact than demographic variables.

8 More recently, Par- reports, based on paper sheets and including few at- dos et al. (2006) collected data from an online tutoring tributes ( the three period grades and number of system regarding USA 8th grade math tests. The au- SCHOOL absences); and questionnaires, used to comple- thors adopted a regression approach, where the aim was ment the previous information. We designed the latter to PREDICT the math test score based on individual skills. with closed questions ( with predefined options) re- The authors used Bayesian Networks and the best result lated to several demographic ( mother's education, was an predictive error of 15%. family income), social/emotional ( alcohol consump- In this work, we will analyze recent real-world data tion) (Pritchard and Wilson 2003) and SCHOOL related from two Portuguese SECONDARY schools. Two differ- ( number of past class failures) variables that were ent sources were used: mark reports and questionnaires.

9 Expected to affect student performance. The question- Since the former contained scarce information ( only naire was reviewed by SCHOOL professionals and tested on the grades and number of absences were available), it a small set of 15 students in order to get a feedback. The was complemented with the latter, which allowed the final version contained 37 questions in a single A4 sheet collection of several demographic, social and SCHOOL re- and it was answered in class by 788 students. Latter, lated attributes ( student's age, alcohol consump- 111 answers were discarded due to lack of identification tion, mother's education). The aim is to PREDICT student details (necessary for merging with the SCHOOL reports). achievement and if possible to identify the key variables Finally, the data was integrated into two datasets re- that affect educational success/failure.

10 The two core lated to Mathematics (with 395 examples) and the Por- classes ( Mathematics and Portuguese) will be mod- tuguese language (649 records) classes. eled under three DM goals: During the preprocessing stage, some features were dis- carded due to the lack of discriminative value. For in- i) binary classification (pass/fail); stance, few respondents answered about their family ii) classification with five levels (from I very good or income (probably due to privacy issues), while almost excellent to V - insufficient); and iii) regression, with a numeric output that ranges be- 100% of the students live with their parents and have a tween zero (0%) and twenty (100%). personal computer at home. The remaining attributes are shown in Table 1, where the last four rows denote For each of these approaches, three input setups ( the variables taken from the SCHOOL reports.


Related search queries