Autosoft Journal

Online Manuscript Access


Improving Performance Prediction On Education Data With Noise And Class Imbalance



Abstract

This paper proposes to apply machine learning techniques to predict students’ performance on two real-world educational data-sets. The first data-set is used to predict the response of students with autism while they learn a specific task, whereas the second one is used to predict students’ failure at a secondary school. The two data-sets suffer from two major problems that can negatively impact the ability of classification models to predict the correct label; class imbalance and class noise. A series of experiments have been carried out to improve the quality of training data, and hence improve prediction results. In this paper, we propose two noise filter methods to eliminate the noisy instances from the majority class located inside the borderline area. Our methods combine the over-sampling SMOTE technique with the thresholding technique to balance the training data and choose the best boundary between classes. Then we apply a noise detection approach to identify the noisy instances. We have used the two data-sets to assess the efficacy of class-imbalance approaches as well as both proposed methods. Results for different classifiers show that, the AUC scores significantly improved when the two proposed methods combined with existing class-imbalance techniques.


Keywords


Pages

Total Pages: 8

DOI
10.1080/10798587.2017.1337673


Manuscript ViewPdf Subscription required to access this document

Obtain access this manuscript in one of the following ways


Already subscribed?

Need information on obtaining a subscription? Personal and institutional subscriptions are available.

Already an author? Have access via email address?


Published

Online Article

Cite this document


References

Boukis, Christos, Aristodemos Pnevmatikakis, and Lazaros Polymenakos, eds. "Artificial Intelligence and Innovations 2007: From Theory to Applications." IFIP The International Federation for Information Processing (2007): n. pag. Crossref. Web. https://doi.org/10.1007/978-0-387-74161-1

ACM SIGKDD Explorations Newsletter 6.1 (2004): n. pag. Crossref. Web. https://doi.org/10.1145/1007730

Berrar, D., and P. Flach. "Caveats and Pitfalls of ROC Analysis in Clinical Microarray Research (and How to Avoid Them)." Briefings in Bioinformatics 13.1 (2011): 83-97. Crossref. Web. https://doi.org/10.1093/bib/bbr008

Blagus R. BMC Bioinformatics 14.1 (2013) https://doi.org/10.1186/1471-2105-14-1

Breiman, Leo. Machine Learning 45.1 (2001): 5-32. Crossref. Web. https://doi.org/10.1023/A:1010933404324

Brodley C.E. Journal of Artificial Intelligence Research

Chawla N.V. Journal of Artificial Intelligence Research

Cortez P. Proceedings of 5th Annual Future Business Technology Conference

Dietterich, Thomas G. "Ensemble Methods in Machine Learning." Lecture Notes in Computer Science (2000): 1-15. Crossref. Web. https://doi.org/10.1007/3-540-45014-9_1

Fawcett, Tom. "An Introduction to ROC Analysis." Pattern Recognition Letters 27.8 (2006): 861-874. Crossref. Web. https://doi.org/10.1016/j.patrec.2005.10.010

Gamberger D. Proc. of 16th ICML

García, V. et al. "Surrounding Neighborhood-Based SMOTE for Learning from Imbalanced Data Sets." Progress in Artificial Intelligence 1.4 (2012): 347-362. Crossref. Web. https://doi.org/10.1007/s13748-012-0027-5

Guo X. Fourth International Conference on Natural Computation IEEE

Huang, De-Shuang, Xiao-Ping Zhang, and Guang-Bin Huang, eds. "Advances in Intelligent Computing." Lecture Notes in Computer Science (2005): n. pag. Crossref. Web. https://doi.org/10.1007/11538059

Haibo He, and E.A. Garcia. "Learning from Imbalanced Data." IEEE Transactions on Knowledge and Data Engineering 21.9 (2009): 1263-1284. Crossref. Web. https://doi.org/10.1109/TKDE.2008.239

Khoshgoftaar, Taghi M., and Pierre Rebours. "Improving Software Quality Prediction by Noise Filtering Techniques." Journal of Computer Science and Technology 22.3 (2007): 387-396. Crossref. Web. https://doi.org/10.1007/s11390-007-9054-2

KHOSHGOFTAAR, TAGHI M., VEDANG JOSHI, and NAEEM SELIYA. "DETECTING NOISY INSTANCES WITH THE ENSEMBLE FILTER: A STUDY IN SOFTWARE QUALITY ESTIMATION." International Journal of Software Engineering and Knowledge Engineering 16.01 (2006): 53-76. Crossref. Web. https://doi.org/10.1142/S0218194006002677

López, Victoria et al. "An Insight into Classification with Imbalanced Data: Empirical Results and Current Trends on Using Data Intrinsic Characteristics." Information Sciences 250 (2013): 113-141. Crossref. Web. https://doi.org/10.1016/j.ins.2013.07.007

Márquez-Vera, Carlos et al. "Predicting Student Failure at School Using Genetic Programming and Different Data Mining Approaches with High Dimensional and Imbalanced Data." Applied Intelligence 38.3 (2012): 315-330. Crossref. Web. https://doi.org/10.1007/s10489-012-0374-8

Szczuka, Marcin et al., eds. "Rough Sets and Current Trends in Computing." Lecture Notes in Computer Science (2010): n. pag. Crossref. Web. https://doi.org/10.1007/978-3-642-13529-3

Pedregosa F. Journal of Machine Learning Research

Radwan, Akram M. et al. "Active Machine Learning Framework for Teaching Object Recognition Skills to Children with Autism." International Journal of Developmental Disabilities 63.3 (2016): 158-169. Crossref. Web. https://doi.org/10.1080/20473869.2016.1190543

Sáez, José A. et al. "Managing Borderline and Noisy Examples in Imbalanced Classification by Combining SMOTE with Ensemble Filtering." Lecture Notes in Computer Science (2014): 61-68. Crossref. Web. https://doi.org/10.1007/978-3-319-10840-7_8

Sáez, José A. et al. "SMOTE-IPF: Addressing the Noisy and Borderline Examples Problem in Imbalanced Classification by a Re-Sampling Method with Filtering." Information Sciences 291 (2015): 184-203. Crossref. Web. https://doi.org/10.1016/j.ins.2014.08.051

Seiffert, Chris et al. "RUSBoost: A Hybrid Approach to Alleviating Class Imbalance." IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 40.1 (2010): 185-197. Crossref. Web. https://doi.org/10.1109/TSMCA.2009.2029559

Seiffert, Chris et al. "An Empirical Study of the Classification Performance of Learners on Imbalanced and Noisy Software Quality Data." Information Sciences 259 (2014): 571-595. Crossref. Web. https://doi.org/10.1016/j.ins.2010.12.016

Sheng V.S. Proceedings of the 21st National Conference on Artificial Intelligence

Sluban, Borut, Dragan Gamberger, and Nada Lavrač. "Ensemble-Based Noise Detection: Noise Ranking and Visual Performance Evaluation." Data Mining and Knowledge Discovery 28.2 (2013): 265-303. Crossref. Web. https://doi.org/10.1007/s10618-012-0299-1

Thai-Nghe, Nguyen, Andre Busche, and Lars Schmidt-Thieme. "Improving Academic Performance Prediction by Dealing with Class Imbalance." 2009 Ninth International Conference on Intelligent Systems Design and Applications (2009): n. pag. Crossref. Web. https://doi.org/10.1109/ISDA.2009.15

Thai-Nghe, Nguyen, Zeno Gantner, and Lars Schmidt-Thieme. "Cost-Sensitive Learning Methods for Imbalanced Data." The 2010 International Joint Conference on Neural Networks (IJCNN) (2010): n. pag. Crossref. Web. https://doi.org/10.1109/IJCNN.2010.5596486

Żytkow, Jan M., and Mohamed Quafafou, eds. "Principles of Data Mining and Knowledge Discovery." Lecture Notes in Computer Science (1998): n. pag. Crossref. Web. https://doi.org/10.1007/BFb0094798

Tomek I. IEEE Transactions on Systems, Man, and Cybernetics, SMC-6

Wu, Xindong, and Xingquan Zhu. "Mining With Noise Knowledge: Error-Aware Data Mining." IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 38.4 (2008): 917-932. Crossref. Web. https://doi.org/10.1109/TSMCA.2008.923034

YANG, QIANG, and XINDONG WU. "10 CHALLENGING PROBLEMS IN DATA MINING RESEARCH." International Journal of Information Technology & Decision Making 05.04 (2006): 597-604. Crossref. Web. https://doi.org/10.1142/S0219622006002258

Zhu, Xingquan, and Xindong Wu. "Class Noise Vs. Attribute Noise: A Quantitative Study." Artificial Intelligence Review 22.3 (2004): 177-210. Crossref. Web. https://doi.org/10.1007/s10462-004-0751-8

Zhu X. Proceedings of the 20th ICML

JOURNAL INFORMATION


ISSN PRINT: 1079-8587
ISSN ONLINE: 2326-005X
DOI PREFIX: 10.31209
10.1080/10798587 with T&F
IMPACT FACTOR: 0.652 (2017/2018)
Journal: 1995-Present

SCImago Journal & Country Rank


CONTACT INFORMATION


TSI Press
18015 Bullis Hill
San Antonio, TX 78258 USA
PH: 210 479 1022
FAX: 210 479 1048
EMAIL: tsiepress@gmail.com
WEB: http://www.wacong.org/tsi/