Evaluating AI-Generated Feedback for Formative Assessment in Higher Education

Xiaolei Li; David Chen; David Tuffley; Gary Scott; Gervase Tuxworth; Tsungcheng Yao; Geraldine Torrisi-Steele

doi:10.46787/ijaipil.v2026i1.6962

Vol. 2026 No. 1 (2026): 2026

Articles

Evaluating AI-Generated Feedback for Formative Assessment in Higher Education

PDF

Xiaolei Li,
David Chen,
David Tuffley,
Gary Scott,
Gervase Tuxworth,
Tsungcheng Yao,
Geraldine Torrisi-Steele

more info

Xiaolei Li
Bio

David Chen
Griffith University, Australia
Bio

DOI: https://doi.org/10.46787/ijaipil.v2026i1.6962

Published 2026-03-10 — Updated on 2026-03-19

Versions

Keywords

AI-generated feedback; AI-assisted Assessment; Formative Assessment; Peer Review; Higher Education

How to Cite

Li, X., Chen, D., Tuffley, D., Scott, G., Tuxworth, G., Yao, T., & Torrisi-Steele, G. (2026). Evaluating AI-Generated Feedback for Formative Assessment in Higher Education. International Journal of AI in Pedagogy, Innovation, and Learning Futures, 2026(1). https://doi.org/10.46787/ijaipil.v2026i1.6962 (Original work published March 10, 2026)

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

The consistent provision of effective formative feedback on assessment is an integral part of the learning process, yet it is one of the most demanding and challenging aspects of learning and teaching, especially when required at scale. It is therefore unsurprising that amid the exploration of AI tools in teaching and learning, there is growing interest in the application of Large Language Models (LLMs) to support formative feedback in the learning process. Using student submissions from a course that included structured peer review activities, the study reported in this paper is a comparative analysis of AI-generated feedback and scoring (GPT-4). The findings show AI-generated feedback provides structured and detailed comments for assessment components based on clear and objective criteria, with strong agreement with human marking practices. Limitations appeared in areas requiring subjective judgment, such as evaluating the quality of peer reviews and the depth of student reflection. In these cases, human educators provide more nuanced interpretations based on contextual and pedagogical understanding. Findings put emphasis on the importance of human oversight for qualitative and interpretive evaluation. These findings suggest that a balanced human-AI approach, grounded in pedagogical intent and careful integration, is essential for the effective use of AI-assisted feedback in higher education.

PDF

References

Abdel Aziz, M. H., Rowe, C., Southwood, R., Nogid, A., Berman, S., & Gustafson, K. (2024). A scoping review of artificial intelligence within pharmacy education. American Journal of Pharmaceutical Education, 88(1), 100615. https://doi.org/10.1016/j.ajpe.2023.100615
Ali, K., Barhom, N., Tamimi, F., & Duggal, M. (2023). ChatGPT—A double-edged sword for healthcare education? Implications for assessments of dental students. European Journal of Dental Education. Advance online publication. https://doi.org/10.1111/eje.12937
Ballantine, J., Boyce, G., & Stoner, G. (2024). A critical review of AI in accounting education: Threat and opportunity. Critical Perspectives on Accounting, 99, 102711. https://doi.org/10.1016/j.cpa.2024.102711
Beerepoot, M. T. P. (2023). Formative and summative automated assessment with multiple-choice question banks. Journal of Chemical Education, 100(8), 2947–2955. https://doi.org/10.1021/acs.jchemed.3c00120
Birss, D. (2023). The prompt collection. (Publisher not provided.)
Black, P., & Wiliam, D. (2009). Developing the theory of formative assessment. Educational Assessment, Evaluation and Accountability, 21(1), 5–31. https://doi.org/10.1007/s11092-008-9068-5
Bloom, B. S. (1984). The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring. Educational Researcher, 13(6), 4–16.
Boud, D., & Molloy, E. (2013). Rethinking models of feedback for learning: The challenge of design. Assessment & Evaluation in Higher Education, 38(6), 698–712. https://doi.org/10.1080/02602938.2012.691462
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., & Henighan, T. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Carless, D. (2022). From teacher transmission of information to student feedback literacy: Activating the learner role in feedback processes. Active Learning in Higher Education, 23(2), 143–153. https://doi.org/10.1177/1469787420945845
Chang, D. H., Lin, M. P.-C., Hajian, S., & Wang, Q. Q. (2023). Educational design principles of using AI chatbot that supports self-regulated learning in education: Goal setting, feedback, and personalization. Sustainability, 15(17), 12921. https://doi.org/10.3390/su151712921
Correia, A.-P., Hickey, S., & Xu, F. (2025). Realizing the possibilities of the large language models: Strategies for prompt engineering in educational inquiries. Theory Into Practice, 64(4), 434–447. https://doi.org/10.1080/00405841.2025.2528545
Dai, W., Lin, J., Jin, H., Li, T., Tsai, Y.-S., Gašević, D., & Chen, G. (2023). Can large language models provide feedback to students? A case study on ChatGPT. In 2023 IEEE International Conference on Advanced Learning Technologies (ICALT) (pp. 323–325). IEEE. https://doi.org/10.1109/ICALT58122.2023.00100
Dai, Y., Liu, A., & Lim, C. P. (2023). Reconceptualizing ChatGPT and generative AI as a student-driven innovation in higher education. Procedia CIRP, 119, 84–90. https://doi.org/10.1016/j.procir.2023.05.002
Gao, R., Merzdorf, H. E., Anwar, S., Hipwell, M. C., & Srinivasa, A. R. (2024). Automatic assessment of text-based responses in post-secondary education: A systematic review. Computers and Education: Artificial Intelligence, 6, 100206. https://doi.org/10.1016/j.caeai.2024.100206
Gobrecht, A., Tuma, F., Möller, M., Zöller, T., Zakhvatkin, M., Wuttig, A., Sommerfeldt, H., & Schütt, S. (2024). Beyond human subjectivity and error: A novel AI grading system (arXiv:2405.04323). arXiv. https://doi.org/10.48550/arXiv.2405.04323
Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81–112. https://doi.org/10.3102/003465430298487
Henderson, M., Bearman, M., Chung, J., Fawns, T., Buckingham Shum, S., Matthews, K. E., & De Mello Heredia, J. (2025). Comparing generative AI and teacher feedback: Student perceptions of usefulness and trustworthiness. Assessment & Evaluation in Higher Education, 1–16. https://doi.org/10.1080/02602938.2025.2502582
Henderson, M., Ryan, T., & Phillips, M. (2019). The challenges of feedback in higher education. Assessment & Evaluation in Higher Education, 44(8), 1237–1252. https://doi.org/10.1080/02602938.2019.1599815
Irons, A., & Elkington, S. (2021). Enhancing learning through formative assessment and feedback (2nd ed.). Routledge.
Kerman, N. T., Banihashem, S. K., Karami, M., Er, E., Van Ginkel, S., & Noroozi, O. (2024). Online peer feedback in higher education: A synthesis of the literature. Education and Information Technologies, 29(1), 763–813. https://doi.org/10.1007/s10639-023-12273-8
Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, 77(6), 1121–1134.
Lee, D., & Palmer, E. (2025). Prompt engineering in higher education: A systematic review to help inform curricula. International Journal of Educational Technology in Higher Education, 22(1), 7. https://doi.org/10.1186/s41239-025-00503-7
Lee, Y.-C., & Fu, W.-T. (2019). Supporting peer assessment in education with conversational agents. In Proceedings of the 24th International Conference on Intelligent User Interfaces: Companion (pp. 7–8). https://doi.org/10.1145/3308557.3308695
Lo, L. S. (2023). The CLEAR path: A framework for enhancing information literacy through prompt engineering. The Journal of Academic Librarianship, 49(4), 102720. https://doi.org/10.1016/j.acalib.2023.102720
Memarian, B., & Doleck, T. (2024). A review of assessment for learning with artificial intelligence. Computers in Human Behavior: Artificial Humans, 2(1), 100040. https://doi.org/10.1016/j.chbah.2023.100040
Messer, M., Brown, N. C. C., Kölling, M., & Shi, M. (2024). Automated grading and feedback tools for programming education: A systematic review. ACM Transactions on Computing Education, 24(1), 1–43. https://doi.org/10.1145/3636515
Møgelvang, A., Bjelland, C., Grassini, S., & Ludvigsen, K. (2024). Gender differences in the use of generative artificial intelligence chatbots in higher education: Characteristics and consequences. Education Sciences, 14(12), 1363. https://doi.org/10.3390/educsci14121363
Molenaar, I. (2022). The concept of hybrid human-AI regulation: Exemplifying how to support young learners’ self-regulated learning. Computers and Education: Artificial Intelligence, 3, 100070. https://doi.org/10.1016/j.caeai.2022.100070
Ng, S. W. (2012). The impact of peer assessment and feedback strategy in learning computer programming in higher education. Issues in Informing Science and Information Technology, 9, 17–27. https://doi.org/10.28945/1601
Nicol, D. D., & Macfarlane-Dick, D. (2006). Rethinking formative assessment in higher education: A theoretical model and seven principles of good feedback practice. Studies in Higher Education, 31(2), 199–218.
Ocampo, J. C. G., & Panadero, E. (2023). Web-based peer assessment platforms: What educational features influence learning, feedback and social interaction? In O. Noroozi & B. De Wever (Eds.), The power of peer learning (pp. 165–182). Springer International Publishing. https://doi.org/10.1007/978-3-031-29411-2_8
Ofosu-Ampong, K. (2023). Gender differences in perception of artificial intelligence-based tools. Journal of Digital Art & Humanities, 4(2), 52–56. https://doi.org/10.33847/2712-8149.4.2_6
Parekh, V., Shah, D., & Shah, M. (2020). Fatigue detection using artificial intelligence framework. Augmented Human Research, 5(1), 5. https://doi.org/10.1007/s41133-019-0023-4
Reynolds, L., & McDonell, K. (2021). Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1–7). https://doi.org/10.1145/3411763.3451760
Vygotsky, L. S. (1978). Mind in society: The development of higher psychological processes. Harvard University Press.
Wong, J., Baars, M., Davis, D., Van Der Zee, T., Houben, G.-J., & Paas, F. (2019). Supporting self-regulated learning in online learning environments and MOOCs: A systematic review. International Journal of Human–Computer Interaction, 35(4–5), 356–373. https://doi.org/10.1080/10447318.2018.1543084
Zimmerman, B. J. (2008). Investigating self-regulation and motivation: Historical background, methodological developments, and future prospects. American Educational Research Journal, 45(1), 166–183. https://doi.org/10.3102/0002831207312909

Evaluating AI-Generated Feedback for Formative Assessment in Higher Education

Versions

Keywords

How to Cite

Download Citation

Abstract

References