The weekly columnArticle 35, October 2000 Blind Marking or Calibrated Marking:How Should TEFL/TESL Teachers Grade Written Exams?By Christine Canning- WilsonCenter for Excellence in Applied Research and Training (CERT), HCT Abu Dhabi Editor's note: This article is also available in PDF (Acrobat) format for downloading or printing. Click here to download it now. You will need to have the free Acrobat Reader installed to view the document; click here to download and install the Reader if necessary. Abstract: This paper will review the results of two different methods of scoring written exams using blind-marking and calibrated-marking techniques in two different tertiary level institutions under the same system. By blind-marking the author means scoring written exams without an awareness of other raters scores and/or visual bias by the results of other graders numerical marks or identity. The term calibrated-marking is intended to reflect the practice of group consensus of what constitutes a written exam score, regardless of whether or not the paper is blind-marked or not. Additionally, the paper suggests that rigorous standards which incorporate good testing practice influence grading as well as an institutions ability to train its markers to grade with consistency and accuracy. The qualitative and semi-quantitative results of the study iniative are still being explored to exam the effects of gender bias in marking and halo-effect in scoring; therefore, at the current time, only the variables and their effect on the differences in scoring practices are examined in this report. Although the report highlights the results of one institution, it reflects concern for the practices instituted by the other. Therefore, to keep the anominity of the two institutions, they will simply be referred to only as Institution ABC and Institution XYZ. 1. Literature Review Many programs, with high-stakes exams administered on large or small-scale use either norm-reference testing or criterion-based testing practices. Norm-reference tests derive their pass mark from the performance of the student. It is not predetermined. Criterion-reference tests are written to reflect a specific curriculum. Criterion- reference tests may also use criteria separate from outside the given curriculum and/or student population to determine proficiency. Institution ABC uses norm-reference testing in their one-year foundations program. Institution XYZ always uses criterion-reference testing, to measure student performance, proficiency and mastery of the English language. However, both Institution ABC and Institution XYZ incorporate a writing score into their exam practices. Institution ABC relies heavily on in-house calibrated writing practices for scoring papers, whilst Institution XYZ uses a combination of calibrated writing practice, blind-marking practices, and on-going teacher training workshops to help practitioners mark English written examinations based on international benchmarks and standards. The first issue, before beginning the paper that we must examine, is how testing influences language teaching. The forerunning experts in testing, Alderson and Wall, theorize a washback hypothesis between assessment and its relationship to teaching and learning. In their cornerstone article, Alderson and Wall (1993) directly state:
Additional factors beyond those stated by Anderson and Wall, affect learning, teaching and testing. Other influences and variables must be entered into the testing process in order to understand how a test impacts learning, teaching, curriculum and/or administrative practices. I believe these practices and beliefs include, but are not limited to:
How do these factors affect the practices of markers scoring English written exams? First, poor training or in-house training can lead to standards which are not based on the literature. It can standardized standards which should never have been put into practice in the first place. It can potentially lead to abuses and a lack of consistency, which in turn can give a testing program insuffienct or inaccurate results, a teacher a wrong washback of what material has been learned by the student and an unfair benchmark or grade to the learner of what he/she has mastered in the language classroom. 2. Program Overview Institution ABC is a one-year foundations program. The ABC curriculum serves as a transitional phase of instruction for students leaving high school and entering their respective faculties. The ABC provides up to three semester-long levels of instruction. Each level (English 1, English 2, and English 3) offers approximately nine hours per week of basic and intermediate instruction in English. These courses carry credit and lead to English-for-Special Purpose (ESP) classes in a separate ESP Program. ABC English 2 and 3 courses have dual tracks, one for students whose major requires extensive amounts of English (ESL track), and a less intensive one for students studying in the EFL track. For example, most science majors are in the ESL track; whereas, most Education and Arts majors are in the EFL tracks. As of Spring 2000, the ABC students were assessed with a 10% teacher grade, 30% mid-term grade and a 60% final exam grade based on in-house teacher produced examinations overseen by the testing committee. In comparison to the ABC program, the XYZ institution offers solid programs, which can lead to a Certificate, Diploma, Higher Diploma or Bachelors Degree using English as a medium of instruction. The XYZ Colleges use both college based assessment examinations and international benchmark exams such as the PET and the IELTS to measure students proficiency. Both institutions are dedicated to improving the English language skills of learners. Their commitments to furthering language abilities of young nationals can be found in their in-depth mission statements. The ABC On-Line Mission Statement is stated verbatim as follows:
It further states that " Because students will interact with English to different degrees and in varying contexts, individualization must be built into proficiency-oriented curricula." As of the time of the study, no official use of proficiency guidelines standardized tests (PET/IELTS) or formal speaking courses were in practice. The Mission Statement and goals of the XYZ institution (2000) differ from its sister institution. The XYZ Handbook and On-line Handbook 2000 states that:
The Aims of the English Program according to the XYZ Handbook are stated verbatim:
It is clear that the mission of the XYZ is to improve teaching excellence and to maximize opportunities, which keeps learning on the cutting edge of technological and pedagogical advances. Unlike ABC students, XYZ students are required to take English courses through out their entire program of study until graduation. Teachers are vigorously trained to comply with the high standards set forth by the XYZ and XYZ Academic Services. XYZ are dedicated to the delivery of technical and professional programs of the highest quality to the citizens of the XYZ country. 3. Testing Programs There are other fundamental differences between the ABC and the XYZ programs. Firstly, Institution XYZ offers and tests a speaking component. To date Institution ABC does not formally teach or test speaking. Secondly, the XYZ uses standardized tests. These regularly administered and highly recognizable examinations allow XYZ students to be benchmarked by international standards. ABC courses/program do not use standardized exams that have been piloted and validated at the end of their courses. Instead, level heads and teachers create exams based on the materials found in the books using similar exercises. Section 3 of the ABC Mission Statement On-line Program File states, "Students are first assessed through a placement test. The midterm and final examinations are developed by the Testing Committee working collaboratively with all the teachers." The ABC exams are produced in-house, and are not calibrated against international benchmark tests. The test is written to the level of the student and the course materials, this practice often result in an approximate 70/30-pass/fail rate on examinations. This is a significant achievement in comparison to the 1993 ABC goal to aim for a proposed 60% pass rate on in-house tests. To obtain these new results during the seven-year period, self-assessment reviews and new testing practices had to be implemented. As the XYZs Academic Central Services (ACS) support to student assessments are standardized and centralized, the ACS was not faced with the same types of challenges as the ABC. Because the ABC does not follow all of the same basic testing practices employed by the XYZ, ABC exams, over the years, have been known to have more than one answer to a question, typing errors and to have had irregular pass/fail rates where upper-administration has had to call the testing process into question. An additional challenge the ABC's testing committee has faced is the need to produce criterion exams, which are written to the level of the students. These in-house exams, because of previous mentioned constraints make it next to impossible for the score results to be benchmarked at international standards. Furthermore, unlike standardized exams, which are piloted and have a regular system for validating questions, the in-house produced tests at ABC potentially run the risk of problems with testing issues such as validity. The last challenge faced by the ABC, and to be fair most testing programs worldwide is test security. As stated before, because of the number of teachers on curriculum and testing teams, who have potential access to exams, there still remains a potential risk as far as security issues are concerned. Again, because the XYZs Academic Services are standardized and centralized it is not faced with the same types of challenges. 4. Grading Criteria for Marking Scripts Data examined shows that the XYZ has a more reliable system of examination writing, vetting and scoring than the ABC. Computerized statistical reports show that grading is more accurate and objective at the XYZ. Practices for grading are based on their own XYZ in-house writing bands, which are calibrated with the internationally recognized ESU Framework. Empirical evidence will show that XYZs system for banding is a more reliable way of scoring than that of the current ABC practice of in-house designed calibration. Each semester, the ACS at XYZ offers new and veteran faculty on-going training seminars in grading writing scripts. Furthermore, it trains its teachers to meet international grading requirements to ensure quality and equity amongst markers. As stated previously, the XYZ in-house banding system is based on the ESU Framework. In other areas of evaluation and assessment, the XYZ sponsors international and prestigious testing conferences such as the Current Trends in English Language Testing (CTELT). It has hired testing specialists who are internationally recognized. It offers its employees handbooks and on-line self-access manuals for reference. Most recently, the XYZ encouraged the development of teacher and student sites for testing practice with its generous support from Quality/Teaching/Learning Grants. It would be more than fair and accurate to state that the XYZ is a forerunner in its commitment to furthering good testing practices. As a result of the in place standards, the XYZ students are the benefactors of quality control. Learners are regularly graded and assessed based on internally created and simulated international benchmarks as well as by formal standardized examinations of proficiency. The XYZ's stringent use of academic requirements such as the UCLES/IDP, International English Language Testing System (IELTS) and the UCLES Preliminary English Test (PET), allow the XYZ learners to be eligible to actively compete in a global world using the English language. XYZ students are regularly encouraged to apply and learn the language for use in the workplace and to benchmark proficiency against other language learners around the world. Moreover, passing these prestigious international examinations is required for students to graduate from an XYZ Program. In comparison, Institution ABC does not incorporate standardized exams at the end of its courses to grade a students proficiency. As stated earlier in the ABC Mission Statement, they still must try to build proficiency based-curricula. Therefore, they have no reason to test for proficiency in the language at the current time, since it is not taught in their program. As a result, they rely solely on their teacher-generated tests. XYZ college-based exams are graded by an in-house designed banding system based on the ESU Framework for banding. Bands range from 1-10 with quarter band intervals to distinguish ability and score differentials. Formal exterior assessment of University of Cambridge Local Examination Syndicate (UCLES) exams such as IELTS and PET use the established standardized criteria. A review of an approximate 9,000 written script exams scored in June 2000 revealed a 64% marking accuracy in same score grading between three independent markers using blind marking practices (see figure 1a). Software statistical practices showed a 96% marking accuracy rate within a half band radius amongst three independent markers using blind marking practices. These results are phenomenal and the XYZ practices should serve as a model for other institutions. Figure 1a:
The ABC writing calibration practices differ greatly from those in use at the XYZ. The calibration criteria to determine student grades changes semester to semester and sometimes exam to exam. In hand, I have 5 different ABC calibration sheets for grading that have been used on different exams, but for the purposes of this paper, I will use the calibration sheet devised for the Spring 1998 exams, which were used to grade th 297 scripts examined in the study. The 1998 ESL 3 Calibration paper reads as follows:
The calibration guideline used to grade the 1998 Spring exams was an ill-designed tool to measure student performance. Therefore, the design of the instrument could have influenced the results of teacher assessments of student papers, in addition to the other fundamental problems in the testing process. The poor design of the instrument can be attributed to two main factors:
Sample scripts were used with teachers prior to grading to come to a consensus of what constituted a mark of "9" and what constituted a mark of "5". However, because of the ambiguity of the instrument, the lack of inter-rater reliability between markers/scores, the directive from the Unit Coordinator to eliminate the old practice of a third check to ensure consistency between all graders (that had been previously established by the past supervisor of testing) and the blatant disregard for calibration policy by some markers, resulted in inconsistent marking. The data collected from 297 random written exam scripts indicated 26.7% of the scripts were in some form of grading violation. The 88 papers with marking inconsistencies had the potential to lead to unfair scoring practices (see figures 2a and 2b). Figure 2a:
5 Other Fundamental Grading Differences Between the ABC and the XYZ The ACS of the XYZ is responsible to provide reliable and valid examinations for the entire system. Statistical data, collected by Howell and Marsden (2000) and analyzed by in-house software packages, indicates the high level of accuracy in teacher banding when grading written scripts. This success can be attributed to:
The software makes its determination by formulating at quarter band differentials. If the three banding scores are as follows, the final grade will be calculated as:
Upon completion of the data entry, the inter-marker data is reviewed by in-house designed software packages. Each individual teachers performance is scored against his/her colleagues grading. If a teacher is marking too high or too low, the software will alert ACS that there is a potential inconsistency. XYZs use of blind marking, code numbers, and alphabetized lists leads to:
The testing policies and practices up to Spring 2000 have been different in the ABC Unit for the following reasons:
6. Recommendations It would be recommended that the XYZ continue on their current path of success and continue to further develop their already successful inter-rater reliability results on grading student written scripts. Although the ABC testing program has made consistent strides over the years on limited man-power and resources, some changes need to be made to further improve the foundations program. The following suggestions are only recommendations offered by the author who has worked in both systems. It would be advisable to have an outside testing consultant review the entire system using a top-down and bottom-up evaluation process for a more objective opinion. It would be equally advisable to have an internal evaluation from all faculty currently involved in teaching or testing courses in the ABC. It would be highly recommended that the ABC Unit, hire a testing supervisor with a Ph.D. in testing and assessment. It would be suggested that the program consider restructuring the current testing programs marking practices. It would be recommended that the Unit Coordinator and assistant coordinators were not actively involved in testing. It would be better if they could serve as a scanning team for inconsistencies, if software cannot be readily provided. It would be further recommended that standardized proficiency exams such as the PET and IELTS be implemented at the 2nd and 3rd levels of the EFL and ESL tracks as a course requirement to graduate from the foundation programs. It would be suggested that a banding system of descriptors be created and that consistent training by experts be offered to train teachers in independent blind marking to increase inter-rater reliability. It would also be advisable to not involve classroom teachers in the review or inputting of their students exam papers. It would be advantageous to the ABC Unit, to use more advanced statistical programs for data analysis and to use independent mathematicians to report on the validity of test questions and exam results. It would be advantageous to reinstate the previous policy of a third independent reviewer. A third marker grading a writing script should be incorporated and a fourth arbitrator should review papers where a spread of scores proves to be inconsistent. Exams should be independently reviewed to make sure that markers are grading to task. Lastly, ABC supervisors and teachers should be held more accountable for how they mark and score student papers. 7. Future Directions It would be interesting to see if future writing scores would become more accurate and show more marking inter-rater reliability, if these suggestions were implemented. It would also be noteworthy, to see if the gender of the marker influenced scoring practices when correcting same sex and different sex students papers. Studying whether or not handwriting influenced ABC grading of written exams would also prove to be interesting in terms of research. 8. Conclusions It would be unfair to catagorize all calibration practices as being lesser or not equal to the practices of blind-marking. Testing practices outside of calibration-scoring can affect and potentially influence score outcomes. The author would recommend a combination similar to the practices of Institution XYZ, which would include but not be limited to, rigorous training in banding practices, on-going calibration workshops, implementation of software to improve standards of inter-rater reliability and a top-down/bottom-up self-evaluation and outside-evaluation system for feedback on teacher grading practices. It would be recommended that teachers or institutions wishing to upgrade their current standards research the literature or join professional organizations in the areas of testing. Bibliography:
Other Secondary References:
Christine Canning-Wilson is an instructor and co-curriculum coordinator at the Center for Excellence in Research and Training (CERT) at the Higher Colleges of Technology in Abu Dhabi. Previously, she was a supervisor and lecturer in the University General Requirements Unit of UAE University where she also served as the Chairperson of the Media Graphics & Visual Arts Committee. She is the past Chairperson of the TESOL Arabia 2000 Conference and member of the TESOL Arabia Executive Council. Currently, she serves as the Chairperson of the Video Special Interest Group of TESOL Arabia. Christine Canning-Wilson has published in numerous journals, newsletters and proceedings. In addition, she has been invited to speak at many international conferences. She can be contacted at christine.canning@hct.ac.ae
Questions or comments about this week's article? Why not post them on our Discussion Forum |
|