Academia.eduAcademia.edu

Computing Marks from Multiple Assessors Using Adaptive Averaging

2009

Consider a situation in which a group of assessors mark a collection of submissions; each assessor marks more than one submission and each submission is marked by more than one assessor. Typical scenarios include reviewing conference submissions and peer marking in a class. The problem is how to optimally assign a final mark to each submission. The mark assignment must be robust in the following sense. A small group of assessors might collude and give marks which significantly deviate from the marks given by other assessors. Another small group of assessors might give arbitrary marks, uncorrelated with the others’ assessments. Some assessors might be excessively generous while some might be extremely stringent. In each of these cases, the impact of the marks by assessors from such groups has to be appropriately discounted. Based on the work in [2], we propose a method which produces marks meeting the above requirements. The final mark assigned to each submission is a weighted averag...

Computing Marks from Multiple Assessors Using Adaptive Averaging Aleksandar Ignjatovic, Chung Tong Lee, Cat Kutay, Hui Guo, Paul Compton School of Computer Science and Engineering, University of New South Wales, Sydney, Australia Email: {ignjat, ctlee, ckutay, huig, compton}@cse.unsw.edu.au Abstract Consider a situation in which a group of assessors mark a collection of submissions; each assessor marks more than one submission and each submission is marked by more than one assessor. class. Typical scenarios include reviewing conference submissions and peer marking in a The problem is how to optimally assign a final mark to each submission. assignment must be robust in the following sense. The mark A small group of assessors might collude and give marks which significantly deviate from the marks given by other assessors. Another small group of assessors might give arbitrary marks, uncorrelated with the others’ assessments. Some assessors might be excessively generous while some might be extremely stringent. In each of these cases, the impact of the marks by assessors from such groups has to be appropriately discounted. Based on the work in [2], we propose a method which produces marks meeting the above requirements. The final mark assigned to each submission is a weighted average of marks by individual assessors; the weight given to each assessor’s mark is inversely related to the total variance of all his marks from the final marks. Clearly, such definition is circular, and the existence of a final mark assignment having such a property is proved using the Brouwer Fixed Point Theorem for continuous maps on convex compact sets [1]. We provide a fast converging iterative algorithm for computing such a fixed point and give results of empirical tests of the robustness and adequacy of the marks calculated by our algorithm. 1. Introduction Assume that 𝑀 assessors 𝑎𝑖 : 1 ≤ 𝑖 ≤ 𝑀 are marking 𝑁 submissions 𝑠𝑗 : 1 ≤ 𝑗 ≤ 𝑁 . Each submission is marked by at least two (preferably more) assessors, and each assessor marks at least two (preferably more) submissions. of submissions for a large conference. A typical example is the common reviewing process The assessors might also be the students submitting their assignments, i.e., our method applies to peer marking as well. As it is often the case in practice, assessors might not have uniform criteria; some might consistently be tougher, some are more generous with their marks; some might mark erratically, allowing a large random component in their marks. Further, in case of peer marking, there might be collusions of smaller groups, giving members of the colluding group higher marks than warranted, and to everyone else low marks. The aim of this paper is to show how adaptive averages can be used to design a marking procedure which is robust with respect to:  discrepancies in the strictness of marking criteria of individual assessors,  influence of collusion of smaller groups (in case of peer marking), and  presence of assessors with somewhat arbitrary (random) marking practice. The method should also:  detect which assessors give anomalous marks, and indicate the nature of the anomaly;  be reasonably efficient and allow significant number of both assessors and submissions. In our procedure, final marks are obtained as a fixed point of a weighted average operator, with weights assigned to marks given by each assessor reflecting the variance of the marks of that assessor from the finally assigned marks. Our procedure satisfies all of the above criteria, as our simulations and empirical testing on actual data show. 2. Basic Notations For each assessor 𝑎𝑖 , let 𝐷𝑖 be the domain of the set of indices of all submissions that 𝑎𝑖 has marked. For each submission 𝑠𝑗 such that 𝑗 ∈ 𝐷𝑖 , let 𝑚 𝑖, 𝑗 be the mark given by 𝑎𝑖 to 𝑠𝑗 . Let also 𝐺𝑗 be the set of indices of all assessors that have marked 𝑠𝑗 . All marks are non- negative real numbers in a bounded range, i.e., there exists 𝑅 > 0 such that 0 ≤ 𝑚 𝑖, 𝑗 ≤ 𝑅 for all 𝑖, 𝑗 where 1 ≤ 𝑖 ≤ 𝑀, 1 ≤ 𝑗 ≤ 𝑁. For the purpose of analysis, let us consider an unspecified marking method ℳ. Denoting by 𝜇𝑗 the mark assigned by ℳ to the submission 𝑠𝑗 , we define two metrics for variance of an assessor 𝑎𝑖 from the assigned marks: 𝑣 𝑖 = 𝑠𝑣 𝑖 = where 𝐷𝑖 1 𝐷𝑖 1 𝐷𝑖 𝑗 ∈𝐷 𝑖 𝑚 𝑖, 𝑗 − 𝜇𝑗 𝑗 ∈𝐷 𝑖 𝑝 𝑠𝑔𝑛 𝑚 𝑖, 𝑗 − 𝜇𝑗 ⋅ 𝑚 𝑖, 𝑗 − 𝜇𝑗 𝑝 (1) denotes the number of submissions marked by 𝑎𝑖 ; 𝑝 ≥ 1 is a real parameter; and 𝑠𝑔𝑛 𝑥 = −1, 𝑥 < 0 . 1, 𝑥 ≥ 0 Thus, the variance metric 𝑣 𝑖 takes into account only the absolute values of the differences between 𝜇𝑗 (assigned by ℳ ) and 𝑚 𝑖, 𝑗 If 𝑣 𝑖 retains the sign of these differences. (given by 𝑎𝑖 ), while the variance metric 𝑠𝑣 𝑖 is large and absolute value of 𝑠𝑣 𝑖 is small, the assessor 𝑎𝑖 has a high degree of arbitrariness, because his marks are often and equally likely excessively high and excessively low. If both 𝑣 𝑖 and 𝑠𝑣 𝑖 are of high positive values, 𝑎𝑖 tends to be excessively generous compared to the marking method ℳ, while a large value of 𝑣 𝑖 and a large negative value of 𝑠𝑣 𝑖 indicates that 𝑎𝑖 is a harsh assessor. The higher the value of the parameter 𝑝, the more contributing to the sum the large differences become. Let 𝜇 = 𝜇𝑖 : 1 ≤ 𝑖 ≤ 𝑀 and let 𝑞 ≥ 1 be another real parameter. Consider a submission 𝑠𝑗 ; we define a weight function 𝑤 𝑖, 𝑗, 𝜇 for all 𝑖 ∈ 𝐺𝑗 and then a weighted average 𝑓𝑗 𝜇 of marks 𝑚 𝑖, 𝑗 given to 𝑠𝑗 by 𝑎𝑖 . 𝑤 𝑖, 𝑗, 𝜇 = 𝑓𝑗 𝜇 = 𝑖∈𝐺𝑗 1− 𝑙∈𝐺𝑗 𝑣 𝑖 𝑘∈𝐺𝑗 𝑣 𝑘 1− 𝑞 𝑣 𝑙 𝑘∈𝐺𝑗 𝑣 𝑘 𝑞 (2) 𝑤 𝑖, 𝑗, 𝜇 ⋅ 𝑚 𝑖, 𝑗 If we interpret 𝜇𝑗 as the final mark assigned to 𝑠𝑗 , then for a fixed 𝑗, 𝑓𝑗 𝜇 is a weighted average of marks 𝑚 𝑖, 𝑗 : 𝑖 ∈ 𝐺𝑗 . Since 0 ≤ 𝑣 𝑖 ≤ 𝑘∈𝐺𝑗 𝑣 𝑘 , we have 0 ≤ 1 − 𝑣 𝑖 𝑘∈𝐺 𝑗 𝑣 𝑘 ≤ 1. Because all these values can be close to one, we need to use a “spreading function” to emphasize the variances of values of 1− 𝑣 𝑖 𝑘 ∈𝐺 𝑗 𝑣 𝑘 and taking the term to a power of 𝑞 serves the purpose. In our experiments, taking 𝑞 with order of magnitude equal to the average size of the sets 𝐺𝑗 worked very well. Finally, to obtain the correct weighted average, we must normalize weights so that for each 𝑠𝑗 , 𝑖∈𝐺𝑗 𝑤 𝑖, 𝑗, 𝜇 = 1. This explains the form of the weight formulas (2). Note that for two different submissions 𝑠𝑗 1 and 𝑠𝑗 2 , the corresponding weights 𝑤 𝑖, 𝑗1 , 𝜇 and 𝑤 𝑖, 𝑗2 , 𝜇 for the same assessor 𝑎𝑖 may be different, because the sets of 𝐺𝑗 1 and 𝐺𝑗 2 might not be the same. The sum 𝑓𝑗 𝜇 can be seen as an adaptive weighted average of all marks given to 𝑠𝑗 , in the sense that the weight 𝑤 𝑖, 𝑗, 𝜇 assigned to the marks of an assessor 𝑎𝑖 is inversely related to his share 𝑣 𝑖 in the total variance of all assessors who marked 𝑠𝑗 . If the values 𝜇𝑗 satisfy 𝑓𝑗 𝜇 = 𝜇𝑗 for all 𝑗: 1 ≤ 𝑗 ≤ 𝑁, we will have precisely the desired properties of the marking system. Namely that the impact on the final marks of marks assigned by assessors with larger variance in their marking is appropriately diminished. reliability of the corresponding assessor. 𝑓𝑗 𝜇 1≤𝑗 ≤𝑁 , The weights reflect appropriately the Consequently, if we define the operator 𝐹 𝜇 = 𝜇 should be a fixed point of the operator 𝐹 in the hypercube 0, 𝑅 𝐹 𝜇 = 𝜇 with 0 ≤ 𝜇𝑗 ≤ 𝑅. 𝑁 , i.e., If all assessors of a submission 𝑠𝑗 give the same mark, this mark is assigned to 𝑠𝑗 as 𝜇𝑗 . Thus, we can assume that for every 1 ≤ 𝑗 ≤ 𝑁 there are at least two assessors 𝑎𝑖1 and 𝑎𝑖2 such that 𝑚 𝑖𝑖 , 𝑗 ≠ 𝑚 𝑖2 , 𝑗 . Note that for all such 𝑗 the denominators in equations (2) are non zero, regardless of the value of 𝜇𝑗 ; consequently, all weights given by (2) are well-defined and for all 𝑗 ≤ 𝑀: min 𝑚 𝑖, 𝑗 𝑖∈𝐺𝑗 ≤ 𝑓𝑗 𝜇 = 𝑖∈𝐺𝑗 𝑤 𝑖, 𝑗, 𝜇 ⋅ 𝑚 𝑖, 𝑗 ≤ max 𝑚 𝑖, 𝑗 . 𝑖∈𝐺𝑗 (3) The operator 𝐹 maps a compact and convex subset of ℝ𝑁 , namely the 𝑁-dimensional cube 0, 𝑅 𝑁 , into itself, and this mapping is continuous. 𝐹 has a fixed point in 0, 𝑅 𝑁 Thus, by the Brouwer Fixed Point Theorem , which produces the assignment of marks which satisfy the appropriate weighted average equations. 3. Experiments The system of equations 𝑓𝑗 𝜇 = 𝜇𝑗 : 1 ≤ 𝑗 ≤ 𝑁 is a set of polynomial equations, and it can be solved efficiently using standard iterative procedures, with the simple arithmetic mean of the given marks 𝑚 𝑖, 𝑗 as the starting point for iterations. In our tests, we use Wolfram’s Mathematica software package to obtain a fixed point solution. The results of one of our tests are presented in Figure 1. Marks Adaptive Average 8 6 4 Simple Average 2 Control Average Student # 5 10 15 20 25 30 Figure 1 Simulation Result of Assessments of 32 Students We simulated a situation of a class of 32 students. Each submission was marked by all students except the author. The range of marks is from 1 to 10. higher the student number, the better the ability. The students are arranged so that the The first eight students are weakest and they colluded by giving each other 10 points and everyone else 1 point. In addition to being the students with highest ability, the last three were also lazy assessors and assigned marks using a random number generator. Thus, we argue that the fair marks should be the average of marks by all assessors, excluding those colluders and lazy ones. However, in practice it is hard to ascertain who colluded and who was careless in marking in order to eliminate the marks given by such assessors. Note that our algorithm does not require making any decisions regarding the quality of assessors, but instead relies on essentially self-adapting averaging method. We can see that, for example, the first eight colluding students would have managed to significantly increase their marks, while also significantly reducing the marks of the best students, if the final marks were determined as simple averages of all given marks. However, as can be seen from the Figure 1, the algorithm dramatically reduces the effect of unworthy marks, both in terms of reducing the benefit for the colluding students as well as reducing the impact on marks of good submissions. After the system has been solved and the marks 𝜇𝑗 ’s have been obtained, one can evaluate 𝑣 𝑖 and 𝑠𝑣 𝑖 for all assessors 𝑎𝑖 . From the statistics of these values, one can find assessors with large variance as well as groups that might have colluded. Colluders Random Assessors 30 20 𝑣 𝑖 10 5 10 15 10 20 25 30 𝑠𝑣 𝑖 20 Figure 2 Variance Analysis For peer marking, the weight equation can be modified as follows: 𝑤 𝑖, 𝑗, 𝜇 = 1− 𝑙∈𝐺𝑗 where 𝑟 is a positive real number. 𝑣 𝑖 𝑘∈𝐺𝑗 𝑣 𝑘 𝑞 𝑣 𝑙 1− 𝑘∈𝐺𝑗 𝑣 𝑘 ⋅ 𝜇𝑖 𝑞 𝑟 ⋅ 𝜇𝑙 (4) 𝑟 This way we give more weight to assessors who will receive higher marks themselves, with the assumption that a higher mark reflects increased competence in the subject and thus more reliable marking. 4. Conclusions The same procedure applies to various other scenarios that involve a data fusion process. For example, we might have a network of wireless temperature sensors that report their readings to a central processing unit for monitoring operation environment. Sensors might have variable accuracy over time, due to battery life, occasional loss of individual packets, intermittent exposure to direct sun etc. Our adaptive averaging method can be used to estimate temperature of the environment at a given moment from individual readings, while appropriately discounting data from sensors with large variance from the estimated values. development can be regarded as a form of data fusion for experiences. Trust Details about using adaptive averaging for trust and reputation evaluation can be found in [2]. The parameters 𝑝 and 𝑞 in equations (1) and (2) can be used to tune the method for different setups. For example, if each submission is marked by small number of assessors, taking larger 𝑝 improves rejection of anomalous marks. Similarly, for a larger number of assessors, or if each assessor has marked small number of submission, taking a larger 𝑞 improves the performance. References [1] D. H. Griffel, Applied Functional Analysis, Dower Publications Inc., 1981. [2] A. Ignjatovic, N. Foo, C. T. Lee, An Analytic Approach to Reputation Ranking of Participants in Online Transactions, IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2008.