Jump to content

Judging bias in the 2018-2019 season--statisticsss yay


Recommended Posts

Ok, making a separate thread for this because I would like to keep updating this and keep it easy to follow.

 

Hi fellow satellites! Because I'm mildly insane, I decided to compile all of the judges scores from the Grand Prix events in order to make it easier for fans to examine the judging records of judges throughout this season and determine whether and which judges exhibited evidence of bias in their scoring. I figured I would post it now in anticipation of the Grand Prix Final, especially as some of these judges will inevitably show up on the judging panels of the events there.

 

First, I created event reports which detailed how much a judge over- or under- scored a skater relative to other judges, for every single skater and every single judge on the Senior Grand Prix. In these reports, I entered in an abbreviated version of the protocols from each event and had my spreadsheet calculate how much each judge differed from the average other judges on the panel in three measures: Total Score (abbreviated TS), Average Raw GOE per element (out of 5) and Average Raw PCS per component (out of 10).

 

Here are the reports:**

Skate America

Skate Canada (thanks @WinForPooh for her help with this one!)

Cup of Finland

NHK Trophy

Rostelecom

IdF

Grand Prix Final

 

So, for instance, taking the first entry of the first event (Skate America Men’s) as an example, Patty Klein (CAN) scored Nathan Chen 10.03 points higher than the average of the other judges, gave him GOE scores that were 0.58/5 points higher on average per element, and PCS scores that were 0.12/10 higher per component, whereas Stanislava Smidova scored him 6.9 points below the average of the other judges, gave him GOEs 0.37 lower, and PCS 0.13 lower.

 

Have a question about whether a certain judge was being unfair? Compare their scores to the other judges!

 

Additionally, I also calculated the average nationalistic bias of each judge. More specifically, I determined the average TS/GOE/PCS difference for skaters sharing the same nationality as the judge in question, and the average TS/GOE/PCS difference for skaters having a different nationality, and found the difference between the two (which is represented by the somewhat obscurely named DELTA in the event reports). You can understand this as the average number of “bonus” points a skater or skaters receive from a certain judge when they share nationalities with that judge. Once I did all these calculations, I started to compile them into a database of judges.

 

You can find the database of judges here.

 

In this database, you can find average nationalistic biases averaged across all the events that the judge judged. So for instance, if you look at the first few entry as an example, you will see that Rebecca Andrew (AUS) scored Australian skaters virtually the same as she scored non-Australian skaters on average across all events. You can also see the Total Score average broken down by discipline, how many skaters she scored, as well as whether she was a relatively lenient or strict judge relative to the panels she’s been on (she was slightly on the stricter side). Then, in each of the specific judges’ pages, you can see the deviations of their scores from the other judges on the same panel for each skater for each event that they’ve judged so far on the Grand Prix. So if you want to take a closer look at any particular judge, you can!

 

Unfortunately, this database is not complete, as I still need to input roughly 40 or so out of the 100ish judges who judged on the Grand Prix (yeah, it’s a lot of work). However, I did get almost all of the judges from the big Federations (Russia, USA, Canada + Japan) as well as most of the judges who judged more than one or two events. So there should be plenty of points of interest.

 

General (if tentative) conclusions: Canadian judged are a mixed bag (though Canada doesn’t exactly have a lot of very competitive skaters this year), with some quite biased ones and some quite fair ones. American judges are mostly bad, as are Russian judges. Japanese judges largely avoid being too biased, although one or two are edging a little close. Rounding out the smaller feds, Israeli, Chinese, and Ukrainian judges seem pretty terrible, but most of the other small feds are all over the place as far as bias is concerned.

 

In terms of total score bias per discipline, the biggest biases so far are, in order, 1. Dance 2. Men 3. Ladies 4. Pairs. Men above Ladies and Pairs makes sense because there are more points available for the judges to distribute in Men, but Dance has the least available points...

 

Some caveats:

1. The data set for some judges is rather limited, so I wouldn’t necessarily draw hard conclusions about judges who’ve only scored one or two of their own skaters.

2. Similarly, past judging behavior is not necessarily predictive of future judging behavior. Some judges who’ve shown fair behavior before (eg. Agita Abele LAT, who I analyzed in a previous post that I’m too lazy to find right now) now appear to exhibit biased behavior, and a judge who in the past behaved in a manner consistent with bias may in the future not exhibit the same bias.

3. I’m not too sure that averaging total score across disciplines is very helpful as anything more than a crude picture of the bias of a judge. The GOE and PCS averages should be more comparable across disciplines, however. The numbers also behave a bit strangely when a judge has scored skaters from their own nationality in one discipline but not the other. Overall, however, I think the most informative numbers are the average bias a judge exhibits in a given discipline.

4. There may be a few errors in the protocols, as I have not had the time to proofread them carefully. However, I do think I’ve caught most of the large errors, so any remaining errors should not alter final numbers too much. If you spot an error, please let me know!

5. This set of data only tells you how much a judge over- or under- scores skaters relative to other judges. It does not tell you whether a specific skater is “correctly” scored by the judges as a whole. So it can’t really speak to questions like, for instance, whether Alina Zagitova or Shoma Uno or whomever are overscored, only whether a specific judge scored them above or below other judges and whether that might be related to nationality.

 

* Note that this is not a comparison between the judges’ scores and the official score. The formula is not Judge’s score minus official score, it’s Judge’s score minus the average of the other judges’ scores.

 

** I’ve also done some of the Senior Bs, though those numbers are not included in the judges database. For those interested, here are Autumn Classic, Ondrej Nepala, and half of Lombardia (singles disciplines only). I do intend to include them at some point but there’s just too much work. If anyone is interested in helping or wants to request a specific event from this season, please let me know.

Link to comment
Share on other sites

Thank you for doing all this work, I find it facinating! Also, big thanks for the explainer, I was wondering about a few thing. Could you explain what the red, green or no colored cells mean? Different levels of statistical certainty? Also, any chance of making the first row float (called Freeze in Google Spreadsheets, in View)? 

Link to comment
Share on other sites

47 minutes ago, cirelle said:

Thank you for doing all this work, I find it facinating! Also, big thanks for the explainer, I was wondering about a few thing. Could you explain what the red, green or no colored cells mean? Different levels of statistical certainty? Also, any chance of making the first row float (called Freeze in Google Spreadsheets, in View)? 

Hm, the first row should already float, it doesn't do that for you?

 

the red and green are just visual indicators of bias levels. red=bad and green=good. I somewhat arbitrarily set them at 0.4 threshold for red for GOE and PCS, and 6.5 pts for total score, and less than 0.1 points in PCS and GOE and 1 pt in total score for green.

Link to comment
Share on other sites

Elaboration on how to read the statistics:

The basic idea is this. First, I created an array representing the differences between each judge's scores and the average of the other judges. This is what the numbers in the summary page of the event reports mean. So for instance, Patty Klein gave Nathan Chen a total score of 288.79, whereas the other judges gave Nathan an average of 278.76. Therefore Patty Klein "overscored" Nathan by 288.79-278.76=10.03 (overscore here isn't a judgment about the quality of her scoring, it's just description of her score being higher than the other judges on average). That's what the 10.03 number you see under her column in the Nathan Chen row represents, and the same idea for the other cells.

Now of course, as gkelly pointed out, just that number by itself doesn't tell you a lot. If Patty Klein is a lenient judge in general, then 10.03 may not be particularly unusual. Therefore, in order to contextualize these score deviations, I have included a number representing the judge's average score deviation from the other judges across the competition in the same discipline, labeled "MEAN". If we look at Patty Klein's average, we see that it is 0.06, which tells us that she doesn't have a particular tendency to over or under-score in general, in contrast to say, Alexei Beletski, who has an average of -6.81, which means he underscores skaters by about 7 points in comparison to other judges on average. So 10.03 is a bit high for Patty Klein, but we can also take a look at the other scores to see whether large over or underscores are common for Patty Klein. We can see that while 10.03 is Klein's biggest deviation, there are other deviations of similar magnitude, so maybe she just particularly likes Nathan Chen. (The calculations are then repeated for average GOE out of 5 per element and average score out of 10 for PCS)

So that's how to interpret the arrays of numbers next to skaters' names. Now, when it comes to calculating national bias, I do two things. First, I find the average of the score deviations of skaters of the same nation as the judge, and then I find the average of the score deviations of skaters of different nations as the judge. Let's take US judge Wendy Enzmann as an example this time (still using SkAm Men's as our example competition). Enzmann judged 3 US skaters at this competition: Nathan Chen, Vincent Zhou, and Jimmy Ma. Her score deviations were, respectively, 4.65, 3.56, and 1.58. So the average of her same-nation score deviations was (4.65+3.56+1.58)/3=3.26, which is the number listed under MEANSAME (...maybe I should have switched the stats labels to more transparent ones). This means that on average, she scored US skaters 3.26 points higher than other judges. Her average score deviation for non-US skaters, on the other hand, was 0.27, which means she scored non-US skaters 0.27 higher than other judges. In order to calculate her bias, I found the difference between these numbers (this is given as DELTA in the competition reports but was renamed bias in the judge database...hmm yeah I really should have relabeled it). Enzmann's bias was (a fairly modest, especially for men's) 3.26-0.27=3.00 (accounting for rounding). You can think of these 3 points as 3 apparent bonus points Nathan, Vincent, and Jimmy got from Enzmann for being American.

In the judges database, this information was all recorded. The judge's database additionally throws together data from different competitions, in order to get a bigger picture of a judge's biases. I'm still thinking about exactly how I want combining competition data to work, but for now I think the most helpful thing to look at is the biases for each discipline (labeled "Men bias" "Ladies bias" etc). These are calculated by throwing all the score deviations from all the competitions in a specific discipline that a judge has judged, and re-calculating the average score deviation for both home and non-home skaters for the combined data set. Then, I find the difference between those averages again (now reported as "bias" instead of "delta"), which represents the average "bonus points" skaters of the same nationality as a judge got across competitions in the same discipline. I also combined data across disciplines in the "GOE bias" "PCS bias" and "Score bias" columns, though I'm a little conflicted about how well that worked. I also color coded it (somehow I forgot to mention this)--red means that the bias level was fairly high (dark red means even higher), green means bias level was low. I used 0.1, 0.4, and 0.8 as cut offs for GOE and PCS (<0.1 = green, between 0.1 and 0.4 = colorless, between 0.4 and 0.8 light red, higher than 0.8 dark red) and 1, 6.5, and 13 as cut offs for total score, but those numbers are a little arbitrary (I don't have all the judges' data together yet so it's a little difficult to make a non-arbitrary determination at this point) and it's mainly supposed to serve as a visual aid. This is a work in progress, so I'll probably wind up changing things once my judge data is all compiled into the database.

Link to comment
Share on other sites

59 minutes ago, shanshani said:

Hm, the first row should already float, it doesn't do that for you?

 

the red and green are just visual indicators of bias levels. red=bad and green=good. I somewhat arbitrarily set them at 0.4 threshold for red for GOE and PCS, and 6.5 pts for total score, and less than 0.1 points in PCS and GOE and 1 pt in total score for green.

 

Hmm, I get the links to the judges floating, but not the first row. Might be bc I’m on mobile? 

 

Cool, good to know!

Link to comment
Share on other sites

This is great! I've been looking for this kind of data in the earlier project with another satellite here.

Would you like to collaborate on this? I'd love to analyze this dataset.

Some questions:

a/ The judge spreadsheet that you uploaded was averaged over all skaters in all competitions that this judge participated in? Are there judges which appeared in multiple competitions? Did you make a distinction between seniors and juniors?

b/ Do you have the *raw* data (one judge delta for each performance, with label as to which performance it is)? What's the raw format like? (another spread sheet? Or some database?)

 

I'd like to work with the raw data if possible. (I have a stats background and worked on some collaborative filtering problems before (ie: how Netflix recommends movies to us). This problem has that flavor, so probably the same math would give us interesting insights).

@shanshani Please feel free to PM me if you want to talk more about it -- or, we can discuss it openly here and hopefully other techie satellites can join us. I'm excited!

Link to comment
Share on other sites

11 hours ago, wingman said:

This is great! I've been looking for this kind of data in the earlier project with another satellite here.

Would you like to collaborate on this? I'd love to analyze this dataset.

Some questions:

a/ The judge spreadsheet that you uploaded was averaged over all skaters in all competitions that this judge participated in? Are there judges which appeared in multiple competitions? Did you make a distinction between seniors and juniors?

b/ Do you have the *raw* data (one judge delta for each performance, with label as to which performance it is)? What's the raw format like? (another spread sheet? Or some database?)

 

I'd like to work with the raw data if possible. (I have a stats background and worked on some collaborative filtering problems before (ie: how Netflix recommends movies to us). This problem has that flavor, so probably the same math would give us interesting insights).

@shanshani Please feel free to PM me if you want to talk more about it -- or, we can discuss it openly here and hopefully other techie satellites can join us. I'm excited!

a) Yes, judges scores were averaged over multiple competitions both within and across disciplines if a judges more than one competition, although I'm still debating whether averaging over multiple disciplines provides any value. Only senior grand prix data was used for the spread sheet--I might add juniors at some point but I'll have to think about how I want to handle that data. Right now the focus is on finishing the judge's database with all the grand Prix data and then adding other senior international competitions like Senior Bs

 

b) If you click the judge names, the raw deviations for each skater in each competition are available on a separate sheet for each judge. You can also open the competition spreadsheets and find the same data there, in addition to the protocol data from which the deviations were computed.

 

last thing: that's great! you'd probably do a better job than me since I'm a mere amateur (like...I have to enter data by hand because I don't know how to program a scraper that could input protocols automatically). what qualifies me is the motivation to think of and do the project and the willingness to put in the work more than any statistical or technical background, haha

Link to comment
Share on other sites

It's *amazing* that you entered the stuff by hand. :bow::2thumbsup:

Did you just get such data from the ISU pdf protocol? (and did you enter that raw data by hand, or did you have a script to read off the pdf file?)

 

Let me make sure I understood the definitions of the variables:

- TES/GOE/PCS diff of judge A on skater B means:

* for each skate( short and free)

* take judge A's score on skater B's performance (eg: TES), then subtract off the *average* score of the other judges

?

That's your raw data, and after wards the other cells are your calculations?

 

I'd like to get *raw* data (ie: no averaging, no extra statistics, as much info recorded as possible). So for example, ideally for me, I want to get the entire ISU protocols in raw format. Previously we had this data before (for old rules / pre 2018-2019 season), but we didn't have the judge's names, so we couldn't do the judge bias analysis.

 

Sorry for these question - it's actually easier for me to work on the raw data directly, as opposed to the averaged stats. :) I like your analysis very much though, and will definitely check-in with you to get intuition.

 

 

 

Link to comment
Share on other sites

3 minutes ago, wingman said:

It's *amazing* that you entered the stuff by hand. :bow::2thumbsup:

Did you just get such data from the ISU pdf protocol? (and did you enter that raw data by hand, or did you have a script to read off the pdf file?)

 

Let me make sure I understood the definitions of the variables:

- TES/GOE/PCS diff of judge A on skater B means:

* for each skate( short and free)

* take judge A's score on skater B's performance (eg: TES), then subtract off the *average* score of the other judges

?

That's your raw data, and after wards the other cells are your calculations?

 

I'd like to get *raw* data (ie: no averaging, no extra statistics, as much info recorded as possible).

Sorry for the question - it's actually easier for me to work on the raw data directly, as opposed to the averaged stats. :) I like your analysis very much though, and will definitely check-in with you to get intuition.

 

 

 

Data is from skatingscores.com (which I'm pretty sure just scrapes the protocols and then runs a formula to get each judges' total score). I'm currently working on importing GPF data, so I would be looking at pages like this.

 

I don't look at TES, TS is total score (ie. the number of points the skater scored in the competition, except according to each judge instead of the official score). But you basically have it right. I take the total score for each judge and then subtract the average of the other judges to get the score differencess. But I don't do that by hand--I record each total scored based exactly off what skatingscores.com tells me, and the excel sheet does the relevant calculations. For GOE I write down all the raw GOEs for each judge and then the excel sheet calculates the average GOE the judge gave for each skater first, and then does the same set of calculations. For PCS is first de-factors them and converts the PCS score to an average component mark per skater per judge out of 10. (The idea here is that total score differences aren't that comparable across disciplines, but raw GOE/PCS differences should be more comparable.)

 

Basically, in terms of feeding data in I just feed an abbreviated version of the protocols (total score per judge, all the GOEs, and then the total PCS). The sheet then processes that to produce score differences per skater per judge, the mean score difference across the competition for each judge, the mean score difference for same nation skaters, and the mean score difference for different nation skaters.

Link to comment
Share on other sites

Thanks. Let me contact skatingscores.com directly and see what sort of existing data / analysis / model are there.

A model I can think of would be able to make statements like this:

* it is likely that judge X favors skaters (S1, S2, S3,...) in TS, which all have the following common attribute (A1, A2 etc).

That's probably the kind of thing that we want to look for right?

 

With more detailed data we can even say

* it is likely that judge X gives high GOE on the Salchow of skaters (S1, S2, S3,...), whose jumps all have the following common attribute: (A1, A2, etc...)

 

So that's why I want to have as detailed data as possible. Though, given how noisy figure skating score is, I think the bias has to rather huge for a statistical model to pick this up.

Link to comment
Share on other sites

40 minutes ago, wingman said:

Thanks. Let me contact skatingscores.com directly and see what sort of existing data / analysis / model are there.

A model I can think of would be able to make statements like this:

* it is likely that judge X favors skaters (S1, S2, S3,...) in TS, which all have the following common attribute (A1, A2 etc).

That's probably the kind of thing that we want to look for right?

 

With more detailed data we can even say

* it is likely that judge X gives high GOE on the Salchow of skaters (S1, S2, S3,...), whose jumps all have the following common attribute: (A1, A2, etc...)

 

So that's why I want to have as detailed data as possible. Though, given how noisy figure skating score is, I think the bias has to rather huge for a statistical model to pick this up.

BTW they once yelled at me for scraping their data. So lol. Pretty sure they scrape their data from ISU anyway? Just fyi in case you mention scraping or anything to them :tumblr_m9gck4P2Jf1qzckow:

Link to comment
Share on other sites

@yuzuangel Interesting. Well, I just wrote to them directly asking for permission to use their data. (Not that they own the data, honestly, but I'm ok with giving their some credit for doing the scraping work). We'll see. If they say no, I'll write to the ISU to ask for data. And if not, then I'll scrape.

The more I think about it, the more I'm convinced that a simple collaborative filtering model with a "judge bias" term, together with a low-rank factorization on the performance's attributes would work well. I'm excited. :)

Link to comment
Share on other sites

Yeah, once the data set starts to build up for an individual judge it's pretty easy to ferret out bias. My method actually probably understates magnitude of the problem, since cases where judges underscore skaters who are directly competing with their own get kind of drowned out because they're averaged in with cases where the judges don't have any incentive to underscore.

 

The data is all there and publicly available, it just has to be put together into one easily accessible and usable format. Once that happens, you just have to determine what you want to test for and what metric you're going to use and apply it to the data set. Plus, a lot of the process should be able to be automated (I just don't have the technical chops to do that). You can look for stuff besides national bias--for instance, once I'm done putting together my judges database for this season I think I'm also gonna take a look at bloc scoring. Based on my impressions, it should be pretty easy to establish that former Soviet bloc countries' judges scores are correlated with Russian judges' scores (or something a long those lines--might be easier to look at whether former Soviet bloc judges overscore Russian skaters).

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...