The Judging Process
After Season 1 we sent a survey to Buzzword players. Many respondents commented on the answer-judging process and its pace. While we understand that the judging process takes some time, we believe it is the best feasible way to evaluate answers. Here is an outline of the answer judging process and our rationale behind it:
Before a Buzzword game opens for play, our software automatically generates likely acceptable and unacceptable answers. Judges review them and verify whether they are indeed correct or incorrect and adjust the rulings accordingly. Judges also add likely acceptable answers they can think of that weren’t automatically generated, e.g., WW1 for World War I.
When a player types an answer to a Buzzword question, the answer is automatically compared to every correct and incorrect answer that has already been ruled on for that question (including the automatically generated answers, answers manually added by judges, and answers already given by other players). If there is an existing ruling, the same judgment (i.e., correct or incorrect) is applied to the answer given. Otherwise, that answer is tentatively marked incorrect and flagged for a judge to review.
While a Buzzword game is open for play, judges review newly submitted answers several times per day, examining answers that have been flagged for review. If the flagged answer is correct (e.g., due to being an acceptable typo away from an expected answer), the ruling is changed and the scores of players who gave that answer are updated. If another player later gives the same answer, their answer will be automatically ruled correct too.
For example, if a question’s desired answer is Massachusetts and a player answers
Masachusets, the player will initially be ruled incorrect. Soon—likely within a few hours—a judge will review this answer and mark it correct (as it is a phonetic equivalent of the expected answer). After that, all future answers of
Masachusets will automatically be marked correct without the judge needing to review it again. Likewise, if someone answers that question with
Iowa and the judge marks it incorrect after review, every other answer of (exactly)
Iowa will be ruled incorrect without the judge needing to review it again.
In practice, each Buzzword question often ends up with over 200 different answers submitted by players—some just plain wrong, some probable misspellings that are too far off to be acceptable, some misspellings that are acceptable, some alternate versions of the expected answer, etc. Because judges are dealing with so many answers, they sometimes make mistakes. We provide the protest system so players can bring possible errors to our attention.
When NAQT began developing Buzzword, we decided early on that we wanted this to be a serious competition. NAQT has always had fairly rigorous standards about what constitutes a correct answer. One of our goals in development was to craft a set of rules governing acceptable answers that allowed us to maintain these standards while acknowledging the inherent differences between oral and written answer submission. For example, the rules about typos have no direct parallel in face-to-face quiz bowl, but they are similar to the rule about accepting phonetic equivalents (e.g., a pronunciation of go-ETH-uh for Goethe).
The controlling decision for the entire process of judging is how to handle unexpected answers. Assuming we want human review (which we do), we thought of three plausible possibilities:
Tentatively rule all answers correct, then review them manually.
In many cases, this review process would result in players being ruled incorrect and their score and position in the rankings being lowered. We feel that is a worse experience than the option we chose.
Guess (via algorithm) whether answers are correct, make a tentative ruling, then review answers manually.
This is plausible and a possible area of future work, but it’s very challenging to do well. For example, Elizabeth I and Elizabeth II are only one inserted/deleted character away from each other, but neither would ever be acceptable when the other is desired. We can develop more sophisticated algorithms to account for issues like that, and at some point we may, but we have prioritized other feature development (such as the overall creation of Buzzword and the introduction of leagues in Season 2).
Furthermore, no matter how sophisticated the algorithm, inevitably some of its guesses will be mistaken. When a mistake occurs in the direction of mistakenly ruling an answer correct, this will still lead to the aforementioned poor situation of a player’s score decreasing.
Tentatively rule answers incorrect, then review them manually.
We chose this solution. It results in the possibility of a player being ruled correct after their game is over and their score increasing, which we feel is a better experience for players than the reverse situation found in solution (1).
That said, the written format of Buzzword is new to us too, and we are continually evaluating our rules and procedures in hopes of improving the player experience. We have introduced some changes to acceptability rules for Season 2 (becoming a bit more lenient about typos).
We want to continue making Buzzword a rigorous, fun competition and are always open to feedback at firstname.lastname@example.org. We hope you are enjoying the game!