People often use terms such as "maybe", "probably", "likely", "possibly" etc when describing the probability for an event occurring. The problem is that people differ wildly in what they actually mean when using these terms. So to be able to measure if people are good or bad forecasters we have to quantify their predictions. All answers thus have to be given in percentages. If participants are asked to give the likelihood of a specific event occurring, and the event did happen. Then the person who answered the probability for the event occurring to 90% will get a better score than the person answering 60%.
The Brier score was originally used to quantify the accuracy of weather forecasts, but can be used to describe the accuracy of any probabilistic forecast. The Brier score measures how far away from the truth a forecast was.
The Brier scale goes from 0 to 2, where a lower score is better. The score of 0 thus means that you were completely right, and would be achieved if you predict something to happen with 100% certainty, and it occurs.
Accuracy is calculated over the lifetime of a question with a brier score calculated for every day on which the forecaster had an active forecast.
Even though a persons average brier score will be an indication of his general forecasting abilities, the brier score will be affected by the difficulty of questions, and is therefore not suitable for determining accuracy compared to others. To determine a persons relative performance we calculate the accuracy score.
Accuracy score = Brier score - Median brier score
Where the Median brier score is the median brier score of all participants on a specific question. A negative accuracy score therefore means the person was better at predicting than half of the participants.
To calculate the total score, the accuracy scores from all questions are added together to calculate a persons cumulative accuracy score. On questions where the person didn't make a forecast, the accuracy score is set to 0.
Since both the brier score and the accuracy score does not make intuitive sense, we mostly illustrate scores as the relative performance within a team both for specific questions and in total.
Given how lacking most decision making processes in organizations are, just using a system for collecting and aggregating predictions is in of itself a big step towards improving decision making processes. Empiricast uses several methods for aggregating individual predictions in a way that produces a better forecast than the mean or average of the crowd. We build our algorithms based on the results from a range of relevant research and also on insights from forecasting competitions such as the Good Judgement Project.
Some of the methods we apply to create a better forecast are:
Factors such as the number of forecasters per question and the number of previous forecasts per participant will affect the degree to which algorithms can be applied to improve the forecast.
Our aggregation algorithms will be continuously developed and updated as we learn from the data over time.