Problem setter’s handbook - Writing judges


System verifies the correctness of the submissions using components called judges. There are two types of judge programs:

  • test case judge – responsible for assessing the submission in the context of a single test case,
  • master judge – responsible for summarizing the results from individual test cases.

The problem setters have at their disposal built-in judge programs, which in many cases are sufficient. For the needs of other, less typical, problems there is the possibility of implementing judges on your own. This is possible in each programming language offered by the Sphere Engine system.

Implementation of a test case judge

The task of a test case judge is to evaluate the submission in the context of a single test case.

The assessment of the submission consists in determining whether it is correct (by establishing the so-called "status") and – optionally – in giving a score. The judge program makes a decision based on the test case data and the result of executing the user's program.
Dependencies are depicted in the diagram below:

Test case judge interface

The test case judge has access to several data streams. Each stream is accessible through a so-called file descriptor which is an integer.

Streams for reading (input streams)

Name File descriptor
Test case input data 0
Test case output data 4
User's program output data 3
Source code of the user's program 5

Streams for writing (output streams)

Name File descriptor
Score 1
Auxiliary/debug stream 6
User's program input data
(available only for interactive problems)
8

Validation of the submission

The correctness of the submission in the context of a test case is determined by the so-called status. Status information is determined by using the end signal of the program.

The value of the end signal is interpreted as follows:

  • 0 – the submission correctly solves the test case (accepted),
  • 1 – the submission is incorrect (wrong answer).

For example, in C, the end signal is the value returned by the main function of the program (i.e. the function named "main").

Note: The test case judge is only able to specify two of many different statuses (i.e. accepted or wrong answer). Other statuses (e.g. compilation error, runtime error, time limit exceeded) are set automatically by the Sphere Engine system.

Score

In the case of optimization problems or problems assessed partially, an important factor in addition to the correctness of the solution is its quality. In the case of problems of this type, the author of the judge program has the ability to determine the score which is a number (potentially a floating point number).

The score should be written to the appropriate stream (descriptor number 1). No other information should be written to the stream with the score.

In many cases, there is no need to determine the score. Very often verification of the submission is limited to determining whether it is correct or incorrect.

Examples

The sample source codes of test case judges are available on GitHub:

Example 1: Impossible model output file

Problem description:
For given function f find its root i.e. the argument x0 that f(x0) = 0.

In general there are many solutions to the problem, for example for polynomial x2 + x - 2 the numbers 1 and -2 are both correct answers.

You can see that it is hard to prepare model output file in test case. There are possibly infinitely many solutions for the certain functions thus it is impossible to keep all of them in the output file. It forces us to use different approach.

Judge description:
The test case judge should verify the condition from the problem task i.e. for the user's answer from the output file it should check if that answer is a root of the function. Test case judge uses his access to model input file to read the problem instance.

Example 2: Ambiguous model output file

Problem description:
For given graph G with n vertices 1, 2, ..., n determine if it has hamiltonian cycle (i.e. closed loop through a graph that visits each node exactly once). If the hamiltonian cycle exists print it as a sequence of vertices.

It is easy to see that 1-2-3-1 is the same cycle as 2-3-1-2. We could add the requirement to start with the smallest vertex number. Unfortunately it is possible that there exists many different hamiltonian cycles which are not cyclic shifts. We could again use the previous approach and verify if user's answer is really hamiltonian cycle. Alternatively we can build model output file with all possible hamiltonian cycles:

Judge description:
For user's answer the judge looks for that specific one on the list contained in model output file.

Caution

It can be problematic to keep all answers due to possible huge number of good solutions.

Implementation of the master judge

The master judge is responsible for summarizing the results of individual test cases. Based on partial results it determines the final verdict consisting of the following parameters:

  • the final status of the submission (e.g. accepted or wrong answer),
  • score,
  • execution time,
  • memory consumption.

The final verdict is determined based on the data from the test cases and (if requirements or restrictions are introduced) also based on the source code of the user's program.
Dependencies are depicted in the diagram below:

Master judge interface

The master judge (like the test case judge) has access to several data streams. Each stream is accessible through a so-called file descriptor which is an integer.

Streams for reading (input streams)

Name File descriptor
Results of test cases 0
Source code of the user's program 5

Streams for writing (output streams)

Name File descriptor
Final results 1
Auxiliary/debug stream 6

The format of results from individual test cases

The partial results obtained during the processing of consecutive test cases are recorded in the appropriate stream (descriptor number 0). The data of each test case is written in separate line of the stream and takes the following form:
N SCORE SIGNAL TIME MEMORY STATUS
For example:
0 AC 10 0 0.57 8096

The individual parameters have the following meaning and format:

Name Meaning Format
N test case number integer
the numbering starts with the number 0
STATUS verdict the following abbreviations are used:
  • AC​ - accepted
  • WA​ - wrong answer
  • TLE​ - time limit exceeded
  • RE​ - run time error
  • CE​ - compilation error
  • IE​ - internal error
SCORE score a number in the X.xx format (e.g. 2.00, 1345.00, 12)
SIGNAL the end signal returned by user's program an integer
TIME CPU time a number of seconds in the X.xx format (e.g. 2.00, 1345.00, 12)
MEMORY maximum memory consumption a number of kilobytes (integer)

Important: Obtaining a detailed report from processing a submission requires the partial results to be redirected to the auxiliary stream (descriptor number 6). The data must retain the following format:
test N - STATUS (score=SCORE, sig=SIGNAL, time=TIME, mem=MEMORY)
For example:
test 0 - AC (score=10, sig=0, time=0.57, mem=8096)

The final result and its format

The author of the master judge is free to determine the parameters of the final result. For example, the final submission execution time can be determined as the sum of the execution times of individual test cases. Alternatively, the final execution time can be determined as an average value.

Final result should be written to the appropriate stream (descriptor number 1). The data format is as follows:
STATUS SCORE SIGNAL TIME MEMORY
For example:
AC 10 0 0.57 8096

Individual parameters have meaning and a format similar to the parameters from the previous table. Nevertheless, all values ​​are calculated by the master judge in the manner specified by its author:

Name Meaning Format
STATUS final verdict the following abbreviations are used:
  • AC​ - accepted
  • WA​ - wrong answer
  • TLE​ - time limit exceeded
  • RE​ - run time error
  • CE​ - compilation error
  • IE​ - internal error
SCORE final score a number in the X.xx format (e.g. 2.00, 1345.00, 12)
SIGNAL the end signal an integer
TIME execution time a number of seconds in the X.xx format (e.g. 2.00, 1345.00, 12)
MEMORY memory consumption a number of kilobytes (integer)

Examples

Weighted % of correctly solved sets

Here we present the generalization of Score is % of correctly solved sets master judge. It was a little disadvantage that each test case is worth the same and to increase to influence of some submission's aspect you were forced to produce many test cases.

For example when your test cases verify three aspects A, B and C of the problem and you would like to put weights 20%, 30% and 50% respectively, you were able to do that by creating 10 test cases. Two of them responsible for an aspect A, three of them responsible for an aspect B and five of them responsible for an aspect C. However it is inconvenient and you can consider following idea:

Master judge description:
Master judge has the information about the number of test cases and weights which it should assign to each test case. The final score is the weighted sum of accepted test cases.

For example for three test cases a,b,c and weights 20%, 30%, 50% the submission gets one of the possible results depending on passed test cases:

  • no test case passed - 0%
  • a - 20%
  • b - 30%
  • a,b - 50%
  • c - 50%
  • a,c - 70%
  • b,c - 80%
  • a,b,c - 100%
Forbidden structures in source code

The problem setter may require that the solution cannot use some programming structures. For example he may want to allow to use language C++ but with no access to STL library to force users to implement efficient data structures manually. Another example is to restrict source codes to not use loop structures to support only solutions based on recursion.

Master judge description:
Master judge uses access to the user's source code to detect usages of forbidden keywords (for example loops: while, for, goto). When forbidden keyword is detected the final status is set to wrong answer in other case the master judge performs classical verification (for example the same as Generic masterjudge).
Used memory limitations

We cannot directly support memory limit due to the reasons explained in testing the memory complexity of algorithms appendix. To make possible to bound the amount of available memory one can implement master judge for that purpose.

Master judge description:
Master judge gathers the information of used memory from test case judges and takes the maximum value as the result (note that this is the behavior of default master judges). We verify the memory limit with respect to the user's solution programming language to adjust the master judge for all programming languages we allow to use.

Important

It's very important to adjust the memory limit to the programming language due to different memory needs of programs in different languages.