Acknowledgements : Updated 14/02/2001 with help and comments from Tim Mann, Peter Berger , Andreas Schwartmann , Severi Salminen and many others. Thanks guys.
Running a computer Chess tournament is a long and thankless task that will tie up your computer for days . Many people feel that such tournaments [referred to as "Basement tournaments"] are merely a waste of computational resources. Despite this, many computer Chess enthusiasts still eagerly spend much of their free time running such tournaments and the reasons why they do so vary. Some do it because, they want to assist authors of their favourite programs as beta testers, others do it because they have the need to find out once and for all the best program or to prove a point, and yet others do it because they enjoy watching the computers battle it out and enjoy playing through the resulting Chess games.
If you are game to try for whatever reason, do read on.
Choosing the hardware and Software.
The first decision you have to make is to decide what software and hardware you are running the tournament in. This is related in many ways to what type of programs you wish to test. If the programs you decide to test are all using the same interface e.g. Winboard, or can communicate with each other through drivers such as auto232, then things are simple.
If the programs cannot communicate, things becomes more tricky as you must either transfer the moves manually, or use the old Alt-tab trick [ if on the same computer]. This is not desirable, because some programs like Chessmaster are can hog CPU resources a lot. However in the past, this was the only way to test Chessmaster .
Another way would be to use, task monitoring to ensure that one program is not using up all the resources.
Theoretically speaking, two computers [or one dual , still even though there are 2 processes In Dual CPU computers engines are still fighting for resources, e.g. access to EGTB and access to main memory. ] are superior than playing 2 programs on one machine since we can be sure that all the resources of one computer is used for the program. Also using two computers avoids the problem of deciding whether to turn Ponder on . [ See Section on Ponder ON or OFF? ]
Unfortunately, few people have identical computers for a computer
Chess
tournament. To ensure that the tournament is fair, some have proposed
increased the time available for the program on the slower computer,
based on differences in processor speed. However, this overlooks the
fact that, if both programs ponder [ there is no reason otherwise why
they shouldn't on 2 separate machines] the extra time given to the
program on the slower machine benefits the other program since it
ponders as well. Connecting two computers together is also somewhat
more complex and technically demanding. Perhaps a better way would be
to switch computers every match, or run matches over ICS [if you can
find someone with exactly the same computer configurations as you.]
Ponder ON or OFF?
Pondering refers to the fact that very the engine has moved, it will continue to "think" [on the assumption that the opponent will play what it considered the best move] just as humans do. There are some who feel that Pondering should be set OFF when testing on one computer. This ensures that when it's programs A's turn to move we can be sure that all the computer resources are used by A and A alone.
On the other hand, some argue that a engine that is crippled by turning off pondering, is a totally different animal from the one that authors work with since they usually test with ponder on, on one computer.
Crafty for example has time management that is based on the assumption that Pondering is on. As such if it accurately predicts the opponent's moves during pondering, it will save time later. If pondering is not turned on, this will not happen, and Crafty will spend too much time on the moves early in the game and get into time trouble later. Thanks to Tim Mann for explaining this] It is also unknown if setting Ponder off will hurt some engines like Crafty more than others. There have being attempts by some to show that setting Ponder-OFF effects all programs equally, [Noticeably, tests by Volker Pittlik , show that the results are similar whether Ponder on or OFF are used], but neither side remains convinced. [E.g. Volker tested Crafty with a series of strong freeware programs , but it is arguable that the weakness that affects Crafty due to pondering would only be apparent against Stronger commercial programs].
In some cases, you will have games where one program doesn't support pondering at all playing against one that supports pondering. In that case if you are running the match on one computer, it would be advisable to turn pondering off since the CPU usage would be extremely uneven. On two computers , the program that can ponder should be allowed to ponder. The lack of pondering in one engine shouldn't handicap the other, unless there are serious reasons against doing so. [See the discussion about book learning] [Thanks to Severi Salminen for pointing this out]
Round Robins tends to be most popular, for tournament of 5-10 programs. Generally programs of about the same estimated strength are chosen, unfortunately this makes significant results more difficult to come by. [See interpreting chess results.] Swiss Systems are usually used in a large free for all tournament with many programs of different standards . Knock-out type of tournaments tends to be less popular among testers. Testers can also let new programs run the "Gauntlet" by testing one program against various programs of known strengths to gauge its strengths.
Choosing the time controls is a tricky task. The first limitation as mentioned above is that some chess programs don't support certain time controls. There are some who feel that blitz time controls are not chess, and are less Interesting than longer [read 40/90] time controls.
Another disadvantage of extremely short time controls, is that many programs cannot handle time trouble and many crash. On the other hand, blitz gives you the luxury of playing more games over a shorter period of time and as every one knows the more games you play, the more certain you are of your results.
You should also keep in mind, that programs are not equal at all time controls. Some like Yace, used to be much better in Blitz than in standard time controls, while others like Francesa,Amy are the opposite. Another thing to consider is your processor speed. G/30 in slower computers might be equivalent to G/15 or less on a faster machine. Slower time controls [e.g. one move per day] will allow us to get a glimpse of the strength of current day programs on future hardware.
With so many chess programs around, the computer Chess tester has to be selective. In many ways this is the most significant step since the limitations of each program directly affect the tournament. [E.g. some programs cannot handle X in Y, others have fixed hash tables.] There is no standard way to pick chess programs but I think it's advisable to include at least one program with a well established strength as a bench mark.
Also, it may not be a good idea to include multiple versions of the same engine, unless there is really a big difference between them and unless your purpose is to test the difference in strenghts between the versions the results may be less meanful for other Chess engines.
Some testers insist that participating Chess engines have the ability to recognise draws be it due to the 50-move rule,insufficent material or 3 fold repetition. This is because without such features, there will be a lot of time wasted where Chess engines mindlessly move back and forth in drawn positions ! Here's a table by Geroge Lyako which lists some of the draw claiming features of various Winboard Chess engines .
Opening books, Nunn tests and Learning.
Computer Chess results can be adversely effected by poor opening
books.
In fact, it has being suggested that some commercial programs have
strong "killer" books, which accounts for much of their strength, and
each new version improves because of a better book rather through the
strength of the engine. This is of course, an extreme view. But we
cannot deny that opening books play a big part in determining the
results.
The Nunn's test avoid the variability in quality of opening books, by starting all the chess programs from various [16] default opening positions. However just as humans use openings that suit their playing style, opening books are designed to allow each Chess engine to play to it's best and are a integral part of the Chess program. Another problem is that the Nunn's test insists that a program is only stronger than another if it demonstrates it's strength over it's rival in various opening positions [since the Nunn's test positions cover a range of openings]. This is extremely artificial, if you replace the two Programs with 2 Human GMs and consider the consequences.
Many programs comes along with their own opening books, and some testers feel that those books should be used. On the other hand, self-made books have often being used as well.
There
is a question of whether learning should be turned on, especially in a
tournament with engines that don't have this feature. My own personal
view is that learning should be turned on, since the lack of the
learning feature in one engine shouldn't handicap/hinder the use of it
in a engine that has it. Of course, you might get a lot of repeated
losses by the non-learning engine if the learning engine has aggressive
positive book learning.
Allocation of memory in Transposition tables
In general, If you are doing engine versus engine matches on one
machine, the total memory allocated to both programs should
be half that of your system's total memory.
I must add that this advice is limited only to people with low [say below 256 MB] amounts of RAM. Given that the amount of memory resources needed by Windows is somewhat fixed ,if you have large amounts of RAM, you do not need to follow the "50% rule" above. For example, Andreas Schwartmann's 512 Mb Ram machine can support a total of 420Mb for the tournament without any problems. [Thanks to Andreas Schwartmann and Mogens Larsen for pointing this out.]
How much RAM you should allocate also depends on the time controls used. At lightning and Blitz time controls, large amounts of RAM [64mb and up?] allocated to transposition tables do not help improve the playing strength and may hurt the engine instead. Large Hash tables are only useful, at long time controls, when the transposition tables are filled up.
In the interest of fair competition you should allocate the same amount of memory to both engines. However depending on the program, you might not have full control for the allocation process. Some programs [e.g. Francesa] use only one fixed hash upon compiling and cannot be adjusted at all. While others might allow almost any finite allocation of memory. Lastly, some programs like Crafty lie in between, by allowing you to adjust the size of Hash Tables but in discrete increments.
Therefore it may not be possible to be totally "fair" in the allocation of hash for engine versus engine matches on the same machine.
Another question to consider is when allocating memory for engines
that
use endgame Tablebases and those that don't. Should the total memory
allocated to each engine include those allocated to the endgame caches?
Automated running of Round robins and Nunn testes.
Unfortunately, if you are using Winboard , there is no in build
feature
to support the running of Round Robins or Nunn Tests unlike in Fritz.
However the /mg command helps you run matches of fixed lengths between
2 given engines. As for running Nunn tests a post
at the Winboard forum by Dieter Buerssner shows you how to do it. To
automate a Round Robin Tournament , you probably need to use a script
or a batch file.
Peter Berger has kindly emailed me a batch file to help automate tournaments on Winboard for people unfamiliar with batch files. . You should also definitely refer to this excellent article on using batch files to run tournaments in Winboard that Peter Berger has contributed!
I would be interested in hearing from you if you have a script/method you would like to share that can automate Round Robin tournaments
If you need some simple software to generate tournament schedules
for
round robins or to keep track of the results you can get them at
Roundrobin.exe and
Tourney.exe
Handling defaults and crashes
During the course of the tournament, there will be times when
engines
crash for no particular reason. You will have the discretion to decide
whether to award a win/loss/draw or allow a replay. In general, if the
problem is due to the failure to recognise the 50 move rule or
insufficient material , it would be best to award a draw. It's gets
more tricky when a program crashes when it's winning.
Some argue
that when a program is clearly winning [say a rook up] , and it
crashes, you should award a win to the crashing program. However, many
[most?] would disagree. Firstly, it's arguable that a engine that
crashes should be treated as having forfeited the game just as a Human
GM would lose the game if he refused to move. Another problem is that
if you follow the rule strictly of awarding wins when the computer
crashes, it's conceivable that a programmer could program his engine to
crash whenever, it gets a "winning" position to ensure the win!
Handling Upgrades and bug fixes during the tournament
A great debate arose in Computer Chess Club [CCC] in March 2001 over whether it was a good idea to upgrade programs while the tournament is still ongoing.
It was argued that by mixing up engine versions, you invalidate the results of the tournament since it would be akin to replacing players in the middle of the tournament. Also, programmers could then pick and choose versions that could do especially well against specific engines.
On the other hand, in real official Computer Chess tournaments like the WCCC tournaments, this is exactly what happens! Depending on the opposition, the programmer will select opening, tune parameters change code etc in hope to get a version that can best handle the next opponent. Is this "scientific" ? Probably not.
Also, Kasparov and other humans are not exactly the same in between rounds anyway, since they adjust ,learn and change depending on who they face over the board.
Even if the upgrade does not increase strength but merely fixes a bad bug that causes the program crash, some still feel that the upgrade shouldn't be applied since they feel that the bug is part of the program and should be evaluated as such. Again, some disagree .They argue that you learn nothing about the strength of a Chess program, if a stupid bug causes it to crash again and again.
There is
probably no right answer to this and it all depends on your objective.
Whatever way you choose, it would be best to state your policy up front
and to apply it without bias.
Calculating and posting the results
If you need to calculate the ELOs based on the results, I
recommend the use of Elostat_13.zip
by Frank Schubert. Now that you have finished the tournament, you may
wish to share the results of your tournament. You can post in various
places including the Winboard Forum
, Computer Chess Club [CCC]
and other forums depending on the type of programs in your tournament.
Before you post , do remember to provide the following information
You might also wish to provide brief commentary. For example, you may want to comment on how certain programs performed better than expected or worse than expected and if so why. Was it due to a poor opening book? Poor King safety? Constant crashing?
The last information could be very useful for authors to help search out bugs.
My advice is that you shouldn't post all the games in the forum. Put all the games up on a WebPages or offer to email the games to people who are interested. If you really need to post, pick one or two interesting ones and comment on them.
While authors of the engines in your tournament will likely be very interested in the results and bugs you report, don't be disappointed if few or no one comment on your postings. Unless your results are somewhat unexpected , people will not be likely to comment.
Happy Testing.
Aaron Tay
14-02-2001