Changelog for draw program ========================== Version 4.00 (17th October 2010) -------------------------------- - Added new "-T" parameter to set the 3-match prize between 2 and 20 pounds, with the default being 10 pounds if it isn't specified. It also made me check up the prize overflow rules i.e. if 10 pounds*total winners across all categories is greater than the total prize pool (45% of ticket sales - does *not* include any rollover or Super Draw figure - that's applied to the jackpot tier only later on), then every category winner gets total prize pool/number of winners (presumably rounded down to the nearest pound as usual). Note that any rollover or Super Draw is still only applied later on to the jackpot prize pool and then that's additionally shared amongst the jackpot winners only (or rolled over again if there's no jackpot winners). I have now implemented this convoluted set of rules, which has never remotely been close to happening in real life (note that -T will affect the overflow calculations as you'd expect). - If less than 7 valid numbers are supplied via -b parameters, instead of doing nothing when reading from the keyboard or a file for the remaining numbers, threads are now spawned to simulate "fake drawings" (and hence calculation of the matching results tables) for each of the balls remaining in the pot (e.g. 49 before the first number is drawn, 48 before the second and so on). This only applies when 2 or more CPU cores are used and whilst the file monitoring can use all the cores requested (by alternating the parent process between running some of the jobs with checking for file changes), keyboard input for the remaining balls will tie up the parent process, so one less core is used in that case. - The test suite has been changed to more accurately simulate the timing of the balls coming from the machine. It does this by backgrounding a script that, for each ball drawn, sleep 6 seconds and then appends the next ball drawn to a file, which is monitored by the foreground "draw" program (via the -f option). The 6 seconds between each ball allows all CPU cores to work on "fake" background jobs that compute Match 1-4 category totals as if each of the remaining balls in the pot were drawn. Although this means throwing away all but one of those fake balls' totals when the "real" ball is drawn, the jobs are ordered so that at least one of the match 1-4 categories should be done. This should mean that at least some work can be skipped when the "real" jobs are put on the queue to finish off whatever wasn't done by the fake jobs for that "real" ball. - Added the display of the minimum, maximum, total and average timings for the 7 tables generated for a draw. This is because the tables no longer increase their generation time in a linear fashion thanks to the new "fake" threads that short-cut a chunk of the calculations. - Don't spawn any match 1-4 jobs at all for the bonus ball table, because the match 1-4 figures are the same as the 6th main ball table (so those figures are now saved and re-used for the bonus ball table). Somewhat bemusing that it's taken me this long to notice such an obvious optimisation. - If any input is read from the keyboard or a file, an extra "0" should be input as a terminator (i.e. what would be a theoretical 8th number). It isn't needed if seven valid -b parameters are supplied, since it's safe to assume those are all correct. The reason for the terminator is that even after the 7th (usually bonus) ball is input, the user should still have the chance to correct that final ball using -1 as input to delete it. The new test suite does indeed supply the "0" terminator before you ask. - The revised test suite now has large chunks of predictive work being done in the 6 seconds prior to each ball coming out. Although the tables below only seem to show modest improvements on the 3.X results, this is for the *slowest* table display time (usually after the 6th or bonus ball). With enough predictive work, some tables will actually display "instantly" (i.e. in 0.001 seconds). Intel Q6600 @ 2.40 GHz cores Binary 1 2 3 4 ------ -------- -------- -------- -------- 32-bit Linux 1.352636 0.683688 0.327421 0.219935 64-bit Linux 1.688279 1.045291 0.696715 0.369326 32-bit Windows 2.121000 1.076000 0.733000 0.375000 Intel E5630 @ 2.53 GHz cores Binary 2 4 6 8 ------ -------- -------- -------- -------- 32-bit Linux 0.599034 0.192066 0.140819 0.111113 64-bit Linux 0.939752 0.329066 0.206559 0.154745 Note that although the E5630 server had 16 logical cores, the hyperthreading used to achieve this actually slowed the results beyond 8 cores. - Primary development platform is now 64-bit Fedora 13, but I'm ridiculously still using Fedora 8's gcc 4.1.2, because it still produces way better code for this program than Fedora 13's gcc 4.4.4. Windows binaries are run in a Windows 7 Ultimate 64-bit environment. - Had to use Fedora 13 with VMware Workstation 7.1.2 (VMware Server 2.0.2 is ridiculously almost a year old and doesn't work with recent kernels - pathetic from VMware!) to get a version that actually worked with Mac OS X Leopard. I really don't like VMware as a Linux host - it doesn't track recent kernels, has a rubbish interactive perl script that you have to manually run *every time you update a kernel* (and if the new kernel isn't supported, that's VMware broken unless you downgrade your kernel) and it runs Mac OS X like a total sick pig. Yes, I tried VirtualBox and KVM and neither can run Mac OS X yet. I used Xcode 3.1.4 in case you're wondering, which defaulted to gcc 4.0.1. - cygwin1.dll has been updated to the latest Cygwin release and now requires Unicode sequences to be sent to display a pound symbol properly, ho hum. It turns out Mac OS X's Terminal application also insists on Unicode sequences, so I trim the code $c2 (aka decimal 194) from any run's output before comparing with the expected output files (which just have $a3 - aka decimal 163 - as the pound symbol, since they were generated on Linux). Version 3.20 (19th February 2009) --------------------------------- - Added new "-f" option to monitor a file for drawn numbers, rather than reading them from the keyboard - this assumes that not enough valid -b parameters to finish the draw (i.e. 7 for the main Lotto) have been supplied either. The specified file is polled every 0.10 seconds for changes. - Added new "-g" option to provide GUI text input for ticket sales, rollover amount and the balls drawn. Requires an X11 display and zenity to work. - Corrected calculations where there's no bonus ball in the game (-u option). Note that the 6-match tier gets 68% of the total prize pool (52% from the original 6-match tier allocation when there's a bonus ball and 16% from the now "missing" 5+bonus tier) in this case. - On the Linux platform, used *setaffinity*() calls to force the parent process and each child thread onto separate CPU cores, avoiding having to "hammer" the parent/children to persuade the scheduler to put them on different CPU cores. - Raised maximum number of CPU cores supported from 8 to 16, though I can't test beyond 8 myself at the moment. - Minimum ticket sales are now 294 tickets (yes, 6*49 before you ask) - this is to avoid crazily low ticket sales that could potentially hang the slightly-skewed random ticket allocation. - The Mac OS X platform is finally supported, thanks to the words "VMWare", "Xcode" and an awful lot of fiddling. I can now create a Mac OS X .dmg file from the source code and this has been shipped simultaneously with the Linux and Windows binaries for the first time. Because of virtualisation, I have not included Mac OS X timing figures since they would be meaningless (and a lot slower). - My primary development platform has now switched from 64-bit Fedora 8 to 64-bit Fedora 10, but I'm unhappy with the bad performance I'm getting from F10's gcc 4.3.2 (I even tried F11 Alpha's gcc 4.4.0 and that was no better). Hence, I am running gcc 4.1.2 from 64-bit Fedora 8 to create the 32-bit and 64-bit Linux binary packages in Fedora 10 for the moment. - Abandoned any future support for other types of draws because the only lottery show shown live on UK TV with variable prizes is the main Lotto. It's also going to be painful to write code for other games, so I'd rather not :-) - There won't be any more final table generation speeds until I see an improvement in the quad core Linux figures. Version 3.10 (25th January 2009) -------------------------------- - This is mostly just a "clean-up" release, fixing minor issues like pluralisation, typos, function prototypes, out-of-date info in the README file and adding support in the Makefile for building on Mac OS X. Sorry, I still don't ship for Mac OS X yet though [I really don't like distributing binaries that I can't run myself!], but I have a contact who is checking out that he can build the binary .dmg OK on Mac OS X. If I do get Mac OS X (or Darwin) working on my Dell, I will start shipping binaries for that platform as well. - Added new "-w" option to display all individual prize amounts in all categories where they are calculable, even if they can change when later balls are drawn. The default is still not to display an individual prize amount until its final value is known - this is because Camelot wouldn't want to do this (e.g. 4-match individual prize drops when the 5th and 6th balls are drawn because there are more 4-match winners in total later on, but the prize pool they're sharing stays the same). Cue lots of people claiming Camelot have "stolen" parts of their prize money if they can see the 4-match individual prize shrink before their eyes...! Note that the test suite now uses -w, so the expected output (expected1.txt) is slightly different compared to previous versions. - Added new '-O" option which outputs minimal CSV info (numbers drawn so far, prize amounts/winners/pools, ticket sales) to stdout for processing by other programs (e.g. Camelot's caption generating machine, I wish). * Now that an alternative (CSV) output of the results is available, it seemed unfair to include the table output in the timing of the code, so the timing end point was moved to just before the full or CSV table is output. This removes any lag from fprintf(), scrolling etc. that would potentially influence the timings (i.e. the timed code now involves no terminal output at all). - Tried out the program on HP-UX 11.X (PA-RISC and Itanium) for a "laugh". We only have 2-CPU single-core servers at work, but using "-c 2" came out horribly slow compared to my Dell. Typical times hovered around 4-5 seconds, but I've left in the HP-UX Makefile/code changes for the few hardy souls still using HP-UX. Note that the test suite will fail because srand48()/drand48() are different on HP-UX compared to Linux/Mac OS X it seems. I may have to break out a custom RNG routine (which might speed up random ticket generation anyway). The best thing about HP-UX compilation, though, is that it's yet another slew of ultra-fussy compiler warnings (on the Itanium platform) that are useful to fix. - Code from Blender (http://www.blender.org/ - GPL 2 licensed) was used to determine the number of CPU cores on Linux and Mac OS X. For HP-UX, there was actually an example in "man pstat", so I used that. I stuck to $NUMBER_OF_PROCESSORS on Windows, because Cygwin doesn't have the GetSystemInfo() call that Blender uses. - A test suite template (testsuite.in) is now used so that its shell escape can be substituted by the Makefile. This allows /bin/ksh to be the shell for HP-UX and /bin/bash for other platforms. - Until jobs are on the queue, child threads used to sit in a usleep() loop waiting for some to appear. However, if all the 7 balls are supplied by -b, the jobs go on the queue very quickly after the threads are spawned (and certainly in less than one usleep() interval). In that situation, we tell the child threads never to usleep() waiting for the first job to appear (yes, this will use 100% CPU checking for the 1st job to turn up, but that's going to be in less than 1ms, so who cares). Also, if some balls have to be typed in manually (e.g. less than 7 valid -b params were supplied by the user), tell the child threads to usleep() before the keyboard input, but as soon as a valid number is typed in, tell the threads to avoid usleep(). This should speed up the threads start time by up to one usleep() interval (not much, but every little helps). - Changed the order of which match rows (0-7) are done first. Previous releases went from 7 (5+bonus) down to 0, now it's done in this order: 2,3,4,7,5,6,1,0. It actually saved a few hundredths execution time with this simple re-order, mainly because a large number of child thread jobs are put into the job queue much earlier. - Final table generation speeds (in seconds): Number of CPU cores (-c flag) Binary 1 2 3 4 | 6 8 ------ -------- -------- -------- -------- | -------- -------- 32-bit Linux 1.366219 0.678154 0.456354 0.341592 | 0.210546 0.163625 64-bit Linux 1.688124 0.828099 0.551967 0.431109 | 0.243988 0.182467 32-bit Windows 2.137000 1.076000 0.733000 0.546000 | N/A N/A Not much change to report here - managed to save a few hundredths on Linux on my Dell and lose a few hundredths on Windows and Linux on the HP server. Remember that 1-4 cores are on a Dell Vostro quad core, 6 and 8 cores are on an 8-core HP Xeon server (so they don't scale the same since the CPUs are different speeds). Version 3.00 (15th January 2009) -------------------------------- - A major change to the thread model has more than doubled the speed on a quad core machine. The previous model had just having 4 jobs scanning 14m combinations each (parent does match 1, child threads do matches 2-4), which not only limited the program to using 4 cores or less, but the match 2-4 jobs also took less time in total than the match 1 job, meaning that a dual-core run was just as fast as a quad-core run. In the new model, the 14m combinations at each match level are split into chunks of about 500,000 combinations each and those are then put on the job queue. It means more than 100 jobs have to be run, keeping all the cores busy all of the time. It also means that more than 4 CPU cores can join in the calculations for the first time. For example, I've run the code on an 8-core CPU blade (2*quad-core) at work and it's the fastest yet. Note that testing showed that using a variable-sized job chunk on 64-bit platforms slightly increased its speed, so Match 1 jobs chunks were made the smallest (since each combination takes longer) and Match 4 the largest. The variable-sized job chunk is switched on with -DVARIABLE_COMBO_CHUNK on 64-bit platforms only. - Downloaded the 32-bit and 64-bit versions of the free (for non-commercial use) Intel C/C++ compiler for Linux (aka "icc") from here: http://www.intel.com/cd/software/products/asmo-na/eng/219771.htm Apart from annoyances like being a 1GB (!) download, requiring a serial number and a Net connection for "activation" (why? it's free!), I eventually got it installed. Lots of buzz on the Net how it's produces noticeably faster binaries, but no luck here - icc-built 32-bit and 64-bit binaries were significantly slower than gcc-built ones. However, the compiler warnings were handy to add "proper" casts that gcc didn't moan about. Binaries will continue to built with gcc by default, unless someone can supply me with some "magic" icc flags that can beat gcc's figures. - Changed all function parameters to ANSI C style, since no-one uses old-style K&R parameters any more. Was encouraged to do so by icc's warnings, which is fair enough. - Fixed a major oversight on my part when compiling with Windows Cygwin and it explains why the previous two releases had "sucky" Windows performance. I'd assumed that since Cygwin is sadly 32-bit only, I'd not need to specify any 32-bit flags. The opposite is true - it appears that gcc compilation in Linux and Windows for the draw program seemingly produce the fastest binaries with the lowest common denominator of architecture flags - namely "-m32 -march=i386". Whilst I'd added those flags on Linux, I hadn't on Windows :-( Once I used them on Windows, performance improved massively - now it's only 25% (instead of 125%!) slower compared to Linux. - Final table generation speeds (in seconds): Number of CPU cores (-c flag) Binary 1 2 3 4 | 6 8 ------ -------- -------- -------- -------- | -------- -------- 32-bit Linux 1.514662 0.700743 0.476763 0.362027 | 0.208721 0.154751 64-bit Linux 1.703136 0.911918 0.591918 0.444385 | 0.246141 0.183338 32-bit Windows 2.091000 1.061000 0.717000 0.546000 | N/A N/A The 1-4 CPU core figures were obtained from a Dell Vostro 400 desktop with 4GB RAM and one quad-core Intel Q6600 CPU at 2.40Ghz running 64-bit Fedora 8 Linux or 64-bit Vista Ultimate SP1. Compilation was with gcc 4.1.2 on Linux and Cygwin 1.5.25-15 on Windows. It should be noted that the latest Vostro model (420) is now well under 400 pounds to buy its base unit from Dell, so I would consider this almost an entry level PC nowadays. It's why I've taken the 1-4 CPU core figures from that machine rather than the souped up blade server I mention below (plus I can get Windows figures from the Vostro to show how much faster Linux is :-) ), since it's within anyone's reach at home to reproduce the Vostro's results. The 6 and 8 CPU core figures were obtained from an HP ProLiant BL460c G1 blade server with 8GB RAM and two quad-core Intel Xeon E5440 CPUs at 2.83Ghz running 64-bit CentOS 5.2 Linux. The same binaries that were compiled on the Dell Vostro were also used on the blade server. Sorry, obviously there's no Windows equivalent for that blade server. Version 2.00 (11th January 2009) -------------------------------- - Added POSIX threads support, which will use all the CPU cores in your machine (up to 4, since that's all the work I can give it at the moment) or the number specified by the new -c flag. BTW, A useful POSIX threads tutorial is here: https://computing.llnl.gov/tutorials/pthreads/ - Saved 28MB of malloc()'ed memory when -v is specified, because the verification algorithm doesn't use the ticket combination memory array. - Added warning and prompt to getpercent script to make sure the user really wants to re-generate percent.h (if they do, it will "break" the test suite by generating a different set of random numbers). - Final table generation speeds (in seconds) are now (using -c 1 to -c 4): Binary 1 CPU core 2 CPU cores 3 CPU cores 4 CPU cores ------ ---------- ----------- ----------- ----------- 32-bit Linux 1.334858 0.835928 0.836347 0.846030 64-bit Linux 1.623586 0.996805 1.000646 0.996645 32-bit Windows 2.948000 1.778000 1.778000 1.778000 The fractionally worse performance of 3 and 4 cores compared to 2 is because there isn't enough work to keep 4 cores going all the time - a dual-core setup with a parent process and a single child thread actually covers it all (parent does the longest computation and the child serially does quicker ones, which it completes before the parent process does). - Fixed a few minor spelling mistakes in code comments and documentation. Version 1.00 (8th January 2009) ------------------------------- - Initial release, licensed with GPL 3. - Support for 32-bit Linux, 64-bit Linux and 32-bit Windows builds (the latter using Cygwin, so cygwin1.dll - GPL 2 licensed - is also bundled). - Final table generation speeds (i.e. after bonus ball is drawn) when testsuite on Linux or testsuite.bat on Windows is run: Binary Speed (secs) ------ ------------ 32-bit Linux 1.330660 64-bit Linux 1.666244 32-bit Windows 3.198000