Changelog for draw program
==========================

Version 4.00 (17th October 2010)
--------------------------------

- Added new "-T" parameter to set the 3-match prize between 2 and 20 pounds,
  with the default being 10 pounds if it isn't specified. It also made me
  check up the prize overflow rules i.e. if 10 pounds*total winners across
  all categories is greater than the total prize pool (45% of ticket sales -
  does *not* include any rollover or Super Draw figure - that's applied to
  the jackpot tier only later on), then every category winner gets total
  prize pool/number of winners (presumably rounded down to the nearest pound
  as usual). Note that any rollover or Super Draw is still only applied later
  on to the jackpot prize pool and then that's additionally shared amongst the
  jackpot winners only (or rolled over again if there's no jackpot winners).
  I have now implemented this convoluted set of rules, which has never remotely
  been close to happening in real life (note that -T will affect the overflow
  calculations as you'd expect).
  
- If less than 7 valid numbers are supplied via -b parameters, instead of
  doing nothing when reading from the keyboard or a file for the remaining
  numbers, threads are now spawned to simulate "fake drawings" (and hence
  calculation of the matching results tables) for each of the balls remaining
  in the pot (e.g. 49 before the first number is drawn, 48 before the second
  and so on). This only applies when 2 or more CPU cores are used and whilst
  the file monitoring can use all the cores requested (by alternating
  the parent process between running some of the jobs with checking for file
  changes), keyboard input for the remaining balls will tie up the parent
  process, so one less core is used in that case.

- The test suite has been changed to more accurately simulate the timing of the
  balls coming from the machine. It does this by backgrounding a script that,
  for each ball drawn, sleep 6 seconds and then appends the next ball drawn to
  a file, which is monitored by the foreground "draw" program (via the -f
  option). The 6 seconds between each ball allows all CPU cores to work on
  "fake" background jobs that compute Match 1-4 category totals as if each of
  the remaining balls in the pot were drawn. Although this means throwing away
  all but one of those fake balls' totals when the "real" ball is drawn, the
  jobs are ordered so that at least one of the match 1-4 categories should be
  done. This should mean that at least some work can be skipped when the "real"
  jobs are put on the queue to finish off whatever wasn't done by the fake
  jobs for that "real" ball.

- Added the display of the minimum, maximum, total and average timings for the
  7 tables generated for a draw. This is because the tables no longer increase
  their generation time in a linear fashion thanks to the new "fake" threads
  that short-cut a chunk of the calculations.

- Don't spawn any match 1-4 jobs at all for the bonus ball table, because the
  match 1-4 figures are the same as the 6th main ball table (so those figures
  are now saved and re-used for the bonus ball table). Somewhat bemusing that
  it's taken me this long to notice such an obvious optimisation.

- If any input is read from the keyboard or a file, an extra "0" should be
  input as a terminator (i.e. what would be a theoretical 8th number). It isn't
  needed if seven valid -b parameters are supplied, since it's safe to assume
  those are all correct. The reason for the terminator is that even after the
  7th (usually bonus) ball is input, the user should still have the chance to
  correct that final ball using -1 as input to delete it. The new test suite
  does indeed supply the "0" terminator before you ask.

- The revised test suite now has large chunks of predictive work being done
  in the 6 seconds prior to each ball coming out. Although the tables below
  only seem to show modest improvements on the 3.X results, this is for the
  *slowest* table display time (usually after the 6th or bonus ball). With
  enough predictive work, some tables will actually display "instantly"
  (i.e. in 0.001 seconds).

                       Intel Q6600 @ 2.40 GHz cores
  Binary             1         2         3         4    
  ------          --------  --------  --------  --------
  32-bit Linux    1.352636  0.683688  0.327421  0.219935
  64-bit Linux    1.688279  1.045291  0.696715  0.369326
  32-bit Windows  2.121000  1.076000  0.733000  0.375000

                       Intel E5630 @ 2.53 GHz cores
  Binary             2         4         6         8     
  ------          --------  --------  --------  --------
  32-bit Linux    0.599034  0.192066  0.140819  0.111113
  64-bit Linux    0.939752  0.329066  0.206559  0.154745

  Note that although the E5630 server had 16 logical cores, the hyperthreading
  used to achieve this actually slowed the results beyond 8 cores.

- Primary development platform is now 64-bit Fedora 13, but I'm ridiculously
  still using Fedora 8's gcc 4.1.2, because it still produces way better code
  for this program than Fedora 13's gcc 4.4.4. Windows binaries are run in
  a Windows 7 Ultimate 64-bit environment.

- Had to use Fedora 13 with VMware Workstation 7.1.2 (VMware Server 2.0.2
  is ridiculously almost a year old and doesn't work with recent kernels -
  pathetic from VMware!) to get a version that actually worked with Mac OS X
  Leopard. I really don't like VMware as a Linux host - it doesn't track recent
  kernels, has a rubbish interactive perl script that you have to manually run
  *every time you update a kernel* (and if the new kernel isn't supported,
  that's VMware broken unless you downgrade your kernel) and it runs Mac OS X
  like a total sick pig. Yes, I tried VirtualBox and KVM and neither can run
  Mac OS X yet. I used Xcode 3.1.4 in case you're wondering, which defaulted
  to gcc 4.0.1.

- cygwin1.dll has been updated to the latest Cygwin release and now requires
  Unicode sequences to be sent to display a pound symbol properly, ho hum.
  It turns out Mac OS X's Terminal application also insists on Unicode
  sequences, so I trim the code $c2 (aka decimal 194) from any run's output
  before comparing with the expected output files (which just have $a3 -
  aka decimal 163 - as the pound symbol, since they were generated on Linux).

Version 3.20 (19th February 2009)
---------------------------------

- Added new "-f" option to monitor a file for drawn numbers, rather than
  reading them from the keyboard - this assumes that not enough valid -b
  parameters to finish the draw (i.e. 7 for the main Lotto) have been supplied
  either. The specified file is polled every 0.10 seconds for changes.

- Added new "-g" option to provide GUI text input for ticket sales, rollover
  amount and the balls drawn. Requires an X11 display and zenity to work.

- Corrected calculations where there's no bonus ball in the game (-u option).
  Note that the 6-match tier gets 68% of the total prize pool (52% from the
  original 6-match tier allocation when there's a bonus ball and 16% from the
  now "missing" 5+bonus tier) in this case.

- On the Linux platform, used *setaffinity*() calls to force
  the parent process and each child thread onto separate CPU cores, avoiding
  having to "hammer" the parent/children to persuade the scheduler to put
  them on different CPU cores.

- Raised maximum number of CPU cores supported from 8 to 16, though I can't
  test beyond 8 myself at the moment.

- Minimum ticket sales are now 294 tickets (yes, 6*49 before you ask) -
  this is to avoid crazily low ticket sales that could potentially hang
  the slightly-skewed random ticket allocation.

- The Mac OS X platform is finally supported, thanks to the words "VMWare",
  "Xcode" and an awful lot of fiddling. I can now create a Mac OS X .dmg file
  from the source code and this has been shipped simultaneously with the Linux
  and Windows binaries for the first time. Because of virtualisation, I have
  not included Mac OS X timing figures since they would be meaningless (and
  a lot slower).

- My primary development platform has now switched from 64-bit Fedora 8 to
  64-bit Fedora 10, but I'm unhappy with the bad performance I'm getting from
  F10's gcc 4.3.2 (I even tried F11 Alpha's gcc 4.4.0 and that was no better).
  Hence, I am running gcc 4.1.2 from 64-bit Fedora 8 to create the 32-bit and
  64-bit Linux binary packages in Fedora 10 for the moment.

- Abandoned any future support for other types of draws because the only
  lottery show shown live on UK TV with variable prizes is the main Lotto.
  It's also going to be painful to write code for other games, so I'd rather
  not :-)

- There won't be any more final table generation speeds until I see an
  improvement in the quad core Linux figures.

Version 3.10 (25th January 2009)
--------------------------------

- This is mostly just a "clean-up" release, fixing minor issues like
  pluralisation, typos, function prototypes, out-of-date info in the README
  file and adding support in the Makefile for building on Mac OS X. Sorry, I
  still don't ship for Mac OS X yet though [I really don't like distributing
  binaries that I can't run myself!], but I have a contact who is checking out
  that he can build the binary .dmg OK on Mac OS X. If I do get Mac OS X (or
  Darwin) working on my Dell, I will start shipping binaries for that platform
  as well.

- Added new "-w" option to display all individual prize amounts in all
  categories where they are calculable, even if they can change when later
  balls are drawn. The default is still not to display an individual prize
  amount until its final value is known - this is because Camelot wouldn't
  want to do this (e.g. 4-match individual prize drops when the 5th and
  6th balls are drawn because there are more 4-match winners in total later on,
  but the prize pool they're sharing stays the same). Cue lots of people
  claiming Camelot have "stolen" parts of their prize money if they can see
  the 4-match individual prize shrink before their eyes...! Note that the
  test suite now uses -w, so the expected output (expected1.txt) is slightly
  different compared to previous versions.

- Added new '-O" option which outputs minimal CSV info (numbers drawn so
  far, prize amounts/winners/pools, ticket sales) to stdout for processing
  by other programs (e.g. Camelot's caption generating machine, I wish).

* Now that an alternative (CSV) output of the results is available, it
  seemed unfair to include the table output in the timing of the code,
  so the timing end point was moved to just before the full or CSV table
  is output. This removes any lag from fprintf(), scrolling etc. that would
  potentially influence the timings (i.e. the timed code now involves no
  terminal output at all).

- Tried out the program on HP-UX 11.X (PA-RISC and Itanium) for a "laugh".
  We only have 2-CPU single-core servers at work, but using "-c 2" came out
  horribly slow compared to my Dell. Typical times hovered around 4-5
  seconds, but I've left in the HP-UX Makefile/code changes for the few
  hardy souls still using HP-UX. Note that the test suite will fail because
  srand48()/drand48() are different on HP-UX compared to Linux/Mac OS X
  it seems. I may have to break out a custom RNG routine (which might speed
  up random ticket generation anyway). The best thing about HP-UX compilation,
  though, is that it's yet another slew of ultra-fussy compiler warnings
  (on the Itanium platform) that are useful to fix.

- Code from Blender (http://www.blender.org/ - GPL 2 licensed) was used to
  determine the number of CPU cores on Linux and Mac OS X. For HP-UX,
  there was actually an example in "man pstat", so I used that. I stuck
  to $NUMBER_OF_PROCESSORS on Windows, because Cygwin doesn't have the
  GetSystemInfo() call that Blender uses.

- A test suite template (testsuite.in) is now used so that its shell escape
  can be substituted by the Makefile. This allows /bin/ksh to be the shell
  for HP-UX and /bin/bash for other platforms.

- Until jobs are on the queue, child threads used to sit in a usleep() loop
  waiting for some to appear. However, if all the 7 balls are supplied by -b,
  the jobs go on the queue very quickly after the threads are spawned
  (and certainly in less than one usleep() interval). In that situation,
  we tell the child threads never to usleep() waiting for the first job
  to appear (yes, this will use 100% CPU checking for the 1st job to turn up,
  but that's going to be in less than 1ms, so who cares).

  Also, if some balls have to be typed in manually (e.g. less than 7 valid
  -b params were supplied by the user), tell the child threads to usleep()
  before the keyboard input, but as soon as a valid number is typed in,
  tell the threads to avoid usleep(). This should speed up the threads
  start time by up to one usleep() interval (not much, but every little
  helps).

- Changed the order of which match rows (0-7) are done first. Previous
  releases went from 7 (5+bonus) down to 0, now it's done in this order:
  2,3,4,7,5,6,1,0. It actually saved a few hundredths execution time with
  this simple re-order, mainly because a large number of child thread jobs
  are put into the job queue much earlier.

- Final table generation speeds (in seconds):

                                 Number of CPU cores (-c flag)
  Binary             1         2         3         4      |      6        8
  ------          --------  --------  --------  --------  |  --------  --------
  32-bit Linux    1.366219  0.678154  0.456354  0.341592  |  0.210546  0.163625
  64-bit Linux    1.688124  0.828099  0.551967  0.431109  |  0.243988  0.182467
  32-bit Windows  2.137000  1.076000  0.733000  0.546000  |     N/A      N/A

  Not much change to report here - managed to save a few hundredths on Linux
  on my Dell and lose a few hundredths on Windows and Linux on the HP server.
  Remember that 1-4 cores are on a Dell Vostro quad core, 6 and 8 cores are on
  an 8-core HP Xeon server (so they don't scale the same since the CPUs are
  different speeds).

Version 3.00 (15th January 2009)
--------------------------------

- A major change to the thread model has more than doubled the speed on a quad
  core machine. The previous model had just having 4 jobs scanning 14m
  combinations each (parent does match 1, child threads do matches 2-4), which
  not only limited the program to using 4 cores or less, but the match 2-4 jobs
  also took less time in total than the match 1 job, meaning that a dual-core
  run was just as fast as a quad-core run.

  In the new model, the 14m combinations at each match level are split into
  chunks of about 500,000 combinations each and those are then put on the job
  queue. It means more than 100 jobs have to be run, keeping all the cores
  busy all of the time. It also means that more than 4 CPU cores can join
  in the calculations for the first time. For example, I've run the code on
  an 8-core CPU blade (2*quad-core) at work and it's the fastest yet.

  Note that testing showed that using a variable-sized job chunk on 64-bit
  platforms slightly increased its speed, so Match 1 jobs chunks were made
  the smallest (since each combination takes longer) and Match 4 the largest.
  The variable-sized job chunk is switched on with -DVARIABLE_COMBO_CHUNK on
  64-bit platforms only.

- Downloaded the 32-bit and 64-bit versions of the free (for non-commercial
  use) Intel C/C++ compiler for Linux (aka "icc") from here:
  http://www.intel.com/cd/software/products/asmo-na/eng/219771.htm

  Apart from annoyances like being a 1GB (!) download, requiring a serial
  number and a Net connection for "activation" (why? it's free!), I eventually
  got it installed. Lots of buzz on the Net how it's produces noticeably
  faster binaries, but no luck here - icc-built 32-bit and 64-bit binaries
  were significantly slower than gcc-built ones. However, the compiler warnings
  were handy to add "proper" casts that gcc didn't moan about. Binaries
  will continue to built with gcc by default, unless someone can supply me
  with some "magic" icc flags that can beat gcc's figures.

- Changed all function parameters to ANSI C style, since no-one uses old-style
  K&R parameters any more. Was encouraged to do so by icc's warnings, which
  is fair enough.

- Fixed a major oversight on my part when compiling with Windows Cygwin and
  it explains why the previous two releases had "sucky" Windows performance.
  I'd assumed that since Cygwin is sadly 32-bit only, I'd not need to
  specify any 32-bit flags. The opposite is true - it appears that gcc
  compilation in Linux and Windows for the draw program seemingly produce
  the fastest binaries with the lowest common denominator of architecture
  flags - namely "-m32 -march=i386". Whilst I'd added those flags on Linux,
  I hadn't on Windows :-( Once I used them on Windows, performance improved
  massively - now it's only 25% (instead of 125%!) slower compared to Linux.

- Final table generation speeds (in seconds):

                                 Number of CPU cores (-c flag)
  Binary             1         2         3         4      |      6        8
  ------          --------  --------  --------  --------  |  --------  --------
  32-bit Linux    1.514662  0.700743  0.476763  0.362027  |  0.208721  0.154751
  64-bit Linux    1.703136  0.911918  0.591918  0.444385  |  0.246141  0.183338
  32-bit Windows  2.091000  1.061000  0.717000  0.546000  |     N/A      N/A

  The 1-4 CPU core figures were obtained from a Dell Vostro 400 desktop
  with 4GB RAM and one quad-core Intel Q6600 CPU at 2.40Ghz running 64-bit
  Fedora 8 Linux or 64-bit Vista Ultimate SP1. Compilation was with gcc 4.1.2
  on Linux and Cygwin 1.5.25-15 on Windows. It should be noted that the
  latest Vostro model (420) is now well under 400 pounds to buy its base unit
  from Dell, so I would consider this almost an entry level PC nowadays. It's
  why I've taken the 1-4 CPU core figures from that machine rather than the
  souped up blade server I mention below (plus I can get Windows figures from
  the Vostro to show how much faster Linux is :-) ), since it's within anyone's
  reach at home to reproduce the Vostro's results.
  
  The 6 and 8 CPU core figures were obtained from an HP ProLiant BL460c G1
  blade server with 8GB RAM and two quad-core Intel Xeon E5440 CPUs at 2.83Ghz
  running 64-bit CentOS 5.2 Linux. The same binaries that were compiled on the
  Dell Vostro were also used on the blade server. Sorry, obviously there's
  no Windows equivalent for that blade server.

Version 2.00 (11th January 2009)
--------------------------------

- Added POSIX threads support, which will use all the CPU cores
  in your machine (up to 4, since that's all the work I can give
  it at the moment) or the number specified by the new -c flag.
  BTW, A useful POSIX threads tutorial is here:
  https://computing.llnl.gov/tutorials/pthreads/

- Saved 28MB of malloc()'ed memory when -v is specified, because the
  verification algorithm doesn't use the ticket combination memory array.

- Added warning and prompt to getpercent script to make sure the
  user really wants to re-generate percent.h (if they do, it will
  "break" the test suite by generating a different set of random
  numbers).

- Final table generation speeds (in seconds) are now (using -c 1 to -c 4):

  Binary           1 CPU core   2 CPU cores   3 CPU cores   4 CPU cores
  ------           ----------   -----------   -----------   -----------
  32-bit Linux      1.334858     0.835928      0.836347      0.846030
  64-bit Linux      1.623586     0.996805      1.000646      0.996645
  32-bit Windows    2.948000     1.778000      1.778000      1.778000

  The fractionally worse performance of 3 and 4 cores compared to 2 is
  because there isn't enough work to keep 4 cores going all the time -
  a dual-core setup with a parent process and a single child thread
  actually covers it all (parent does the longest computation and the
  child serially does quicker ones, which it completes before the parent
  process does).

- Fixed a few minor spelling mistakes in code comments and documentation.

Version 1.00 (8th January 2009)
-------------------------------

- Initial release, licensed with GPL 3.

- Support for 32-bit Linux, 64-bit Linux and 32-bit Windows builds (the
  latter using Cygwin, so cygwin1.dll - GPL 2 licensed - is also bundled).

- Final table generation speeds (i.e. after bonus ball is drawn) when
  testsuite on Linux or testsuite.bat on Windows is run:

  Binary          Speed (secs)
  ------          ------------
  32-bit Linux      1.330660
  64-bit Linux      1.666244
  32-bit Windows    3.198000