mhz - calculate clock rate and megahertz Usage: mhz [-c] ******************************************************************* PA RISC timings (by Stan Sieler, sieler@gmail.com) 3000 A Class (A440) 54 Mhz, 18.36 nanosec clock (#1) (equivalent 9000 440 Mhz, 2.27 nanosec clock) 3000 968 64 Mhz, 15.73 nanosec clock 3000 928 48 Mhz, 21.03 nanosec clock 3000 918 34 Mhz, 29.26 nanosec clock (#2) 9000 rp2430 649 Mhz, 1.54 nanosec clock (32 or 64 bit!) 9000 K220 (859) 120 Mhz, 8.35 nanosec clock #1: first reports: mhz: should take approximately 297 seconds (A400 is SERIES e3000/A400-100-11, clock of 440 reduced to claimed 110, actual 55) #2: first reports: mhz: should take approximately 30 seconds (clock of 64 reduced to claimed 34) ***************************************************************** ***************************************************************** mhz version: v 1.5 1997/06/14 03:27:23 mhz author: Larry McVoy ***************************************************************** Caveat emptor and other warnings This code must be compiled using the optimizer! If you don't compile this using the optimizer, then many compilers don't make good use of the registers and your inner loops end up using stack variables, which is SLOW. Also, it is sensitive to other processor load. When running mhz with "rtprio" (real-time priority), I have never had mhz make a mistake on my machine. At other times mhz has been wrong about 10% of the time. If there is too much noise/error in the data, then this program will usually return a clock speed that is too high. ***************************************************************** Constraints mhz.c is meant to be platform independent ANSI/C code, and it has as little platform dependent code as possible. This version of mhz is designed to eliminate the variable instruction counts used by different compilers on different architectures and instruction sets. It is also structured to be tightly interlocked so processors with super-scalar elements or dynamic instructure reorder buffers cannot overlap the execution of the expressions. We have to try and make sure that the code in the various inner loops does not fall out of the on-chip instruction cache and that the inner loop variables fit inside the register set. The i386 only has six addressable registers, so we had to make sure that the inner loop procedures had fewer variables so they would not spill onto the stack. ***************************************************************** Algorithm We can compute the CPU cycle time if we can get the compiler to generate (at least) two instruction sequences inside loops where the inner loop instruction counts are relatively prime. We have several different loops to increase the chance that two of them will be relatively prime on any given architecture. This technique makes no assumptions about the cost of any single instruction or the number of instructions used to implement a given expression. We just hope that the compiler gets at least two inner loop instruction sequences with lengths that are relatively prime. The "relatively prime" makes the greatest common divisor method work. If all the instructions sequences have a common factor (e.g. 2), then the apparent CPU speed will be off by that common factor. Also, if there is too much variability in the data so there is no apparent least common multiple within the error bounds set in multiple_approx, then we simply return the maximum clock rate found in the loops. The processor's clock speed is the greatest common divisor of the execution frequencies of the various loops. For example, suppose we are trying to compute the clock speed for a 120Mhz processor, and we have two loops: SHR --- two cycles to shift right SHR;ADD --- three cycles to SHR and add then the expression duration will be: SHR 11.1ns (2 cycles/SHR) SHR;ADD 16.6ns (3 cycles/SHR;ADD) so the greatest common divisor is 5.55ns and the clock speed is 120Mhz. Aside from extraneous variability added by poor benchmarking hygiene, this method should always work when we are able to get loops with cycle counts that are relatively prime. Suppose we are unlucky, and we have our two loops do not have relatively prime instruction counts. Suppose our two loops are: SHR 11.1ns (2 cycles/SHR) SHR;ADD;SUB 22.2ns (4 cycles/SHR;ADD;SUB) then the greatest common divisor will be 11.1ns, so the clock speed will appear to be 60Mhz. The loops provided so far should have at least two relatively prime loops on all tested architectures. ***************************************************************** Copyright (c) 1994 Larry McVoy. Distributed under the FSF GPL with additional restriction that results may published only if (1) the benchmark is unmodified, and (2) the version in the sccsid below is included in the report. Support for this development by Silicon Graphics is gratefully acknowledged. Support for this development by Hewlett Packard is gratefully acknowledged. Support for this development by Sun Microsystems is gratefully acknowledged. *****************************************************************