Measurement Tools

Compute* systems are extremely complex, requiring comprehension at many :evels Measurements, too. must be made at different levels to provide a complete system charaderizaticr For a particular metric, it is important to choose the best tool The tools can be compared ,n terms of what they can measure their ease of use, the overhead they incur and their sampling frequencies Table t compares the tools for each category

Hardware measurement tools analyze signals trom tne machine under test Most hardware tools are passive and do not disturb the system. They are fast enough to capture individual machine cycles and therefore are very good at collecting low-level traces and sequences. The short-term sampling frequency can be very high Hardware tools cannot measure higher levels of the system such as processes and programs They require equipment attached to the system under test and extensive setup time.

Machines such as ihe HP 3000 Series 68 have writable control store, allowing the use of microcoded tools The sampling rate is slower than hardware tools and the extra microcode takes some minimal amount of system overhead Microcode tools are best suited for obtaining statistics at the procedure and process level They are also very good at sampling at fixed time intervals. The installation ot the tools is straightforward but does require a cool start of the system

The most abstract level of measurement requires software running on the system under test. There can be significant overhead associated with software tools, which may also perturb the system. But software tools can track paths in system level algorithms, monitor interactions between processes, and log a large number of software event counters Software tools are generally very easy to install and use.

Table I

Comparison of Measurement Techniques

Sampling Typeot

Monitor Overhead Ease of Use Frequency Measurements

Hardware 0% Hardware I0s-i06te Signals shortftaces required

Microcode 01-1% Coolstari ID5,« Fixed-time sampling required procedures processes

Software 5-10% Simple left's Software event count-Instaflation ers, system inter actions

The Series 70 performance analysis made use of each type ot tool Besides the HPSnapshot and hardware monitor tools mentioned in the accompanying article several microcode-based tools were used

Microcode Tools

The Instruction Gatherer records the current executing instruction at one-millisecond intervals it also gives some information about the most trequent subcases of the instructions The data from ten sites gave a lime-weighted profiie of which instructions were execuled most often This information led to the remicrocod-ing of two instructions for increased performance

The Microsampler is a simple microcoded version of the Sampler software tool. It records program counter values in the operating system code From this information, six high-frequency procedures were identified and rewritten in microcode

The cache post microcode is special-purpose microcode used to investigate alternative solutions to cache posting The microcode validated the use of the cache simulator for the posting circuitry investigation ihe hardware monitor, provided invaluable measurement capabi lity.

The monitor consists of an HP 1K30 Logic Analyzer coupled to an HP Touchscreen Persona! Computer via the HP-IB (IEEE 488), as shown in Fig. 2. The probes of the HP 1630 are attached to pins on the backplane of the system under tesf li iP 3000 Series 64, (¡8, or 70). The Touchscreen computer serves as controller, reduction tool, and data storage for the HP 1630.The monitor can automatically run a series of independent tests with a simple command file.

For each lest, the Touchscreen computer begins by downloading configuration information to the HP 1630. The measurement is then started on the HP 1630. The Touchscreen waits for a specified time, then halts the measurement. The collected data is uploaded tu the Touchscreen computer where it is reduced and stored to disc. A side benefit of the automated process is that the uploaded data from the HP 1630 is more detailed than that available through manual operation. Careful analysis of the internal operation of the HP 1630 gave us confidence that it would satisfy statistical sampling demands.

A collection of ten tests was run on four HP 3000 Series 68 systems. These systems were chosen based on how well they represented the customer base, as determined from the HPSnapshot data. The tests measured many variables including instruction paths. VO use, and cache statistics. The tests required a total of probes. It was calculated that half-huur samples would best meet the conflicting de-

Fig, 2. The hardware monitor performance measurement toot.

The Series 70: Not Just a Cache

The HP 3000 Series 70 Business Computer is a collection of performance enhancement. To be included in the Series 70 product, each enhancement had to qualify in terms of measurable performance, applicability across the customer base and independence of the other enhancements. Each enhancement was the result of the methodology of measurements and analysis used in the Series 70 cache design and described in the accompanying ahicte

Microcode was written to find the most common instruction executions Two instructions that were surprisingly prominent in the mix were rewritten to execute faster

Microcode was also written to find the most often used MPE procedures. This Information led to the selection of six procedures to be rewritten tn microcode. The nucrocoded versions of these procedures execute up to ten times faster.

The expanded main memory offered on both Ihe Senes 66 and 70 was also subjected to measurement and analysis. The effect of additional memory on the multiprogramming leve1 the number of physical l/Os. and the CPU utilization were studied. Methods to identity when main memory bottlenecks occur were developed, which resulted in projections at performance improvements resulting from additional memory.

There were also several improvements In the MPE operating system software. The HPSnapshot tool revealed areas where changes could have a significant effecl on system performance.

mands of aggregate and variance analysis. Each sample captured about half a million cycles, and a total of 144 samples were collected.

The data gave Insighl into the low-level activities of the CPU. As a result, a number of performance opportunities were identified. The biggest opportunity lay in improving the cache memory subsystem.

The Series (38 cache is a 4K-word. 2-set cache (see "How a Cache Works," page 42). The hit rate measured on the four systems was 92.5%. However, although a cache miss occurs only 7.5% of the lime, almost 30% of the CPU time is spent waiting for cache misses to be resolved. During this time, the CPU is frozen and cannot proceed. A simple model of the cache was created and validated through the hardware monitor. The model suggested that if the hit rate could he improved by 5 or 6 percent, the CPU would freeze only about 10% of the time. This would result in a savings of almost 20% of ihe CPU cycles, which translates into an effective speedup of about 25%. The availability of denser RAMs and programmable array logic parts (PALs) indicated a strong possibility for just such an improvement.

Modeling

Modeling extracts the essential characteristics of a system and converts them into a form that is easily evaluated and modified. There are two major types of modeling. Analytic modeling computes steady-state values of a system according to laws of queuing theory.*1 Models are convenient and can guarantee correct results if the input data is correct and complete. The disadvantages lie in restrictions placed on the type of environments that can be modeled. Typically, simplifications must be made to the environ-

Analytic models helped estimate the relative merits of several proposed algorithmic changes in the memory manager, dispatcher. and disc driver The most beneficial changes were Implemented by Ihe MPE software engineers. They were then run through many benchmarks to verify ihe performance improvements

Although nol part of Ihe Series 70, the HP 7933/35XP cached disc drive was also a result of the performance engineering cycle described here. Important workload parameters were identified. A comparative study of the I/O subsystem performance with MPE disc caching, no caching, and the HP 7933i'35XP was conducted Engineers at HP's Disc Memory Division worked closely with the commercial systems performance engineers to model various design alternatives Benchmarks were run on prototypes of the enhanced disc drives and the performance gains were verified. The system workload characteristics under which [he cached discs performed best were explicitly identified to ensure customer satisfaction

All of these components were evaluated with respect to each olher and to the system as a whole The performance increases of the components complement each other and keep the system balanced The result of the Series 70 project is a product offering a 20% to 35% increase In system throughput over a Series 68 Expanded mam memory and Ihe HP 7933/35XP can provide additional performance improvements ment to make the analysis tractable.

Simulation is also modeling, but at a lower level of abstraction. In a Monte Carlo simulation model, the key system variables are represented by their measured probability distributions. Random numbers are applied to the model to generate specific values for the key system variables. These specific values are then used to compute the output variables of interest.

Trace-driven simulations are at a still lowrer level of abstraction. Traces of values for each of the key system variables are collected and applied together to a deterministic model of the system. The output variables of interest are computed for the specific set of trace data.

Simulation models can be constructed and solved for virtually any system. However, although simulation models provide valuable information, their limitations must also be recognized and understood. For example, simulation models cannot guarantee a steady-state solution, but only a particular solution correspond ingfu the Input data.

Any model runs the risk of ignoring some unknown, yet essential feature of the system. Additional dangers exist in the measurement of system variables and construction of the model. Careful modeling subjects the recorded variables to independent tests of correctness. The model construction can then be validated by modeling the existing system and comparing the results to measurements of the existing system.

After validation, the model is used in design analysis. Various changes can be introduced and the performance changes observed The set of changes that maximizes performance (and satisfies constraints of cost, design time, etc.) is then chosen as the final design. The final design is then modeled and the performance for the product predicted.

There are currently no good system variables that will accurately predict how a cache will perform. This makes it impossible to construct analytic or Monte Carlo simulation models that accurately predict cache performance. Currently, the best way to model caches is through trace-driven simulators. The Series 70 cache design used both analytic and trace-driven simulation. A simple analytic model at the system level was constructed using the results of the trace-driven cache simulator.

The cache modeling process involved collecting traces of memory reference addresses. Software was written to simulate various cache organizations. The traces were then run through the simulator under various cache organizations.

Ideally, the traces would have been collected from a Series 68. However, the speed of the Series 68 prevents data collection without special high-speed hardware. Instead. a Series 37 was chosen. A special memory controller that reads the memory reference address and writes it to its own local memory was constructed quickly. The special controller is passive and traces all system activity, including (he operating system. One million consecutive memory references can be collected. This number is sufficient to guarantee many calls to the operating system, and also includes many task switches. Three different customer environments were run to collect the data. Approximately nine measurements were collected for each environment, yf for a total of 30 million memory references (see "Realistic

Cache Simulation," page 45), The traces were then subjected to a series of tests to confirm that the collected data * was correct.

Cache Simulator

The cache simulator was developed in parallel with the collection process. The simulator takes the trace data and applies it to various models of the cache, The simulator development consisted of two phases. The first phase concentrated on completeness and correctness of the model implementation, the ease of use. and the choice of statistics to be kept. These goals led to a modular structure, very detailed statistics, and gave considerable freedom in adding and altering features. This flexibility has allowed the simulator to be leveraged to model caches for several other machines besides the Series 70.

It is extremely important that the simulator model the cache designs accurately. Artificial data was generated and simulated and then compared with the expected results to verify the accuracy of the cache simulator. Next, the Series 68 cache was modeled with the (race data. The same environment was then run on the Series 68 and the actual cache statistics were collected with the hardware monitor. The close correlation between the simulated and actual results gave a high level of confidence that the simulator would provide reasonable information on which to base the design of the Series 70 cache.

Phase two of the simulator development concentrated on the speed of the simulator by porting il to a mainframe computer. During the port, unnecessary code and structure were eliminated. The verification tests were then run again and compared to the original simulator output. The streamlining and port to the mainframe resulted in a functionally equivalent simulator that runs 80 times faster. A simulation of a single trace of one million memory references now takes about 45 seconds.

The simulator has the ability to vary seven different parameters: total size of the cache, associativity, block size, algorithms for handling writes, block replacement strategies, the handling of I/O requests, and tag indexing. Simulations were run varying each of the parameters. The effect of each parameter on cache performance and the sensitivity of performance to each parameter were determined.

Fig. 3 shows an example of the simulation results. The graph shows that the biggest contributor to cache performance is the size of the cache. The effect of diminishing returns with increasing cache size is clearly seen. Fig. 3 also shows the effect of associativity on different sizes of caches. The increased complexity of a multiple-set cache can be weighed against the performance gain it provides. This information was computed for all parameters, and the best combination of performance, complexity, and cost was determined. The final design was then chosen and simulated. The Series 70 cache size is 64K words, which is 16 times larger than the Series 68 cache. Like Ihe Series 68. the Series 70 cache has 2 sets.

The simulator provides cache performance information such as hit rates. It does not, however, provide a higherlevel view of system performance, such as throughput. In particular, the memory reference traces do not include any measurement of time between memory references. The hardware monitor provides such information for the Series fiB. An analysis of the system showed that the timing information would be valid for the new cache. The simple analytical model uses both the simulator data and the hard ware monitor data to produce estimates of the effect of the new cache on the system. Besides the expected values for cache statistics, Ihe mode! also shows a saturation effect. The lower the hit rate a system has, the more the new cache will benefit it.

While the cache simulation traces were fairly stable, the data from real Series 68s had more variance. Statistical analysis of the data led to a range of values for the percentage of CPU cycles that could be recovered by the Series 70 cache. A 00% confidence interval was chosen for the range, which means that, given normal distributions and random sampling, the mean of the cycles recovered should lie within the range nine times out of ten. The range given by the analysis was 19.4% to 28,3%, with an estimated mean of 23.8% recovery of "lost" CPU cycles. Because it was known that the traces from the Series 37 would give optimistic results [since main memory for the Series 37 is several times smaller than the Series 68). the target for the Series 70 cache was set at 20%.

The other components of the Series 70 underwent similar analysis. The analyses were then combined to give a prediction of the overall system performance gain. Care was taken to estimate the overlap effects exhibited when two or more components try to free the same resource. Predictions of the bottlenecks within the new system were also made.

At this point, the performance prediction was used by the marketing department to start working on the pricing

Series 70 Cachf? Simulation Hit Rate versus Total Cache Size Versus Associativity f Set

2 Set

A Set

Hi! Rate Ml

2 Set

A Set

Hi! Rate Ml

70 Cache

c

-

-

•u 0'

A-

¿fj

O

/

/

/ /

i

/

/

/

! 66 Cache

_ <

/

/

i

V

Total Cacne Size ¡kWords]

1000

Total Cacne Size ¡kWords]

Fig. 3. An example of simulation results, showing that cache size has the greatest effect on cache performance.

and positioning strategy for the product. Since the normal approach is to wait for the actual hardware 1obecome available to measure the performance gain, valuable time was saved in ihe introduction process through the use of the highly accurate simulation results.

Design Tracking

Measurements and modeling require a lot of work that precedes product development. Mot only do they help establish the correct design, but they are also often valuable during the development phase. Unexpected design changes

0 0

Post a comment

  • Receive news updates via email from this site