Notes on System Failures, System Hangs and Memory Dumps for MPE/iX
by Stan Sieler, Allegro Consultants, Inc.
Copyright (c) 1995 Allegro Consultants, Inc. (updated 2004)
0. Introduction
Sometimes the computer "dies". This paper discusses:
Sometimes the system is alive, so the free speedometer is discussed.
There are two basic kinds of system failure that an MPE/iX (or MPE XL) user will encounter: a "System Failure" and a "system hang". The former is easily identfied by the "System Failure" message that appears on the hardware Console (usually LDEV 20). The latter is typified by users complaining that the machine is "hung" (non-responsive). Each will be discussed below.
A System Failure reports the following information on the hardware console:
SYSTEM ABORT 504 FROM SUBSYSTEM 143 SECONDARY STATUS: INFO = -34, SUBSYS = 107 SYSTEM HALT 7, $01F8
Additionally, the hex display (enabled on the console by typing control-B) displays something like:
B007 0101 02F8 DEAD
Note that the "504" and "$1F8" above are the same value, shown in decimal and in hex. Further, the hex display shows "0101" and "02F8". These two numbers are reporting the following:
0101 02F8
The bold underlined portions indicate packets 1 and 2 of the System Abort number (01F8) (i.e., the first two hex nibbles (01 and 02 above) of each 4-digit hex number are “packet numbers”).
Note: if the System Abort number is in the range 0 to $FF (decimal 255), only one "packet" will be needed to represent it, and no 2nd packet will be shown.
1.1 Interpreting the System Failure Number
The System Failure number (504 in our example) can be converted to English by doing the following on a live MPE/iX machine while logged in as any user with PM capability:
:hello manager.sys :debug = errmsg (#504, #98) 'Prefetch of needed data for a READ/WRITE request could not be made.' c :
The above "= errmsg" command looks up the System Failure number (message #504) in the system error catalog (set #98…a magic number). This catalog is not complete, and some System Failure numbers are not in the catalog.
1.2 Interpreting the Subsystem Number
If the System Failure reported a subsystem (143 in the our example), the following might convert it to a subsystem name:
:debug = errmsg (#32765, #143) c
Note: In all these errmsg examples the "#"signs are required, and the 32765 is a "magic" number.
Here are two examples, one which succeeds, and one which fails:
:debug = errmsg (#32765, #143) 'File System' = errmsg (#32765, #129) 'External error - subsys: #129 info: #32765'
If the above doesn’t produce a useful string, you can try two other approaches, both of which use the appropriate SYMOS file after loading the DAT macros.
The first uses a macro called “subsysstr”, which knows about 30 hand-coded subsystem numbers, and also knows how to use the “errmsg” function as shown above:
:debug use datinit.dat.telesup macstart , '1' = subsysstr (#129) '7978 Tape Device Mgr'
If that doesn’t work then you probably are looking at a relatively unusual subsystem number. The slowest, but most reliable, method of translating a subsystem number into something is to search the SYMOS for a constant of the form SUBSYS_xxxxx. Here’s an example, using subsystem number #129.
:debug use datinit.dat.telesup macstart , '1' env filter '129' set dec /* important, because "129" is decimal, not hex */ symlist subsys@ ,,c SUBSYS_7978_DM CONST INTEGER #129 SUBSYS_TAPE CONST INTEGER #129 env filter ''
In this example two lines matched the filter, showing that subsystem #129 is either “SUBSYS_TAPE” or “SUBSYS_7978_DM”. Since a 7978 is a tape drive, I’d suspect that SUBSYS_7978_DM is the most likely “answer” to the “what is subsystem 129” question. (I submitted a bug report to HP: no two *different* SUBSYS constants should ever have the same value!)
An optional step to dramatically improve the performance of the SYMLIST command is to prefetch the SYMOS file into memory. An example is:
:fetch symos.osb79.telesup
The SYMOS file you should fetch is the one that was opened by the MACSTART command above. You can see which one this is by doing a SYMINFO command:
symf
1.3 Interpreting the Secondary Status Number
The Secondary Status line may provide some additional information about the System Failure, if the INFO and SUBSYS values are not 0. Take the two numbers (in our example, INFO = -34, and SUBSYS = 107), and use the "errmsg" function as follows:
:debug =errmsg (-#34, #107) /* as always, "#"s are necessary */ 'The length specified was beyond the bounds of the specified object.' c
Not all Secondary Status messages are in the catalog. If you had tried one that is not, you would see:
:debug =errmsg (-#51, #107) 'External error - subsys: #107 info: #51'
Note: I recommend submitting a bug report to HP for any System Failure or Secondary Status values that are not in the catalog!
The most common type of system failure is a deliberate call to an internal MPE routine called system_abort. When this kind of system failure occurs, the three line message shown at the top is printed to the Console. Note the third line which says "SYSTEM HALT 7, $01F8". The "7" means that system_abort was called. At least seven other kinds of system halts are defined (SYSTEM HALT 0 through SYSTEM HALT 6).
The SYSTEM HALTS 0..6 represent system failures for problems other than system_abort, and usually reflect a problem at a lower level in the operating system (e.g., in the interrupt handling code). SYSTEM HALTS 1..7 should produce a multi-line printout on the console, as shown above. SYSTEM HALT 0 does not.
If the console output is missing or corrupted you can determine the type of SYSTEM HALT that occurred by looking at the hex display. You can think of the hex display as presenting a series of 16-bit numbers (4 hex digits) in a sequence. The sequence is repeated over and over, with a pause of about 1/2 second between each number.
The last number in the sequence is usually $DEAD. The first number is usually of the form $Bnxx. The "xx" portion (the bottom two hex digits) reports the type of SYSTEM HALT that occurred.
In the example at the top the hex display shows:
B007 0101 02F8 DEAD
The "07" means: SYSTEM HALT 7 (system_abort was called).
Sometimes, the system seems to "hang" and little or no response is seen by the users. When this happens it is important to characterize what is hung and what isn’t. The following questions should be asked before stopping the machine and taking a memory dump:
- What does the hex display show? (See: Speedometer in section 4 below)
- Does any terminal get a response from the Command Interpreter?
(If a terminal is sitting with a ":" prompt, hit RETURN. Is another ":" prompt displayed?) - Is the hardware console (ldev 20) hung?
- If a terminal can be found that is working, does a :SHOWPROC command hang the terminal?
- Does a Control-A at the hardware console (LDEV 20) result in an "=" prompt?
- Are the disk drives active?
The answers to these questions will aid the person who analyzes the dump.
Once a memory dump has been taken and the system rebooted you will probably want to load the dump for analysis. The following steps should be done:
- Logon as MGR.TELESUP, DUMPS
Note: if the DUMPS group does not exist, logon as MGR.TELESUP and do :NEWGROUP DUMPS, and then :CHGROUP DUMPS
- Enter: DAT.DAT
This will run the DAT (Dump Analysis Tool) program.
- Enter: GETDUMP FOO
"FOO" will be the name of the dump. This name must begin with a letter, and be 1 to 5 letters and/or digits long. One recommendation is to call the dump S#### where #### is the System Abort number (e.g., “S0504”).
DAT will request a tape whose formal name is DUMPTAPE (this may be file equated before running DAT, if necessary).
- REPLY to the tape request.
DAT will read the first few records of the tape and report how much disk storage will be required to hold the dump. DAT will then allocate all of the necessary disk storage "up front", before reading the rest of the tape.
If DAT is able to allocate enough disk space, and if the dump is on a single tape (or DDS or DLT), you can now walk away for a while.
*** Please do the next two steps even if you think you don’t want to analyze the dump yourself; it saves 5 to 15 minutes for the next person who analyzes the dump!
- Enter: MACSTART "FOO", "1"
This will tell DAT that you want to start analyzing the dump. The "FOO" (in quotes) is the name of the dump you used on the earlier GETDUMP command (which didn’t use quotes). The extra "1" tells DAT that you are only interested in "macros" for the operating system.
If this process encounters a few errors, please do a screen capture (e.g.: PSCREEN) so we can analyze them later.
- Enter: PROCESS_WAIT ; UI_SHOWJOB
These two commands may take up to 15 minutes to run.
- Enter: EXIT
You have now loaded a dump, FOO, and "prepared" it. If you want to send the dump to anyone for analysis, use STORE to store it as follows:
:STORE FOO@
The "@" is important, because the dump is actually stored on disk as FOOMEM and FOOVAR, where "FOO" is the name you picked for the dump. Some day dumps may be stored as even more files (e.g.: FOO001, FOOMEM, FOO002, FOOVAR), so the "@" will always be needed.
When an HP 3000 system running MPE/iX is alive, the hex display on the hardware console functions as a speedometer, reporting how busy the system is. (Remember: some machines have LED hex displays, and all have the ability to put the hex display on the status line of the hardware console when control-B is hit.)
The speedometer will typically cycle between two values:
FxFF and FFFF.
Ignore the FFFF value.The "x" digit in the FxFF value reports what percentage busy the CPU is. The number should be multiplied by 10 to obtain the percentage.
Examples:
- F4FF = 40% busy
- FAFF = 100% busy ("A" is the hex value for decimal 10).
- F0FF = idle (0% busy)
Note: on newer HP 3000s you will have to interact with the GSP (Guardian Service Processor) to see the speedometer. A typical scenario is:
- Connect to the GSP (press control-B at the hardware console, telnet to the GSP port, or use a browser and logon to the GSP via a Secure Web Console)
- Login to the GSP (often just by pressing <return> twice)
- Get to the Virtual Front Panel by entering: VFP <return>
- If asked, say “Yes” to the “Proceed with Live Mode of VFP? (Y/[N]) y” question
- Watch for a few updates:
unknown, no source stated legacy PA HEX chassis-code FFFF
unknown, no source stated legacy PA HEX chassis-code F0FF
unknown, no source stated legacy PA HEX chassis-code FFFF
unknown, no source stated legacy PA HEX chassis-code F0FF
The above says: F0FF, which is 0% busy. - Exit the VFP by typing “q”: q
- (optional) exit the GSP by typing “co”: co