Therac-25

format_list_bulleted Contenido keyboard_arrow_down

ImprimirCitar

The Therac-25 was a radiotherapy linear accelerator produced by AECL, successor to the Therac-6 and Therac-20 models (previous units were produced in association with CGR). The device was involved in at least six accidents between June 1985 and January 1987, in which several patients received radiation overdoses. Three of the patients died as a direct consequence in what was by then the most serious accident in the 35 years of medical linear accelerators. These accidents called into question the reliability of software control of critical safety systems, becoming a case study in medical informatics and software engineering.

History

Linear accelerator.

Animation of the operation of a linear accelerator of medical use.

The French company CGR manufactured the Neptune and Sagittaire linear accelerators.

In the early 1970s CGR and the Canadian public company Atomic Energy Commission Limited (AECL) collaborated on the construction of linear accelerators controlled by a DEC PDP-11 minicomputer: the Therac- 6, which produced X-rays up to 6 MeV, and the Therac-20, which could produce X-rays or electrons up to 20 MeV. The computer added a certain ease of use because the accelerator could work without it. CGR developed the software for the Therac-6 and reused some subroutines for the Therac-20.

In 1981 the two companies canceled their collaboration agreement. AECL developed a new double-pass concept for the acceleration of electrons in a smaller space, changing its power source from a klystron to a magnetron. In some techniques the produced electrons are used directly, while in others they are collided with a high atomic number target to produce X-ray beams. This dual accelerator concept was applied to the Therac-20 and Therac-25, the latter being much more compact, versatile and easy to use. It was also more economical for a hospital to have a dual machine that could deliver electron and X-ray treatments, rather than two machines.

The Therac-25 was designed as a computer controlled machine and some security mechanisms were upgraded from hardware to software. AECL decided not to duplicate some security mechanisms. AECL reused code modules and routines from the Therac-20 for the Therac-25.

The first prototype of the Therac-25 was built in 1976. It began shipping in late 1982.

The software for the Therac-25 was developed by one person over several years using PDP-11 assembly language. It was an evolution of the Therac-6 software. In 1986 the programmer left AECL. In one lawsuit, the lawyers could not identify the programmer or know his qualifications and experience.

5 machines were installed in the United States and 6 in Canada. After the accidents, in 1988 AECL dissolved the AECL Medical section and the company Theratronics International Ltd was in charge of carrying out the maintenance of the installed Therac-25 machines.

Design

Swivel plaque.

Rotation of the rotating plate.

The machine offered two modes of radiation therapy:

Direct electron beam therapy, which delivered small doses of high-power electrons (5 to 25 MeV) for short periods of time.
Very high voltage X-ray therapy (or photon therapy), which delivered x-rays created by collision of high-power electrons (25 MeV) against a target (target) and then passed them through a diffuser filter and a collider.

It also had a visible light mode (field light mode) that allowed the patient's position to be adjusted by illuminating the treatment area with visible light. When operating in low power electron beam mode, that beam was emitted directly from the machine, propagated to a safe concentration by magnetic scanners. In X-ray mode, the machine was designed to interpose four components in the path of the electron beam:

White, which turned the beam of electrons into X-rays.
Diffuser filter, which distributed the beam for a wider area.
Moving block set (colimator), which shaped the X-ray beam to adjust it to the treatment area.
X-ray ion camera, which measured the power of the beam.

The patient is placed on a fixed stretcher. Above it is a rotating plate to which the components that modify the electron beam are attached. The turntable has one position for X-ray (photon) mode, another position for electron mode, and a third position for visible light adjustments. In this position an electron beam is not expected, and light reflecting off a stainless steel mirror simulates the beam. In this position there is no ion chamber to act as a delivered radiation dosimeter because the radiation beam is not expected to function. The turntable has microswitches that indicate the position to the computer. When the plate is in one of the three fixed positions allowed, a plunger locks it by interlocking. In these types of machines, electromechanical locks were traditionally used to ensure that the turntable was in the correct position before starting the treatment. On the Therac-25 they were replaced by software checks.

Cases

Kennstone Regional Oncology Center, 1985

A Therac-25 had been operating for six months in Marietta, Georgia, United States, and on June 3, 1983, it applied radiation therapy after a lumpectomy to a 61-year-old woman. The treatment was 10 MeV of electrons. The patient felt a burning heat in the area. In the following days she had reddening of the area, her shoulder locked up and she had spasms. Her redness moved from her chest to her back, indicating a radiation burn, but the doctors couldn't explain it. The hospital physicist consulted AECL about the incident. He calculated that the applied dose was between 15,000 and 20,000 rad (radiation absorbed dose ) when it should have been 200 rad. A dose of 1000 rad can be deadly for whole body radiation. In October 1985 the patient sued the hospital and the manufacturer of the machine. In November 1985 AECL was notified of the lawsuit. In March 1986 AECL informed the FDA that it had received a complaint from the patient.

Due to the radiation overdose, her chest had to be surgically removed, an arm and shoulder were immobilized, and she was in constant pain. The treatment printout function was not activated at the time of treatment and there was no record of the applied radiation data. An out-of-court settlement was reached to resolve the lawsuit.

Ontario Cancer Foundation, 1985

DEC VT-100 terminal.

The Therac-25 had been in operation in the clinic for six months when on July 26, 1985 a 40-year-old patient was receiving her 24th treatment for cervical cancer. The operator activated the treatment, but at five seconds the machine stopped with the error message “H-tilt”, the treatment pause indication and the dosimeter indicating that no radiation had been applied. The operator pressed the P (Proceed: continue). The machine stopped again. The operator repeated the process five times until the machine stopped the treatment. A technician was called and found no problem. The machine treated six other patients on the same day.

The patient complained of burning and swelling in the area and was hospitalized on July 30. She was suspected of a radiation overdose and the machine was taken out of service. On November 3, 1985, the patient herself died of cancer, although the autopsy mentioned that if she had not died then, she would have had to have a hip replacement due to damage from the radiation overdose. A technician estimated that she received between 13,000 and 17,000 rad.

The incident was reported to the FDA and the Canadian Radiation Protection Bureau.

AECL suspected that there might be an error with three microswitches that reported the position of the turntable. AECL was unable to replicate a failure of the microswitches and microswitch testing was inconclusive. AECL changed the method to be tolerant of one failure and modified the software to check if the turntable was moving or in the treatment position.

AECL claimed the modifications were a five-order-of-magnitude increase in safety.

Yakima Valley Memorial Hospital, 1985

In December 1985 a woman developed an erythema with a parallel band pattern in the treatment area. Hospital staff sent a letter on January 31, 1986 to AECL about the incident. AECL responded in two pages detailing the reasons why radiation overdose was impossible on the Therac-25, both due to machine failure and operator error.

At six months the patient developed chronic ulcers under the skin due to tissue necrosis. She had surgery and skin grafts were placed. The patient continued to live with minor sequelae.

East Texas Cancer Center, Tyler, March 1986

For two years they treated more than 500 patients with the Therac-25. On March 21, 1986, a patient presented for his ninth treatment session for a tumor on his back. The treatment was 22 MeV of electrons with a dose of 180 rad in an area of 10x17 cm, with an accumulated radiation in 6 weeks of 6000 rad.

The experienced operator entered the data of the session and realized that she had written an X instead of an E as treatment type. With the cursor she went up and changed the X to an E and like the rest of the parameters were correct clicked ↵ Enter all the way down to the command box. All parameters were marked “Verified” and the message “Rays ready” was displayed. Pressed B ("Beam on"). The machine stopped and displayed the message "Malfunction 54" (error 54). It also showed 'Treatment pause'. The manual said that the "Malfunction 54" message was a "dose input 2" error. A technician later testified that "dose input 2" meant that the radiation delivered was either too high or too low.

The radiation monitor (dosimeter) marked 6 units supplied when it had demanded 202 units. The attendant pressed P (Proceed: continue). The machine stopped again with the message "Malfunction 54" (error 54) and the dosimeter indicated that it had delivered fewer units than required. The surveillance camera in the radiation room was offline and the intercom had been broken that day.

With the first dose the patient felt like an electric shock and a crackle from the machine. Since it was his ninth session, he realized that it was not normal. He started to get up from the table to ask for help. At that moment the operator pressed the P to continue processing. The patient felt a shock of electricity through his arm as if his hand was torn off. He reached the door and began to bang on it until the operator opened it. He was immediately recognized by a doctor, who observed intense erythema in the area, suspecting that it had been a simple electric shock. He sent the patient home. The hospital physicist checked the machine and because it was calibrated to specifications he continued to treat patients throughout the day. What they did not know is that the patient had received a massive dose of between 16,500 to 25,000 rads in less than a second over an area of one cm². The crackling of the machine had been produced by a saturation of the ionization chambers, which had the consequence that they indicated that the applied radiation dose had been very low.

Over the following weeks the patient experienced paralysis of the left arm, nausea, vomiting, and ended up being hospitalized for radiation-induced spinal cord myelitis. His legs, mid-diaphragm and vocal cords ended up paralyzed. He also had recurrent herpes simplex skin infections. He passed away five months after the overdose.

From the day after the accident, AECL technicians checked the machine and were unable to replicate error 54. They checked the grounding of the machine to rule out electric shock as the cause. The machine returned to operation on April 7, 1986.

East Texas Cancer Center, Tyler, April 1986

On April 11, 1986, a patient was to receive electron treatment for skin cancer on the face. The prescription was 10 MeV for an area of 7x10 cm. The operator was the same as the one in the March incident, three weeks earlier. After filling in all the treatment data he realized he had to change the mode from X to E. He did so and clicked ↵ Enter to go down to the command box. As "Beam ready" was displayed, clicked P (Proceed: continue). The machine stopped producing a great noise, which was heard through the intercom. Error 54 was displayed. The operator entered the room and the patient felt fire on his face. The patient died on May 1, 1986. Autopsy showed radiation damage to the right temporal lobe and brainstem.

The hospital physicist stopped the machine treatments and told AECL. After strenuous work, the physicist and operator were able to reproduce the error 54 message. They determined that speed in editing the data entry was a key factor in producing error 54. After much practice, he was able to reproduce the error 54 at will. error 54. In AECL they could not reproduce the error and they only got it after following the instructions of the physicist so that the data entry was very fast.

AECL calculated the dose delivered in the accident to be 25,000 rad.

Yakima Valley Memorial Hospital, 1987

On January 17, 1987 a patient was to receive a treatment with two verification film exposures of 4 and 3 rad and an electron treatment of 79 rad. Film was placed under the patient and 4 rads were administered through a 22x18 cm opening. The machine was stopped, the aperture was opened to 35x35 cm and a dose of 3 rad was administered. The machine stopped. The operator entered the room to remove the film plates and adjust the patient's position. He used the hand control inside the room to adjust the turntable. He left the room forgetting the film plates and in the cabin, after seeing the message «Beam ready», he pressed the key B to shoot the rays. After 5 seconds the machine stopped and displayed a message that quickly disappeared. Since the machine was paused, the operator pressed P (Proceed: continue). The machine stopped showing "Flatness" as the reason. The operator listened to the patient on the intercom and entered the room. The patient had felt like a burn. The screen showed that he had only been given 7 rad. A few hours later, the patient showed burns on the skin in the area. Four days later the reddening of the area had a banded pattern similar to that produced in the incident the previous year and for which they had not found the cause. AECL began an investigation but was unable to reproduce the ruling.

The hospital physicist conducted tests with film plates. Two x-ray exposures when the turntable was in the set position in visible light produced photographs similar to those left behind by the operator on the day of the accident. The patient was exposed to between 8,000 and 10,000 rad instead of the prescribed 86 rad. The patient died in April 1987 of complications from the radiation overdose. Family members filed a lawsuit that ended in an out-of-court settlement.

Causes

The investigative commission concluded that the primary causes of the accidents were poor development practices, requirements analysis, and poor software design, and not isolated errors in the source code. In particular, the Therac-25's software was designed in such a way that it was nearly impossible to automatically find and fix bugs.

Researchers found other secondary causes:

AECL did not send the source code to an independent entity.
AECL did not consider the design of software during its assessment of potential failures and risk management.
The system notified an error and stopped the X-rays, but only showed the message «MALFUNCTION» (functioning error) followed by a number from 1 to 64 (relative to the number of anological/digital channel). The machine manual did not explain the problem involved in error codes, and therefore the operator just closed the warning and proceed with the treatment. An operator reported that he had an average of 40 mistakes every day «MALFUNCTION».
AECL staff, as well as machine operators, initially did not believe in patient complaints; this could be explained by the high confidence they had in the machine. In operating courses for AECL operators, he stressed that there were so many security mechanisms that it was impossible to give an overdose of radiation.

Engineering issues were also encountered:

One of the faults occurred only when a particular key sequence was quickly introduced in the VT100 terminal, which controlled the Therac-25 PDP-11 computer. The operator had filled all the boxes and was in the order box when he realized that there was an error in the beam type box that contained an X (Rays X) when he had to contain an E (Electronic Beam). To correct it I used the cursor ↑ up to the box, write an E and go down with the cursor ↓ until the order box, write a B and press Get in.. The complete sequence was ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ E ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ B Get in.. If this sequence was performed in less than 8 seconds, the machine could apply a radiation of up to 1000 times that was intended to be delivered. This happened very rarely and it was unknown that such a mistake existed.

The design did not have a mechanical security system that would prevent the electron beam from operating in the high-energy mode without the target in its position.
Engineering had reused older model code, which had mechanical security systems. Such security devices could not warn when activated, so there was no suspicion of failures in the software.
The software I could not check that the sensors were working properly (open core). The manufacturer later added redundant switches.
The control equipment process was not properly synchronized with the operation interface process, and therefore race conditions occurred if the technician changed the configuration quickly. Setting fast was somewhat unlikely during the initial tests, as the operators did not have enough experience. The software allowed concurrent access to shared memory and there was no real synchronization other than shared variables.
The program changed a flag variable by increasing it, instead of assigning it a fixed value. Occasionally a variable overflow occurred, causing security checks to be ignored at that time.
The program was written in assembly language, a low-level language that requires more design, programming and testing. In addition, it was programmed on its own operating system.

Software problems

Programme tasks.

The overall software design was insecure. Among the software bugs were two race conditions.

When the output or state of a process is dependent on a sequence of events that are executed in arbitrary order and are going to work on the same shared resource, an error can be produced when said events do not arrive (are executed) in the order that the programmer expected. The term originates from the similarity of two processes competing in a race to arrive before the other, in such a way that the state and the exit of the system will depend on which one arrived before, being able to cause inconsistencies and unpredictable behaviors that are not compatible with a deterministic system.

Data entry routines

Data entry screen in a Therac-25.

Sequence of data editing that causes error 54.

Data editing sequence that causes error 54 in less than 8 seconds.

Subroutines of the control program of a Therac-25.

Tyler's accidents were caused by problems in the data entry routines that allowed the fit test to be run before all prescription parameters had been entered and checked. It was a race condition issue.

The task monitoring task (“Treat”) controlled the treatment phases by executing eight subroutines. The variable Tphase ("Treatment Phase": treatment phase) was used, which indicated the phase of the treatment that should be executed. After the execution of a Treat subroutine, it reprogrammed itself. The Data entry subroutine communicated with the keyboard handler task via the shared variable Data-entry completion flag.) to determine if all the prescription data had been entered correctly. The keyboard manager recognizes the completion of the input data and changes the variable to indicate this. Once the value is set, the Datent subroutine detects that the variable has changed its state and then changes the value of the Tphase variable from 1 (Data Entry) to 3 (Set-Up Test). In this case the Datent subroutine exits to the Treat subroutine, which reprograms itself and begins execution of the Set-Up Test subroutine. If the variable (Data entry completion) has not moved, Datent does not change the value of Tphase and exits to the main body of Treat. Treat reprograms itself, essentially reprogramming the Datent subroutine.

Once all the parameters have been entered, Datent calls the Magnet subroutine, which adjusts the magnets in a process that takes 8 seconds to account for the hysteresis process. Magnet calls a Ptime subroutine that introduces a delay. Since several magnets need to be adjusted, Ptime goes in and out several times. To indicate the adjustment of magnets, a flag variable is initialized when entering the Magnet subroutine and is set to 0 when exiting Ptime. The Ptime subroutine checks a shared variable set by the keyboard manager, which indicates the presence of edit requests. If there are edits, Ptime clears the magnet setting flag variable and outputs to Magnet, which in turn outputs to Datent. But Ptime only checks the edit variable if the magnets setting variable is on. Since Ptime deletes it on its first execution, any editing executed in subsequent Ptime cycles will not be recognized. So any change in mode or power reflected on the screen and in the mode and power variables, will not be perceived by Datent so that it can adjust the machine parameters.

Part of the problem was fixed after the crash by changing the magnet adjustment variable to zero at the end of the Magnet subroutine (when the magnets had been adjusted) instead of at the end of the Ptime subroutine.

Variable overflow

Overflow of the Class3 variable.

This error occurred in the Yakima accidents due to a race condition in which a software safety mechanism failed, allowing the machine to activate in the wrong situation. The visible light adjustment function (field-light feature) allows very precise positioning of the patient for treatment. Normally, the operator enters the parameters on the screen, enters the room and makes the final manual adjustments. On the screen it reflects that there is an "unverified" status. After the settings all the parameters show the message “verified”. The operator presses the “Set” button on the hand control or writes “Set” on the screen. The machine will then place the collimator in the proper position for treatment.

In the program, after the prescription is entered and verified by the Datent routine, the control variable Tphase is changed so that the Set-Up Test routine is started. Each pass of the Set-Up Test routine increments the shared variable Class3 by 1. A value other than 0 indicates that there is an inconsistency and the treatment should not occur. A value of 0 for Class3 indicates that there is consistency of the parameters and lightning can be fired. After putting a value in the Class3 variable, the Set-Up Test routine tests the system for errors by checking that the shared variable F$mal has a value other than 0, which would indicate that it is not ready for treatment. The Set-Up Test subroutine would then be reprogrammed. When F$mal is 0 (everything is correct for the treatment), the Set-Up Test subroutine sets the Tphase variable to 2, which causes it to reprogram the Set-Up Done subroutine and the treatment continues. The latch checking mechanism is performed by the concurrent task Housekeeper (Hkeper). The upper collimator check is performed by a Hkeper subroutine called Lmtchk. Lmtchk first checks the Class3 variable. If Class3 contains anything other than 0, Lmtchk calls the Chkcol subroutine to check the collimator. If Class3 contains 0, Chkcol is bypassed and the upper collimator position is not checked. The Chkcol subroutine writes to bit 9 of the shared variable F$mal based on the position of the upper collimator (which in turn is checked by Datent's Set-Up Test subroutine and can thus decide whether to reprogram or proceed with Set- Up Done).

During the setup of the machine for a treatment session, the Set-Up Test runs hundreds of times because it reprograms itself waiting for other events to occur. In the code, the Class3 variable is incremented by 1 on each pass of the Set-Up Test. Since the Class3 variable has 1 byte (8 bits) it can only contain up to the value 255 decimal. Every 256 passes of the Set-Up Test routine, the variable overflows to a value of 0 (rays can be applied). That means that every 256 passes of the Set-Up Test routine, the upper collimator will not be checked and its errors will not be detected.

The overdose occurred when the operator pressed the Set button (or typed Set on the keyboard) at the precise instant that Class3 overflowed to 0. Then Chkcol was not executed, and F$mal was not updated to indicate that the upper collimator was still in the field-light position (Field-light position). The software connected the maximum power of 25 MeV without a target and without applying a diffusion filter to the beam, resulting in a highly concentrated electron beam.

AECL corrected this problem by changing the Class3 variable to a fixed value other than 0 (instead of incrementing it) on each pass of the Set-Up Test routine.

Lessons Learned

The importance of security versus ease of use for the operator. Making the interface friendly can conflict with security goals. Accidents are almost never simple. They normally involve interdependent events with technical, human and organizational factors. One of the most serious errors is the tendency to believe that the cause of an accident has been determined without having adequate evidence to reach that conclusion (eg the microswitch in the Hamilton accident) or without investigating all contributing factors. Another mistake is to think that fixing a particular bug will prevent all future crashes. Accidents are often attributed to human error. Almost all of the factors involved can be attributed to human error. Even failures due to material wear could be defined as human failure if adequate redundancy has not been provided or personnel have not been trained to maintain and replace worn parts. Concluding that an accident is the result of human error is neither significant nor helpful.

In accidents in complex systems, all contributing factors must be considered. On the Therac-25 they were:

Inadequate management and lack of procedures for tracking reported incidents.

Overconfidence in software and removal of physical locks and enslavements making the software a "single point of failure" (unique point of failure) that can lead to an accident. Software should not have the exclusive responsibility for security.

There is a tendency in engineers to ignore the software. In the initial investigation of the Therac-25 incidents it was assumed that the cause was on the hardware and the research focused on the hardware.
In process control systems a software error can be attributed to a hardware failure. Without thorough research with detailed event records, it is not possible to determine whether a sensor provided wrong information, the software ordered an incorrect command or the actuator had a sporadic error.

There were no independent checks that the software was operating properly. Verification of all errors cannot fall on operators and less when the means to do so are not provided. Therac-25 software lied to operators and did not detect an overdose. The system ionization chambers could not handle the high ionization density of an electron beam without filter diffuser at high currents. Ionization chambers were saturated and indicated that a low dose had been applied. Engineers should design thinking about the worst case possible.

Any company that makes risk equipment must have incident, risk, traceability and quality control procedures.

Improved software engineering practices. Documentation should not be provided later. Software quality control practices should be established. Designs should be kept simple. Records should be established from the outset that may permit traceability and error auditing. The software must be tested extensively at the module level. Performing all tests in a complete system is not adequate. Reuse of software modules in another system does not guarantee security and can lead to dangerous designs. The software of unknown origin (SOUP) is the code that does not have formal documentation or was developed by a third party and has no evidence of the controls established during its development process. This code, by definition, is capable of producing faults. It is very important to conduct a risk analysis of any SOUP code that will be introduced into a project and justify the reasons for doing so.

Unrealistic risk assessments and overtrust in such assessments. AECL's claim that security had increased five orders of magnitude as a result of the change of a micro-interruptor after Hamilton's accident is difficult to justify.

Critical security projects should incorporate special procedures for security and design analysis. Security should be ensured even when there are software errors. The Therac-20 contained the same software error as the Therac-25 involved in Tyler's deaths, but had the physical insurance and enslavement that mitigated its consequences. Protection against software errors can also be implemented in the software itself.

Obligation to report incidents to regulatory agencies and other system users. AECL did not report the first incidents to regulators or other users. Users held at least three meetings to comment on incidents and propose improvements.

Main sequence of events

Legend

Contenido relacionado

Más resultados...