Building a digital diagnostic assistant on shared experiences
Source: Nieke Roos, Bits & Chips
TNO-ESI has brought all its diagnostics projects together under one roof. This clustering is paving the way for high-tech companies like ASML and Canon Production Printing to benefit from each other in developing a digital diagnostic assistant.
“In designing system diagnostics functionality, we’re used to following the Pareto principle: we attack the 20 percent of the causes that are responsible for 80 percent of the problems. Unfortunately, with this, we cover the large and mostly trivial part of the issues – those that occur frequently and that can relatively easily be correlated using data science technology,” says ASML’s Martijn van Veelen. “It doesn’t help us with the many rare cases that are similar but not the same. That’s what we want to tackle with TNO-ESI.”
“This is a generic challenge with diagnosing complex machines,” observes Peter Kruizinga of Canon Production Printing. “When analyzing the different issues in the field, we also see a number of frequent problems alongside a very long tail of problems that crop up only very occasionally. Almost half of the issues encountered by our service technicians occur rarely. Together with TNO-ESI, we’re looking to develop a methodology for dealing with these types of problems.”
“High-tech equipment is getting more and more complex, but their diagnostic capabilities aren’t keeping up, resulting in availability and performance issues and even system failures that are very hard to troubleshoot. We’re seeing this at all our OEM partners,” notes TNO-ESI’s Masoud Dorosti. “As of last year, we have a project cluster specifically focused on diagnostics. By bringing all our projects in this domain under one roof, we aim to stimulate knowledge sharing, not only between the teams but also between the partners, and get the most out of our combined efforts.”
The Carefree project with Canon Production Printing started at the beginning of 2020. “Initially, we focused on hardware failures that prevent the printer from functioning altogether,” says Kruizinga, a lead technologist at the printing specialist. “In such so-called hard-down situations, a service engineer is dispatched to analyze the problem, replace the defective part and get the machine up and running again.”
Most service visits, however, are being paid to systems that are still functioning, albeit suboptimally. “The machine is printing but the quality is below par, or the paper isn’t being handled or transported correctly,” Kruizinga illustrates. “As hard-down situations are relatively localized, the failing part can be found relatively easily. Performance issues are much bigger problems in the sense that they’re more system-level. They can have a variety of causes, occurring in a host of places. Subpar quality, for example, can have lots of reasons, including ink residues building up or dust collecting somewhere in the machine. Finding the root causes is much more difficult.”
After developing an approach to diagnosing hardware failures, based on so-called Bayesian networks, the focus of the Carefree project shifted to addressing this system-level type of problems. “In hard-down situations, the structure of the system tells us a lot about the causes and effects and we can use that to scale up toward a solution,” Kruizinga points out. “For print quality issues and other performance problems, the knowledge isn’t so well structured. We’re currently mapping out the different failing mechanisms to see if we can generalize them. With TNO-ESI, we’re looking into probabilistic programming as a very promising research avenue.”
“We’re developing reasoning models that produce hypotheses about what could be the cause of the symptoms we’re observing from the system,” adds Leonardo Barbini, a researcher involved in TNO-ESI’s diagnostics expertise team and diagnostics project cluster. “Unlike hardware failures, performance problems aren’t black and white – it’s not a matter of a component that’s working or not; we’re building a hypothesis generator for a gray area. From a mathematical point of view, this means moving from discrete random variables to continuous random variables, prompting the natural evolution from Bayesian networks with discrete states to a probabilistic programming paradigm.”
“Years ago, TNO-ESI did a project on performance diagnostics in our application domain, but for us, it turned out to be very difficult to get the right performance data from our customers, as that data is highly sensitive,” recalls Van Veelen, a systems engineer and technology researcher at ASML. “So we decided to move away from performance diagnostics and instead utilize our knowledge of the system design and the underlying physical principles to create a diagnostic model. This resulted in the Assisted Diagnostics in Action project, ADIA, and its successor, System Diagnostics in Action, or SD2Act, both with TNO-ESI.”
“In using the design information and the physics knowledge, we have two objectives,” Van Veelen explains. “The first is zero repeat. Since we have over a thousand systems in the field, all with a slightly different configuration, we can’t apply exactly the same reasoning model to each and every system. With some generalization, however, we can recognize that an incident may not be identical to something we’ve seen before but similar enough so that we can use the same type of reasoning although it may be a slightly different system. Data science and fingerprinting alone will never provide sufficient statistical confidence, but by feeding design information into the equation, we can achieve the right abstraction level.”
The second objective is to create a hypothesis generator, like Canon Production Printing, that enables a move from mere correlation to causation. “Despite all the effort being put into FMEAs, failure mode and effects analyses, it’s my experience that they only cover a small part of what can go wrong in practice,” ASML’s Van Veelen goes on to elaborate. “This is even worse when analyzing a system that doesn’t exist yet. We, as humans, are simply incapable of overseeing all the possible failure scenarios. By having these automatically generated, we can extend the fingerprint of an incident with the causal relations from the symptoms to the failures. That gives us a kind of explainability based on real evidence – the design documentation and the physical principles – and it relieves us from the burden of having to statistically prove the correctness of our diagnostics, which is impossible when you only have rare cases.”
Going from ADIA to SD2Act, ASML broadened its focus. “In ADIA, which started 6-7 years ago, we limited our scope to relatively straightforward parts of the system, like electrical connections and the flow of cooling water. These have very structured design diagrams that can basically be parsed by a script and then converted through a model-to-model transformation to obtain the causal relations,” Van Veelen points out. “Running from 2021-2024, SD2Act takes the diagnostics to the system level, using architectural information about the system decomposition and the functional deployment onto the hardware to derive as many causal relations as possible.”
By clustering its diagnostics projects, TNO-ESI has paved the way for all kinds of cross-fertilization. “A big diagnostic challenge is to determine a service engineer’s next action when getting a system up and running again,” observes Barbini. “This requires gathering additional system information or performing additional system tests. Both with ASML and Canon, we had discussions on this topic, which resulted in the same solution being implemented on both ends.”
Barbini gives some more examples: “In ADIA, we already did a lot of work on model-based FMEAs, parts of which are automatically filled in to reduce the effort it takes to make them. It turns out that they’re also very interesting for Canon. We’re now in talks with them to see if they can reuse the simulation approach and modeling language we developed at ASML. The other way around, as we’re moving toward performance diagnostics at Canon, we’re seeing some of the earlier interest resurface at ASML to meet the need that’s still there.”
Some of the (mathematical) frameworks developed under the roof of the cluster also lend themselves well to being shared, believes Barbini’s colleague Dorosti, who manages the Carefree and SD2Act projects for TNO-ESI. “At ASML, we came up with new ways to handle causal loops in a system. These ideas can easily be transferred to Canon. Conversely, our experiences with probabilistic programming at Canon could fertilize our discussions about performance diagnostics at ASML.”
Every 3-4 months, TNO-ESI organizes cluster meetings in which the diagnostics experts from the different partners alternatingly present their challenges. Kruizinga finds these meetings very insightful: “It really helps to hear that others are having similar problems and learn how they’re tackling them.” ASML’s Van Veelen concurs: “It’s highly instrumental to share experiences, not only about the technical side of things but also about how to bring the innovations we come up with to our engineers and service people – going from making FMEAs to collaborating with designers and doing modeling work is no small step.”
Machine reasoning
The ultimate goal is to create a digital diagnostic assistant. “We can’t keep on building ever more complex systems without dramatically improving our development efficiency,” asserts Dorosti. “Our engineers need to achieve better results in a shorter time. They can only do that with digital diagnostic assistance based on advanced reasoning models.”
“Scalability is the name of the game,” concludes Kruizinga.
“We need a methodology for diagnostics, not a point solution.
Whether a machine is in a hard-down situation or having performance issues, our main challenge is to efficiently find the cause of a wide range of potentially very rare problems. For us, this means having a digital tool that helps our service engineers perform their diagnostic tasks, for example by doing part of the reasoning in the printer itself.
The next step is preparing our machines already in the design phase for a more efficient diagnosis in the field.”
Credits: ASML
“We envision a digital diagnostic assistant that makes the lives of our service technicians easier,” philosophizes ASML’s Van Veelen. “We want it to rule out trivial causes and automatically report a broken part, for example. We’d like it to point our service people to possible alternative causes elsewhere in the machine, come up with hypotheses much sooner than a team of experts can, enable us to focus our diagnostic efforts on where we can innovate. Above all, it has to integrate seamlessly into our existing toolchain, as we don’t want to change our established way of working.”
“We want to reduce the amount of human reasoning by increasing the amount of machine reasoning,” summarizes TNO-ESI’s Barbini.
“We want to automate what’s easy and assist in what’s difficult.”