Section: Overall Objectives
Scientific Objectives
Introduction
In the increasingly networked world, reliability of applications becomes ever more critical as the number of users of, e.g., communication systems, web services, transportation etc., grows steadily. Management of networked systems, in a very general sense of the term, therefore is a crucial task, but also a difficult one.
MExICo strives to take advantage of distribution by orchestrating cooperation between different agents that observe local subsystems, and interact in a localized fashion.
The need for applying formal methods in the analysis and management of complex systems has long been recognized. It is with much less unanimity that the scientific community embraces methods based on asynchronous and distributed models. Centralized and sequential modeling still prevails.
However, we observe that crucial applications have increasing numbers of users, that networks providing services grow fast both in the number of participants and the physical size and degree of spatial distribution. Moreover, traditional isolated and proprietary software products for local systems are no longer typical for emerging applications.
In contrast to traditional centralized and sequential machinery for which purely functional specifications are efficient, we have to account for applications being provided from diverse and non-coordinated sources. Their distribution (e.g. over the Web) must change the way we verify and manage them. In particular, one cannot ignore the impact of quantitative features such as delays or failure likelihoods on the functionalities of composite services in distributed systems.
We thus identify three main characteristics of complex distributed systems that constitute research challenges:
-
Concurrency of behavior;
-
Interaction of diverse and semi-transparent components; and
-
management of Quantitative aspects of behavior.
Concurrency
The increasing size and the networked nature of communication systems, controls, distributed services, etc. confront us with an ever higher degree of parallelism between local processes. This field of application for our work includes telecommunication systems and composite web services. The challenge is to provide sound theoretical foundations and efficient algorithms for management of such systems, ranging from controller synthesis and fault diagnosis to integration and adaptation. While these tasks have received considerable attention in the sequential setting, managing non-sequential behavior requires profound modifications for existing approaches, and often the development of new approaches altogether. We see concurrency in distributed systems as an opportunity rather than a nuisance. Our goal is to exploit asynchronicity and distribution as an advantage. Clever use of adequate models, in particular partial order semantics (ranging from Mazurkiewicz traces to event structures to MSCs) actually helps in practice. In fact, the partial order vision allows us to make causal precedence relations explicit, and to perform diagnosis and test for the dependency between events. This is a conceptual advantage that interleaving-based approaches cannot match. The two key features of our work will be (i) the exploitation of concurrency by using asynchronous models with partial order semantics, and (ii) distribution of the agents performing management tasks.
Interaction
Systems and services exhibit non-trivial interaction between specialized and heterogeneous components. A coordinated interplay of several components is required; this is challenging since each of them has only a limited, partial view of the system's configuration. We refer to this problem as distributed synthesis or distributed control. An aggravating factor is that the structure of a component might be semi-transparent, which requires a form of grey box management.
Quantitative Features
Besides the logical functionalities of programs, the quantitative aspects of component behavior and interaction play an increasingly important role.
-
Real-time properties cannot be neglected even if time is not an explicit functional issue, since transmission delays, parallelism, etc, can lead to time-outs striking, and thus change even the logical course of processes. Again, this phenomenon arises in telecommunications and web services, but also in transport systems.
-
In the same contexts, probabilities need to be taken into account, for many diverse reasons such as unpredictable functionalities, or because the outcome of a computation may be governed by race conditions.
-
Last but not least, constraints on cost cannot be ignored, be it in terms of money or any other limited resource, such as memory space or available CPU time.