Section:
Application Domains
Resource Allocation for High Performance and Cloud
Computing
In the context of numerical simulations on high performance machines, optimizing data
locality and resource usage is very important for faster execution times and lower energy
consumption. This optimization can be seen as a special case of scheduling problem on
parallel resource, with several challenges. First, instances are typically large: a large
matrix factorization (with blocks) involves about tasks. Then,
HPC platforms consist of heterogeneous and unrelated resources, what is known to make
scheduling problems hard to approximate. Finally, due to co-scheduling effects and shared
communication resources, it is not realistic to accurately model the exact duration of
tasks. All these observations make it impossible to rely on static optimal solutions, and HPC
applications have gone from simple generic static allocations to runtime dynamic scheduling
strategies that make their decisions based on the current state of the platform (the location
of input data), the expected transfer and running times for the tasks, and some affinity and
priority information that have possibly been computed offline. In this context, we are
strongly involved in the design of scheduling strategies for the StarPU runtime, with two
goals: proving that it is possible to design approximation algorithms whose complexity is
extremely small (typically sub-linear in the number of ready tasks), and show that they can
be used in practice with good performance results. We are pursuing collaborations both with
teams developing the StarPU system (Storm) by designing algorithms for the generic scheduling
problems [20], and with teams developing linear algebra algorithms
over the runtime (Hiepacs), by proposing specialized algorithms for specific cases. For
example, in the case of linear algebra applications on heterogeneous platforms, we have
considered the combinatorial optimization problem associated to matrix multiplication, that
is amenable to partitioning the unit square into zones of prescribed areas while minimizing
the overall size of the boundaries. We have improved the best known approximation ratio to
1.15 in [40] and we have shown that the resulting distribution
schemes can indeed be used to design efficient implementations using StarPU
in [41].
In the context of Cloud Computing platforms and Data Science, we are interested in resource
allocation and data placement strategies. For BigData applications running on data centers
platforms, data locality is the largest contributing factor to application performance. In
practice, allocation decisions are made at runtime based on strategies that tend to favor
local tasks. Our goal is to assess and to improve the efficiency of these runtime
strategies. In particular, we have proven that the problem of maximizing locality when
allocating map tasks is amenable to a graph orientation and a semi matching problem, what
enabled us to assess the efficiency of classical MapReduce allocation algorithm
in [34] for the map phase and to propose a low cost algorithm to
compute optimal allocation schemes. We also consider more generic VM placement problem for
large-scale datacenters. This placement is often handled with naive greedy rules, whereas it
is possible to propose online or offline efficient allocation algorithms, for example to
optimize the reliability of all applications in a platform with
faults [48] or to co-locate applications with compatible periodic
load variations [47].