Section: New Results
Data Analysis and Learning
Participants : Konstantin Avrachenkov, Hlib Mykhailenko, Giovanni Neglia, Dmytro Rubanov.
Unsupervised learning
In [21] K. Avrachenkov in collaboration with A. Kondratev and V. Mazalov (both from Petrozavodsk State Univ., Russia) apply cooperative game-theoretic methods for community detection in networks. The traditional methods for detecting community structure are based on selecting denser subgraphs inside the network. Their new approach is to use the methods of cooperative game theory that highlight not only the link density but also the mechanisms of cluster formation. Specifically, they suggest two approaches from cooperative game theory: the first approach is based on the Myerson value, whereas the second approach is based on hedonic games. Both approaches allow to detect clusters with various resolution. However, the tuning of the resolution parameter in the hedonic games approach is particularly intuitive. Furthermore, the modularity based approach and its generalizations can be viewed as particular cases of the hedonic games.
Kernels and, broadly speaking, similarity measures on graphs are extensively used in graph-based unsupervised and semi-supervised learning algorithms as well as in the link prediction problem. In [19] K. Avrachenkov and D. Rubanov in collaboration with P. Chebotarev (Trapeznikov Institute of Control Sciences, Russia) analytically study proximity and distance properties of various kernels and similarity measures on graphs. This can potentially be useful for recommending the adoption of one or another similarity measure in a machine learning method. Also, they numerically compare various similarity measures in the context of spectral clustering and observe that normalized heat-type similarity measures with log modification generally perform the best.
Semi-supervised learning
Graph-based semi-supervised learning for classification endorses a nice interpretation in terms of diffusive random walks, where the regularisation factor in the original optimisation formulation plays the role of a restarting probability. Recently, a new type of biased random walks for characterising certain dynamics on networks have been defined and rely on the -th power of the standard Laplacian matrix . In particular, these processes embed long range transitions, the Levy flights, that are capable of one-step jumps between far-distant states (nodes) of the graph. In [24] K. Avrachenkov in collaboration with E. Bautista, S. De Nigris, P. Abry and P. Gonçalves (from Dante Inria team and ENS Lyon) build upon these volatile random walks to propose a new version of graph based semi-supervised learning algorithms whose classification outcome could benefit from the dynamics induced by the fractional transition matrix. In [22] using the framework of Levy flights, they further improve the classification outcome, even in settings traditionally poorly performing such as unbalanced classes, and they derive a theoretical rule for classification decision.
In [6] K. Avrachenkov in collaboration with P. Chebotarev (Trapeznikov Institute of Control Sciences, Russia) and A. Mishenin (Saint Petersburg Univ., Russia) study a semi-supervised learning method based on the similarity graph and regularized Laplacian. They give convenient a optimization formulation of the regularized Laplacian method and establish its various properties. In particular, they show that the kernel of the method can be interpreted in terms of discrete and continuous-time random walks and possesses several important properties of proximity measures. Both optimization and linear algebra methods can be used for efficient computation of the classification functions. The authors demonstrate on numerical examples that the regularized Laplacian method is robust with respect to the choice of the regularization parameter and outperforms the Laplacian-based heat kernel methods.
Distributed computing
In distributed graph computation, graph partitioning is an important preliminary step, because the computation time can significantly depend on how the graph has been split among the different executors. In [30] H. Mykhailenko and G. Neglia, in collaboration with F. Huet (I3S) propose a framework for distributed edge partitioning based on simulated annealing. The framework can be used to optimize a large family of partitioning metrics. They provide sufficient conditions for convergence to the optimum as well as discuss which metrics can be efficiently optimized in a distributed way. They implemented these partitioners in Apache GraphX and performed a preliminary comparison with JA-BE-JA-VC, a state-of-the-art partitioner that inspired the new approach. They show that this approach can provide improvements, but further research is required to identify suitable metrics to optimize as well as to design a more efficient exploration phase for the algorithm without sacrificing convergence properties.
Because of the significant increase in size and complexity of the networks, the distributed computation of eigenvalues and eigenvectors of graph matrices has become very challenging and yet it remains as important as before. In [20] K. Avrachenkov in collaboration with P. Jacquet (Nokia Bell Labs) and J. Sreedharan (Purdue Univ., USA) develop efficient distributed algorithms to detect, with higher resolution, closely situated eigenvalues and corresponding eigenvectors of symmetric graph matrices. We model the system of graph spectral computation as physical systems with Lagrangian and Hamiltonian dynamics. The spectrum of Laplacian matrix, in particular, is framed as a classical spring-mass system with Lagrangian dynamics. The spectrum of any general symmetric graph matrix turns out to have a simple connection with quantum systems and it can be thus formulated as a solution to a Schrödinger-type differential equation. Taking into account the higher resolution requirement in the spectrum computation and the related stability issues in the numerical solution of the underlying differential equation, we propose the application of symplectic integrators to the calculation of eigenspectrum. The effectiveness of the proposed techniques is demonstrated with numerical simulations on real-world networks of different sizes and complexities.