2025Activity reportProject-TeamD-DAL
RNSR: 202524739L- Research center Inria Centre at the University of Lille
- In partnership with:Université de Lille
- Team name: Data : Dynamics Algorithmics and Logics
- In collaboration with:Centre de Recherche en Informatique, Signal et Automatique de Lille
Creation of the Project-Team: 2025 October 01
Each year, Inria research teams publish an Activity Report presenting their work and results over the reporting period. These reports follow a common structure, with some optional sections depending on the specific team. They typically begin by outlining the overall objectives and research programme, including the main research themes, goals, and methodological approaches. They also describe the application domains targeted by the team, highlighting the scientific or societal contexts in which their work is situated.
The reports then present the highlights of the year, covering major scientific achievements, software developments, or teaching contributions. When relevant, they include sections on software, platforms, and open data, detailing the tools developed and how they are shared. A substantial part is dedicated to new results, where scientific contributions are described in detail, often with subsections specifying participants and associated keywords.
Finally, the Activity Report addresses funding, contracts, partnerships, and collaborations at various levels, from industrial agreements to international cooperations. It also covers dissemination and teaching activities, such as participation in scientific events, outreach, and supervision. The document concludes with a presentation of scientific production, including major publications and those produced during the year.
Keywords
Computer Science and Digital Science
- A2.1. Programming Languages
- A2.1.1. Semantics of programming languages
- A2.1.4. Functional programming
- A2.1.6. Concurrent programming
- A3.1. Data
- A3.1.1. Modeling, representation
- A3.1.2. Data management, quering and storage
- A3.1.3. Distributed data
- A3.1.4. Uncertain data
- A3.1.5. Control access, privacy
- A3.1.6. Query optimization
- A3.1.7. Open data
- A3.1.8. Big data (production, storage, transfer)
- A3.1.9. Database
- A3.2.1. Knowledge bases
- A3.2.2. Knowledge extraction, cleaning
- A3.2.3. Inference
- A3.2.4. Semantic Web
- A4.5. Formal method for verification, reliability, certification
- A4.5.1. Static analysis
- A4.5.2. Model-checking
- A4.5.3. Program proof
- A4.7. Access control
- A4.8. Privacy-enhancing technologies
- A7. Theory of computation
- A7.2. Logic in Computer Science
- A9.1. Knowledge
- A9.2. Machine learning
- A9.7. AI algorithmics
- A9.8. Reasoning
Other Research Topics and Application Domains
- B6.1. Software industry
- B6.3.1. Web
- B6.3.4. Social Networks
- B6.5. Information systems
- B9.5.1. Computer science
- B9.5.6. Data science
- B9.10. Privacy
1 Team members, visitors, external collaborators
Research Scientists
- Antoine Amarilli [INRIA, Advanced Research Position, from Oct 2025, HDR]
- Mikael Monet [INRIA, Researcher, from Oct 2025]
Faculty Members
- Sylvain Salvati [Team leader, UNIV LILLE, Professor, from Oct 2025, HDR]
- Iovka Boneva [UNIV LILLE, Associate Professor, from Oct 2025]
- Aurélien Lemay [UNIV LILLE, Associate Professor, from Oct 2025, HDR]
- Charles Paperman [UNIV LILLE, Associate Professor, from Oct 2025, HDR]
- Sophie Tison [UNIV LILLE, Emeritus, from Oct 2025, HDR]
PhD Students
- Bastien Degardins [UNIV LILLE, from Oct 2025]
- Oliver Irwin [UNIV LILLE, from Oct 2025]
- Sebastien Labbe [UNIV LILLE, from Oct 2025]
- Loup Lobet [INRIA, from Oct 2025]
- Arthur Lombardo [ENS PARIS, from Oct 2025]
Administrative Assistant
- Aurore Dalle [INRIA]
External Collaborator
- Sara Riva [UNIV LILLE, from Oct 2025]
2 Overall objectives
A major and early success story of computer science, combining theoretical foundations and efficient implementations, is the development of relational databases. Relational databases have now been deployed in essentially all sectors of activity, from large multi-server clusters to phones and embedded devices. An important factor of this success is the abstraction lying at the core of relational databases: they combine a simple data model with a declarative language (SQL) to write user queries. These queries are then transparently translated to optimized execution plans taking advantage of the schema (the data model) and statistics of the data. This abstracts away all internal details that would be unpleasant to the user: how data is concretely stored, the shape of the actual execution plan, low level optimizations, the orchestration of data streams that compute intermediate query results, concurrent accesses, and so on.
Building on this foundation, the field of data management has evolved in two antagonistic directions over the past decade. One first direction is that the scope of data usage in applications has significantly expanded. Data is used to train statistical models; they are the basis of key decisions that have a strong impact on people and on society, and must therefore be explainable. The data that we store may be updated often, which makes it necessary to efficiently recompute the query answers. In particular, the data may arrive in streams, and we may have to compute query answers on the fly, even before we have read the entire data stream. The number of query answers to produce may be so large that we need to produce it in streaming, or to access it implicitly, instead of querying it in full. The data is heterogeneous: it does not necessarily follow a fixed schema like the relational model traditionally assumes, and it may feature non-relational data such as textual documents, tree-shaped data (such as XML and JSON records), and so on. In particular, data may be stored as a graph, without a fixed schema; it may then be interrogated with queries featuring recursion, e.g., asking for pairs of reachable nodes. There may be uncertainty on the data stored by such systems, making it difficult to compute accurate answers. Last, some queries need to be evaluated across datasets from different sources, e.g., large open datasets from multiple public organizations. In such settings, data is typically not stored within one single system, but integrated across different systems. Further, beyond traditional query evaluation, the data can be transformed, combined, and post-processed. This can be done with procedural languages within SQL databases (stored functions and procedures), or with domain specific languages such as LINQ that are used to querying data within the programming language C#. Thus, many data processing frameworks are actually libraries that can be seen as query languages embedded within a programming language; conversely, some query languages such as XPath 3.0 for XML documents or jq for JSON documents now embed limited programming languages so as to make them more expressive. These new developments are blurring the line between querying and programming, and are giving up on the idea that data manipulation needs should be expressed declaratively.
Faced with the growth of these new applications, the scope of the SQL standard has significantly expanded to accommodate some (but not all) of the corresponding needs. Recent versions of SQL support recursive queries, incremental view maintenance and triggers, stored procedures, queries over strings, embedded XML and JSON documents and query over them, along with naive forms of uncertainty and incompleteness (through NULLs). Most recently, the graph query language GQL has been published as part of SQL. These features are standardized and supported by relational engines, though the coverage and efficiency of these features may vary across implementations. Further, for some of these tasks, such as data integration and post-processing, an entire ecosystem of solutions has developed on top of relational databases.
The second direction is that the data has increased in volume. Some organizations now want to store much larger volumes of data, e.g., by extracting it from the Web, collecting scientific data, or running services for an unprecedented number of users. Traditional relational databases, with their support of the full SQL language, may not be suited to such large data volumes. For this reason, in the 2000s, new “NoSQL” database paradigms started to become popular. These systems compromise on some aspects of SQL, such as its expressiveness (i.e., supporting less expressive queries), or its formal guarantees (e.g., distributed consistency), to achieve better performance. Highly-optimized systems now exist for specific tasks, such as key-value stored, document-oriented databases, etc.
To summarize, new data management applications now ask for two conflicting aspects: the support for more expressive queries (going beyond the more arcane features of SQL); while offering more efficient querying over very large quantities of data. These features should be offered while retaining the guiding principle of data management systems, namely, going from a high-level declarative language that describes user needs, down to low-level execution plans that implement them over data representations. Parts of this translation can be data-independent, using, e.g., static analysis; or they can be data-dependent, using, e.g., logical constraints or statistics of the database.
The main goal of D-DAL (D-DAL is pronounced as Dédale, the French pronunciation of Daedalus) is to find paths to extract meaning from the maze of data. In other words, D-DAL's scientific objective consists in overcoming the limits of expressivity and efficiency of programs dealing with data. This global objective gives rise to two main challenges. A challenge with a large perspective of extracting value from data, and a practical and pragmatic challenge of processing large data efficiently.
Challenge 1: Extracting value from data
The first challenge that we will address is that of supporting more expressive ways to interrogate data, so as to make sense of complex data. We will address several aspects of this challenge. First, the data can be uncertain1, imprecise, or probabilistic. Second, it can be unstructured, or with an unknown structure that must be discovered or formalized. Third, we may need to extend query evaluation to compute complex information, in particular explanations, but also provenance annotations, complex aggregation values, or complex results, in a principled manner.
Dealing with uncertain data.
Data in real life is rarely certain: it can be subject to human error when manually collected, generated from physical sensors having imperfect precision, or extracted from sources of various quality from the Web via natural language processing (NLP) or machine learning techniques, that are often themselves of a probabilistic nature. The traditional way of dealing with this uncertainty is to simply ignore it, e.g., by removing all rows containing NULL values in a database before evaluating a query. This naive approach can however lead to incorrect or incomplete answers and is thus not always desirable. The challenge is then to design methods and algorithms that are able to take this uncertainty into account in a principled and efficient way, i.e., by reasoning over the possible completions of the data in a logical fashion, or by reasoning quantitatively over the possible worlds of probabilistic data to determine which query results are likely.
Finding structure in data.
Relational databases assume that the data is strictly structured by a fixed data schema: this assumption is central to the query language, and to efficient algorithms to select query execution plans. However, when data comes from various sources, or when it is updated with new kinds of information, then relational databases are inadequate: they require time-consuming engineering tasks to refactor data schema to reflect the new needs. By contrast, the model of graph databases is more flexible and has become increasingly popular for applications. As efficiency comes at a cost, flexibility also has a cost, and data may display irregularities and peculiarities that make it harder to process. We focus on two main research directions to explore on data graphs.
The first research direction is to develop ways to express restrictions on the shape of graph data, via flexible notions of schemas. Schemas make it possible to trade some of the flexibility of the graph model with the efficiency of the relational model: they make it possible to accommodate diverse data representations in a controlled manner, while disallowing specific kinds of irregularities, and while guiding query evaluation algorithms to optimize query evaluation plans with the information given by the schema.
The second approach consists in analyzing data in order to reveal its structure. Through comprehensive data examination and modelling techniques, we aim to automatically discover the underlying patterns and organization within the graph. These methods could lead to the inference of a schema that will be exploited by other algorithms to exploit directly the data, to transform it into a usable form, or to directly help inference of queries or transformation.
Explaining query results.
The increasing usage of computing in every aspect of modern society has naturally raised a need for transparency in algorithms. In some scenarios, having the answer to a query is not enough and one also wants some kind of explanation on how the result was derived, or wants to sort the possible results according to some importance score.
In particular, in contexts where decisions are made on the basis of data that can impact people's lives, like in medicine, or that can impact society, like in policy making, it is important not to just query data, but also to understand why the answers are obtained, or even understand how pieces of data weigh in the answers using certain aggregates like Shapley values. Indeed data is imperfect and query explanation is a way to deal with this imperfection. These concerns are not confined to the database community: the problem arises in all data-driven computations. Most notably, with the rise of AI, explainable AI has emerged as an important scientific domain.
From a technical point of view, query explainability can be understood more generally as building complex query answers that keep track of the operations performed during the evaluation, amounting to a form of user-defined aggregation; it can also be understood in the abstract setting of semiring provenance. All these new tasks can incur an increase in the time to answer queries, and thus call for the development of new notions of explanations and of new efficient algorithms.
Challenge 2: Processing large data efficiently
The standard of SQL is continuously growing. Now, it encompasses a larger number of data formats and ways to query these formats: recursive queries, XML and Xquery, JSON, and more recently property graphs. Most of these extensions of the standard are encoded in the relational model, sometimes in very simple manners. For example, JSON documents are in general2 stored as strings. Instead of attacking the problem of handling massive data by trying to treat all kinds of formats and queries at once, like the SQL standard does, we prefer to concentrate on particular fragments and propose efficient analyses, optimization methods and algorithms for these fragments. When the number of query results is very large, specific methods are needed to construct them. For instance, streaming the results one by one dilutes the computational effort over time. In situations when queries are repeated, another strategy may be to store the results of the query and update it along with the data. We shall, in some cases, develop high throughput prototypes for these specific query fragments. In the end, the composition of these tools should result in highly efficient data processing pipelines.
As data comes in many forms (text, JSON documents, graphs, relational databases, ...), query languages are highly diverse (regular expressions, JSONpath, Xpath, SQL, the numerous NOSQL querying APIs, ...). An important challenge is to understand what is computationally feasible and what is not. Knowing which fragments of the query languages can benefit from efficient evaluation will possibly allow us to develop specific tools to handle them. We are going to focus on particular aspects of expressive queries that go beyond the traditional relational model: recursion, uncertainty handling, and support for embedded programming. For these features, we aim at studying which aspects of queries make evaluation more or less efficient. This will result in the design of fast algorithms for some families of queries and proofs of computational hardness for others. Beyond these results, for practically useful families of queries, we shall develop domain specific languages and prototypes that take advantage of the algorithms that we will have designed.
Beyond this theoretical endeavour, we also wish to explore how efficient query evaluation algorithms can be articulated with modern hardware platforms. Indeed, hardware now offers increasingly parallel, concurrent or distributed primitives. We believe that one of the keys to efficient query evaluation on large data is to align novel algorithmic methods with modern hardware.
Last, in settings where efficient query evaluation remains out of reach, we intend to study approximation algorithms, to obtain approximate answers to queries with reasonable time bounds. Approximation can be understood in a logical sense (i.e., principled overapproximations or underapproximations of query results), in a quantitative sense (i.e., approximate numerical or probabilistic query answers), or in a practical sense (i.e., approximations that are validated by an experimental approach).
Enumeration and incremental algorithms.
Querying data is only one aspect of data processing. Indeed, when looking at entire data processing pipelines, we see that results are often combined with other results coming from other data sources, and can be further processed by programs. This poses new research questions: instead of optimizing query evaluation as a monolithic task, we can study how to globally optimize query evaluation pipelines by studying the interplay between the evaluation of several queries.
In this light, we intend to study two common algorithmic frameworks to deal with large amounts of data, which are well-suited to orchestrate the simultaneous evaluations of many queries in a pipeline. These are enumeration algorithms and incremental algorithms.
The principle of enumeration algorithms is the following: when the number of query results is huge, it is neither feasible nor realistic to produce the entire collection of results in answer to a user query. Instead, it is preferable to enumerate the results in streaming, one after the other, while minimizing the delay between two successive answers. Evaluating queries by enumerating results offers two potential advantages. First, downstream applications in the pipeline can access query results more quickly, as soon as they are computed. Second, outputting results one by one might drastically reduce the amount of memory needed to perform the computation.
The principle of incremental algorithms is that queries that are repeatedly executed can be drastically accelerated by memorizing previous results. The idea of incremental algorithms is to avoid the recomputation of all the results; this is usually done by designing data structures that can efficiently be maintained when the data is updated and that can serve to compute quickly the new query result.
Designing query execution engines.
Once query results from various data sources are enumerated either by an enumeration algorithm or from data structures that store them, one needs to orchestrate and synchronize the processing of these results. In this context results are abstracted as streams that are combined together.
Techniques pertaining to programming languages are well suited to work on streams. Streams of data and their combinations are well studied in the context of synchronous and reactive programming. Generalizations have been also studied by the community of functional programming languages. Bringing these methods to deal with queries and more broadly to data processing can have a strong impact on the efficiency of query execution engines.
The relation with programming languages and thus to compilation naturally leans towards trying to exploit hardware optimally to improve data throughput. In particular, exploiting all the levels of parallelization: SIMD3 instructions, the several cores that are inside CPUs, GPUs, or in a distributed context.
Finally another challenge is to develop methods at the frontier of enumeration and incremental algorithms, i.e., to develop data structures that support the enumeration of query results, and that can also be updated efficiently when the data itself is updated, so as to start again the enumeration of the (possibly new) query results.
3 Research program
Our research organisation addresses the two scientific challenges that motivate the project D-DAL, extracting value from data and processing large data efficiently, with three axes that correspond to different level of abstractions with respect to these challenges.
The first axis, the basement, is about circuits which are a transverse to the two challenges. Circuits form a handy abstraction for representing data and understanding its structure but also a proxy that allows to study the complexity of queries and also to understand how to make the most out of modern CPUs.
The second axis, the first floor, is about querying. It builds upon what is produced in the basement and is intended to develop efficient querying algorithms and eventually querying programs. This axis is concerned with explaining query results, studying fragments of querying languages, dynamics aspects of querying and streaming. The first floor builds upon the basement, circuits play an important role in explaining query results and various aggregation mechanisms as they provide the right structure to represent concisely query results and then execute efficiently computations on these results. Circuits also provide the frontier of the feasible and the efficient about queries and they will allow us to focus on the right query fragments. Finally, circuits are the most basic streaming algorithms we are going to study and implement. The study of streaming methods will allow us to compose efficiently these basic streaming algorithms and enable the creation of efficient querying pipelines.
The third axis, the roof, is about proposing and exploiting information about data. The roof has two slopes, one is about proposing and studying schemas for graphs and the other is about taking into account knowledge about data so as to accelerate querying. The first slope is concerned with the schema language ShEx which has been developed within Links (the team from which D-DAL originates) and D-DAL will pursue this effort as it gives us concrete problems in relation to data for query optimization exploiting knowledge about data. The second slope is precisely about taking into account knowledge about data when answering queries. This information should make algorithms more efficient and also to work with uncertain data more consistently. We will work on the specific information coming from ShEx schemas, but also with more general settings. Overall, the roof axis stands upon the second axis as it aims at refining the work done on querying and applying it to settings where we can take advantage from extra hypotheses in order to lower the complexity.
In a a nutshell, the three axes build a coherent architecture which aims to work out the maze of data. We here give a summary of their contributions to the two challenges:
-
Extracting value from data
-
BasementCircuits are natural and concise representations of data, be it raw data or answers to queries. With tools coming from knowledge compilation, they reveal the inherent structure of data, which can be used independently from the queries that we intend to evaluate over it.
-
First floorQuerying is the basic way to extract information from data, which can then be refined into value. We plan to explore various querying fragments that enrich the state of the art of querying, including expressive queries featuring recursion, aggregation, and uncertainty handling. It will enable the extraction of richer information.
-
RoofThe development of ShEx is a concrete proposal for describing data in ways that allow to extract value from graph data by taking logical constraints into account. Furthermore, taking advantage from extra information for querying, such as deduction rules and integrity constraints, will expand the capabilities of querying.
-
Basement
-
Efficient data processing
-
BasementCircuits can play the role of abstract proxy towards exploiting modern CPUs to process data efficiently. Furthermore, they also play an important role in the study of computational complexity, and in particular low complexity classes, or classes allowing parallel computation. Such classes will allow us to describe the landscape of queries that can be solved very efficiently on large data.
-
First floorQuery fragments, and algorithms to process them efficiently, are the central focus of this axis. In particular, the dynamic aspects of queries will be studied in two aspects. First by proposing enumeration and incremental algorithms for queries. Second by studying streaming and the composition of streams.
-
RoofTaking advantage of information about data and developing ShEx are central to exploiting real world data, which often comes with additional information such as schemas. We will refine querying algorithms to make them more efficient with extra hypotheses, in particular to choose query evaluation plans based on statistics of the data.
-
Basement
Our overall goal is to design highly efficient querying languages in various contexts (data formats, uncertainty, etc.) and the tools that allow to combine them into efficient data processing programs. D-DAL is on the ridge between expressiveness and efficiency. We have structured our project into tree axes that are detailed in the rest of this section. We summarize their objectives below according to the two challenges we attack.
-
Extracting value from data
-
Short termWe are going to extend queries and data representation in several directions. First we will work on enhancing circuit classes by negation, and we will study generalizations of structural constraints that allow for efficient counting algorithms and hence handle quantitative uncertainty. Second we will also incorporate probabilities in ShEx and work on queries on graphs with constraints. We will also study which logical constraints can be used to reason on data, and which counterfactual problems (e.g., resilience, Shapley values) can benefit from our existing expertise on related tasks. These short term objectives build on our previous work and work towards more expressive query fragments.
-
Medium termWe plan to take a more general view on negation within circuits, in particular work more thoroughly in connection with knowledge compilation. One goal is to obtain a better understanding of the probabilistic setting specifically. Another goal is to work on open-world query answering in the presence of uncertainty and of recursion. We also plan to study the lifting of constraints of data graphs to views on graphs. This medium-term objectives aim at having a better qualitative understanding of data.
-
Long termWe plan to build a unifying view of the relation between circuit representations and querying problems. This will provide us with an integrated view of how data can be represented. We also plan to work on relating aggregation and uncertainty problems with formalisms used for constraint satisfaction problems, to explore how such approaches can lead to new structural algorithms for query evaluation. We also hope that such approaches can help connect the multiple communities that propose constraint languages for data, in particular graph-shaped data. Finally, we plan to have ShEx be standardized by the IEEE and have a set of open software tools for it.
-
Short term
-
Efficient data processing
-
Short termWe are going to consolidate our team's first works which relate circuits to efficient implementations based on SIMD instructions. We shall also work on dynamic and incremental algorithms for simple data like texts and trees, again building on our existing research providing first results in these directions. We will also study efficient enumeration of query results that respect a specific order (the order is a good property for further processing), and connect this with efficient dynamic algorithms. All these works give basic building blocks for streaming algorithms, that we plan to orchestrate via work on simple programs called transducers.
-
Medium termWe shall build a circuit-based compilation chain, which goes from query fragments to efficient code, generalizing previous work to transducers. We will also study how the compilation can be doubled with efficient query plan selection at runtime, leveraging the efficient data structures that we will have designed by then. We will also explore how dynamic and incremental algorithms can accommodate expressive features such as recursion and aggregation without compromising the effectiveness of query evaluation, and identify tractability assumptions that can be leveraged to that end. In particular, in connection with enumeration algorithms, we shall work on how to efficiently compose data processes, in particular connecting enumeration algorithms that produce sequence of diffs with incremental algorithms that maintain a query result over evolving data.
-
Long termWe shall propose an end-to-end compilation framework for efficient query compilation and the orchestration of heterogeneous systems, including expressive tasks such as reasoning and optimization. We shall take the direction of probabilistic programming to sacrifice precision in favor of efficiency for the evaluation of queries, guided by a fine-grained theoretical understanding of the possible approximation tradeoffs. We will also undertake this precise complexity study in the dynamic and incremental settings as well as for enumeration problems, again aiming at a full understanding of the landscape of possible complexity regimes.
-
Short term
3.1 The Basement: Circuits for Data Manipulation
Large data poses challenges that are both theoretical and practical. D-DAL intends to use circuits as a common abstraction to address several of these problems. Circuits are an abstract object that can be approached from various angles, corresponding to different research communities. First, circuits are studied from the knoweldge compilation perspective, an AI field which studies classes of Boolean circuits and which has connections to data management 24. Second, circuits are studied as a compilation target, this time to translate programs to abstract representations which ensure that they can be evaluated in a massively parallel or vectorized way. Third, circuits can be studied from a fully theoretical angle, in the field of circuit complexity theory. We now present these axes.
Studying circuits from Knowledge Compilation.
Participants: Antoine Amarilli, Mikaël Monet, Charles Paperman, Sylvain Salvati.
Knowledge compilation 31 is a subfield of AI that studies different ways to represent Boolean functions, in particular restricted classes of Boolean circuits. It studies tradeoffs between conciseness and tractability: which representations can be used for which computational tasks (e.g., counting satisfying assignments), and when can we efficiently translate from one representation to another.
It turns out that knowledge compilation is a useful tool to solve many types of database tasks 24, such as factorized representations of query answers, enumeration algorithms, provenance computation, and uncertainty management (in particular on probabilistic data). We wish to pursue this study of knowledge compilation to databases and query evaluation, and also investigate knowledge compilation in itself to achieve a better understanding of these circuit classes.
Short term.
In the short term, we will explore how specific circuit classes can be used for efficient evaluation of particular queries, across various tasks such as enumeration and ranked access. The PhD thesis of Oliver Irwin is set in this context: it is co-directed by Florent Capelli from CRIL and Sylvain Salvati. It studies how to efficiently find query answers using circuits for queries that may feature negation, and how to reinterpret efficient query evaluation algorithms in a knowledge compilation perspective. We further intend to pursue our study of query evaluation on probabilistic databases via circuit classes, understanding the power and limitations of this method.
Medium term.
One longer-term challenge to be attacked by D-DAL is to understand the power of negation in circuit formalisms that allow tractable counting. Indeed, for the circuit classes commonly used in knowledge compilation 21, the use of negation is highly restricted. However, it is not well-understood whether the expressive power of such circuits is extended when we allow arbitrary negation. This question is purely about knowledge compilation circuit classes, but it also has consequences for questions in database theory about tractable query evaluation on probabilistic databases 27. Another challenge is whether knowledge compilation classes can also capture more tractable enumeration tasks, in particular those involving negations (e.g., enumerating objects which do not have a given property); these would be necessary for a knowledge-compilation-based proof of tractable enumeration for some formalisms such as first-order logic on restricted graphs 43.
Long term.
A far-reaching question in knowledge compilation and database theory concerns the relationship between two kinds of results: algorithms (namely, efficient query evaluation algorithms, and computational complexity hardness results), and tractable circuits (namely, efficient compilation algorithms, and lower bounds on the size of tractable circuits). One long-term research direction of the D-DAL team would be to unify these settings. For instance, we could try to understand which hardness results on query evaluation can imply circuit lower bounds, e.g., in the context of recursive queries 25; and when tractable algorithms can be translated to circuit classes, subsuming the inclusion-exclusion conjecture 27.
Another long-term goal of D-DAL could be to consolidate existing results about knowledge compilation. The current reference on the topic is the seminal work of Darwiche and Marquis 31, but it dates back to 2002. We have started a survey of knowledge compilation in 21, but far more efforts would be needed in order to survey all known results for knowledge compilation circuit classes.
Circuits as a compilation target for parallelism and vectorization.
Participants: Charles Paperman, Sylvain Salvati.
Most of the data exchanged on the web is now represented as JSON files which are text documents (other document formats are also in use such as XML or YAML documents). This is a consequence of the fact that many data engine now operate and communicate with these kinds of data. Extracting information and answering queries from textual documents has become a pervasive task when dealing with large amount of data. However with the growth of data that comes with larger and larger documents, processing textual documents has become an important bottleneck in data processing.
For the moment this bottleneck is addressed first by speeding up the translation of the documents into in memory data structures on which queries are then executed. This translation, also called parsing, is accelerated via low level optimizations. The quickest implementations have required expert programmers and complex costly developments. They leverage specific CPU instructions known as vectorial instructions. The diversity of CPU architectures require the production of highly specialized code for each of them. This work is worth doing since the resulting code is often accelerated by one or two orders of magnitude compared to the non-vectorial variant.
D-DAL aims at developing tools that automatize these optimization and reduce the cost of vectorizing programs that handle textual data. We will not only optimize parsing, but a combination of parsing with query resolution. The PhD thesis of Claire Soyez-Martin 46, 41 (conducted under the supervision of Charles Paperman and Sylvain Salvati) advocates that such automations can be done via a notion of vectorial circuits. Our plan is two fold: first study the compilation of expressive queries into vectorial circuits and then build dedicated compilation schemes for vectorial circuits that takes benefits of vectorial instructions.
Short term.
In continuation of the work engaged with the PhD thesis of Claire Soyez-Martin, we are going to develop further the model of vectorial circuits. The aim is to obtain a cost model for these objects that reflects the efficiency of their compilation to CPU and vectorial instructions. We will also identify application domains that can benefit from our methods. We have already identified several possibilities: biological sequence analysis, network monitoring, pervasive streaming tasks in relation with data (e.g. utf8 validation, JSON processing, etc.)
Medium term.
We plan to develop a compilation chain that will contain two steps. The first step will consist in sets of compilers from query fragments to vectorial circuits. The second will be a set compilers from vectorial circuits to efficient vectorized hardware code. At first we will aim at X86_64 architectures. This compilation pipeline will be the result of ground work conducted in algebraic automata theory.
Long term.
We plan to extend our work to other hardware components. One of them is the RiscV platform which has recently been endowed with an interesting set of vectorial instructions. Another one is the processing in memory units (PIM) that provide a highly parallel architecture that could benefit from the relationship between automata theory and circuits.
Foundations of circuits: circuit complexity, algebra and abstract machines.
Participants: Antoine Amarilli, Mikaël Monet, Charles Parperman, Sylvain Salvati.
Our work on circuits takes its root in deep connections between logic, formal language and algebra. These deep connections are at the heart of finite state verification techniques that have been flourishing in the last decades. We exploit these connections in a different way. For us logic is connected to queries as logical languages naturally express properties on object which in turn coincides with the use of declarative languages to extract information from data. Moreover, logical formulae on data of bounded size can be represented as circuits which extract information from data. Languages or language transformations are a natural counter part of logical formulae. In general, they can be represented as finite state machines. Finally, algebra (monoids, semigroups, forest algebras) naturally expresses the invariants of these machines. Classes of algebra, computation power of finite state machines, complexity of circuits are then deeply connected. The classification of algebras with respect to their properties often coincides with natural classes of machines and of circuits. Algebras form a useful mediation between machines and circuits, that is, between sequential devices and parallel ones. More generally, algebras formalize the structure of the information that needs to be captured when processing queries. It thus plays a central role in many of the applications D-DAL aims at developing. We plan to develop our understanding of these deep connections in relation with querying with applications in many objectives of D-DAL, most notably: vectorization, incremental maintenance, enumeration, transducers, etc.
Short term.
In connection to vectorization, we shall explore particular classes of circuits and describe the classes of languages they correspond to. We shall also develop algebraic tools for forest algebras to enhance the stream processing on tree structured documents. In particular, we wish to extend the expressiveness of the queries that our software rsonpath can solve. Concerning incremental maintenance and enumeration, we shall workout some algebraic invariants on trees that will allow us to extend the methods that we have developed for strings.
Medium term.
The algebraic work on data (strings, trees, graphs, ...) can be extended to more dynamic objects like transducers and even to programs via denotational semantics. In general, queries not only check properties of data, but also transform data. Extending algebraic methods to such dynamic setting will allow us to abstract data processing pipelines and open the way to further optimizations.
Long term.
In the long term, we aim at developing a coherent corpus of algebraic methods for trees and graphs that lend themselves to classify queries on structured documents and graphs. We wish also to attack difficult classification problems that remain open on circuit complexity for classes of regular languages. This ground work shall set the stage for future applications similar to rsonpath.
3.2 First Floor: Logic and Query Evaluation
We now present scientific objectives which operate at a higher level, and deal with query evaluation itself, rather than with the tools presented in the previous section which are ways to implement query evaluation.
Provenance, explanation, aggregation, counting, uncertainty, probabilistic data, Shapley, approximation algorithms.
Participants: Antoine Amarilli, Mikaël Monet.
As already mentioned in Section 2, querying data often goes beyond merely computing a set of answers and returning them. First, data is often uncertain by nature (e.g., probabilistic or missing values). Then, algorithms must be developed that take into account this uncertainty, by considering all possible states of the data; which turns out to often be intractable. In this context, approximate query answers (e.g., approximating the probability that a probabilistic database satisfies a query) can be employed to lower the complexity. Second, one sometimes wishes to explain the query results, using for instance the notion of provenance or using specific scoring mechanisms such as the Shapley value or the notion of resilience of a query result.
Short term.
It is known that when probabilistic data has bounded treewidth, then probabilistic query evaluation (PQE) can be solved efficiently for many queries. We shall explore restrictions that are less limiting than bounded treewidth but could still ensure tractable PQE. One idea would be to design width measures that could be parameterized by the queries. Ideally, we would also obtain lower bounds for specific width measures and queries as it was done in 28.
Medium term.
Another research direction would be to explore the tractability of diverse counterfactual scoring mechanisms, such as Shapley values as done in 32, 39, or the resilience 35 (intuitively representing how “robust” a query answer is, by determing how many input tuples should be removed from the database so that the specific answer no longer exists) for queries featuring recursion.
Long term.
An intriguing and potentially fruitful research direction would be to connect the kind of counting problems studied in database theory (such as PQE or Shapley values) to other well-established counting problems from other fields such as counting versions of constraint satisfaction problems (CSPs) 33, 38 or Holant problems and holographic reduction 30. We will also strive to provide a toolbox to use these algorithms in practice, leveraging for instance the existing implementation of ProvSQL 45, an open-source PostgreSQL extension that computes (between other things) the Boolean provenance of a query over a database.
Dynamic data: incremental and dynamic algorithms and connections with dynamic algorithms.
Participants: Antoine Amarilli, Mikaël Monet, Charles Paperman, Sylvain Salvati.
In many query evaluation scenarios, the data to interrogate is regularly updated. When re-evaluating queries multiple times over different versions of the same data, it is more efficient to build index data structures which can maintain the results of the query under updates, rather than re-evaluating the query every time from scratch. This problem of query evaluation over dynamic data has been studied in many different contexts. The D-DAL team will focus on two broad kinds of contexts: tree-shaped data on the one hand, which includes texts, trees, genomic sequences, etc.; and unrestricted data, namely graph databases and general relational databases.
Short term.
In the short term, we intend to focus on the first application setting of texts and trees, and pursue our ongoing work in these settings. While the main results in the area have already been proven, some work remains in expanding the frontiers of what can be done efficiently over dynamic text and trees: can we maintain data structures for in-order enumeration? in which orders? with which complexity? One important direction which remains open is that of showing improved bounds on query evaluation which depends on the specific query (rather than for a general query class). From our initial results for regular queries on words 26, one especially promising ongoing effort is to extend these results to trees. Another is to extend such results to enumeration data structures. In both cases, significant algebraic groundwork is required.
Medium term.
We then intend to focus on broader directions. One of them is to support more expressive updates: the bulk of the literature on dynamic data focuses on single-character or single-tuple changes, but real updates are often more expressive. On text, we want to be able to support copy-and-paste updates, which move an entire part of the text to a different point; or search-and-replace updates. On relational databases, we want to develop incremental algorithms that cope with updates which are themselves expressed in SQL, i.e., with UPDATE statements. Another direction is to move to the study of incremental maintenance for more expressive queries, in particular those featuring recursion or aggregation. For instance, can we characterize how efficiently we can maintain the results of Datalog queries, with fine bounds that depend on the specific Datalog query which is posed? Which assumptions on the underlying data can be used to achieve tractable algorithms?
Long term.
A more open direction for query evaluation over dynamic data would be to achieve precise bounds on the complexity of maintaining each individual query by charting the space of possible tradeoffs across various performance measures: what is the time needed to handle updates to the data; what is the delay between successive results? How much memory usage is required to store the data structure? Which updates are supported? In this light, one especially challenging obstacle is the need for suitable lower bound techniques, in particular unconditional lower bounds, which are sometimes achievable depending on the model studied. Another general question is to connect more broadly to the literature studying data structures, in which questions about mutable data have been investigated for decades. We could for instance leverage results on maintaining search trees balanced under updates, or algorithms on dynamic graphs 37 which have been studied in the algorithms community (especially algorithms to dynamically maintain the connectivity of directed graphs).
Efficient evaluation: streaming and parallelism.
Participants: Aurélien Lemay, Charles Paperman, Sylvain Salvati.
Efficiency in data processing mostly comes from streaming data. This means that data is passed from small processing units to others so as to produce an output. Certain operations cause serious synchronization problems (sorting data, eliminating duplicates, combining streams, etc.) they block the processing and cause sometimes important lags. Sticking to functional programming, D-DAL will develop streaming data processing programs by using static analysis, speculative computations (to prepare results when lags are unavoidable), develop stream fusion compilation methods so as to obtain efficient streaming algorithms. Techniques coming from synchronous programming will be considered, in particular in connection with the leveraging of fine-grained parallelism for low level streaming. Also, in the medium term, we shall explore how to parallelize query executions taking advantage of static analyses of the functional code.
Short term.
Part of the effort will concern very simple classes of functional programs known as transducers. D-DAL will implement libraries for manipulating transducers, minimizing them and composing them. Several extensions of the models of the literature are being proposed by the team and will benefit from these implementations. Of particular interest to us are higher-order transducers which form an elegant generalization of many transducers from the literature. We shall work on compiling them for streaming documents.
Medium term.
Orchestrating programs that stream data is challenging as they may work at different paces and as communications between them do not have a negligible cost. We will work on this orchestration problem in connection with programming languages theory, most notably with synchronous and functional programming languages. Indeed, when we have compiled programs with our method, we shall be able to efficiently compose them. Many technical methods for achieving this have been developed for these programming languages.
Long term.
In the long run, we will take stock on the work of the community of functional probabilistic programming. Indeed, when working with huge data, approximate querying becomes hardly avoidable and thus extending our methods to sampling data with various probabilistic distributions becomes necessary. In particular, we shall see how we implement probabilistic querying by means of transducers.
Enumerating the results of recursive queries.
Participants: Antoine Amarilli, Mikaël Monet, Charles Paperman, Sylvain Salvati.
One important challenge in modern database theory is efficient enumeration algorithms 44, i.e., efficiently producing query results one after the other. This is related to query evaluation in streaming, but with a key difference: here we are looking for algorithms which produce a stream of results, instead of consuming a stream of input data. Database theory has studied which queries can efficiently be evaluated, but an especially challenging topic is to understand under which assumptions it can be done for recursive queries, e.g., Datalog queries. This task can in particular be studied over restricted formalisms of input data, such as textual data. This is the focus of document spanners 34, which are a declarative way to specify how to extract matches of queries over textual documents. There, recursivity is needed to express long-distance dependencies (e.g., extract all phone numbers that are between two specific delimiters, no matter how far).
Short term.
One immediate research objective is to study which enumeration tasks for (recursive) queries can be efficiently solved, in particular when considering variations on the standard setting of enumeration algorithms. For instance, in many applications, we want to produce results in a specific order; to directly access individual results by their offset; or to efficiently test whether a candidate result is actually a solution, or finding out its position in the enumeration order. We intend to pursue our ongoing work in such directions 22.
Medium term.
A medium-term research question is the combination of enumeration algorithms with the previous axis of handling dynamic data. Previous work by members of the team has already studied how enumeration algorithms could be applied to dynamic data via knowledge compilation methods 23. However, we are still missing an understanding of how efficient such algorithms can be made, especially if we want query-dependent algorithms. This would probably require an algebraic theory of queries which can return results (to cover document spanners), in particular understanding their connection to transducers.
A related medium-term research question is the idea of enumerating query results via updates, i.e., producing each query results not from scratch but by editing a previous result. Team members have started investigating this question in a specific context 29, but this enumeration framework could be fruitful in many other settings. It also has an intriguing connection to the question of query evaluation over dynamic data, because the two problems are related: query evaluation on dynamic data takes a stream of changes as input and maintain query results, whereas enumerating query results via updates will produce a stream of changes to go over all results to a query.
Long term.
Most research about enumeration algorithms for recursive queries using circuit methods focus on the case of textual data (e.g., document spanners) and tree-shaped data. This leaves open the setting of more general data, in particular graph data. Of course, the task has already been studied, in particular in the comparatively simple setting of regular path queries 40. But we are still missing an understanding of those enumeration tasks that can be solved efficiently in more expressive query languages (e.g., Datalog queries) and depending on the various possible semantics (e.g., returning simple paths, trails, shortest paths, or arbitrary walks). As many of these tasks will be intractable, efficient algorithms will probably also need to identify conditions on the input data that are sufficient to ensure tractability.
Another wide open direction for long-term research in enumeration algorithms concerns lower bounds. Indeed, many lower bounds are conditional, and they usually do not apply to enumeration but to related problems (e.g., checking whether a solution exists, or materializing all solutions). An especially difficult challenge would be to develop lower bound methods that show the intractability of efficient enumeration per se.
3.3 The roof: Knowledge on Data
Optimization for schemas for graphs.
Participants: Antoine Amarilli, Iovka Boneva, Charles Paperman.
The role of schemas for graph data is a bit different than schemas for relational data. The latter are prescriptive (i.e. data is organized exactly like this) and are an essential ingredient for static optimization of queries. The former are de facto often descriptive (i.e. data should look more or less like this). Such looseness of schemas is unavoidable, and even desirable, for big data stores or knowledge bases which are constantly evolving, fed with data from numerous sources that can be incomplete, inconsistent and contain errors.
D-DAL aims at providing tools for increasing the usefulness of such approximate schemas in the context of dynamically evolving data. This line of research continues our work of the formalization, the standardization the development and the tooling of the shape expressions language (ShEx) for RDF graphs, and aims at extending on some of our previous results and approaches to property graphs and PG-Schemas.
Short term.
A first short term objective is to enrich graph schema languages such as ShEx with probabilistic information indicating the degree of compliance to the actual data. We will aim at a practical extension with limited expressiveness in order to keep feasible the associated computational tasks (data validation, schema extraction) while still helping experts in assessing the quality of a data set, detecting errors and monitoring the evolution of data. This approach will be built on our previous work in discovering schemas from existing data, while extending it to property graph schemas.
Medium term.
As all software, data-centric applications constantly evolve, either to match new use cases, or to exploit new data sources, or both. This most often requires to transform the data model in order to match the new processing needs. As a mid-term objective we plan to investigate algorithms for incremental and conjoint data and schema evolution, based on logic tools such as mappings and the chase.
Although there exist a number of native graph data stores, lots of data is effectively stored in relational or other data stores (e.g., Wikibase for wikidata) and is only exposed as a graph to the user. There might also be the case that data from streams is processed and integrated to a knowledge graph. The original data can contain schema information or satisfy some constraints. As a mid-term objective, we plan to investigate lifting such constraints from the underlying data to the graph view. We will focus on useful and feasible sub-classes.
Long term.
Finally, we will continue our participation in Shape Expressions Working group which defines the ShEx language, currently in the process of standardization at IEEE. We will also continue to participate in the implementation of a validator for ShEx and related tools as open software.
Static analysis with constraints (open-world query answering, query rewriting under constraints, consistency, certain answers).
Participants: Antoine Amarilli, Iovka Boneva, Aurélien Lemay, Sylvain Salvati, Sophie Tison.
Contrary to the closed world assumption of traditional database research, the absence of a piece of information often does not mean that this piece of information is false, but rather that we do not know its status; this is called the open world assumption. In this scenario, the data is typically enriched by constraints that it should satisfy, such as key, cardinality or equality constraints. Several questions arise when querying data under the open world assumption. Is the data consistent with the constraints we know about it? Taking into account the constraints and the fact that the information is incomplete, what degree of certainty can we have about the answer set of queries? Is it possible to take advantage of the constraints so as to execute simpler queries (e.g., using rewriting techniques)? Moreover, it may not be possible to know which constraints apply to the data at stake, in which case it can then be desirable to infer knowledge using machine learning methods. D-DAL will apply the querying methods it develops to that setting so as to produce added values from knowledge bases.
Short term.
We will pursue work we have engaged with the PhD of Lily Gallois 36 on regular path queries (RPQs) and word constraints on graph databases. Word constraints are the simplest kind of constraints that can be put on a graph database; they seem to naturally connect with rewrite systems. As we showed in 42, this relation is however more complicated that expected. We are going to study further the classes of constraints that enable reasoning about RPQs and their decidability. We will also explore the problems of certain query answering and query rewriting in this setting. The aim is to propose efficient query resolution algorithms from which we can build more complex queries, in particular queries with aggregation.
We are also going to work on schema and constraints inference for graph databases. We have skills in symbolic learning methods which have proved useful in extracting information from the web. Facing large graphs, these methods will have difficulties to scale, moreover they are often not robust to noise. Machine learning methods, typically neural networks, are more immune to this problem. We shall develop a hybrid approach: the method that we envision consists in learning structures with neural networks and then extracting the information that is useful in the context of data processing by means of symbolic techniques.
Medium term.
The D-DAL team intends to study several medium-term problems related to open-world query answering and reasoning under rules. One especially interesting question is how the notion of circuits can be leveraged to achieve tractable algorithms for these tasks. The hope is that circuits can be a versatile tool unlocking many results that would otherwise need to be shown independently: supporting uncertain facts (and uncertain rules); providing provenance information to capture how a specific fact is derived and explain to the user why a query result is true (given the data and the rules); giving factorized representations of a possibly large number of query answers. At a technical level, a broad question is how forward-chaining deduction techniques such as the chase can be extended to create a circuit representation of the derivations. This is challenging because the chase, unlike circuits, may represent some cyclic dependencies, which may cause convergence issues depending on the semiring in which the computation is performed. Another question is how this can be combined with backward-chaining methods such as query rewriting.
Long term.
In the long run, one unifying challenge of the ontology-mediated query answering community is the question of unifying different logical frameworks. Indeed, while query languages are relatively standard, rules can nowadays be specified in many different incompatible ways: description logics, existential rules, graph patterns, database constraint languages such as TGDs, etc. It is challenging to understand the connections between these settings and difficult to conceive systems that could orchestrate query evaluation across back-ends using different query formalisms. Another more algorithmic challenge concerns the complexity of ontology-mediated queries: while there are many results showing complexity upper and lower bounds for general classes (e.g., guarded TGDs), it is much more challenging to try to identify precise complexities of individual queries, in the spirit of what is done for less expressive queries such as conjunctive queries. Thus, the D-DAL team aims at making progress towards this very ambitious theoretical goal.
4 Application domains
The project of the D-DAL team aims at efficiently extracting information from large datasets. We aim to process data in all the diversity of its representations, and to take into account its quality, including questions of uncertainty and probability. Our approach is to develop theoretical results that can later be implemented into concrete programs. As such, our project relates to several Inria strategic axes:
-
Data Science and Artificial Intelligence
The work of our team will study how to efficiently produce data for other downstream applications that analyze it or aggregate it from various sources. This is key in analysing data where large chunks of data are processed to produce statistics and other kinds of aggregates. Apart from efficient data processing, the longer terms objectives of the project aim to incorporate randomized programming to support wider possible scenarios in data science. Similarly, machine learning algorithms are fed with data. These algorithms require efficient data manipulation and also a control over data quality. Our work on uncertainty and provenance is useful in this setting. Finally, the older sense of AI (symbolic AI) is connected to D-DAL through the study of circuits and knowledge compilation. We hope that these connections will reveal fruitful and that our contributions will also have impact in this field.
-
Software and Hardware
We intend to develop query execution engines, querying platform and schemas validators for graphs. Moreover, D-DAL will also have interests in query languages, their extensions, and their execution. These topics make us have close interest in compilation techniques and programming language methods. We adopt an integrating view of query processing that embeds computational capabilities. This leads to see data processing and its interactions with other software differently, delegating more and more computation to query engines.
D-DAL will develop compilation methods for various types of query, taking advantage of parallel hardware capabilities. In particular, we intend to develop tools to analyze queries so as to detect opportunities for parallel execution. Models of abstract circuits will allow us to harness the power of modern hardware, in particular making better use of SIMD instructions. We also plan to work on RISC V architectures and processing in memory units (PIM). Databases have always had a tradition of exploiting hardware capabilities. However, with open source architectures such as RISC V, the particular needs of databases may be addressed by particular instructions. We hope that our work can have influence on RISC V architectures and that specific hardware design for data processing can emerge from our research.
-
Databases
Our view of databases is different from the integrating view that SQL standardization gives. While SQL continues to integrate in a top-down manner more and more querying capabilities under the same software, the approach of D-DAL is to combine with optimized streaming techniques, in a bottom-up fashion, well-defined and limited query languages for which we have identified highly efficient algorithms. Though more difficult to approach than the SQL view, our view may be more adapted for data processing in contexts where data is scattered within files in different formats or is too large for database management systems.
5 Social and environmental responsibility
5.1 Footprint of research activities
-
Antoine Amarilli
is environmental chair in the steering committee of the informal conference “Highlights of Logic, Games, and Automata” (since 2024).
-
Iovka Boneva
is the leader of the Sustainable Development Commission of the CRIStAL research lab.
-
Antoine Amarilli
is a co-animator of the TCS4F pledge on sustainable research in theoretical computer science (since 2020).
-
Sophie Tison
is a member of the general assembly of the European Association for Theoretical Computer Science (EATCS) (elected in 2019).
5.2 Parity and equal opportunities in science
-
Sophie Tison
Member of the steering committee of the Programme de mentorat Femmes et Sciences de Lille.
-
Sylvain Salvati
Member of commission parité et égalité des chances at Inria.
5.3 Impact of research results
Databases and methods from Artificial Intelligence are used in virtually all aspects of the modern digitalized world, from companies' web services to governments' institutions.
6 Highlights of the year
6.1 Creation of the D-DAL team-project
The D-DAL team-project has been created in October 2025 It is the result of the collective work of the team members. We are greatful to INRIA for its trust and support. In particular, INRIA allowed us to recruit Antoine Amarilli this year has been very important to us both in term of team life and of scientific achievements. This further allowed us to recruit numerous students to work with us:
-
Trainees
Thibault Tisserant, Paul Raphaël, Rémi de Pretto, Vincent Peth, Harshul Sagar.
-
PhD students
Sébastien Labbe, Loup Lobet, Arthur Lombardo, Étienne Parent
7 Latest software developments, platforms, open data
7.1 Application of database techniques to genomic
Charles Paperman has developed an application axis in genomic in collaboration with the CRIStAL team Bonsai. Two PhDs have started to apply database methods to genomic. Very promising results have already been obtained with software that work better than the state of the art. The software Vizitig that allows users to query large genomic data is developing at a fast pace.
7.2 Latest software developments
7.2.1 helicase
-
Keywords:
SIMD, Parsing, Bioinformatics, Rust
-
Functional Description:
It is a rust crate (library) that allows to parse and extract relevant information classical format in bio-informatics. It servers as a demonstration of a methodology that take benefits of SIMD instructions through analysis from automata theory in the scope of the Shannon Meet Cray ANR project
- URL:
-
Contact:
Charles Paperman
7.2.2 NetworkDisk
-
Name:
NetworkDisk
-
Keywords:
Large graphs, Python, Databases
-
Functional Description:
NetworkDisk provides a way to manipulate graphs on disk. The goal is to be as much as possible compatible with (Di)Graph objects of the NetworkX Python package but lifting memory requirement and providing persistence of the Graph.
- URL:
-
Contact:
Charles Paperman
7.2.3 rsonpath
-
Keywords:
JSon, Streaming, SIMD, Rust
-
Functional Description:
The rsonpath crate provides a JSONPath parser and a query execution engine, which utilizes SIMD instructions to provide massive throughput improvements over conventional engines.
- URL:
-
Contact:
Charles Paperman
-
Partners:
Warsaw University, Technical University of Munich (TUM)
7.2.4 ShapeDesigner
-
Name:
ShapeDesigner
-
Keywords:
Validation, Data Exploration, Verification
-
Functional Description:
ShapeDesigner allows construct a ShEx or SHACL schema for an existing dataset. It combines algorithms to analyse the data and automatically extract shape constraints, and to edit and validate shape schemas.
- URL:
-
Contact:
Jeremie Dusart
-
Partner:
CRIStAL
7.2.5 ShEx validator
-
Name:
Validation of Shape Expression schemas
-
Keywords:
Data management, RDF
-
Functional Description:
Shape Expression schemas is a formalism for defining constraints on RDF graphs. This software allows to check whether a graph satisfies a Shape Expressions schema.
-
Release Contributions:
ShExJava now uses the Commons RDF API and so support RDF4J, Jena, JSON-LD-Java, OWL API and Apache Clerezza. It can parse ShEx schema in the ShEcC, ShEJ, ShExR formats and can serialize a schema in ShExJ.
To validate data against a ShExSchema using ShExJava, you have two different algorithms:
- the refine algorithm: compute once and for all the typing for the whole graph
- the recursive algorithm: compute only the typing required to answer a validate(node,ShapeLabel) call and forget the results.
- URL:
-
Contact:
Iovka Boneva
-
Partner:
CRIStAL
7.2.6 vizitig
-
Keywords:
Genomics, Transcriptomics, Databases, De Bruijn graphs, Data visualization
-
Functional Description:
Vizitig is a genomics and transcriptomics exploration software built around a de Bruijn graph that directly encodes k-mers from raw sequencing data, whether assembled or not. It enables querying very large sequencing datasets without prior alignment by combining sequence structure, biological annotations, and sample metadata within a single queryable framework. Users can search for regions of interest based on genes, loci, biomarkers, or sample traits (presence, abundance, etc.), and interactively explore only the relevant graph neighborhoods in a scalable way. Vizitig provides a human-readable query language, a full-featured web interface, as well as command-line and Python interfaces, facilitating integration with standard analysis tools and data sharing.
-
Contact:
Charles Paperman
8 New results
8.1 Circuits for Data Manipulation
Participants: Antoine Amarilli, Oliver Irwin, Mikael Monet, Sylvain Salvati.
8.1.1 Studying circuits from Knowledge Compilation
In 16, Florent Capelli, Oliver Irwin and Sylvain Salvati present an elementary branch and bound algorithm with a simple analysis of why it achieves worstcase optimality for join queries on classes of databases defined respectively by cardinality or acyclic degree constraints. They then show that if one is given a reasonable way for recursively estimating upper bounds on the number of answers of the join queries, the algorithm can be turned into an algorithm for uniformly sampling answers with expected running time where is the upper bound, is the actual number of answers and ignores polylogarithmic factors. Their approach recovers recent results on worstcase optimal join algorithm and sampling in a modular, clean and elementary way.
8.1.2 Foundations of circuits: circuit complexity, algebra and abstract machines
In 14, with Florin Manea, Tina Ringleb and Markus Schmid, Antoine Amarilli has studied linear time subsequence and supersequence regex matching. It is well-known that checking whether a given string matches a given regular expression can be done in quadratic time and that this cannot be improved to a truly subquadratic running time of assuming the strong exponential time hypothesis (SETH). They study a different matching paradigm where they ask instead whether has a subsequence that matches , and show that regex matching in this sense can be solved in linear time . Further, the same holds if asked for a supersequence. They show that the quantitative variants where one wants to compute a longest or shortest subsequence or supersequence of that matches can be solved in , i. e., asymptotically no worse than classical regex matching; and they show that is conditionally not possible for these problems. They also investigate these questions with respect to other natural string relations like the infix, prefix, left-extension or extension relation instead of the subsequence and supersequence relation. They further study the complexity of the universal problem where one asks if all subsequences (or supersequences, infixes, prefixes, left-extensions or extensions) of an input string satisfy a given regular expression.
In 19, Antoine Amarilli, Mikaël Monet and Rémi De Pretto study the class of local languages which is a well-known subclass of the regular languages that admits many equivalent characterizations. In this short note they establish the PSPACE-completeness of the problem of determining, given as input a nondeterministic finite automaton (NFA) , whether the language recognized by is local or not. This contrasts with the case of deterministic finite automata (DFA), for which the problem is known to be in PTIME.
In 18, Antoine Amarilli, Mikaël Monet and Rémi De Pretto study two rewrite rules on hypergraphs, called edge-domination and node-domination, and show that they are confluent. These rules are rather natural and commonly used before computing the minimum hitting sets of a hypergraph. Intuitively, edge-domination allows one to remove hyperedges that are supersets of another hyperedge, and node-domination allows one to remove nodes whose incident hyperedges are a subset of that of another node. They show that these rules are confluent up to isomorphism, i.e., if one applies any sequences of edge-domination and node-domination rules, then the resulting hypergraphs can be made isomorphic via more rule applications. This in particular implies the existence of a unique minimal hypergraph, up to isomorphism.
In 17, Antoine Amarilli with Benoît Groz study the cutwidth measure on graphs and ways to bound the cutwidth of a graph by partitioning its vertices. They consider bounds expressed as a function of two quantities: on the one hand, the maximal cutwidth of the subgraphs induced by the classes of the partition, and on the other hand, the cutwidth of the quotient multigraph obtained by merging each class to a single vertex. They consider in particular the decomposition of directed graphs into strongly connected components (SCCs): in this case, is the maximal cutwidth of an SCC, and is the cutwidth of the directed acyclic condensation multigraph. They show that the cutwidth of a graph is always in , specifically it can be upper bounded by . They also show a lower bound justifying that the constant cannot be improved in general.
8.2 Logic and Query Evaluation
Participants: Antoine Amarilli, Mikael Monet, Charles Paperman.
8.2.1 Provenance, explanation, aggregation, counting, uncertainty, probabilistic data, Shapley, approximation algorithms.
In 9, Antoine Amarilli et al. study approximation algorithms for queries in probabilistic graphs. Query evaluation over probabilistic databases is notoriously intractable – not only in combined complexity, but often in data complexity as well. This motivates the study of approximation algorithms, and particularly of combined FPRASes, with runtime polynomial in both the query and instance size. In this paper, they focus on tuple-independent probabilistic databases over binary signatures, i.e., probabilistic graphs, and study when they can devise combined FPRASes for probabilistic query evaluation. They settle the complexity of this problem for a variety of query and instance classes, by proving both approximability results and (conditional) inapproximability results doubled with (unconditional) DNNF provenance circuit size lower bounds. This allows them to deduce many corollaries of possible independent interest. For example, they show how the results of Arenas et al. on counting fixed-length strings accepted by an NFA imply the existence of an FPRAS for the two-terminal network reliability problem on directed acyclic graphs: this was an open problem until now. They also show that one cannot extend a recent result of van Bremen and Meel that gives a combined FPRAS for self-join-free conjunctive queries of bounded hypertree width on probabilistic databases: neither the bounded-hypertree-width condition nor the self-join-freeness hypothesis can be relaxed. They last show how their methods can give insights on the evaluation and approximability of regular path queries (RPQs) on probabilistic graphs in the data complexity perspective, showing in particular that some of them are (conditionally) inapproximable.
In 10, with Wolfgang Gatterbauer and Neha Makhija, Antoine Amarilli and Mikaël Monet have investigated the complexity classification of resilience for regular path queries. The resilience problem for a query and an input set or bag database is to compute the minimum number of facts to remove from the database to make the query false. In this paper, they study how to compute the resilience of Regular Path Queries (RPQs) over graph databases. Their goal is to characterize the regular languages for which it is tractable to compute the resilience of the existentially-quantified RPQ built from . They show that computing the resilience in this sense is tractable (even in combined complexity) for all RPQs defined from so-called local languages. By contrast, they show hardness in data complexity for RPQs defined from the following language classes (after reducing the languages to eliminate redundant words): all finite languages featuring a word containing a repeated letter, and all languages featuring a specific kind of counterexample to being local (which they call four-legged languages). The latter includes in particular all languages that are not star-free. Their results also imply hardness for all non-local languages with a so-called neutral letter. They also highlight some remaining obstacles towards a full dichotomy. In particular, for the RPQ , resilience is tractable but the only known PTIME algorithm uses submodular function optimization.
8.2.2 Dynamic data: incremental and dynamic algorithms and connections with dynamic algorithms
In 13, with Corentin Barloy and Louis Jachiet, Antoine Amarilli and Charles Paperman study the dynamic membership problem for regular tree languages under relabeling updates: we fix an alphabet and a regular tree language over (expressed, e.g., as a tree automaton), we are given a tree with labels in , and we must maintain the information of whether the tree belongs to while handling relabeling updates that change the labels of individual nodes in . Their first contribution is to show that this problem admits an algorithm for any fixed regular tree language, improving over known algorithms. This generalizes the known upper bound over words, and it matches the lower bound of from dynamic membership to some word languages and from the existential marked ancestor problem. Their second contribution is to introduce a class of regular languages, dubbed almost-commutative tree languages, and show that dynamic membership to such languages under relabeling updates can be decided in constant time per update. Almost-commutative languages generalize both commutative languages and finite languages: they are the analogue for trees of the ZG languages enjoying constant-time dynamic membership over words. Their main technical contribution is to show that this class is conditionally optimal when one assumes that the alphabet features a neutral letter, i.e., a letter that has no effect on membership to the language. More precisely, they show that any regular tree language with a neutral letter which is not almost-commutative cannot be maintained in constant time under the assumption that the prefix-U1 problem from 26 also does not admit a constant-time algorithm.
8.3 Knowledge on Data
Participants: Iovka Boneva, Sylvain Salvati, Sophie Tison.
8.3.1 Optimization for schemas for graphs
Iovka Boneva has started a particularly fruitful international collaboration with researchers from Poland, Austria and Italy which aims at bridging the gap between graph schema languages for RDF and for property graphs. This resulted in a common core definition of a schema language that is at the intersection of three schema languages for graphs of different types.
In 12, Iovka Boneva collaborated with several specialists in schema languages for graphs. They established a common model for three main graph schema languages: SHACL and ShEx for RDF graphs, and PG-Schema for property graphs. These languages are designed and used for formalising properties of data and ensuring data quality in diverse graph-shaped data driven applications such as Semantic Web and databases. Each language has its unique approach to defining constraints and validating graph data, leaving potential users in the dark about their commonalities and differences. The paper provides concise formal definitions of the core components of these languages, employs a uniform framework to facilitate a comprehensive comparison between them, and identifies a common set of functionalities, shedding light on both overlapping and distinctive features. This work is a first step towards interoperability of systems based on different schema languages.
In 15, Iovka Boneva with Jose E. Labra Gayo, Eric G. Prud'hommeaux, Katherine Thornton and Andra Waagmeester presented formal semantics for the ShEx schema language enriched with an inheritance mechanism. It is inspired by inheritance in object-oriented programming languages, and provides similar advantages such as reuse, modularity, and more flexible data modelling. Using an example, we explain the main features of the inheritance mechanism. We present its syntax and formal semantics. The semantics is an extension of the semantics of ShEx 2.1. It also directly yields a validation algorithm as an extension of the previous ShEx validation algorithms.
8.3.2 Static analysis with constraints (open-world query answering, query rewriting under constraints, consistency, certain answers)
With Pablo Barceló and Gaëlle Fontaine, Sylvain Salvati and Sophie Tison have corrected some proofs in 11. This paper is about handling graph databases with regular path constraints. Applications of graph databases are prone to inconsistency due to interoperability issues. This raises the need for studying query answering over inconsistent graph databases in a simple but general framework. They follow the approach of consistent query answering (CQA), and study its data complexity over graph databases for conjunctive regular-path queries (CRPQs) and conjunctive regular-path constraints (CRPCs). They deal with subset, superset and symmetric-difference repairs. Without restrictions, CQA is undecidable for the semantics of superset- and symmetric-difference repairs, and for subset-repairs. However, they identify restrictions on CRPCs and databases that lead to decidability, and even tractability of CQA.
9 Partnerships and cooperations
9.1 International research visitors
9.1.1 Visits of international scientists
Invitation of researchers to the team seminar
Our team often invites external researchers (from French institutions or international) to give research seminars. Participants sometimes stay a few days to discuss research with the team. Here is a non-exhaustive list of researchers that gave a presentation in the team in 2025-2026:
- Pieter Bonte [Ghent University]
- Alexandra Rogova [University of Warsaw]
- Harry Vinall-Smeeth [RWTH Aachen University]
- Sarah Kleest-Meißner [Hasselt University]
- Corto Mascle [MPI-SWS, Kaiserslautern]
- Lê Thành Dũng (Tito) Nguyen [Aix-Marseille University]
- Michaël Cadilhac [DePaul University]
- Howard Straubing [Boston College]
- Rémi Morvan [freelance software engineer (previously University of Bordeaux)]
9.1.2 Visits to international teams
Research stays abroad
Antoine Amarilli visited the Institute of Computer Science at the University of Göttingen (Germany), in the context of a collaboration with Florin Manea, Markus Schmid, and Tina Ringleb, from February 10 to February 16, 2025.
9.2 National initiatives
9.2.1 ANR JCJC KCODA
Participants: Florent Capelli [PI], Oliver Irwin, Charles Paperman, Sylvain Salvati.
-
Duration
2021-2026
-
Description
The aim of KCODA is to study how succinct representations can be used to efficiently solve modern optimization and AI problems that use a lot of data. We suggest using data structures from the field of compilation of knowledge that can represent large datasets succinctly by factoring certain parts while allowing efficient analysis of the represented data. The first goal of KCODA is to understand how one can efficiently solve optimization and training problems for data represented by these structures. The second goal of KCODA is to offer better integration of these techniques into the systems of database management by proposing new algorithms allowing to build factorized representations of the data responses to DB requests and by proposing encodings of these representations inside the DB.
9.2.2 ANR SxC
Participants: Loup Lobet, Charles Paperman [PI], Sylvain Salvati.
-
Duration
2025-2030
-
Coordinator
Charles Paperman
-
Scientific Partners
LIP (ENS Lyon), LCIS (Université Grenobles Alpes) and LABRI (Université de Bordeaux)
-
Objectives
Knowledge about handcrafted vectorization algorithms is scattered among various fields, system architectures, programming languages paradigms and industrial pieces of codes. The main objective of the project is to bring structure to this knowledge through novel conceptual and practical tools. The end goal is to facilitate the writing of SIMD programs and to pave the way to future auto-vectorization methods for stream processing. The main difficulty to achieve these goals is to understand the interplay of sequentiality and bit-level parallelism. Circuit complexity, as introduced by Shannon in the early years of computer science, offers a powerful framework to study such notion of parallelism. In this multidisciplinary project, we bring together methodologies from the fields of circuit complexity of automata, and programming language design for parallel architectures, with motivations from data processing and bioinformatics applications.
The project’s results will be implemented into a compilation tool-chain. At the highest level, we will design and develop a domain specific language, dubbed Vectoid. Its goal is to allow more programmers to harness SIMD instructions. Vectoid will be compiled to a Vectorial Intermediate Representation (VIR) dedicated to stream processing. In turn VIR will be compiled to low level code for various SIMD architectures. The design of this compilation tool-chain will be informed by a theoretical investigation on the expressivity of vectorial circuits. It will be evaluated on data processing and bioinformatics benchmarks.
9.2.3 ANR EXPAND
Participants: Antoine Amarilli.
-
Duration
2025-2030
-
Coordinator
Michaël Thomazo
-
Scientific Partners
LABRI (Bordeaux), LIMOS (Clermont-Ferrand), LIRMM (Montpellier), DI ENS (Paris), IRISA (Rennes), and CRIStAL (D-DAL and SPIRALS team)
-
Objectives
The goal of EXPAND is to expand the applicability of ontology-based query answering by allowing for enhanced query languages, and provide richer ways to manipulate and understand query answers.
9.3 Regional initiatives
In collaboration with Florent Capelli and Stefan Mengel of Centre de Recherche en Informatique de Lens (CRIL), Antoine Amarili and Mikaël Monet have obtained a PostDoc grant from région Nord via the CPER Cornelia. They will recruit a post-doctoral student who is expected to start in September 2026.
10 Dissemination
Participants: Antoine Amarilli, Iovka Boneva, Oliver Irwin, Sébastien Labbe, Aurélien Lemay, Loup Lobet, Arthur Lombardo, Mikael Monet, Charles Paperman, Sylvain Salvati, Sophie Tison.
10.1 Promoting scientific activities
Member of organizing committees
-
Mikael Monet
, Sophie Tison
member of the EDBT/ICDT'27 organizing committee. The conference will take place in Lille in 2027.
10.1.1 Scientific events: selection
Member of the conference program committees
-
Antoine Amarilli
was member of the program committee of UndoneCS 2026
-
Iovka Boneva
was member of the program committee of BDA 2025 (the French conference on data management)
Reviewer
-
Antoine Amarilli
has reviewed for various journals/conferences, including STACS'26, ICDT'25, LMCS
-
Mikael Monet
has reviewed for various journals/conferences, including PODS'26, JACM (2025), TODS (2025), TOCL (2026)
10.1.2 Journal
Member of the editorial boards
-
Mikael Monet
Managing editor for the TheoretiCS journal.
10.1.3 Invited talks
-
Antoine Amarilli
gave a talk at the invitation-only workshop FMT'25 (finite model theory) in Les Houches, France. He was also invited (without giving a talk) to Dagstuhl workshop 25081 "Semirings in Databases, Automata, and Logic"
10.1.4 Leadership within the scientific community
-
Antoine Amarilli
Vice-President (elected in 2024) of the EATCS (European Association for Theoretical Computer Sciences).
-
Charles Paperman
Elected member of EATCS (European Association for Theoretical Computer Sciences).
-
Sophie Tison
Member of the scientific committee of the GDR IM.
-
Sophie Tison
Member of the SSAAL Société des Sciences, de l'Agriculture et des Arts de Lille.
10.1.5 Scientific expertise
-
Sophie Tison
Expert for HCERES (vice-president of the DTIS-ONERA committee).
-
Sophie Tison
External expert for IUF.
-
Sophie Tison
Expert for the programme "Jeunes Talents France 2025 L’Oréal-UNESCO Pour les Femmes et la Science".
10.1.6 Research administration
-
Iovka Boneva
Member of Inria Lille CER (Commission des Emplois de Recherche).
-
Charles Paperman
Elected member of the research commission of FST (Université de Lille).
-
Sylvain Salvati
Elected member of CE (Commission d'évaluation INRIA).
-
Sophie Tison
Member of the Steering Committee of Highlights of Logic, Automata, and Games.
10.2 Teaching - Supervision - Juries - Educational and pedagogical outreach
10.2.1 Supervision
-
Bastien Degardins
PhD project started in October 2024. On graph databases for DNA analysis. Supervised by Paperman and Marchet (CRIStAL Bonsai).
-
Oliver Irwin
PhD project started in 2022. On compilation and aggregation in databases. Co-supervised by Capelli and Salvati.
-
Sebastien Labbe
PhD student from Oct 2025, co-supervised by Antoine Amarilli and Charles Paperman
-
Loup Lobet
PhD student from Oct 2025, co-supervised by Laure Gonnord and Charles Paperman
-
Etienne Parent
PhD student from Oct 2025, co-supervised by Gabrien Radanne, Matthieu Moy and Charles Paperman
-
Arthur Lombardo
PhD student from Oct 2025, co-supervised by Antoine Amarilli, Mikaël Monet and Pierre Senellart
-
Vincent Peth
M2 intern, co-supervised by Antoine Amarilli and Mikaël Monet, summer 2025
-
Rémi de Pretto
M1 intern, co-supervised by Antoine Amarilli and Mikaël Monet, summer 2025
-
Paul Raphaël
L3 intern, co-supervised by Antoine Amarilli and Mikaël Monet and Sylvain Salvati, summer 2025
-
Harshul Sagar
PhD student at University of Illinois with whom Antoine Amarilli and Mikaël Monet have an ongoing research project that started during summer 2025
-
Thibault Tisserand
M1 intern, supervised by Sylvain Salvati, summer 2025
10.2.2 Juries
-
Sylvain Salvati
has been a reviewer of the PhD of Vincent Moreau A topological and fibrational approach to higher-order automata at Université Paris Cité.
-
Charles Paperman
has been in the PhD committee of Gabriel Bathie.
-
Antoine Amarilli
has been in the HDR committee of Tatiana Starikovskaya (reviewer), in the PhD committee of Rémi Morvan, in the PhD committee of Neha Makhija, in the PhD committee of Harry Vinall-Smeeth (reviewer), in the CSI committees of Javad Taheri, Oliver Irwin, Fatemeh Ghasemi, and Lucas Larroque, and in a hiring committee (COS) for a “Maître de conférences” position in LABRI (Bordeaux)
-
Sophie Tison
has been in the PhD committee of Rémi Morvan.
-
Iovka Boneva
has been in the hiring committee (COS) for two “Maître de conférences” positions at Université de Lille
10.2.3 Educational and pedagogical outreach
-
Aurélien Lemay
Responsible for computer science and numeric correspondent for his department.
-
Paperman
Responsible of alternance for the department of Computer Science.
-
Sylvain Salvati
Head of Master of Computer Science at the University of Lille.
-
Sylvain Salvati
Member of the board for Computer Science of the doctoral school MADIS (Mathematics and Computer Science).
-
Sylvain Salvati
Member of the collège doctoral (doctoral college) of University of Lille: PhD students education and professional training.
10.2.4 Teaching Activities
-
Iovka Boneva
Teaches computer science in BUT Informatique of Université de Lille.
-
Oliver Irwin
Taught 64 hours in Computer Science in the Computer Science Department of Université de Lille.
-
Aurélien Lemay
Teaches computer science in the LEA department of Université de Lille for around 200h per year (Licence and Master). He is also responsible for computer science and numeric correspondent for its department.
-
Mikael Monet
Teaches computer science as a temporary lecturer at Université de Lille and at Centrale Lille. He teaches two databases courses, as well as a course on "algorithms and complexity".
-
Charles Paperman
Teaches CS in master degrees of the computer science department, Miashs at bachelor level in the math department and in the data science master degree.
-
Sara Riva
Teaches CS both in the Bachelor and Master of the Department of Computer Science of the Université de Lille. She also teaches a Database course at Centrale Lille.
-
Sylvain Salvati
Teaches at the Bachelor and Master level in the computer science department of Université de Lille.
10.3 Popularization
10.3.1 Specific official responsibilities in science outreach structures
-
Sophie Tison
Member of the scientific committee of the Maison Pour la Science.
-
Charles Paperman
Co-organizer of the yearly thematic of GDR-IFM and GDR-BIMMM l'adn dans tout ses états.
10.3.2 Others science outreach relevant activities
-
Sara Riva
is involved in the internship action Fourmis which fosters the interest of highschool students towards Mathematics and Computer Science.
11 Scientific production
11.1 Major publications
- 1 inproceedingsCommon Foundations for SHACL, ShEx, and PG-Schema.WWW 2025 - The ACM Web Conference 2025Sydney, AustraliaACMApril 2025, 8-21HALDOI
- 2 proceedingsDynamic Membership for Regular Tree Languages.MFCSWarsaw, PolandSchloss Dagstuhl – Leibniz-Zentrum für Informatik2025HALDOI
- 3 articleApproximating Queries on Probabilistic Graphs.Logical Methods in Computer Science214December 2025HALDOI
- 4 proceedingsResilience for Regular Path Queries: Towards a Complexity Classification.PODS32Berlin, GermanyACM2025, 1-18HALDOI
- 5 proceedingsLinear Time Subsequence and Supersequence Regex Matching.MFCSWarsaw, PolandSchloss Dagstuhl – Leibniz-Zentrum für Informatik2025HALDOI
- 6 articleCorrections to “On the data complexity of consistent query answering over graph databases [Journal of Computer and System Sciences 88 (2017) 164–194]”.Journal of Computer and System Sciences155February 2026, 103694HALDOI
- 7 inproceedingsShape Expressions with Inheritance.22nd European Semantic Web Conference, ESWC 2025ESWC 2025 - 22nd European Semantic Web Conference15718Lecture Notes in Computer SciencePortoroz, SloveniaSpringer Nature SwitzerlandJune 2025, 481-499HALDOI
- 8 inproceedingsA Simple Algorithm for Worst Case Optimal Join and Sampling.International Conference on Database TheoryLeibniz International Proceedings in Informatics (LIPIcs)328Barcelona, SpainSchloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)March 2025, 23:1-23:19HAL
11.2 Publications of the year
International journals
International peer-reviewed conferences
Reports & preprints
11.3 Cited publications
- 21 articleA Circus of Circuits: Connections Between Decision Diagrams, Circuits, and Automata.arXiv preprint arXiv:2404.096742024back to textback to text
- 22 inproceedingsRanked Enumeration for MSO on Trees via Knowledge Compilation.ICDT2902024back to text
-
23
inproceedings
E numeration on trees with tractable combined complexity and efficient updates. 2019DOIback to textPODS - 24 articleTractable Circuits in Database Theory.ACM SIGMOD Record5322024back to textback to text
-
25
article
T he dichotomy of evaluating homomorphism-closed queries on probabilistic graphs.LMCS2022DOIback to text -
26
inproceedingsDynamic Membership for Regular Languages.
2021DOIback to textback to textICALP - 27 articleThe Non-Cancelling Intersections Conjecture.CoRRabs/2401.162102024back to textback to text
- 28 articleWeighted counting of matchings in unbounded-treewidth graph families.arXiv preprint arXiv:2205.008512022back to text
-
29
inproceedings
E numerating regular languages with bounded delay. 2023DOIback to textSTACS - 30 inproceedingsHolant problems and counting CSP.STOC2009back to text
- 31 articleA knowledge compilation map.Journal of Artificial Intelligence Research172002back to textback to text
- 32 inproceedingsComputing the Shapley Value of Facts in Query Answering.SIGMOD ConferenceACM2022back to text
- 33 articleAn effective dichotomy for the counting constraint satisfaction problem.SIAM Journal on Computing4232013back to text
- 34 articleDocument spanners: A formal approach to information extraction.JACM6222015back to text
- 35 articleThe Complexity of Resilience and Responsibility for Self-Join-Free Conjunctive Queries.PVLDB932015DOIback to text
- 36 phdthesisDialog between chase approach and string rewriting system approach.Université de LilleUniversité de LilleDecember 2019HALback to text
- 37 articleRecent advances in fully dynamic graph algorithms: A quick reference guide.JEA272022back to text
- 38 incollectionCounting Constraint Satisfaction Problems.The Constraint Satisfaction Problem7Dagstuhl Follow-Ups2017back to text
- 39 articleExpected Shapley-like scores of boolean functions: Complexity and applications to probabilistic databases.PACMMOD222024back to text
- 40 inproceedingsEvaluation and enumeration problems for regular path queries.ICDT2018back to text
- 41 inproceedingsAn algebraic approach to vectorial programs.STACS2023HALDOIback to text
- 42 inproceedingsContainment of Regular Path Queries Under Path Constraints.ICDT2024HALDOIback to text
-
43
article
Enumeration for FO queries over nowhere dense graphs .JACM6932022DOIback to text - 44 articleConstant delay enumeration for conjunctive queries.ACM SIGMOD Record4412015back to text
- 45 articleProvSQL: Provenance and probability management in PostgreSQL.PVLDB11122018back to text
- 46 phdthesisFrom semigroup theory to vectorization: recognizing regular languages.Université de LilleUniversité de LilleDecember 2023HALback to text