D-DAL

D-DAL - 2025

2025Activity‌ reportProject-TeamD-DAL

RNSR:‌ 202524739L

Research center Inria‌‌ Centre at the University of Lille
In partnership‌ with:Université de Lille‌
Team name: Data :‌‌ Dynamics Algorithmics and Logics
In collaboration with:Centre‌ de Recherche en Informatique,‌ Signal et Automatique de‌‌ Lille

Creation of the Project-Team: 2025 October 01‌

Each year, Inria research‌ teams publish an Activity‌‌ Report presenting their work and results over the‌ reporting period. These reports‌ follow a common structure,‌‌ with some optional sections depending on the specific‌ team. They typically begin‌ by outlining the overall‌‌ objectives and research programme, including the main research‌ themes, goals, and methodological‌ approaches. They also describe‌‌ the application domains targeted by the team, highlighting‌ the scientific or societal‌ contexts in which their‌‌ work is situated.

The reports then present the‌ highlights of the year,‌ covering major scientific achievements,‌‌ software developments, or teaching contributions. When relevant, they‌ include sections on software,‌ platforms, and open data,‌‌ detailing the tools developed‌ and how they are shared. A substantial part‌ is dedicated to new results, where scientific contributions‌ are described in detail, often with subsections specifying‌ participants and associated keywords.

Finally, the Activity Report‌ addresses funding, contracts, partnerships, and collaborations at various‌ levels, from industrial agreements to international cooperations. It‌ also covers dissemination and teaching activities, such as‌ participation in scientific events, outreach, and supervision. The‌ document concludes with a presentation of scientific production,‌ including major publications and those produced during the‌ year.

Keywords

Computer Science and Digital Science

A2.1.‌ Programming Languages
A2.1.1. Semantics of programming languages
A2.1.4.‌ Functional programming
A2.1.6. Concurrent programming
A3.1. Data
A3.1.1.‌ Modeling, representation
A3.1.2. Data management, quering and storage‌
A3.1.3. Distributed data
A3.1.4. Uncertain data
A3.1.5. Control‌ access, privacy
A3.1.6. Query optimization
A3.1.7. Open data‌
A3.1.8. Big data (production, storage, transfer)
A3.1.9. Database‌
A3.2.1. Knowledge bases
A3.2.2. Knowledge extraction, cleaning
A3.2.3.‌ Inference
A3.2.4. Semantic Web
A4.5. Formal method for‌ verification, reliability, certification
A4.5.1. Static analysis
A4.5.2. Model-checking‌
A4.5.3. Program proof
A4.7. Access control
A4.8. Privacy-enhancing‌ technologies
A7. Theory of computation
A7.2. Logic in‌ Computer Science
A9.1. Knowledge
A9.2. Machine learning
A9.7.‌ AI algorithmics
A9.8. Reasoning

1 Team members,‌ visitors, external collaborators

Research Scientists

Antoine Amarilli [‌INRIA, Advanced Research Position, from Oct‌ 2025, HDR]
Mikael Monet [INRIA‌, Researcher, from Oct 2025]

Faculty‌ Members

Sylvain Salvati [Team leader, UNIV‌ LILLE, Professor, from Oct 2025,‌ HDR]
Iovka Boneva [UNIV LILLE,‌ Associate Professor, from Oct 2025]
Aurélien‌ Lemay [UNIV LILLE, Associate Professor,‌ from Oct 2025, HDR]
Charles Paperman‌ [UNIV LILLE, Associate Professor, from‌ Oct 2025, HDR]
Sophie Tison [‌UNIV LILLE, Emeritus, from Oct 2025‌, HDR]

PhD Students

Bastien Degardins [‌UNIV LILLE, from Oct 2025]
Oliver‌ Irwin [UNIV LILLE, from Oct 2025‌]
Sebastien Labbe [UNIV LILLE, from‌ Oct 2025]
Loup Lobet [INRIA,‌ from Oct 2025]
Arthur Lombardo [ENS‌ PARIS, from Oct 2025]

Administrative Assistant‌

Aurore Dalle [INRIA]

External Collaborator

Sara‌ Riva [UNIV LILLE, from Oct 2025‌]

2 Overall objectives

A major and early‌ success story of computer science, combining theoretical foundations‌ and efficient implementations, is the development of relational‌ databases. Relational databases have now been deployed‌ in essentially all sectors of activity, from large‌ multi-server clusters to phones and embedded devices. An‌ important factor of this success is the abstraction‌ lying at the core of relational databases: they‌ combine a simple data model with a declarative‌ language (SQL) to write user queries. These queries‌ are then transparently translated to optimized execution plans taking advantage of the‌ schema (the data model)‌ and statistics of the‌‌ data. This abstracts away all internal details that‌ would be unpleasant to‌ the user: how data‌‌ is concretely stored, the shape of the actual‌ execution plan, low level‌ optimizations, the orchestration of‌‌ data streams that compute intermediate query results, concurrent‌ accesses, and so on.‌

Building on this foundation,‌‌ the field of data management has evolved in‌ two antagonistic directions over‌ the past decade. One‌‌ first direction is that the scope of data‌ usage in applications has‌ significantly expanded. Data is‌‌ used to train statistical models; they are the‌ basis of key decisions‌ that have a strong‌‌ impact on people and on society, and must‌ therefore be explainable.‌ The data that we‌‌ store may be updated often, which makes it‌ necessary to efficiently recompute‌ the query answers. In‌‌ particular, the data may arrive in streams,‌ and we may have‌ to compute query answers‌‌ on the fly, even before we have read‌ the entire data stream.‌ The number of query‌‌ answers to produce may be so large that‌ we need to produce‌ it in streaming, or‌‌ to access it implicitly, instead of querying it‌ in full. The data‌ is heterogeneous: it‌‌ does not necessarily follow a fixed schema like‌ the relational model traditionally‌ assumes, and it may‌‌ feature non-relational data such as textual documents, tree-shaped‌ data (such as XML‌ and JSON records), and‌‌ so on. In particular, data may be stored‌ as a graph, without‌ a fixed schema; it‌‌ may then be interrogated with queries featuring recursion‌, e.g., asking for‌ pairs of reachable nodes.‌‌ There may be uncertainty on the data stored‌ by such systems, making‌ it difficult to compute‌‌ accurate answers. Last, some queries need to be‌ evaluated across datasets from‌ different sources, e.g., large‌‌ open datasets from multiple public organizations. In such‌ settings, data is typically‌ not stored within one‌‌ single system, but integrated across different systems. Further,‌ beyond traditional query evaluation,‌ the data can be‌‌ transformed, combined, and post-processed. This can be done‌ with procedural languages within‌ SQL databases (stored functions‌‌ and procedures), or with domain specific languages such‌ as LINQ that are‌ used to querying data‌‌ within the programming language C#. Thus, many data‌ processing frameworks are actually‌ libraries that can be‌‌ seen as query languages embedded within a programming‌ language; conversely, some query‌ languages such as XPath‌‌ 3.0 for XML documents or jq for JSON‌ documents now embed limited‌ programming languages so as‌‌ to make them more expressive. These new developments‌ are blurring the line‌ between querying and programming,‌‌ and are giving up on the idea that‌ data manipulation needs should‌ be expressed declaratively.

Faced‌‌ with the growth of these new applications, the‌ scope of the SQL‌ standard has significantly expanded‌‌ to accommodate some (but not all) of the‌ corresponding needs. Recent versions‌ of SQL support recursive‌‌ queries, incremental view maintenance‌ and triggers, stored procedures, queries over strings, embedded‌ XML and JSON documents and query over them,‌ along with naive forms of uncertainty and incompleteness‌ (through NULLs). Most recently, the graph query language‌ GQL has been published as part of SQL.‌ These features are standardized and supported by relational‌ engines, though the coverage and efficiency of these‌ features may vary across implementations. Further, for some‌ of these tasks, such as data integration and‌ post-processing, an entire ecosystem of solutions has developed‌ on top of relational databases.

The second direction‌ is that the data has increased in volume.‌ Some organizations now want to store much larger‌ volumes of data, e.g., by extracting it from‌ the Web, collecting scientific data, or running services‌ for an unprecedented number of users. Traditional relational‌ databases, with their support of the full SQL‌ language, may not be suited to such large‌ data volumes. For this reason, in the 2000s,‌ new “NoSQL” database paradigms started to become popular.‌ These systems compromise on some aspects of SQL,‌ such as its expressiveness (i.e., supporting less expressive‌ queries), or its formal guarantees (e.g., distributed consistency),‌ to achieve better performance. Highly-optimized systems now exist‌ for specific tasks, such as key-value stored, document-oriented‌ databases, etc.

To summarize, new data management applications‌ now ask for two conflicting aspects: the support‌ for more expressive queries (going beyond the more‌ arcane features of SQL); while offering more efficient‌ querying over very large quantities of data. These‌ features should be offered while retaining the guiding‌ principle of data management systems, namely, going from‌ a high-level declarative language that describes user needs,‌ down to low-level execution plans that implement them‌ over data representations. Parts of this translation can‌ be data-independent, using, e.g., static analysis; or‌ they can be data-dependent, using, e.g., logical‌ constraints or statistics of the database.

The main‌ goal of D-DAL (D-DAL is pronounced as Dédale‌, the French pronunciation of Daedalus) is to‌ find paths to extract meaning from the maze‌ of data. In other words, D-DAL's scientific objective‌ consists in overcoming the limits of expressivity and‌ efficiency of programs dealing with data. This global‌ objective gives rise to two main challenges. A‌ challenge with a large perspective of extracting value‌ from data, and a practical and pragmatic‌ challenge of processing large data efficiently.

Challenge‌ 1: Extracting value from data

The first challenge‌ that we will address is that of supporting‌ more expressive ways to interrogate data, so as‌ to make sense of complex data. We will‌ address several aspects of this challenge. First, the‌ data can be uncertain1, imprecise, or‌ probabilistic. Second, it can be unstructured, or with‌ an unknown structure that must be discovered or‌ formalized. Third, we may need to extend query‌ evaluation to compute complex information, in particular explanations,‌ but also provenance annotations, complex aggregation values, or‌ complex results, in a principled manner.

Dealing with uncertain data.

Data in‌ real life is rarely‌ certain: it can be‌‌ subject to human error when manually collected, generated‌ from physical sensors having‌ imperfect precision, or extracted‌‌ from sources of various quality from the Web‌ via natural language processing‌ (NLP) or machine learning‌‌ techniques, that are often themselves of a probabilistic‌ nature. The traditional way‌ of dealing with this‌‌ uncertainty is to simply ignore it, e.g., by‌ removing all rows containing‌ NULL values in a‌‌ database before evaluating a query. This naive approach‌ can however lead to‌ incorrect or incomplete answers‌‌ and is thus not always desirable. The challenge‌ is then to design‌ methods and algorithms that‌‌ are able to take this uncertainty into account‌ in a principled and‌ efficient way, i.e., by‌‌ reasoning over the possible completions of the data‌ in a logical fashion,‌ or by reasoning quantitatively‌‌ over the possible worlds of probabilistic data to‌ determine which query results‌ are likely.

Finding structure‌‌ in data.

Relational databases assume that the data‌ is strictly structured by‌ a fixed data schema‌‌: this assumption is central to the query‌ language, and to efficient‌ algorithms to select query‌‌ execution plans. However, when data comes from various‌ sources, or when it‌ is updated with new‌‌ kinds of information, then relational databases are inadequate:‌ they require time-consuming engineering‌ tasks to refactor data‌‌ schema to reflect the new needs. By contrast,‌ the model of graph‌ databases is more flexible‌‌ and has become increasingly popular for applications. As‌ efficiency comes at a‌ cost, flexibility also has‌‌ a cost, and data may display irregularities and‌ peculiarities that make it‌ harder to process. We‌‌ focus on two main research directions to explore‌ on data graphs.

The‌ first research direction is‌‌ to develop ways to express restrictions on the‌ shape of graph data,‌ via flexible notions of‌‌ schemas. Schemas make it possible to trade some‌ of the flexibility of‌ the graph model with‌‌ the efficiency of the relational model: they make‌ it possible to accommodate‌ diverse data representations in‌‌ a controlled manner, while disallowing specific kinds of‌ irregularities, and while guiding‌ query evaluation algorithms to‌‌ optimize query evaluation plans with the information given‌ by the schema.

The‌ second approach consists in‌‌ analyzing data in order to reveal its structure.‌ Through comprehensive data examination‌ and modelling techniques, we‌‌ aim to automatically discover the underlying patterns and‌ organization within the graph.‌ These methods could lead‌‌ to the inference of a schema that will‌ be exploited by other‌ algorithms to exploit directly‌‌ the data, to transform it into a usable‌ form, or to directly‌ help inference of queries‌‌ or transformation.

Explaining query results.

The increasing usage‌ of computing in every‌ aspect of modern society‌‌ has naturally raised a need for transparency in‌ algorithms. In some scenarios,‌ having the answer to‌‌ a query is not enough and one also‌ wants some kind of‌ explanation on how the‌‌ result was derived, or‌ wants to sort the possible results according to‌ some importance score.

In particular, in contexts where‌ decisions are made on the basis of data‌ that can impact people's lives, like in medicine,‌ or that can impact society, like in policy‌ making, it is important not to just query‌ data, but also to understand why the answers‌ are obtained, or even understand how pieces of‌ data weigh in the answers using certain aggregates‌ like Shapley values. Indeed data is imperfect and‌ query explanation is a way to deal with‌ this imperfection. These concerns are not confined to‌ the database community: the problem arises in all‌ data-driven computations. Most notably, with the rise of‌ AI, explainable AI has emerged as an important‌ scientific domain.

From a technical point of view,‌ query explainability can be understood more generally as‌ building complex query answers that keep track of‌ the operations performed during the evaluation, amounting to‌ a form of user-defined aggregation; it can also‌ be understood in the abstract setting of semiring‌ provenance. All these new tasks can incur‌ an increase in the time to answer queries,‌ and thus call for the development of new‌ notions of explanations and of new efficient algorithms.‌

Challenge 2: Processing large data efficiently

The standard‌ of SQL is continuously growing. Now, it encompasses‌ a larger number of data formats and ways‌ to query these formats: recursive queries, XML and‌ Xquery, JSON, and more recently property graphs. Most‌ of these extensions of the standard are encoded‌ in the relational model, sometimes in very simple‌ manners. For example, JSON documents are in general‌2 stored as strings. Instead of attacking the‌ problem of handling massive data by trying to‌ treat all kinds of formats and queries at‌ once, like the SQL standard does, we prefer‌ to concentrate on particular fragments and propose efficient‌ analyses, optimization methods and algorithms for these fragments.‌ When the number of query results is very‌ large, specific methods are needed to construct them.‌ For instance, streaming the results one by one‌ dilutes the computational effort over time. In situations‌ when queries are repeated, another strategy may be‌ to store the results of the query and‌ update it along with the data. We shall,‌ in some cases, develop high throughput prototypes for‌ these specific query fragments. In the end, the‌ composition of these tools should result in highly‌ efficient data processing pipelines.

As data comes in‌ many forms (text, JSON documents, graphs, relational databases,‌ ...), query languages are highly diverse (regular expressions,‌ JSONpath, Xpath, SQL, the numerous NOSQL querying APIs,‌ ...). An important challenge is to understand what‌ is computationally feasible and what is not. Knowing‌ which fragments of the query languages can benefit‌ from efficient evaluation will possibly allow us to‌ develop specific tools to handle them. We are‌ going to focus on particular aspects of expressive‌ queries that go beyond the traditional relational model: recursion, uncertainty handling, and‌ support for embedded programming.‌ For these features, we‌‌ aim at studying which aspects of queries make‌ evaluation more or less‌ efficient. This will result‌‌ in the design of fast algorithms for some‌ families of queries and‌ proofs of computational hardness‌‌ for others. Beyond these results, for practically useful‌ families of queries, we‌ shall develop domain specific‌‌ languages and prototypes that take advantage of the‌ algorithms that we will‌ have designed.

Beyond this‌‌ theoretical endeavour, we also wish to explore how‌ efficient query evaluation algorithms‌ can be articulated with‌‌ modern hardware platforms. Indeed, hardware now offers increasingly‌ parallel, concurrent or distributed‌ primitives. We believe that‌‌ one of the keys to efficient query evaluation‌ on large data is‌ to align novel algorithmic‌‌ methods with modern hardware.

Last, in settings where‌ efficient query evaluation remains‌ out of reach, we‌‌ intend to study approximation algorithms, to obtain approximate‌ answers to queries with‌ reasonable time bounds. Approximation‌‌ can be understood in a logical sense (i.e.,‌ principled overapproximations or underapproximations‌ of query results), in‌‌ a quantitative sense (i.e., approximate numerical or probabilistic‌ query answers), or in‌ a practical sense (i.e.,‌‌ approximations that are validated by an experimental approach).‌

Enumeration and incremental algorithms.‌

Querying data is only‌‌ one aspect of data processing. Indeed, when looking‌ at entire data processing‌ pipelines, we see that‌‌ results are often combined with other results coming‌ from other data sources,‌ and can be further‌‌ processed by programs. This poses new research questions:‌ instead of optimizing query‌ evaluation as a monolithic‌‌ task, we can study how to globally optimize‌ query evaluation pipelines by‌ studying the interplay between‌‌ the evaluation of several queries.

In this light,‌ we intend to study‌ two common algorithmic frameworks‌‌ to deal with large amounts of data, which‌ are well-suited to orchestrate‌ the simultaneous evaluations of‌‌ many queries in a pipeline. These are enumeration‌ algorithms and incremental algorithms‌.

The principle of‌‌ enumeration algorithms is the following: when the number‌ of query results is‌ huge, it is neither‌‌ feasible nor realistic to produce the entire collection‌ of results in answer‌ to a user query.‌‌ Instead, it is preferable to enumerate the results‌ in streaming, one after‌ the other, while minimizing‌‌ the delay between two successive answers. Evaluating queries‌ by enumerating results offers‌ two potential advantages. First,‌‌ downstream applications in the pipeline can access query‌ results more quickly, as‌ soon as they are‌‌ computed. Second, outputting results one by one might‌ drastically reduce the amount‌ of memory needed to‌‌ perform the computation.

The principle of incremental algorithms‌ is that queries that‌ are repeatedly executed can‌‌ be drastically accelerated by memorizing previous results. The‌ idea of incremental algorithms‌ is to avoid the‌‌ recomputation of all the results; this is usually‌ done by designing data‌ structures that can efficiently‌‌ be maintained when the data is updated and‌ that can serve to‌ compute quickly the new‌‌ query result.

Designing query‌ execution engines.

Once query results from various data‌ sources are enumerated either by an enumeration algorithm‌ or from data structures that store them, one‌ needs to orchestrate and synchronize the processing of‌ these results. In this context results are abstracted‌ as streams that are combined together.

Techniques pertaining‌ to programming languages are well suited to work‌ on streams. Streams of data and their combinations‌ are well studied in the context of synchronous‌ and reactive programming. Generalizations have been also studied‌ by the community of functional programming languages. Bringing‌ these methods to deal with queries and more‌ broadly to data processing can have a strong‌ impact on the efficiency of query execution engines.‌

The relation with programming languages and thus to‌ compilation naturally leans towards trying to exploit hardware‌ optimally to improve data throughput. In particular, exploiting‌ all the levels of parallelization: SIMD3 instructions,‌ the several cores that are inside CPUs, GPUs,‌ or in a distributed context.

Finally another challenge‌ is to develop methods at the frontier of‌ enumeration and incremental algorithms, i.e., to develop data‌ structures that support the enumeration of query results,‌ and that can also be updated efficiently when‌ the data itself is updated, so as to‌ start again the enumeration of the (possibly new)‌ query results.

3 Research program

Our research organisation‌ addresses the two scientific challenges that motivate the‌ project D-DAL, extracting value from data and processing‌ large data efficiently, with three axes that‌ correspond to different level of abstractions with respect‌ to these challenges.

The first axis, the basement‌, is about circuits which are a transverse‌ to the two challenges. Circuits form a handy‌ abstraction for representing data and understanding its structure‌ but also a proxy that allows to study‌ the complexity of queries and also to understand‌ how to make the most out of modern‌ CPUs.

The second axis, the first floor,‌ is about querying. It builds upon what is‌ produced in the basement and is intended to‌ develop efficient querying algorithms and eventually querying programs.‌ This axis is concerned with explaining query results,‌ studying fragments of querying languages, dynamics aspects of‌ querying and streaming. The first floor builds upon‌ the basement, circuits play an important role in‌ explaining query results and various aggregation mechanisms as‌ they provide the right structure to represent concisely‌ query results and then execute efficiently computations on‌ these results. Circuits also provide the frontier of‌ the feasible and the efficient about queries and‌ they will allow us to focus on the‌ right query fragments. Finally, circuits are the most‌ basic streaming algorithms we are going to study‌ and implement. The study of streaming methods will‌ allow us to compose efficiently these basic streaming‌ algorithms and enable the creation of efficient querying‌ pipelines.

The third axis, the roof, is‌ about proposing and exploiting information about data. The‌ roof has two slopes, one is about proposing and studying schemas‌ for graphs and the‌ other is about taking‌‌ into account knowledge about data so as to‌ accelerate querying. The first‌ slope is concerned with‌‌ the schema language ShEx which has been developed‌ within Links (the team‌ from which D-DAL originates)‌‌ and D-DAL will pursue this effort as it‌ gives us concrete problems‌ in relation to data‌‌ for query optimization exploiting knowledge about data. The‌ second slope is precisely‌ about taking into account‌‌ knowledge about data when answering queries. This information‌ should make algorithms more‌ efficient and also to‌‌ work with uncertain data more consistently. We will‌ work on the specific‌ information coming from ShEx‌‌ schemas, but also with more general settings. Overall,‌ the roof axis stands‌ upon the second axis‌‌ as it aims at refining the work done‌ on querying and applying‌ it to settings where‌‌ we can take advantage from extra hypotheses in‌ order to lower the‌ complexity.

In a a‌‌ nutshell, the three axes build a coherent architecture‌ which aims to work‌ out the maze of‌‌ data. We here give a summary of their‌ contributions to the two‌ challenges:

Extracting value from‌‌ data
- Basement
  Circuits are natural and concise representations‌ of data, be it‌ raw data or answers‌‌ to queries. With tools coming from knowledge compilation,‌ they reveal the inherent‌ structure of data, which‌‌ can be used independently from the queries that‌ we intend to evaluate‌ over it.
- First floor‌‌
  Querying is the basic way to extract information‌ from data, which can‌ then be refined into‌‌ value. We plan to explore various querying fragments‌ that enrich the state‌ of the art of‌‌ querying, including expressive queries featuring recursion, aggregation, and‌ uncertainty handling. It will‌ enable the extraction of‌‌ richer information.
- Roof
  The development of ShEx is‌ a concrete proposal for‌ describing data in ways‌‌ that allow to extract value from graph data‌ by taking logical constraints‌ into account. Furthermore, taking‌‌ advantage from extra information for querying, such as‌ deduction rules and integrity‌ constraints, will expand the‌‌ capabilities of querying.
Efficient data processing
- Basement
  Circuits‌ can play the role‌ of abstract proxy towards‌‌ exploiting modern CPUs to process data efficiently. Furthermore,‌ they also play an‌ important role in the‌‌ study of computational complexity, and in particular low‌ complexity classes, or classes‌ allowing parallel computation. Such‌‌ classes will allow us to describe the landscape‌ of queries that can‌ be solved very efficiently‌‌ on large data.
- First floor
  Query fragments, and‌ algorithms to process them‌ efficiently, are the central‌‌ focus of this axis. In particular, the dynamic‌ aspects of queries will‌ be studied in two‌‌ aspects. First by proposing enumeration and incremental algorithms‌ for queries. Second by‌ studying streaming and the‌‌ composition of streams.
- Roof
  Taking advantage of information‌ about data and developing‌ ShEx are central to‌‌ exploiting real world data, which often comes with‌ additional information such as‌ schemas. We will refine‌‌ querying algorithms to make‌ them more efficient with extra hypotheses, in particular‌ to choose query evaluation plans based on statistics‌ of the data.

Our overall goal is to‌ design highly efficient querying languages in various contexts‌ (data formats, uncertainty, etc.) and the tools that‌ allow to combine them into efficient data processing‌ programs. D-DAL is on the ridge between expressiveness‌ and efficiency. We have structured our project into‌ tree axes that are detailed in the rest‌ of this section. We summarize their objectives below‌ according to the two challenges we attack.

Extracting‌ value from data
- Short term
  We are going‌ to extend queries and data representation in several‌ directions. First we will work on enhancing circuit‌ classes by negation, and we will study generalizations‌ of structural constraints that allow for efficient counting‌ algorithms and hence handle quantitative uncertainty. Second we‌ will also incorporate probabilities in ShEx and work‌ on queries on graphs with constraints. We will‌ also study which logical constraints can be used‌ to reason on data, and which counterfactual problems‌ (e.g., resilience, Shapley values) can benefit from our‌ existing expertise on related tasks. These short term‌ objectives build on our previous work and work‌ towards more expressive query fragments.
- Medium term
  We‌ plan to take a more general view on‌ negation within circuits, in particular work more thoroughly‌ in connection with knowledge compilation. One goal is‌ to obtain a better understanding of the probabilistic‌ setting specifically. Another goal is to work on‌ open-world query answering in the presence of uncertainty‌ and of recursion. We also plan to study‌ the lifting of constraints of data graphs to‌ views on graphs. This medium-term objectives aim at‌ having a better qualitative understanding of data.
- Long‌ term
  We plan to build a unifying view‌ of the relation between circuit representations and querying‌ problems. This will provide us with an integrated‌ view of how data can be represented. We‌ also plan to work on relating aggregation and‌ uncertainty problems with formalisms used for constraint satisfaction‌ problems, to explore how such approaches can lead‌ to new structural algorithms for query evaluation. We‌ also hope that such approaches can help connect‌ the multiple communities that propose constraint languages for‌ data, in particular graph-shaped data. Finally, we plan‌ to have ShEx be standardized by the IEEE‌ and have a set of open software tools‌ for it.
Efficient data processing
- Short term
  We‌ are going to consolidate our team's first works‌ which relate circuits to efficient implementations based on‌ SIMD instructions. We shall also work on dynamic‌ and incremental algorithms for simple data like texts‌ and trees, again building on our existing research‌ providing first results in these directions. We will‌ also study efficient enumeration of query results that‌ respect a specific order (the order is a‌ good property for further processing), and connect this‌ with efficient dynamic algorithms. All these works give‌ basic building blocks for streaming algorithms, that we plan to orchestrate via‌ work on simple programs‌ called transducers.
- Medium‌‌ term
  We shall build a circuit-based compilation chain,‌ which goes from query‌ fragments to efficient code,‌‌ generalizing previous work to transducers. We will also‌ study how the compilation‌ can be doubled with‌‌ efficient query plan selection at runtime, leveraging the‌ efficient data structures that‌ we will have designed‌‌ by then. We will also explore how dynamic‌ and incremental algorithms can‌ accommodate expressive features such‌‌ as recursion and aggregation without compromising the effectiveness‌ of query evaluation, and‌ identify tractability assumptions that‌‌ can be leveraged to that end. In particular,‌ in connection with enumeration‌ algorithms, we shall work‌‌ on how to efficiently compose data processes, in‌ particular connecting enumeration algorithms‌ that produce sequence of‌‌ diffs with incremental algorithms that maintain a query‌ result over evolving data.‌
- Long term
  We shall‌‌ propose an end-to-end compilation framework for efficient query‌ compilation and the orchestration‌ of heterogeneous systems, including‌‌ expressive tasks such as reasoning and optimization. We‌ shall take the direction‌ of probabilistic programming to‌‌ sacrifice precision in favor of efficiency for the‌ evaluation of queries, guided‌ by a fine-grained theoretical‌‌ understanding of the possible approximation tradeoffs. We will‌ also undertake this precise‌ complexity study in the‌‌ dynamic and incremental settings as well as for‌ enumeration problems, again aiming‌ at a full understanding‌‌ of the landscape of possible complexity regimes.

3.1‌ The Basement: Circuits for‌ Data Manipulation

Large data‌‌ poses challenges that are both theoretical and practical.‌ D-DAL intends to use‌ circuits as a common‌‌ abstraction to address several of these problems. Circuits‌ are an abstract object‌ that can be approached‌‌ from various angles, corresponding to different research communities.‌ First, circuits are studied‌ from the knoweldge compilation‌‌ perspective, an AI field which studies classes of‌ Boolean circuits and which‌ has connections to data‌‌ management 24. Second, circuits are studied as‌ a compilation target,‌ this time to translate‌‌ programs to abstract representations which ensure that they‌ can be evaluated in‌ a massively parallel or‌‌ vectorized way. Third, circuits can be studied from‌ a fully theoretical angle,‌ in the field of‌‌ circuit complexity theory. We now present these‌ axes.

Studying circuits from‌ Knowledge Compilation.

Participants: Antoine‌‌ Amarilli, Mikaël Monet, Charles Paperman,‌ Sylvain Salvati.

Knowledge‌ compilation 31 is a‌‌ subfield of AI that studies different ways to‌ represent Boolean functions, in‌ particular restricted classes of‌‌ Boolean circuits. It studies tradeoffs between conciseness‌ and tractability: which‌ representations can be used‌‌ for which computational tasks (e.g., counting satisfying assignments),‌ and when can we‌ efficiently translate from one‌‌ representation to another.

It turns out that knowledge‌ compilation is a useful‌ tool to solve many‌‌ types of database tasks 24, such as‌ factorized representations of query‌ answers, enumeration algorithms, provenance‌‌ computation, and uncertainty management (in particular on probabilistic‌ data). We wish to‌ pursue this study of‌‌ knowledge compilation to databases‌ and query evaluation, and also investigate knowledge compilation‌ in itself to achieve a better understanding of‌ these circuit classes.

Short term.

In the short‌ term, we will explore how specific circuit classes‌ can be used for efficient evaluation of particular‌ queries, across various tasks such as enumeration and‌ ranked access. The PhD thesis of Oliver Irwin‌ is set in this context: it is co-directed‌ by Florent Capelli from CRIL and Sylvain Salvati.‌ It studies how to efficiently find query answers‌ using circuits for queries that may feature negation,‌ and how to reinterpret efficient query evaluation algorithms‌ in a knowledge compilation perspective. We further intend‌ to pursue our study of query evaluation on‌ probabilistic databases via circuit classes, understanding the power‌ and limitations of this method.

Medium term.

One‌ longer-term challenge to be attacked by D-DAL is‌ to understand the power of negation in circuit‌ formalisms that allow tractable counting. Indeed, for the‌ circuit classes commonly used in knowledge compilation 21‌, the use of negation is highly restricted.‌ However, it is not well-understood whether the expressive‌ power of such circuits is extended when we‌ allow arbitrary negation. This question is purely about‌ knowledge compilation circuit classes, but it also has‌ consequences for questions in database theory about tractable‌ query evaluation on probabilistic databases 27. Another‌ challenge is whether knowledge compilation classes can also‌ capture more tractable enumeration tasks, in particular those‌ involving negations (e.g., enumerating objects which do not‌ have a given property); these would be necessary‌ for a knowledge-compilation-based proof of tractable enumeration for‌ some formalisms such as first-order logic on restricted‌ graphs 43.

Long term.

A far-reaching question‌ in knowledge compilation and database theory concerns the‌ relationship between two kinds of results: algorithms (namely,‌ efficient query evaluation algorithms, and computational complexity hardness‌ results), and tractable circuits (namely, efficient compilation algorithms,‌ and lower bounds on the size of tractable‌ circuits). One long-term research direction of the D-DAL‌ team would be to unify these settings. For‌ instance, we could try to understand which hardness‌ results on query evaluation can imply circuit lower‌ bounds, e.g., in the context of recursive queries‌ 25; and when tractable algorithms can be‌ translated to circuit classes, subsuming the inclusion-exclusion conjecture‌ 27.

Another long-term goal of D-DAL could‌ be to consolidate existing results about knowledge compilation.‌ The current reference on the topic is the‌ seminal work of Darwiche and Marquis 31,‌ but it dates back to 2002. We have‌ started a survey of knowledge compilation in 21‌, but far more efforts would be needed‌ in order to survey all known results for‌ knowledge compilation circuit classes.

Circuits as a compilation‌ target for parallelism and vectorization.

Participants: Charles Paperman‌, Sylvain Salvati.

Most of the data‌ exchanged on the web is now represented as‌ JSON files which are text documents (other document‌ formats are also in use such as XML or YAML documents). This‌ is a consequence of‌ the fact that many‌‌ data engine now operate and communicate with these‌ kinds of data. Extracting‌ information and answering queries‌‌ from textual documents has become a pervasive task‌ when dealing with large‌ amount of data. However‌‌ with the growth of data that comes with‌ larger and larger documents,‌ processing textual documents has‌‌ become an important bottleneck in data processing.

For‌ the moment this bottleneck‌ is addressed first by‌‌ speeding up the translation of the documents into‌ in memory data structures‌ on which queries are‌‌ then executed. This translation, also called parsing,‌ is accelerated via low‌ level optimizations. The quickest‌‌ implementations have required expert programmers and complex costly‌ developments. They leverage specific‌ CPU instructions known as‌‌ vectorial instructions. The diversity of CPU architectures‌ require the production of‌ highly specialized code for‌‌ each of them. This work is worth doing‌ since the resulting code‌ is often accelerated by‌‌ one or two orders of magnitude compared to‌ the non-vectorial variant.

D-DAL‌ aims at developing tools‌‌ that automatize these optimization and reduce the cost‌ of vectorizing programs that‌ handle textual data. We‌‌ will not only optimize parsing, but a combination‌ of parsing with query‌ resolution. The PhD thesis‌‌ of Claire Soyez-Martin 46, 41 (conducted under‌ the supervision of Charles‌ Paperman and Sylvain Salvati)‌‌ advocates that such automations can be done via‌ a notion of vectorial‌ circuits. Our plan‌‌ is two fold: first study the compilation of‌ expressive queries into vectorial‌ circuits and then build‌‌ dedicated compilation schemes for vectorial circuits that takes‌ benefits of vectorial instructions.‌

Short term.

In continuation‌‌ of the work engaged with the PhD thesis‌ of Claire Soyez-Martin, we‌ are going to develop‌‌ further the model of vectorial circuits. The aim‌ is to obtain a‌ cost model for these‌‌ objects that reflects the efficiency of their compilation‌ to CPU and vectorial‌ instructions. We will also‌‌ identify application domains that can benefit from our‌ methods. We have already‌ identified several possibilities: biological‌‌ sequence analysis, network monitoring, pervasive streaming tasks in‌ relation with data (e.g.‌ utf8 validation, JSON processing,‌‌ etc.)

Medium term.

We plan to develop a‌ compilation chain that will‌ contain two steps. The‌‌ first step will consist in sets of compilers‌ from query fragments to‌ vectorial circuits. The second‌‌ will be a set compilers from vectorial circuits‌ to efficient vectorized hardware‌ code. At first we‌‌ will aim at X86_64 architectures. This compilation pipeline‌ will be the result‌ of ground work conducted‌‌ in algebraic automata theory.

Long term.

We plan‌ to extend our work‌ to other hardware components.‌‌ One of them is the RiscV platform which‌ has recently been endowed‌ with an interesting set‌‌ of vectorial instructions. Another one is the processing‌ in memory units (PIM)‌ that provide a highly‌‌ parallel architecture that could benefit from the relationship‌ between automata theory and‌ circuits.

Foundations of circuits:‌‌ circuit complexity, algebra and‌ abstract machines.

Participants: Antoine Amarilli, Mikaël Monet‌, Charles Parperman, Sylvain Salvati.

Our‌ work on circuits takes its root in deep‌ connections between logic, formal language and algebra. These‌ deep connections are at the heart of finite‌ state verification techniques that have been flourishing in‌ the last decades. We exploit these connections in‌ a different way. For us logic is connected‌ to queries as logical languages naturally express properties‌ on object which in turn coincides with the‌ use of declarative languages to extract information from‌ data. Moreover, logical formulae on data of bounded‌ size can be represented as circuits which extract‌ information from data. Languages or language transformations are‌ a natural counter part of logical formulae. In‌ general, they can be represented as finite state‌ machines. Finally, algebra (monoids, semigroups, forest algebras) naturally‌ expresses the invariants of these machines. Classes of‌ algebra, computation power of finite state machines, complexity‌ of circuits are then deeply connected. The classification‌ of algebras with respect to their properties often‌ coincides with natural classes of machines and of‌ circuits. Algebras form a useful mediation between machines‌ and circuits, that is, between sequential devices and‌ parallel ones. More generally, algebras formalize the structure‌ of the information that needs to be captured‌ when processing queries. It thus plays a central‌ role in many of the applications D-DAL aims‌ at developing. We plan to develop our understanding‌ of these deep connections in relation with querying‌ with applications in many objectives of D-DAL, most‌ notably: vectorization, incremental maintenance, enumeration, transducers, etc.

Short‌ term.

In connection to vectorization, we shall explore‌ particular classes of circuits and describe the classes‌ of languages they correspond to. We shall also‌ develop algebraic tools for forest algebras to enhance‌ the stream processing on tree structured documents. In‌ particular, we wish to extend the expressiveness of‌ the queries that our software rsonpath can solve.‌ Concerning incremental maintenance and enumeration, we shall workout‌ some algebraic invariants on trees that will allow‌ us to extend the methods that we have‌ developed for strings.

Medium term.

The algebraic work‌ on data (strings, trees, graphs, ...) can be‌ extended to more dynamic objects like transducers and‌ even to programs via denotational semantics. In general,‌ queries not only check properties of data, but‌ also transform data. Extending algebraic methods to such‌ dynamic setting will allow us to abstract data‌ processing pipelines and open the way to further‌ optimizations.

Long term.

In the long term, we‌ aim at developing a coherent corpus of algebraic‌ methods for trees and graphs that lend themselves‌ to classify queries on structured documents and graphs.‌ We wish also to attack difficult classification problems‌ that remain open on circuit complexity for classes‌ of regular languages. This ground work shall set‌ the stage for future applications similar to rsonpath‌.

3.2 First Floor: Logic and Query Evaluation‌

We now present scientific objectives which operate at a higher level, and‌ deal with query evaluation‌ itself, rather than with‌‌ the tools presented in the previous section which‌ are ways to implement‌ query evaluation.

Provenance, explanation,‌‌ aggregation, counting, uncertainty, probabilistic data, Shapley, approximation algorithms.‌

Participants: Antoine Amarilli,‌ Mikaël Monet.

As‌‌ already mentioned in Section 2, querying data‌ often goes beyond merely‌ computing a set of‌‌ answers and returning them. First, data is often‌ uncertain by nature (e.g.,‌ probabilistic or missing values).‌‌ Then, algorithms must be developed that take into‌ account this uncertainty, by‌ considering all possible states‌‌ of the data; which turns out to often‌ be intractable. In this‌ context, approximate query answers‌‌ (e.g., approximating the probability that a probabilistic database‌ satisfies a query) can‌ be employed to lower‌‌ the complexity. Second, one sometimes wishes to explain‌ the query results, using‌ for instance the notion‌‌ of provenance or using specific scoring mechanisms such‌ as the Shapley value‌ or the notion of‌‌ resilience of a query result.

Short term.

It‌ is known that when‌ probabilistic data has bounded‌‌ treewidth, then probabilistic query evaluation (PQE) can‌ be solved efficiently for‌ many queries. We shall‌‌ explore restrictions that are less limiting than bounded‌ treewidth but could still‌ ensure tractable PQE. One‌‌ idea would be to design width measures that‌ could be parameterized by‌ the queries. Ideally, we‌‌ would also obtain lower bounds for specific width‌ measures and queries as‌ it was done in‌‌ 28.

Medium term.

Another research direction would‌ be to explore the‌ tractability of diverse counterfactual‌‌ scoring mechanisms, such as Shapley values as done‌ in 32, 39‌, or the resilience‌‌ 35 (intuitively representing how “robust” a query answer‌ is, by determing how‌ many input tuples should‌‌ be removed from the database so that the‌ specific answer no longer‌ exists) for queries featuring‌‌ recursion.

Long term.

An intriguing and potentially fruitful‌ research direction would be‌ to connect the kind‌‌ of counting problems studied in database theory (such‌ as PQE or Shapley‌ values) to other well-established‌‌ counting problems from other fields such as counting‌ versions of constraint satisfaction‌ problems (CSPs) 33,‌‌ 38 or Holant problems and holographic reduction 30‌. We will also‌ strive to provide a‌‌ toolbox to use these algorithms in practice, leveraging‌ for instance the existing‌ implementation of ProvSQL 45‌‌, an open-source PostgreSQL extension that computes (between‌ other things) the Boolean‌ provenance of a query‌‌ over a database.

Dynamic data: incremental and dynamic‌ algorithms and connections with‌ dynamic algorithms.

Participants: Antoine‌‌ Amarilli, Mikaël Monet, Charles Paperman,‌ Sylvain Salvati.

In‌ many query evaluation scenarios,‌‌ the data to interrogate is regularly updated. When‌ re-evaluating queries multiple times‌ over different versions of‌‌ the same data, it is more efficient to‌ build index data structures‌ which can maintain the‌‌ results of the query under updates, rather than‌ re-evaluating the query every‌ time from scratch. This‌‌ problem of query evaluation‌ over dynamic data has been studied in many‌ different contexts. The D-DAL team will focus on‌ two broad kinds of contexts: tree-shaped data on‌ the one hand, which includes texts, trees, genomic‌ sequences, etc.; and unrestricted data, namely graph‌ databases and general relational databases.

Short term.

In‌ the short term, we intend to focus on‌ the first application setting of texts and trees,‌ and pursue our ongoing work in these settings.‌ While the main results in the area have‌ already been proven, some work remains in expanding‌ the frontiers of what can be done efficiently‌ over dynamic text and trees: can we maintain‌ data structures for in-order enumeration? in which orders?‌ with which complexity? One important direction which remains‌ open is that of showing improved bounds on‌ query evaluation which depends on the specific query‌ (rather than for a general query class). From‌ our initial results for regular queries on words‌ 26, one especially promising ongoing effort is‌ to extend these results to trees. Another is‌ to extend such results to enumeration data structures.‌ In both cases, significant algebraic groundwork is required.‌

Medium term.

We then intend to focus on‌ broader directions. One of them is to support‌ more expressive updates: the bulk of the‌ literature on dynamic data focuses on single-character or‌ single-tuple changes, but real updates are often more‌ expressive. On text, we want to be able‌ to support copy-and-paste updates, which move an entire‌ part of the text to a different point;‌ or search-and-replace updates. On relational databases, we want‌ to develop incremental algorithms that cope with updates‌ which are themselves expressed in SQL, i.e., with‌ UPDATE statements. Another direction is to move to‌ the study of incremental maintenance for more expressive‌ queries, in particular those featuring recursion or‌ aggregation. For instance, can we characterize how efficiently‌ we can maintain the results of Datalog queries,‌ with fine bounds that depend on the specific‌ Datalog query which is posed? Which assumptions on‌ the underlying data can be used to achieve‌ tractable algorithms?

Long term.

A more open direction‌ for query evaluation over dynamic data would be‌ to achieve precise bounds on the complexity of‌ maintaining each individual query by charting the space‌ of possible tradeoffs across various performance measures: what‌ is the time needed to handle updates to‌ the data; what is the delay between successive‌ results? How much memory usage is required to‌ store the data structure? Which updates are supported?‌ In this light, one especially challenging obstacle is‌ the need for suitable lower bound techniques, in‌ particular unconditional lower bounds, which are sometimes achievable‌ depending on the model studied. Another general question‌ is to connect more broadly to the literature‌ studying data structures, in which questions about mutable‌ data have been investigated for decades. We could‌ for instance leverage results on maintaining search trees‌ balanced under updates, or algorithms on dynamic graphs 37 which have been‌ studied in the algorithms‌ community (especially algorithms to‌‌ dynamically maintain the connectivity of directed graphs).

Efficient‌ evaluation: streaming and parallelism.‌

Participants: Aurélien Lemay,‌‌ Charles Paperman, Sylvain Salvati.

Efficiency in‌ data processing mostly comes‌ from streaming data. This‌‌ means that data is passed from small processing‌ units to others so‌ as to produce an‌‌ output. Certain operations cause serious synchronization problems (sorting‌ data, eliminating duplicates, combining‌ streams, etc.) they block‌‌ the processing and cause sometimes important lags. Sticking‌ to functional programming, D-DAL‌ will develop streaming data‌‌ processing programs by using static analysis, speculative computations‌ (to prepare results when‌ lags are unavoidable), develop‌‌ stream fusion compilation methods so as to obtain‌ efficient streaming algorithms. Techniques‌ coming from synchronous programming‌‌ will be considered, in particular in connection with‌ the leveraging of fine-grained‌ parallelism for low level‌‌ streaming. Also, in the medium term, we shall‌ explore how to parallelize‌ query executions taking advantage‌‌ of static analyses of the functional code.

Short‌ term.

Part of the‌ effort will concern very‌‌ simple classes of functional programs known as transducers‌. D-DAL will implement‌ libraries for manipulating transducers,‌‌ minimizing them and composing them. Several extensions of‌ the models of the‌ literature are being proposed‌‌ by the team and will benefit from these‌ implementations. Of particular interest‌ to us are higher-order‌‌ transducers which form an elegant generalization of many‌ transducers from the literature.‌ We shall work on‌‌ compiling them for streaming documents.

Medium term.

Orchestrating‌ programs that stream data‌ is challenging as they‌‌ may work at different paces and as communications‌ between them do not‌ have a negligible cost.‌‌ We will work on this orchestration problem in‌ connection with programming languages‌ theory, most notably with‌‌ synchronous and functional programming languages. Indeed, when we‌ have compiled programs with‌ our method, we shall‌‌ be able to efficiently compose them. Many technical‌ methods for achieving this‌ have been developed for‌‌ these programming languages.

Long term.

In the long‌ run, we will take‌ stock on the work‌‌ of the community of functional probabilistic programming. Indeed,‌ when working with huge‌ data, approximate querying becomes‌‌ hardly avoidable and thus extending our methods to‌ sampling data with various‌ probabilistic distributions becomes necessary.‌‌ In particular, we shall see how we implement‌ probabilistic querying by means‌ of transducers.

Enumerating the‌‌ results of recursive queries.

Participants: Antoine Amarilli,‌ Mikaël Monet, Charles‌ Paperman, Sylvain Salvati‌‌.

One important challenge in modern database theory‌ is efficient enumeration algorithms‌ 44, i.e., efficiently‌‌ producing query results one after the other. This‌ is related to query‌ evaluation in streaming, but‌‌ with a key difference: here we are looking‌ for algorithms which produce‌ a stream of results,‌‌ instead of consuming a stream of input data.‌ Database theory has studied‌ which queries can efficiently‌‌ be evaluated, but an especially challenging topic is‌ to understand under which‌ assumptions it can be‌‌ done for recursive queries‌, e.g., Datalog queries. This task can in‌ particular be studied over restricted formalisms of input‌ data, such as textual data. This is the‌ focus of document spanners 34, which are‌ a declarative way to specify how to extract‌ matches of queries over textual documents. There, recursivity‌ is needed to express long-distance dependencies (e.g., extract‌ all phone numbers that are between two specific‌ delimiters, no matter how far).

Short term.

One‌ immediate research objective is to study which enumeration‌ tasks for (recursive) queries can be efficiently solved,‌ in particular when considering variations on the standard‌ setting of enumeration algorithms. For instance, in many‌ applications, we want to produce results in a‌ specific order; to directly access individual results‌ by their offset; or to efficiently test whether‌ a candidate result is actually a solution, or‌ finding out its position in the enumeration order.‌ We intend to pursue our ongoing work in‌ such directions 22.

Medium term.

A medium-term‌ research question is the combination of enumeration algorithms‌ with the previous axis of handling dynamic data‌. Previous work by members of the team‌ has already studied how enumeration algorithms could be‌ applied to dynamic data via knowledge compilation methods‌ 23. However, we are still missing an‌ understanding of how efficient such algorithms can be‌ made, especially if we want query-dependent algorithms. This‌ would probably require an algebraic theory of queries‌ which can return results (to cover document spanners),‌ in particular understanding their connection to transducers.

A‌ related medium-term research question is the idea of‌ enumerating query results via updates, i.e., producing‌ each query results not from scratch but by‌ editing a previous result. Team members have started‌ investigating this question in a specific context 29‌, but this enumeration framework could be fruitful‌ in many other settings. It also has an‌ intriguing connection to the question of query evaluation‌ over dynamic data, because the two problems are‌ related: query evaluation on dynamic data takes a‌ stream of changes as input and maintain query‌ results, whereas enumerating query results via updates will‌ produce a stream of changes to go over‌ all results to a query.

Long term.

Most‌ research about enumeration algorithms for recursive queries using‌ circuit methods focus on the case of textual‌ data (e.g., document spanners) and tree-shaped data. This‌ leaves open the setting of more general data,‌ in particular graph data. Of course, the task‌ has already been studied, in particular in the‌ comparatively simple setting of regular path queries 40‌. But we are still missing an understanding‌ of those enumeration tasks that can be solved‌ efficiently in more expressive query languages (e.g., Datalog‌ queries) and depending on the various possible semantics‌ (e.g., returning simple paths, trails, shortest paths, or‌ arbitrary walks). As many of these tasks will‌ be intractable, efficient algorithms will probably also need‌ to identify conditions on the input data that are sufficient to ensure‌ tractability.

Another wide open‌ direction for long-term research‌‌ in enumeration algorithms concerns lower bounds. Indeed, many‌ lower bounds are conditional,‌ and they usually do‌‌ not apply to enumeration but to related problems‌ (e.g., checking whether a‌ solution exists, or materializing‌‌ all solutions). An especially difficult challenge would be‌ to develop lower bound‌ methods that show the‌‌ intractability of efficient enumeration per se.

3.3 The‌ roof: Knowledge on Data‌

Optimization for schemas for‌‌ graphs.

Participants: Antoine Amarilli, Iovka Boneva,‌ Charles Paperman.

The‌ role of schemas for‌‌ graph data is a bit different than schemas‌ for relational data. The‌ latter are prescriptive (i.e.‌‌ data is organized exactly like this) and are‌ an essential ingredient for‌ static optimization of queries.‌‌ The former are de facto often descriptive (i.e.‌ data should look more‌ or less like this).‌‌ Such looseness of schemas is unavoidable, and even‌ desirable, for big data‌ stores or knowledge bases‌‌ which are constantly evolving, fed with data from‌ numerous sources that can‌ be incomplete, inconsistent and‌‌ contain errors.

D-DAL aims at providing tools for‌ increasing the usefulness of‌ such approximate schemas in‌‌ the context of dynamically evolving data. This line‌ of research continues our‌ work of the formalization,‌‌ the standardization the development and the tooling of‌ the shape expressions language‌ (ShEx) for RDF graphs,‌‌ and aims at extending on some of our‌ previous results and approaches‌ to property graphs and‌‌ PG-Schemas.

Short term.

A first short term objective‌ is to enrich graph‌ schema languages such as‌‌ ShEx with probabilistic information indicating the degree of‌ compliance to the actual‌ data. We will aim‌‌ at a practical extension with limited expressiveness in‌ order to keep feasible‌ the associated computational tasks‌‌ (data validation, schema extraction) while still helping experts‌ in assessing the quality‌ of a data set,‌‌ detecting errors and monitoring the evolution of data.‌ This approach will be‌ built on our previous‌‌ work in discovering schemas from existing data, while‌ extending it to property‌ graph schemas.

Medium term.‌‌

As all software, data-centric applications constantly evolve, either‌ to match new use‌ cases, or to exploit‌‌ new data sources, or both. This most often‌ requires to transform the‌ data model in order‌‌ to match the new processing needs. As a‌ mid-term objective we plan‌ to investigate algorithms for‌‌ incremental and conjoint data and schema evolution, based‌ on logic tools such‌ as mappings and the‌‌ chase.

Although there exist a number of native‌ graph data stores, lots‌ of data is effectively‌‌ stored in relational or other data stores (e.g.,‌ Wikibase for wikidata) and‌ is only exposed as‌‌ a graph to the user. There might also‌ be the case that‌ data from streams is‌‌ processed and integrated to a knowledge graph. The‌ original data can contain‌ schema information or satisfy‌‌ some constraints. As a mid-term objective, we plan‌ to investigate lifting such‌ constraints from the underlying‌‌ data to the graph‌ view. We will focus on useful and feasible‌ sub-classes.

Long term.

Finally, we will continue our‌ participation in Shape Expressions Working group which defines‌ the ShEx language, currently in the process of‌ standardization at IEEE. We will also continue to‌ participate in the implementation of a validator for‌ ShEx and related tools as open software.

Static‌ analysis with constraints (open-world query answering, query rewriting‌ under constraints, consistency, certain answers).

Participants: Antoine Amarilli‌, Iovka Boneva, Aurélien Lemay, Sylvain‌ Salvati, Sophie Tison.

Contrary to the‌ closed world assumption of traditional database research, the‌ absence of a piece of information often does‌ not mean that this piece of information is‌ false, but rather that we do not know‌ its status; this is called the open world‌ assumption. In this scenario, the data is‌ typically enriched by constraints that it should satisfy,‌ such as key, cardinality or equality constraints. Several‌ questions arise when querying data under the open‌ world assumption. Is the data consistent with the‌ constraints we know about it? Taking into account‌ the constraints and the fact that the information‌ is incomplete, what degree of certainty can we‌ have about the answer set of queries? Is‌ it possible to take advantage of the constraints‌ so as to execute simpler queries (e.g., using‌ rewriting techniques)? Moreover, it may not be possible‌ to know which constraints apply to the data‌ at stake, in which case it can then‌ be desirable to infer knowledge using machine learning‌ methods. D-DAL will apply the querying methods it‌ develops to that setting so as to produce‌ added values from knowledge bases.

Short term.

We‌ will pursue work we have engaged with the‌ PhD of Lily Gallois 36 on regular path‌ queries (RPQs) and word constraints on graph databases.‌ Word constraints are the simplest kind of constraints‌ that can be put on a graph database;‌ they seem to naturally connect with rewrite systems.‌ As we showed in 42, this relation‌ is however more complicated that expected. We are‌ going to study further the classes of constraints‌ that enable reasoning about RPQs and their decidability.‌ We will also explore the problems of certain‌ query answering and query rewriting in this setting.‌ The aim is to propose efficient query resolution‌ algorithms from which we can build more complex‌ queries, in particular queries with aggregation.

We are‌ also going to work on schema and constraints‌ inference for graph databases. We have skills in‌ symbolic learning methods which have proved useful in‌ extracting information from the web. Facing large graphs,‌ these methods will have difficulties to scale, moreover‌ they are often not robust to noise. Machine‌ learning methods, typically neural networks, are more immune‌ to this problem. We shall develop a hybrid‌ approach: the method that we envision consists in‌ learning structures with neural networks and then extracting‌ the information that is useful in the context of data processing by‌ means of symbolic techniques.‌

Medium term.

The D-DAL‌‌ team intends to study several medium-term problems related‌ to open-world query answering‌ and reasoning under rules.‌‌ One especially interesting question is how the notion‌ of circuits can be‌ leveraged to achieve tractable‌‌ algorithms for these tasks. The hope is that‌ circuits can be a‌ versatile tool unlocking many‌‌ results that would otherwise need to be shown‌ independently: supporting uncertain facts‌ (and uncertain rules); providing‌‌ provenance information to capture how a specific fact‌ is derived and explain‌ to the user why‌‌ a query result is true (given the data‌ and the rules); giving‌ factorized representations of a‌‌ possibly large number of query answers. At a‌ technical level, a broad‌ question is how forward-chaining‌‌ deduction techniques such as the chase can be‌ extended to create a‌ circuit representation of the‌‌ derivations. This is challenging because the chase, unlike‌ circuits, may represent some‌ cyclic dependencies, which may‌‌ cause convergence issues depending on the semiring in‌ which the computation is‌ performed. Another question is‌‌ how this can be combined with backward-chaining methods‌ such as query rewriting.‌

Long term.

In the‌‌ long run, one unifying challenge of the ontology-mediated‌ query answering community is‌ the question of unifying‌‌ different logical frameworks. Indeed, while query languages are‌ relatively standard, rules can‌ nowadays be specified in‌‌ many different incompatible ways: description logics, existential rules,‌ graph patterns, database constraint‌ languages such as TGDs,‌‌ etc. It is challenging to understand the connections‌ between these settings and‌ difficult to conceive systems‌‌ that could orchestrate query evaluation across back-ends using‌ different query formalisms. Another‌ more algorithmic challenge concerns‌‌ the complexity of ontology-mediated queries: while there are‌ many results showing complexity‌ upper and lower bounds‌‌ for general classes (e.g., guarded TGDs), it is‌ much more challenging to‌ try to identify precise‌‌ complexities of individual queries, in the spirit‌ of what is done‌ for less expressive queries‌‌ such as conjunctive queries. Thus, the D-DAL team‌ aims at making progress‌ towards this very ambitious‌‌ theoretical goal.

4 Application domains

The project of‌ the D-DAL team aims‌ at efficiently extracting information‌‌ from large datasets. We aim to process data‌ in all the diversity‌ of its representations, and‌‌ to take into account its quality, including questions‌ of uncertainty and probability.‌ Our approach is to‌‌ develop theoretical results that can later be implemented‌ into concrete programs. As‌ such, our project relates‌‌ to several Inria strategic axes:

Data Science and‌ Artificial Intelligence
The work‌ of our team will‌‌ study how to efficiently produce data for other‌ downstream applications that analyze‌ it or aggregate it‌‌ from various sources. This is key in analysing‌ data where large chunks‌ of data are processed‌‌ to produce statistics and other kinds of aggregates.‌ Apart from efficient data‌ processing, the longer terms‌‌ objectives of the project aim to incorporate randomized‌ programming to support wider‌ possible scenarios in data‌‌ science. Similarly, machine learning‌ algorithms are fed with data. These algorithms require‌ efficient data manipulation and also a control over‌ data quality. Our work on uncertainty and provenance‌ is useful in this setting. Finally, the older‌ sense of AI (symbolic AI) is connected to‌ D-DAL through the study of circuits and knowledge‌ compilation. We hope that these connections will reveal‌ fruitful and that our contributions will also have‌ impact in this field.
Software and Hardware

We‌ intend to develop query execution engines, querying platform‌ and schemas validators for graphs. Moreover, D-DAL will‌ also have interests in query languages, their extensions,‌ and their execution. These topics make us have‌ close interest in compilation techniques and programming language‌ methods. We adopt an integrating view of query‌ processing that embeds computational capabilities. This leads to‌ see data processing and its interactions with other‌ software differently, delegating more and more computation to‌ query engines.

D-DAL will develop compilation methods for‌ various types of query, taking advantage of parallel‌ hardware capabilities. In particular, we intend to develop‌ tools to analyze queries so as to detect‌ opportunities for parallel execution. Models of abstract circuits‌ will allow us to harness the power of‌ modern hardware, in particular making better use of‌ SIMD instructions. We also plan to work on‌ RISC V architectures and processing in memory units‌ (PIM). Databases have always had a tradition of‌ exploiting hardware capabilities. However, with open source architectures‌ such as RISC V, the particular needs of‌ databases may be addressed by particular instructions. We‌ hope that our work can have influence on‌ RISC V architectures and that specific hardware design‌ for data processing can emerge from our research.‌
Databases
Our view of databases is different from‌ the integrating view that SQL standardization gives. While‌ SQL continues to integrate in a top-down manner‌ more and more querying capabilities under the same‌ software, the approach of D-DAL is to combine‌ with optimized streaming techniques, in a bottom-up fashion,‌ well-defined and limited query languages for which we‌ have identified highly efficient algorithms. Though more difficult‌ to approach than the SQL view, our view‌ may be more adapted for data processing in‌ contexts where data is scattered within files in‌ different formats or is too large for database‌ management systems.

5 Social and environmental responsibility

5.1‌ Footprint of research activities

Antoine Amarilli
is environmental‌ chair in the steering committee of the informal‌ conference “Highlights of Logic, Games, and Automata” (since‌ 2024).
Iovka Boneva
is the leader of the‌ Sustainable Development Commission of the CRIStAL research lab.‌
Antoine Amarilli
is a co-animator of the TCS4F‌ pledge on sustainable research in theoretical computer science‌ (since 2020).
Sophie Tison
is a member of‌ the general assembly of the European Association for‌ Theoretical Computer Science (EATCS) (elected in 2019).

5.2‌ Parity and equal opportunities in science

Sophie Tison‌
Member of the steering committee of the Programme‌ de mentorat Femmes et Sciences de Lille.
Sylvain Salvati
Member of commission‌ parité et égalité des‌ chances at Inria.

5.3‌‌ Impact of research results

Databases and methods from‌ Artificial Intelligence are used‌ in virtually all aspects‌‌ of the modern digitalized world, from companies' web‌ services to governments' institutions.‌

6 Highlights of the‌‌ year

6.1 Creation of the D-DAL team-project

The‌ D-DAL team-project has been‌ created in October 2025‌‌ It is the result of the collective work‌ of the team members.‌ We are greatful to‌‌ INRIA for its trust and support. In particular,‌ INRIA allowed us to‌ recruit Antoine Amarilli this‌‌ year has been very important to us both‌ in term of team‌ life and of scientific‌‌ achievements. This further allowed us to recruit numerous‌ students to work with‌ us:

Trainees
Thibault Tisserant,‌‌ Paul Raphaël, Rémi de Pretto, Vincent Peth, Harshul‌ Sagar.
PhD students
Sébastien‌ Labbe, Loup Lobet, Arthur‌‌ Lombardo, Étienne Parent

7 Latest software developments, platforms,‌ open data

7.1 Application‌ of database techniques to‌‌ genomic

Charles Paperman has developed an application axis‌ in genomic in collaboration‌ with the CRIStAL team‌‌ Bonsai. Two PhDs have started to apply database‌ methods to genomic. Very‌ promising results have already‌‌ been obtained with software that work better than‌ the state of the‌ art. The software Vizitig‌‌ that allows users to query large genomic data‌ is developing at a‌ fast pace.

7.2 Latest‌‌ software developments

7.2.1 helicase

Keywords:
SIMD, Parsing, Bioinformatics,‌ Rust
Functional Description:
It‌ is a rust crate‌‌ (library) that allows to parse and extract relevant‌ information classical format in‌ bio-informatics. It servers as‌‌ a demonstration of a methodology that take benefits‌ of SIMD instructions through‌ analysis from automata theory‌‌ in the scope of the Shannon Meet Cray‌ ANR project
URL:
https://github.com/imartayan/helicase‌
Contact:
Charles Paperman

7.2.2‌‌ NetworkDisk

Name:
NetworkDisk
Keywords:
Large graphs, Python, Databases‌
Functional Description:
NetworkDisk provides‌ a way to manipulate‌‌ graphs on disk. The goal is to be‌ as much as possible‌ compatible with (Di)Graph objects‌‌ of the NetworkX Python package but lifting memory‌ requirement and providing persistence‌ of the Graph.
URL:‌‌
https://networkdisk.inria.fr/
Contact:
Charles Paperman

7.2.3 rsonpath

Keywords:
JSon,‌ Streaming, SIMD, Rust
Functional‌ Description:
The rsonpath crate‌‌ provides a JSONPath parser and a query execution‌ engine, which utilizes SIMD‌ instructions to provide massive‌‌ throughput improvements over conventional engines.
URL:
https://github.com/V0ldek/rsonpath
Contact:‌
Charles Paperman
Partners:
Warsaw‌ University, Technical University of‌‌ Munich (TUM)

7.2.4 ShapeDesigner

Name:
ShapeDesigner
Keywords:
Validation,‌ Data Exploration, Verification
Functional‌ Description:
ShapeDesigner allows construct‌‌ a ShEx or SHACL schema for an existing‌ dataset. It combines algorithms‌ to analyse the data‌‌ and automatically extract shape constraints, and to edit‌ and validate shape schemas.‌
URL:
https://gitlab.inria.fr/jdusart/shexjapp
Contact:
Jeremie‌‌ Dusart
Partner:
CRIStAL

7.2.5 ShEx validator

Name:
Validation‌ of Shape Expression schemas‌
Keywords:
Data management, RDF‌‌
Functional Description:
Shape Expression schemas is a formalism‌ for defining constraints on‌ RDF graphs. This software‌‌ allows to check whether a graph satisfies a‌ Shape Expressions schema.
Release‌ Contributions:

ShExJava now uses‌‌ the Commons RDF API‌ and so support RDF4J, Jena, JSON-LD-Java, OWL API‌ and Apache Clerezza. It can parse ShEx schema‌ in the ShEcC, ShEJ, ShExR formats and can‌ serialize a schema in ShExJ.

To validate data‌ against a ShExSchema using ShExJava, you have two‌ different algorithms:

- the refine algorithm: compute once‌ and for all the typing for the whole‌ graph

- the recursive algorithm: compute only the‌ typing required to answer a validate(node,ShapeLabel) call and‌ forget the results.
URL:
https://github.com/iovka/shex-java
Contact:
Iovka Boneva‌
Partner:
CRIStAL

7.2.6 vizitig

Keywords:
Genomics, Transcriptomics, Databases,‌ De Bruijn graphs, Data visualization
Functional Description:
Vizitig‌ is a genomics and transcriptomics exploration software built‌ around a de Bruijn graph that directly encodes‌ k-mers from raw sequencing data, whether assembled or‌ not. It enables querying very large sequencing datasets‌ without prior alignment by combining sequence structure, biological‌ annotations, and sample metadata within a single queryable‌ framework. Users can search for regions of interest‌ based on genes, loci, biomarkers, or sample traits‌ (presence, abundance, etc.), and interactively explore only the‌ relevant graph neighborhoods in a scalable way. Vizitig‌ provides a human-readable query language, a full-featured web‌ interface, as well as command-line and Python interfaces,‌ facilitating integration with standard analysis tools and data‌ sharing.
Contact:
Charles Paperman

8 New results

8.1‌ Circuits for Data Manipulation

Participants: Antoine Amarilli,‌ Oliver Irwin, Mikael Monet, Sylvain Salvati‌.

8.1.1 Studying circuits from Knowledge Compilation

In‌ 16, Florent Capelli, Oliver Irwin and Sylvain‌ Salvati present an elementary branch and bound algorithm‌ with a simple analysis of why it achieves‌ worstcase optimality for join queries on classes of‌ databases defined respectively by cardinality or acyclic degree‌ constraints. They then show that if one is‌ given a reasonable way for recursively estimating upper‌ bounds on the number of answers of the‌ join queries, the algorithm can be turned into‌ an algorithm for uniformly sampling answers with expected‌ running time $\tilde{𝒪 ( 𝚄𝙿 / 𝙾𝚄𝚃)‌}$ where $𝚄𝙿$ is the upper bound, $𝙾𝚄𝚃‌$ is the actual number of answers and $\overset{˜}{𝒪‌ (\cdot)}$ ignores polylogarithmic factors. Their‌ approach recovers recent results on worstcase optimal join‌ algorithm and sampling in a modular, clean and‌ elementary way.

8.1.2 Foundations of circuits: circuit complexity,‌ algebra and abstract machines

In 14, with‌ Florin Manea, Tina Ringleb and Markus Schmid, Antoine‌ Amarilli has studied linear time subsequence and supersequence‌ regex matching. It is well-known that checking whether‌ a given string $w$ matches a given regular‌ expression $r$ can be done in quadratic time‌ $𝒪 (| w | \times | r‌ |)$ and that this cannot be improved‌ to a truly subquadratic running time of $𝒪‌ ((| w | \times {| r‌ |)}^{1 - ε})$ assuming the‌ strong exponential time hypothesis (SETH). They study a‌ different matching paradigm where they ask instead whether‌ $w$ has a subsequence that matches $r$ , and show that regex‌ matching in this sense‌ can be solved in‌‌ linear time $𝒪 ( | w | +‌ | r |)‌$ . Further, the same‌‌ holds if asked for a supersequence. They show‌ that the quantitative variants‌ where one wants to‌‌ compute a longest or shortest subsequence or supersequence‌ of $w$ that matches‌ $r$ can be solved‌‌ in $𝒪 (| w | \times |‌ r |)$ ,‌ i. e., asymptotically no‌‌ worse than classical regex matching; and they show‌ that $𝒪 (|‌ w | + |‌‌ r |)$ is conditionally not possible for‌ these problems. They also‌ investigate these questions with‌‌ respect to other natural string relations like the‌ infix, prefix, left-extension or‌ extension relation instead of‌‌ the subsequence and supersequence relation. They further study‌ the complexity of the‌ universal problem where one‌‌ asks if all subsequences (or supersequences, infixes, prefixes,‌ left-extensions or extensions) of‌ an input string satisfy‌‌ a given regular expression.

In 19, Antoine‌ Amarilli, Mikaël Monet and‌ Rémi De Pretto study‌‌ the class of local languages which is a‌ well-known subclass of the‌ regular languages that admits‌‌ many equivalent characterizations. In this short note they‌ establish the PSPACE-completeness of‌ the problem of determining,‌‌ given as input a nondeterministic finite automaton (NFA)‌ $A$ , whether the‌ language recognized by $A‌‌$ is local or not. This contrasts with the‌ case of deterministic finite‌ automata (DFA), for which‌‌ the problem is known to be in PTIME.‌

In 18, Antoine‌ Amarilli, Mikaël Monet and‌‌ Rémi De Pretto study two rewrite rules on‌ hypergraphs, called edge-domination and‌ node-domination, and show that‌‌ they are confluent. These rules are rather natural‌ and commonly used before‌ computing the minimum hitting‌‌ sets of a hypergraph. Intuitively, edge-domination allows one‌ to remove hyperedges that‌ are supersets of another‌‌ hyperedge, and node-domination allows one to remove nodes‌ whose incident hyperedges are‌ a subset of that‌‌ of another node. They show that these rules‌ are confluent up to‌ isomorphism, i.e., if one‌‌ applies any sequences of edge-domination and node-domination rules,‌ then the resulting hypergraphs‌ can be made isomorphic‌‌ via more rule applications. This in particular implies‌ the existence of a‌ unique minimal hypergraph, up‌‌ to isomorphism.

In 17, Antoine Amarilli with‌ Benoît Groz study the‌ cutwidth measure on graphs‌‌ and ways to bound the cutwidth of a‌ graph by partitioning its‌ vertices. They consider bounds‌‌ expressed as a function of two quantities: on‌ the one hand, the‌ maximal cutwidth $y$ of‌‌ the subgraphs induced by the classes of the‌ partition, and on the‌ other hand, the cutwidth‌‌ $x$ of the quotient multigraph obtained by merging‌ each class to a‌ single vertex. They consider‌‌ in particular the decomposition of directed graphs into‌ strongly connected components (SCCs):‌ in this case, $y‌‌$ is the maximal cutwidth of an SCC, and‌ $x$ is the cutwidth‌ of the directed acyclic‌‌ condensation multigraph. They show‌ that the cutwidth of a graph is always‌ in $O (x + y)$ ,‌ specifically it can be upper bounded by $1‌ . 5 x + y$ . They also‌ show a lower bound justifying that the constant‌ $1.5$ cannot be improved in general.‌

8.2 Logic and Query Evaluation

Participants: Antoine Amarilli‌, Mikael Monet, Charles Paperman.

8.2.1‌ Provenance, explanation, aggregation, counting, uncertainty, probabilistic data, Shapley,‌ approximation algorithms.

In 9, Antoine Amarilli et‌ al. study approximation algorithms for queries in probabilistic‌ graphs. Query evaluation over probabilistic databases is notoriously‌ intractable – not only in combined complexity, but‌ often in data complexity as well. This motivates‌ the study of approximation algorithms, and particularly of‌ combined FPRASes, with runtime polynomial in both the‌ query and instance size. In this paper, they‌ focus on tuple-independent probabilistic databases over binary signatures,‌ i.e., probabilistic graphs, and study when they can‌ devise combined FPRASes for probabilistic query evaluation. They‌ settle the complexity of this problem for a‌ variety of query and instance classes, by proving‌ both approximability results and (conditional) inapproximability results doubled‌ with (unconditional) DNNF provenance circuit size lower bounds.‌ This allows them to deduce many corollaries of‌ possible independent interest. For example, they show how‌ the results of Arenas et al. on counting‌ fixed-length strings accepted by an NFA imply the‌ existence of an FPRAS for the two-terminal network‌ reliability problem on directed acyclic graphs: this was‌ an open problem until now. They also show‌ that one cannot extend a recent result of‌ van Bremen and Meel that gives a combined‌ FPRAS for self-join-free conjunctive queries of bounded hypertree‌ width on probabilistic databases: neither the bounded-hypertree-width condition‌ nor the self-join-freeness hypothesis can be relaxed. They‌ last show how their methods can give insights‌ on the evaluation and approximability of regular path‌ queries (RPQs) on probabilistic graphs in the data‌ complexity perspective, showing in particular that some of‌ them are (conditionally) inapproximable.

In 10, with‌ Wolfgang Gatterbauer and Neha Makhija, Antoine Amarilli and‌ Mikaël Monet have investigated the complexity classification of‌ resilience for regular path queries. The resilience problem‌ for a query and an input set or‌ bag database is to compute the minimum number‌ of facts to remove from the database to‌ make the query false. In this paper, they‌ study how to compute the resilience of Regular‌ Path Queries (RPQs) over graph databases. Their goal‌ is to characterize the regular languages $L$ for‌ which it is tractable to compute the resilience‌ of the existentially-quantified RPQ built from $L$ .‌ They show that computing the resilience in this‌ sense is tractable (even in combined complexity) for‌ all RPQs defined from so-called local languages. By‌ contrast, they show hardness in data complexity for‌ RPQs defined from the following language classes (after‌ reducing the languages to eliminate redundant words): all‌ finite languages featuring a word containing a repeated letter, and all languages‌ featuring a specific kind‌ of counterexample to being‌‌ local (which they call four-legged languages). The latter‌ includes in particular all‌ languages that are not‌‌ star-free. Their results also imply hardness for all‌ non-local languages with a‌ so-called neutral letter. They‌‌ also highlight some remaining obstacles towards a full‌ dichotomy. In particular, for‌ the RPQ $a b‌‌ c | b e$ , resilience is tractable‌ but the only known‌ PTIME algorithm uses submodular‌‌ function optimization.

8.2.2 Dynamic data: incremental and dynamic‌ algorithms and connections with‌ dynamic algorithms

In 13‌‌, with Corentin Barloy and Louis Jachiet, Antoine‌ Amarilli and Charles Paperman‌ study the dynamic membership‌‌ problem for regular tree languages under relabeling updates:‌ we fix an alphabet‌ $Σ$ and a regular‌‌ tree language $L$ over $Σ$ (expressed, e.g., as‌ a tree automaton), we‌ are given a tree‌‌ $T$ with labels in $Σ$ , and we‌ must maintain the information‌ of whether the tree‌‌ $T$ belongs to $L$ while handling relabeling updates‌ that change the labels‌ of individual nodes in‌‌ $T$ . Their first contribution is to show‌ that this problem admits‌ an $𝒪 (log‌‌ (n) / log (log (‌ n)))‌$ algorithm for any fixed‌‌ regular tree language, improving over known $𝒪 (‌ log (n)‌)$ algorithms. This generalizes‌‌ the known $𝒪 ( log (n)‌ / log (log‌ (n))‌‌)$ upper bound over words, and it matches‌ the lower bound of‌ $Ω (log (‌‌ n) / log (log (n‌)))$ from‌ dynamic membership to some‌‌ word languages and from the existential marked ancestor‌ problem. Their second contribution‌ is to introduce a‌‌ class of regular languages, dubbed almost-commutative tree languages,‌ and show that dynamic‌ membership to such languages‌‌ under relabeling updates can be decided in constant‌ time per update. Almost-commutative‌ languages generalize both commutative‌‌ languages and finite languages: they are the analogue‌ for trees of the‌ ZG languages enjoying constant-time‌‌ dynamic membership over words. Their main technical contribution‌ is to show that‌ this class is conditionally‌‌ optimal when one assumes that the alphabet features‌ a neutral letter, i.e.,‌ a letter that has‌‌ no effect on membership to the language. More‌ precisely, they show that‌ any regular tree language‌‌ with a neutral letter which is not almost-commutative‌ cannot be maintained in‌ constant time under the‌‌ assumption that the prefix-U1 problem from 26 also‌ does not admit a‌ constant-time algorithm.

8.3 Knowledge‌‌ on Data

Participants: Iovka Boneva, Sylvain Salvati‌, Sophie Tison.‌

8.3.1 Optimization for schemas‌‌ for graphs

Iovka Boneva has started a particularly‌ fruitful international collaboration with‌ researchers from Poland, Austria‌‌ and Italy which aims at bridging the gap‌ between graph schema languages‌ for RDF and for‌‌ property graphs. This resulted in a common core‌ definition of a schema‌ language that is at‌‌ the intersection of three‌ schema languages for graphs of different types.

In‌ 12, Iovka Boneva collaborated with several specialists‌ in schema languages for graphs. They established a‌ common model for three main graph schema languages:‌ SHACL and ShEx for RDF graphs, and PG-Schema‌ for property graphs. These languages are designed and‌ used for formalising properties of data and ensuring‌ data quality in diverse graph-shaped data driven applications‌ such as Semantic Web and databases. Each language‌ has its unique approach to defining constraints and‌ validating graph data, leaving potential users in the‌ dark about their commonalities and differences. The paper‌ provides concise formal definitions of the core components‌ of these languages, employs a uniform framework to‌ facilitate a comprehensive comparison between them, and identifies‌ a common set of functionalities, shedding light on‌ both overlapping and distinctive features. This work is‌ a first step towards interoperability of systems based‌ on different schema languages.

In 15, Iovka‌ Boneva with Jose E. Labra Gayo, Eric G.‌ Prud'hommeaux, Katherine Thornton and Andra Waagmeester presented formal‌ semantics for the ShEx schema language enriched with‌ an inheritance mechanism. It is inspired by inheritance‌ in object-oriented programming languages, and provides similar advantages‌ such as reuse, modularity, and more flexible data‌ modelling. Using an example, we explain the main‌ features of the inheritance mechanism. We present its‌ syntax and formal semantics. The semantics is an‌ extension of the semantics of ShEx 2.1. It‌ also directly yields a validation algorithm as an‌ extension of the previous ShEx validation algorithms.

8.3.2‌ Static analysis with constraints (open-world query answering, query‌ rewriting under constraints, consistency, certain answers)

With Pablo‌ Barceló and Gaëlle Fontaine, Sylvain Salvati and Sophie‌ Tison have corrected some proofs in 11.‌ This paper is about handling graph databases with‌ regular path constraints. Applications of graph databases are‌ prone to inconsistency due to interoperability issues. This‌ raises the need for studying query answering over‌ inconsistent graph databases in a simple but general‌ framework. They follow the approach of consistent query‌ answering (CQA), and study its data complexity over‌ graph databases for conjunctive regular-path queries (CRPQs) and‌ conjunctive regular-path constraints (CRPCs). They deal with subset,‌ superset and symmetric-difference repairs. Without restrictions, CQA is‌ undecidable for the semantics of superset- and symmetric-difference‌ repairs, and $Π_{2}^{P} - c o‌ m p l e t e$ for subset-repairs.‌ However, they identify restrictions on CRPCs and databases‌ that lead to decidability, and even tractability of‌ CQA.

9 Partnerships and cooperations

9.1 International research‌ visitors

9.1.1 Visits of international scientists

Invitation of‌ researchers to the team seminar

Our team often‌ invites external researchers (from French institutions or international)‌ to give research seminars. Participants sometimes stay a‌ few days to discuss research with the team.‌ Here is a non-exhaustive list of researchers that‌ gave a presentation in the team in 2025-2026:‌

Pieter Bonte [Ghent University]
Alexandra Rogova [University of‌ Warsaw]
Harry Vinall-Smeeth [RWTH Aachen University]
Sarah Kleest-Meißner [Hasselt University]
Corto Mascle‌ [MPI-SWS, Kaiserslautern]
Lê Thành‌ Dũng (Tito) Nguyen [Aix-Marseille‌‌ University]
Michaël Cadilhac [DePaul University]
Howard Straubing [Boston‌ College]
Rémi Morvan [freelance‌ software engineer (previously University‌‌ of Bordeaux)]

9.1.2 Visits to international teams

Research‌ stays abroad

Antoine Amarilli‌ visited the Institute of‌‌ Computer Science at the University of Göttingen (Germany),‌ in the context of‌ a collaboration with Florin‌‌ Manea, Markus Schmid, and Tina Ringleb, from February‌ 10 to February 16,‌ 2025.

9.2 National initiatives‌‌

9.2.1 ANR JCJC KCODA

Participants: Florent Capelli [PI]‌, Oliver Irwin,‌ Charles Paperman, Sylvain‌‌ Salvati.

Duration
2021-2026
Description
The aim of‌ KCODA is to study‌ how succinct representations can‌‌ be used to efficiently solve modern optimization and‌ AI problems that use‌ a lot of data.‌‌ We suggest using data structures from the field‌ of compilation of knowledge‌ that can represent large‌‌ datasets succinctly by factoring certain parts while allowing‌ efficient analysis of the‌ represented data. The first‌‌ goal of KCODA is to understand how one‌ can efficiently solve optimization‌ and training problems for‌‌ data represented by these structures. The second goal‌ of KCODA is to‌ offer better integration of‌‌ these techniques into the systems of database management‌ by proposing new algorithms‌ allowing to build factorized‌‌ representations of the data responses to DB requests‌ and by proposing encodings‌ of these representations inside‌‌ the DB.

9.2.2 ANR SxC

Participants: Loup Lobet‌, Charles Paperman [PI]‌, Sylvain Salvati.‌‌

Duration
2025-2030
Coordinator
Charles Paperman
Scientific Partners
LIP‌ (ENS Lyon), LCIS (Université‌ Grenobles Alpes) and LABRI‌‌ (Université de Bordeaux)
Objectives

Knowledge about handcrafted vectorization‌ algorithms is scattered among‌ various fields, system architectures,‌‌ programming languages paradigms and industrial pieces of codes.‌ The main objective of‌ the project is to‌‌ bring structure to this knowledge through novel conceptual‌ and practical tools. The‌ end goal is to‌‌ facilitate the writing of SIMD programs and to‌ pave the way to‌ future auto-vectorization methods for‌‌ stream processing. The main difficulty to achieve these‌ goals is to understand‌ the interplay of sequentiality‌‌ and bit-level parallelism. Circuit complexity, as introduced by‌ Shannon in the early‌ years of computer science,‌‌ offers a powerful framework to study such notion‌ of parallelism. In this‌ multidisciplinary project, we bring‌‌ together methodologies from the fields of circuit complexity‌ of automata, and programming‌ language design for parallel‌‌ architectures, with motivations from data processing and bioinformatics‌ applications.

The project’s results‌ will be implemented into‌‌ a compilation tool-chain. At the highest level, we‌ will design and develop‌ a domain specific language,‌‌ dubbed Vectoid. Its goal is to allow more‌ programmers to harness SIMD‌ instructions. Vectoid will be‌‌ compiled to a Vectorial Intermediate Representation (VIR) dedicated‌ to stream processing. In‌ turn VIR will be‌‌ compiled to low level code for various SIMD‌ architectures. The design of‌ this compilation tool-chain will‌‌ be informed by a theoretical investigation on the‌ expressivity of vectorial circuits.‌ It will be evaluated‌‌ on data processing and‌ bioinformatics benchmarks.

9.2.3 ANR EXPAND

Participants: Antoine Amarilli‌.

Duration
2025-2030
Coordinator
Michaël Thomazo
Scientific Partners‌
LABRI (Bordeaux), LIMOS (Clermont-Ferrand), LIRMM (Montpellier), DI ENS‌ (Paris), IRISA (Rennes), and CRIStAL (D-DAL and SPIRALS‌ team)
Objectives
The goal of EXPAND is to‌ expand the applicability of ontology-based query answering by‌ allowing for enhanced query languages, and provide richer‌ ways to manipulate and understand query answers.

9.3‌ Regional initiatives

In collaboration with Florent Capelli and‌ Stefan Mengel of Centre de Recherche en Informatique‌ de Lens (CRIL), Antoine Amarili and Mikaël Monet‌ have obtained a PostDoc grant from région Nord‌ via the CPER Cornelia. They will recruit a‌ post-doctoral student who is expected to start in‌ September 2026.

10 Dissemination

Participants: Antoine Amarilli,‌ Iovka Boneva, Oliver Irwin, Sébastien Labbe‌, Aurélien Lemay, Loup Lobet, Arthur‌ Lombardo, Mikael Monet, Charles Paperman,‌ Sylvain Salvati, Sophie Tison.

10.1 Promoting‌ scientific activities

Member of organizing committees

Mikael Monet‌ , Sophie Tison
member of the EDBT/ICDT'27 organizing‌ committee. The conference will take place in Lille‌ in 2027.

10.1.1 Scientific events: selection

Member of‌ the conference program committees

Antoine Amarilli
was member‌ of the program committee of UndoneCS 2026
Iovka‌ Boneva
was member of the program committee of‌ BDA 2025 (the French conference on data management)‌

Reviewer

Antoine Amarilli
has reviewed for various journals/conferences,‌ including STACS'26, ICDT'25, LMCS
Mikael Monet
has reviewed‌ for various journals/conferences, including PODS'26, JACM (2025), TODS‌ (2025), TOCL (2026)

10.1.2 Journal

Member of the‌ editorial boards

Mikael Monet
Managing editor for the‌ TheoretiCS journal.

10.1.3 Invited talks

Antoine Amarilli
gave‌ a talk at the invitation-only workshop FMT'25 (finite‌ model theory) in Les Houches, France. He was‌ also invited (without giving a talk) to Dagstuhl‌ workshop 25081 "Semirings in Databases, Automata, and Logic"‌

10.1.4 Leadership within the scientific community

Antoine Amarilli‌
Vice-President (elected in 2024) of the EATCS (European‌ Association for Theoretical Computer Sciences).
Charles Paperman
Elected‌ member of EATCS (European Association for Theoretical Computer‌ Sciences).
Sophie Tison
Member of the scientific committee‌ of the GDR IM.
Sophie Tison
Member of‌ the SSAAL Société des Sciences, de l'Agriculture et‌ des Arts de Lille.

10.1.5 Scientific expertise

Sophie‌ Tison
Expert for HCERES (vice-president of the DTIS-ONERA‌ committee).
Sophie Tison
External expert for IUF.
Sophie‌ Tison
Expert for the programme "Jeunes Talents France‌ 2025 L’Oréal-UNESCO Pour les Femmes et la Science".‌

10.1.6 Research administration

Iovka Boneva
Member of Inria‌ Lille CER (Commission des Emplois de Recherche).
Charles‌ Paperman
Elected member of the research commission of‌ FST (Université de Lille).
Sylvain Salvati
Elected member‌ of CE (Commission d'évaluation INRIA).
Sophie Tison
Member‌ of the Steering Committee of Highlights of Logic,‌ Automata, and Games.

10.2 Teaching - Supervision -‌ Juries - Educational and pedagogical outreach

10.2.1 Supervision‌

Bastien Degardins
PhD project started in October 2024.‌ On graph databases for DNA analysis. Supervised by‌ Paperman and Marchet (CRIStAL Bonsai).
Oliver Irwin
PhD project started in 2022.‌ On compilation and aggregation‌ in databases. Co-supervised by‌‌ Capelli and Salvati.
Sebastien Labbe
PhD student from‌ Oct 2025, co-supervised by‌ Antoine Amarilli and Charles‌‌ Paperman
Loup Lobet
PhD student from Oct 2025,‌ co-supervised by Laure Gonnord‌ and Charles Paperman
Etienne‌‌ Parent
PhD student from Oct 2025, co-supervised by‌ Gabrien Radanne, Matthieu Moy‌ and Charles Paperman
Arthur‌‌ Lombardo
PhD student from Oct 2025, co-supervised by‌ Antoine Amarilli, Mikaël Monet‌ and Pierre Senellart
Vincent‌‌ Peth
M2 intern, co-supervised by Antoine Amarilli and‌ Mikaël Monet, summer 2025‌
Rémi de Pretto
M1‌‌ intern, co-supervised by Antoine Amarilli and Mikaël Monet,‌ summer 2025
Paul Raphaël‌
L3 intern, co-supervised by‌‌ Antoine Amarilli and Mikaël Monet and Sylvain Salvati,‌ summer 2025
Harshul Sagar‌
PhD student at University‌‌ of Illinois with whom Antoine Amarilli and Mikaël‌ Monet have an ongoing‌ research project that started‌‌ during summer 2025
Thibault Tisserand
M1 intern, supervised‌ by Sylvain Salvati, summer‌ 2025

10.2.2 Juries

Sylvain‌‌ Salvati
has been a reviewer of the PhD‌ of Vincent Moreau A‌ topological and fibrational approach‌‌ to higher-order automata at Université Paris Cité.
Charles‌ Paperman
has been in‌ the PhD committee of‌‌ Gabriel Bathie.
Antoine Amarilli
has been in the‌ HDR committee of Tatiana‌ Starikovskaya (reviewer), in the‌‌ PhD committee of Rémi Morvan, in the PhD‌ committee of Neha Makhija,‌ in the PhD committee‌‌ of Harry Vinall-Smeeth (reviewer), in the CSI committees‌ of Javad Taheri, Oliver‌ Irwin, Fatemeh Ghasemi, and‌‌ Lucas Larroque, and in a hiring committee (‌COS) for a‌ “Maître de conférences‌‌” position in LABRI (Bordeaux)
Sophie Tison
has‌ been in the PhD‌ committee of Rémi Morvan.‌‌
Iovka Boneva
has been in the hiring committee‌ (COS) for‌ two “Maître de‌‌ conférences” positions at Université de Lille

10.2.3‌ Educational and pedagogical outreach‌

Aurélien Lemay
Responsible for‌‌ computer science and numeric correspondent for his department.‌
Paperman
Responsible of alternance‌ for the department of‌‌ Computer Science.
Sylvain Salvati
Head of Master of‌ Computer Science at the‌ University of Lille.
Sylvain‌‌ Salvati
Member of the board for Computer Science‌ of the doctoral school‌ MADIS (Mathematics and Computer‌‌ Science).
Sylvain Salvati
Member of the collège doctoral‌ (doctoral college) of University‌ of Lille: PhD students‌‌ education and professional training.

10.2.4 Teaching Activities

Iovka‌ Boneva
Teaches computer science‌ in BUT Informatique of‌‌ Université de Lille.
Oliver Irwin
Taught 64 hours‌ in Computer Science in‌ the Computer Science Department‌‌ of Université de Lille.
Aurélien Lemay
Teaches computer‌ science in the LEA‌ department of Université de‌‌ Lille for around 200h per year (Licence and‌ Master). He is also‌ responsible for computer science‌‌ and numeric correspondent for its department.
Mikael Monet‌
Teaches computer science as‌ a temporary lecturer at‌‌ Université de Lille and at Centrale Lille. He‌ teaches two databases courses,‌ as well as a‌‌ course on "algorithms and complexity".
Charles Paperman
Teaches‌ CS in master degrees‌ of the computer science‌‌ department, Miashs at bachelor‌ level in the math department and in the‌ data science master degree.
Sara Riva
Teaches CS‌ both in the Bachelor and Master of the‌ Department of Computer Science of the Université de‌ Lille. She also teaches a Database course at‌ Centrale Lille.
Sylvain Salvati
Teaches at the Bachelor‌ and Master level in the computer science department‌ of Université de Lille.

10.3 Popularization

10.3.1 Specific‌ official responsibilities in science outreach structures

Sophie Tison‌
Member of the scientific committee of the Maison‌ Pour la Science.
Charles Paperman
Co-organizer of the‌ yearly thematic of GDR-IFM and GDR-BIMMM l'adn dans‌ tout ses états.

10.3.2 Others science outreach‌ relevant activities

Sara Riva
is involved in the‌ internship action Fourmis which fosters the interest of‌ highschool students towards Mathematics and Computer Science.

11‌ Scientific production

11.1 Major publications

1 inproceedingsS.‌Shqiponja Ahmetaj, I.Iovka Boneva, J.‌Jan Hidders, K.Katja Hose, M.‌Maxime Jakubowski, J. E.Jose Emilio Labra‌ Gayo, W.Wim Martens, F.Fabio‌ Mogavero, F.Filip Murlak, C.Cem‌ Okulmus, A.Axel Polleres, O.Ognjen‌ Savković, M.Mantas Šimkus and D.Dominik‌ Tomaszuk. Common Foundations for SHACL, ShEx, and‌ PG-Schema.WWW 2025 - The ACM Web‌ Conference 2025Sydney, AustraliaACMApril 2025,‌ 8-21HAL DOI
2 proceedingsDynamic Membership for‌ Regular Tree Languages.MFCSWarsaw, PolandSchloss‌ Dagstuhl – Leibniz-Zentrum für Informatik2025HAL DOI‌
3 articleA.Antoine Amarilli, T.Timothy‌ van Bremen, O.Octave Gaspard and K.‌ S.Kuldeep S. Meel. Approximating Queries on‌ Probabilistic Graphs.Logical Methods in Computer Science‌214December 2025HAL DOI
4 proceedings‌Resilience for Regular Path Queries: Towards a Complexity‌ Classification.PODS32Berlin, GermanyACM‌2025, 1-18HALDOI
5 proceedingsLinear‌ Time Subsequence and Supersequence Regex Matching.MFCS‌Warsaw, PolandSchloss Dagstuhl – Leibniz-Zentrum für Informatik‌2025HAL DOI
6 articleP.Pablo Barceló‌, G.Gaëlle Fontaine, S.Sylvain Salvati‌ and S.Sophie Tison. Corrections to “On‌ the data complexity of consistent query answering over‌ graph databases [Journal of Computer and System Sciences‌ 88 (2017) 164–194]”.Journal of Computer and‌ System Sciences155February 2026, 103694HAL‌DOI
7 inproceedingsI.Iovka Boneva, J.‌ E.Jose Emilio Labra Gayo, E.Eric‌ Prud’hommeaux, K.Katherine Thornton and A.Andra‌ Waagmeester. Shape Expressions with Inheritance.22nd‌ European Semantic Web Conference, ESWC 2025ESWC 2025‌ - 22nd European Semantic Web Conference15718Lecture‌ Notes in Computer SciencePortoroz, SloveniaSpringer Nature‌ SwitzerlandJune 2025, 481-499HAL DOI
8‌ inproceedingsF.Florent Capelli, O.Oliver Irwin‌ and S.Sylvain Salvati. A Simple Algorithm‌ for Worst Case Optimal Join and Sampling.‌International Conference on Database TheoryLeibniz International Proceedings‌ in Informatics (LIPIcs)328Barcelona, SpainSchloss Dagstuhl – Leibniz-Zentrum für Informatik‌ (2025)March 2025,‌ 23:1-23:19HAL

11.2 Publications‌‌ of the year

International journals

9 articleA.‌Antoine Amarilli, T.‌Timothy van Bremen,‌‌ O.Octave Gaspard and K. S.Kuldeep S.‌ Meel. Approximating Queries‌ on Probabilistic Graphs.‌‌Logical Methods in Computer Science214December‌ 2025HAL DOI back‌ to text
10 article‌‌A.Antoine Amarilli, W.Wolfgang Gatterbauer,‌ N.Neha Makhija and‌ M.Mikaël Monet.‌‌ Resilience for Regular Path Queries: Towards a Complexity‌ Classification.Proceedings of‌ the ACM on Management‌‌ of Data322025, 1-18HAL‌DOI back to text‌
11 articleP.Pablo‌‌ Barceló, G.Gaëlle Fontaine, S.Sylvain‌ Salvati and S.Sophie‌ Tison. Corrections to‌‌ “On the data complexity of consistent query answering‌ over graph databases [Journal‌ of Computer and System‌‌ Sciences 88 (2017) 164–194]”.Journal of Computer‌ and System Sciences155‌February 2026, 103694‌‌HAL DOI back to text

International peer-reviewed conferences‌

12 inproceedingsS.Shqiponja‌ Ahmetaj, I.Iovka‌‌ Boneva, J.Jan Hidders, K.Katja‌ Hose, M.Maxime‌ Jakubowski, J. E.‌‌Jose Emilio Labra Gayo, W.Wim Martens‌, F.Fabio Mogavero‌, F.Filip Murlak‌‌, C.Cem Okulmus, A.Axel Polleres‌, O.Ognjen Savković‌, M.Mantas Šimkus‌‌ and D.Dominik Tomaszuk. Common Foundations for‌ SHACL, ShEx, and PG-Schema‌.WWW 2025 -‌‌ The ACM Web Conference 2025Sydney, AustraliaACM‌April 2025, 8-21‌HAL DOI back to‌‌ text
13 inproceedingsA.Antoine Amarilli, C.‌Corentin Barloy, L.‌Louis Jachiet and C.‌‌Charles Paperman. Dynamic Membership for Regular Tree‌ Languages.MFCSWarsaw,‌ PolandSchloss Dagstuhl –‌‌ Leibniz-Zentrum für Informatik2025HAL DOI back to‌ text
14 inproceedingsA.‌Antoine Amarilli, F.‌‌Florin Manea, T.Tina Ringleb and M.‌Markus Schmid. Linear‌ Time Subsequence and Supersequence‌‌ Regex Matching.MFCSWarsaw, PolandSchloss Dagstuhl‌ – Leibniz-Zentrum für Informatik‌2025HAL DOI back‌‌ to text
15 inproceedingsI.Iovka Boneva,‌ J. E.Jose Emilio‌ Labra Gayo, E.‌‌Eric Prud’hommeaux, K.Katherine Thornton and A.‌Andra Waagmeester. Shape‌ Expressions with Inheritance.‌‌22nd European Semantic Web Conference, ESWC 2025ESWC‌ 2025 - 22nd European‌ Semantic Web Conference15718‌‌Lecture Notes in Computer SciencePortoroz, SloveniaSpringer‌ Nature SwitzerlandJune 2025‌, 481-499HAL DOI‌‌back to text
16 inproceedingsF.Florent Capelli‌, O.Oliver Irwin‌ and S.Sylvain Salvati‌‌. A Simple Algorithm for Worst Case Optimal‌ Join and Sampling.‌International Conference on Database‌‌ TheoryLeibniz International Proceedings in Informatics (LIPIcs)328‌Barcelona, SpainSchloss Dagstuhl‌ – Leibniz-Zentrum für Informatik‌‌ (2025)March 2025, 23:1-23:19HAL back to‌ text

Reports & preprints‌

17 miscA.Antoine‌‌ Amarilli and B.Benoît Groz. Cutwidth Bounds‌ via Vertex Partitions.‌December 2025HAL back‌‌ to text
18 misc‌A.Antoine Amarilli, M.Mikaël Monet and‌ R.Rémi de Pretto. Confluence of the‌ Node-Domination and Edge-Domination Hypergraph Rewrite Rules.October‌ 2025HAL back to text
19 miscA.‌Antoine Amarilli, M.Mikaël Monet and R.‌Rémi de Pretto. Locality Testing for NFAs‌ is PSPACE-complete.November 2025HAL back to‌ text
20 miscB.Bastien Degardins, C.‌Charles Paperman and C.Camille Marchet. Vizitig:‌ a pangenome and pantranscriptome explorer.April 2025‌HAL DOI

11.3 Cited publications

21 articleA.‌Antoine Amarilli, M.Marcelo Arenas, Y.‌YooJung Choi, M.Mikaël Monet, G.‌ V.Guy Van den Broeck and B.Benjie‌ Wang. A Circus of Circuits: Connections Between‌ Decision Diagrams, Circuits, and Automata.arXiv preprint‌ arXiv:2404.096742024back to text back to text‌
22 inproceedingsA.Antoine Amarilli, P.Pierre‌ Bourhis, F.Florent Capelli and M.Mikaël‌ Monet. Ranked Enumeration for MSO on Trees‌ via Knowledge Compilation.ICDT2902024back‌ to text
23 inproceedingsA.Antoine Amarilli,‌ P.Pierre Bourhis, S.Stefan Mengel and‌ M.Matthias Niewerth. Enumeration on trees‌ with tractable combined complexity and efficient updates.‌PODS2019DOI back to text
24 article‌A.Antoine Amarilli and F.Florent Capelli.‌ Tractable Circuits in Database Theory.ACM SIGMOD‌ Record5322024back to text back‌ to text
25 articleA.Antoine Amarilli and‌ İ. İ.İsmail İlkan Ceylan. The‌ dichotomy of evaluating homomorphism-closed queries on probabilistic graphs‌.LMCS2022DOIback to text
26‌ inproceedingsA.Antoine Amarilli, L.Louis Jachiet‌ and C.Charles Paperman. Dynamic Membership for‌ Regular Languages.ICALP2021DOI back to‌ text back to text
27 articleA.Antoine‌ Amarilli, M.Mikaël Monet and D.Dan‌ Suciu. The Non-Cancelling Intersections Conjecture.CoRR‌abs/2401.162102024back to text back to text‌
28 articleA.Antoine Amarilli and M.Mikaël‌ Monet. Weighted counting of matchings in unbounded-treewidth‌ graph families.arXiv preprint arXiv:2205.008512022back‌ to text
29 inproceedingsA.Antoine Amarilli and‌ M.Mikaël Monet. Enumerating regular languages‌ with bounded delay.STACS2023DOI back‌ to text
30 inproceedingsJ.-Y.Jin-Yi Cai,‌ P.Pinyan Lu and M.Mingji Xia.‌ Holant problems and counting CSP.STOC2009‌back to text
31 articleA.Adnan Darwiche‌ and P.Pierre Marquis. A knowledge compilation‌ map.Journal of Artificial Intelligence Research17‌2002back to textback to text
32‌ inproceedingsD.Daniel Deutch, N.Nave Frost‌, B.Benny Kimelfeld and M.Mikaël Monet‌. Computing the Shapley Value of Facts in‌ Query Answering.SIGMOD ConferenceACM2022back‌ to text
33 articleM.Martin Dyer and‌ D.David Richerby. An effective dichotomy for‌ the counting constraint satisfaction problem.SIAM Journal on Computing423‌2013back to text‌
34 articleR.Ronald‌‌ Fagin, B.Benny Kimelfeld, F.Frederick‌ Reiss and S.Stijn‌ Vansummeren. Document spanners:‌‌ A formal approach to information extraction.JACM‌6222015back‌ to text
35 article‌‌C.Cibele Freire, W.Wolfgang Gatterbauer,‌ N.Neil Immerman and‌ A.Alexandra Meliou.‌‌ The Complexity of Resilience and Responsibility for Self-Join-Free‌ Conjunctive Queries.PVLDB‌932015DOI‌‌back to text
36 phdthesisL.Lily Gallois‌. Dialog between chase‌ approach and string rewriting‌‌ system approach.Université de LilleUniversité de‌ LilleDecember 2019HAL‌back to text
37‌‌ articleK.Kathrin Hanauer, M.Monika Henzinger‌ and C.Christian Schulz‌. Recent advances in‌‌ fully dynamic graph algorithms: A quick reference guide‌.JEA272022‌back to text
38‌‌ incollectionM.Mark Jerrum. Counting Constraint Satisfaction‌ Problems.The Constraint‌ Satisfaction Problem7Dagstuhl‌‌ Follow-Ups2017back to text
39 articleP.‌Pratik Karmakar, M.‌Mikaël Monet, P.‌‌Pierre Senellart and S.Stéphane Bressan. Expected‌ Shapley-like scores of boolean‌ functions: Complexity and applications‌‌ to probabilistic databases.PACMMOD222024‌back to text
40‌ inproceedingsW.Wim Martens‌‌ and T.Tina Trautner. Evaluation and enumeration‌ problems for regular path‌ queries.ICDT2018‌‌back to text
41 inproceedingsC.Charles Paperman‌, S.Sylvain Salvati‌ and C.Claire Soyez-Martin‌‌. An algebraic approach to vectorial programs.‌STACS2023HAL DOI‌back to text
42‌‌ inproceedingsS.Sylvain Salvati and S.Sophie Tison‌. Containment of Regular‌ Path Queries Under Path‌‌ Constraints.ICDT2024HAL DOI back to‌ text
43 articleN.‌Nicole Schweikardt, L.‌‌Luc Segoufin and A.Alexandre Vigny. Enumeration‌ for FO queries over‌ nowhere dense graphs.‌‌JACM6932022DOI back to text‌
44 articleL.Luc‌ Segoufin. Constant delay‌‌ enumeration for conjunctive queries.ACM SIGMOD Record‌4412015back‌ to text
45 article‌‌P.Pierre Senellart, L.Louis Jachiet,‌ S.Silviu Maniu and‌ Y.Yann Ramusat.‌‌ ProvSQL: Provenance and probability management in PostgreSQL.‌PVLDB11122018‌back to text
46‌‌ phdthesisC.Claire Soyez-Martin. From semigroup theory‌ to vectorization: recognizing regular‌ languages.Université de‌‌ LilleUniversité de LilleDecember 2023HAL back‌ to text

D-DAL - 2025

D-DAL - 2025

2025Activity​​​‌ reportProject-TeamD-DAL

Keywords

Computer Science﻿​﻿﻿ and Digital Science

Other Research Topics and​​​‌ Application Domains

1 Team members,​‌﻿﻿ visitors, external collaborators

Research​​﻿﻿ Scientists

Faculty​‌﻿﻿ Members

PhD﻿​﻿﻿ Students

Administrative Assistant​‌﻿﻿

External Collaborator

2 Overall objectives​​﻿﻿

Challenge​​​‌ 1: Extracting value from﻿​﻿﻿ data

Dealing with﻿​​﻿ uncertain data.

Finding structure﻿‌​‌ in data.

Explaining query﻿​​﻿ results.

Challenge 2: Processing large​​﻿﻿ data efficiently

Enumeration and incremental algorithms.﻿﻿﻿‌

Designing query​​​‌ execution engines.

3 Research﻿​﻿﻿ program

3.1​​​‌ The Basement: Circuits for﻿﻿﻿‌ Data Manipulation

Studying circuits from﻿﻿﻿‌ Knowledge Compilation.

Short﻿​﻿﻿ term.

Medium term.

Long​​﻿﻿ term.

Circuits as a compilation​‌﻿﻿ target for parallelism and​​﻿﻿ vectorization.

Short term.

Medium term.

Long term.

Foundations of circuits:﻿‌​‌ circuit complexity, algebra and​​​‌ abstract machines.

Short​​​‌ term.

Medium﻿​﻿﻿ term.

Long term.

3.2 First Floor:﻿​﻿﻿ Logic and Query Evaluation​‌﻿﻿

Provenance, explanation,﻿‌​‌ aggregation, counting, uncertainty, probabilistic﻿​​﻿ data, Shapley, approximation algorithms.​​​‌

Short term.

Medium term.﻿​​﻿

Long term.

Dynamic﻿​​﻿ data: incremental and dynamic​​​‌ algorithms and connections with﻿﻿﻿‌ dynamic algorithms.

Short term.

Medium term.

Long term.﻿​﻿﻿

Efficient​​​‌ evaluation: streaming and parallelism.﻿﻿﻿‌

Short​​​‌ term.

Medium term.

Long﻿​​﻿ term.

Enumerating the﻿‌​‌ results of recursive queries.﻿​​﻿

Short term.

Medium term.

Long term.

3.3 The​​​‌ roof: Knowledge on Data﻿﻿﻿‌

Optimization for schemas for﻿‌​‌ graphs.

Short term.

Medium term.﻿‌​‌

Long term.

Static​‌﻿﻿ analysis with constraints (open-world​​﻿﻿ query answering, query rewriting​​​‌ under constraints, consistency, certain﻿​﻿﻿ answers).

Short term.

Medium term.

Long term.

4 Application﻿​​﻿ domains

5 Social﻿​﻿﻿ and environmental responsibility

5.1​‌﻿﻿ Footprint of research activities​​﻿﻿

5.2​‌﻿﻿ Parity and equal opportunities​​﻿﻿ in science

5.3﻿‌​‌ Impact of research results﻿​​﻿

6 Highlights of the﻿‌​‌ year

6.1 Creation of﻿​​﻿ the D-DAL team-project

7﻿​​﻿ Latest software developments, platforms,​​​‌ open data

7.1 Application﻿﻿﻿‌ of database techniques to﻿‌​‌ genomic

7.2 Latest﻿‌​‌ software developments

7.2.1 helicase﻿​​﻿

7.2.2﻿‌​‌ NetworkDisk

7.2.3 rsonpath

7.2.4 ShapeDesigner﻿​​﻿

7.2.5﻿​​﻿ ShEx validator

7.2.6 vizitig﻿​﻿﻿

8 New results

8.1​​​‌ Circuits for Data Manipulation﻿​﻿﻿

8.1.1 Studying circuits﻿​﻿﻿ from Knowledge Compilation

8.1.2 Foundations​​﻿﻿ of circuits: circuit complexity,​​​‌ algebra and abstract machines﻿​﻿﻿

2025Activity‌ reportProject-TeamD-DAL

Computer Science and Digital Science

Other Research Topics and‌ Application Domains

1 Team members,‌ visitors, external collaborators

Research Scientists

Faculty‌ Members

PhD Students

Administrative Assistant‌

2 Overall objectives

Challenge‌ 1: Extracting value from data

Dealing with uncertain data.

Finding structure‌‌ in data.

Explaining query results.

Challenge 2: Processing large data efficiently

Enumeration and incremental algorithms.‌

Designing query‌ execution engines.

3 Research program

3.1‌ The Basement: Circuits for‌ Data Manipulation

Studying circuits from‌ Knowledge Compilation.

Short term.

Long term.

Circuits as a compilation‌ target for parallelism and vectorization.

Foundations of circuits:‌‌ circuit complexity, algebra and‌ abstract machines.

Short‌ term.

Medium term.

3.2 First Floor: Logic and Query Evaluation‌

Provenance, explanation,‌‌ aggregation, counting, uncertainty, probabilistic data, Shapley, approximation algorithms.‌

Medium term.

Dynamic data: incremental and dynamic‌ algorithms and connections with‌ dynamic algorithms.

Long term.

Efficient‌ evaluation: streaming and parallelism.‌

Short‌ term.

Long term.

Enumerating the‌‌ results of recursive queries.

3.3 The‌ roof: Knowledge on Data‌

Optimization for schemas for‌‌ graphs.

Medium term.‌‌

Static‌ analysis with constraints (open-world query answering, query rewriting‌ under constraints, consistency, certain answers).

4 Application domains

5 Social and environmental responsibility

5.1‌ Footprint of research activities

5.2‌ Parity and equal opportunities in science

5.3‌‌ Impact of research results

6 Highlights of the‌‌ year

6.1 Creation of the D-DAL team-project

7 Latest software developments, platforms,‌ open data

7.1 Application‌ of database techniques to‌‌ genomic

7.2 Latest‌‌ software developments

7.2.1 helicase

7.2.2‌‌ NetworkDisk

7.2.4 ShapeDesigner

7.2.5 ShEx validator

7.2.6 vizitig

8.1‌ Circuits for Data Manipulation

8.1.1 Studying circuits from Knowledge Compilation

8.1.2 Foundations of circuits: circuit complexity,‌ algebra and abstract machines

8.2 Logic and Query Evaluation

8.2.1‌ Provenance, explanation, aggregation, counting, uncertainty, probabilistic data, Shapley,‌ approximation algorithms.

8.2.2 Dynamic data: incremental and dynamic‌ algorithms and connections with‌ dynamic algorithms

8.3 Knowledge‌‌ on Data

8.3.1 Optimization for schemas‌‌ for graphs

8.3.2‌ Static analysis with constraints (open-world query answering, query‌ rewriting under constraints, consistency, certain answers)

9 Partnerships and cooperations

9.1 International research‌ visitors

9.1.1 Visits of international scientists

Invitation of‌ researchers to the team seminar

9.1.2 Visits to international teams

Research‌ stays abroad

9.2 National initiatives‌‌

9.2.1 ANR JCJC KCODA

9.2.2 ANR SxC

9.2.3 ANR EXPAND

9.3‌ Regional initiatives

10 Dissemination

10.1 Promoting‌ scientific activities

Member of organizing committees

10.1.1 Scientific events: selection

Member of‌ the conference program committees

10.1.2 Journal

Member of the‌ editorial boards