EN FR
EN FR
D-DAL - 2025

2025Activity​​​‌ reportProject-TeamD-DAL

RNSR:‌ 202524739L
  • Research center Inria‌​‌ Centre at the University​​ of Lille
  • In partnership​​​‌ with:Université de Lille‌
  • Team name: Data :‌​‌ Dynamics Algorithmics and Logics​​
  • In collaboration with:Centre​​​‌ de Recherche en Informatique,‌ Signal et Automatique de‌​‌ Lille

Creation of the​​ Project-Team: 2025 October 01​​​‌

Each year, Inria research‌ teams publish an Activity‌​‌ Report presenting their work​​ and results over the​​​‌ reporting period. These reports‌ follow a common structure,‌​‌ with some optional sections​​ depending on the specific​​​‌ team. They typically begin‌ by outlining the overall‌​‌ objectives and research programme,​​ including the main research​​​‌ themes, goals, and methodological‌ approaches. They also describe‌​‌ the application domains targeted​​ by the team, highlighting​​​‌ the scientific or societal‌ contexts in which their‌​‌ work is situated.

The​​ reports then present the​​​‌ highlights of the year,‌ covering major scientific achievements,‌​‌ software developments, or teaching​​ contributions. When relevant, they​​​‌ include sections on software,‌ platforms, and open data,‌​‌ detailing the tools developed​​​‌ and how they are​ shared. A substantial part​‌ is dedicated to new​​ results, where scientific contributions​​​‌ are described in detail,​ often with subsections specifying​‌ participants and associated keywords.​​

Finally, the Activity Report​​​‌ addresses funding, contracts, partnerships,​ and collaborations at various​‌ levels, from industrial agreements​​ to international cooperations. It​​​‌ also covers dissemination and​ teaching activities, such as​‌ participation in scientific events,​​ outreach, and supervision. The​​​‌ document concludes with a​ presentation of scientific production,​‌ including major publications and​​ those produced during the​​​‌ year.

Keywords

Computer Science​ and Digital Science

  • A2.1.​‌ Programming Languages
  • A2.1.1. Semantics​​ of programming languages
  • A2.1.4.​​​‌ Functional programming
  • A2.1.6. Concurrent​ programming
  • A3.1. Data
  • A3.1.1.​‌ Modeling, representation
  • A3.1.2. Data​​ management, quering and storage​​​‌
  • A3.1.3. Distributed data
  • A3.1.4.​ Uncertain data
  • A3.1.5. Control​‌ access, privacy
  • A3.1.6. Query​​ optimization
  • A3.1.7. Open data​​​‌
  • A3.1.8. Big data (production,​ storage, transfer)
  • A3.1.9. Database​‌
  • A3.2.1. Knowledge bases
  • A3.2.2.​​ Knowledge extraction, cleaning
  • A3.2.3.​​​‌ Inference
  • A3.2.4. Semantic Web​
  • A4.5. Formal method for​‌ verification, reliability, certification
  • A4.5.1.​​ Static analysis
  • A4.5.2. Model-checking​​​‌
  • A4.5.3. Program proof
  • A4.7.​ Access control
  • A4.8. Privacy-enhancing​‌ technologies
  • A7. Theory of​​ computation
  • A7.2. Logic in​​​‌ Computer Science
  • A9.1. Knowledge​
  • A9.2. Machine learning
  • A9.7.​‌ AI algorithmics
  • A9.8. Reasoning​​

Other Research Topics and​​​‌ Application Domains

  • B6.1. Software​ industry
  • B6.3.1. Web
  • B6.3.4.​‌ Social Networks
  • B6.5. Information​​ systems
  • B9.5.1. Computer science​​​‌
  • B9.5.6. Data science
  • B9.10.​ Privacy

1 Team members,​‌ visitors, external collaborators

Research​​ Scientists

  • Antoine Amarilli [​​​‌INRIA, Advanced Research​ Position, from Oct​‌ 2025, HDR]​​
  • Mikael Monet [INRIA​​​‌, Researcher, from​ Oct 2025]

Faculty​‌ Members

  • Sylvain Salvati [​​Team leader, UNIV​​​‌ LILLE, Professor,​ from Oct 2025,​‌ HDR]
  • Iovka Boneva​​ [UNIV LILLE,​​​‌ Associate Professor, from​ Oct 2025]
  • Aurélien​‌ Lemay [UNIV LILLE​​, Associate Professor,​​​‌ from Oct 2025,​ HDR]
  • Charles Paperman​‌ [UNIV LILLE,​​ Associate Professor, from​​​‌ Oct 2025, HDR​]
  • Sophie Tison [​‌UNIV LILLE, Emeritus​​, from Oct 2025​​​‌, HDR]

PhD​ Students

  • Bastien Degardins [​‌UNIV LILLE, from​​ Oct 2025]
  • Oliver​​​‌ Irwin [UNIV LILLE​, from Oct 2025​‌]
  • Sebastien Labbe [​​UNIV LILLE, from​​​‌ Oct 2025]
  • Loup​ Lobet [INRIA,​‌ from Oct 2025]​​
  • Arthur Lombardo [ENS​​​‌ PARIS, from Oct​ 2025]

Administrative Assistant​‌

  • Aurore Dalle [INRIA​​]

External Collaborator

  • Sara​​​‌ Riva [UNIV LILLE​, from Oct 2025​‌]

2 Overall objectives​​

A major and early​​​‌ success story of computer​ science, combining theoretical foundations​‌ and efficient implementations, is​​ the development of relational​​​‌ databases. Relational databases​ have now been deployed​‌ in essentially all sectors​​ of activity, from large​​​‌ multi-server clusters to phones​ and embedded devices. An​‌ important factor of this​​ success is the abstraction​​​‌ lying at the core​ of relational databases: they​‌ combine a simple data​​ model with a declarative​​​‌ language (SQL) to write​ user queries. These queries​‌ are then transparently translated​​ to optimized execution plans​​ taking advantage of the​​​‌ schema (the data model)‌ and statistics of the‌​‌ data. This abstracts away​​ all internal details that​​​‌ would be unpleasant to‌ the user: how data‌​‌ is concretely stored, the​​ shape of the actual​​​‌ execution plan, low level‌ optimizations, the orchestration of‌​‌ data streams that compute​​ intermediate query results, concurrent​​​‌ accesses, and so on.‌

Building on this foundation,‌​‌ the field of data​​ management has evolved in​​​‌ two antagonistic directions over‌ the past decade. One‌​‌ first direction is that​​ the scope of data​​​‌ usage in applications has‌ significantly expanded. Data is‌​‌ used to train statistical​​ models; they are the​​​‌ basis of key decisions‌ that have a strong‌​‌ impact on people and​​ on society, and must​​​‌ therefore be explainable.‌ The data that we‌​‌ store may be updated​​ often, which makes it​​​‌ necessary to efficiently recompute‌ the query answers. In‌​‌ particular, the data may​​ arrive in streams,​​​‌ and we may have‌ to compute query answers‌​‌ on the fly, even​​ before we have read​​​‌ the entire data stream.‌ The number of query‌​‌ answers to produce may​​ be so large that​​​‌ we need to produce‌ it in streaming, or‌​‌ to access it implicitly,​​ instead of querying it​​​‌ in full. The data‌ is heterogeneous: it‌​‌ does not necessarily follow​​ a fixed schema like​​​‌ the relational model traditionally‌ assumes, and it may‌​‌ feature non-relational data such​​ as textual documents, tree-shaped​​​‌ data (such as XML‌ and JSON records), and‌​‌ so on. In particular,​​ data may be stored​​​‌ as a graph, without‌ a fixed schema; it‌​‌ may then be interrogated​​ with queries featuring recursion​​​‌, e.g., asking for‌ pairs of reachable nodes.‌​‌ There may be uncertainty​​ on the data stored​​​‌ by such systems, making‌ it difficult to compute‌​‌ accurate answers. Last, some​​ queries need to be​​​‌ evaluated across datasets from‌ different sources, e.g., large‌​‌ open datasets from multiple​​ public organizations. In such​​​‌ settings, data is typically‌ not stored within one‌​‌ single system, but integrated​​ across different systems. Further,​​​‌ beyond traditional query evaluation,‌ the data can be‌​‌ transformed, combined, and post-processed.​​ This can be done​​​‌ with procedural languages within‌ SQL databases (stored functions‌​‌ and procedures), or with​​ domain specific languages such​​​‌ as LINQ that are‌ used to querying data‌​‌ within the programming language​​ C#. Thus, many data​​​‌ processing frameworks are actually‌ libraries that can be‌​‌ seen as query languages​​ embedded within a programming​​​‌ language; conversely, some query‌ languages such as XPath‌​‌ 3.0 for XML documents​​ or jq for JSON​​​‌ documents now embed limited‌ programming languages so as‌​‌ to make them more​​ expressive. These new developments​​​‌ are blurring the line‌ between querying and programming,‌​‌ and are giving up​​ on the idea that​​​‌ data manipulation needs should‌ be expressed declaratively.

Faced‌​‌ with the growth of​​ these new applications, the​​​‌ scope of the SQL‌ standard has significantly expanded‌​‌ to accommodate some (but​​ not all) of the​​​‌ corresponding needs. Recent versions‌ of SQL support recursive‌​‌ queries, incremental view maintenance​​​‌ and triggers, stored procedures,​ queries over strings, embedded​‌ XML and JSON documents​​ and query over them,​​​‌ along with naive forms​ of uncertainty and incompleteness​‌ (through NULLs). Most recently,​​ the graph query language​​​‌ GQL has been published​ as part of SQL.​‌ These features are standardized​​ and supported by relational​​​‌ engines, though the coverage​ and efficiency of these​‌ features may vary across​​ implementations. Further, for some​​​‌ of these tasks, such​ as data integration and​‌ post-processing, an entire ecosystem​​ of solutions has developed​​​‌ on top of relational​ databases.

The second direction​‌ is that the data​​ has increased in volume.​​​‌ Some organizations now want​ to store much larger​‌ volumes of data, e.g.,​​ by extracting it from​​​‌ the Web, collecting scientific​ data, or running services​‌ for an unprecedented number​​ of users. Traditional relational​​​‌ databases, with their support​ of the full SQL​‌ language, may not be​​ suited to such large​​​‌ data volumes. For this​ reason, in the 2000s,​‌ new “NoSQL” database paradigms​​ started to become popular.​​​‌ These systems compromise on​ some aspects of SQL,​‌ such as its expressiveness​​ (i.e., supporting less expressive​​​‌ queries), or its formal​ guarantees (e.g., distributed consistency),​‌ to achieve better performance.​​ Highly-optimized systems now exist​​​‌ for specific tasks, such​ as key-value stored, document-oriented​‌ databases, etc.

To summarize,​​ new data management applications​​​‌ now ask for two​ conflicting aspects: the support​‌ for more expressive queries​​ (going beyond the more​​​‌ arcane features of SQL);​ while offering more efficient​‌ querying over very large​​ quantities of data. These​​​‌ features should be offered​ while retaining the guiding​‌ principle of data management​​ systems, namely, going from​​​‌ a high-level declarative language​ that describes user needs,​‌ down to low-level execution​​ plans that implement them​​​‌ over data representations. Parts​ of this translation can​‌ be data-independent, using,​​ e.g., static analysis; or​​​‌ they can be data-dependent​, using, e.g., logical​‌ constraints or statistics of​​ the database.

The main​​​‌ goal of D-DAL (D-DAL​ is pronounced as Dédale​‌, the French pronunciation​​ of Daedalus) is to​​​‌ find paths to extract​ meaning from the maze​‌ of data. In other​​ words, D-DAL's scientific objective​​​‌ consists in overcoming the​ limits of expressivity and​‌ efficiency of programs dealing​​ with data. This global​​​‌ objective gives rise to​ two main challenges. A​‌ challenge with a large​​ perspective of extracting value​​​‌ from data, and​ a practical and pragmatic​‌ challenge of processing large​​ data efficiently.

Challenge​​​‌ 1: Extracting value from​ data

The first challenge​‌ that we will address​​ is that of supporting​​​‌ more expressive ways to​ interrogate data, so as​‌ to make sense of​​ complex data. We will​​​‌ address several aspects of​ this challenge. First, the​‌ data can be uncertain​​1, imprecise, or​​​‌ probabilistic. Second, it can​ be unstructured, or with​‌ an unknown structure that​​ must be discovered or​​​‌ formalized. Third, we may​ need to extend query​‌ evaluation to compute complex​​ information, in particular explanations,​​​‌ but also provenance annotations,​ complex aggregation values, or​‌ complex results, in a​​ principled manner.

Dealing with​​ uncertain data.

Data in​​​‌ real life is rarely‌ certain: it can be‌​‌ subject to human error​​ when manually collected, generated​​​‌ from physical sensors having‌ imperfect precision, or extracted‌​‌ from sources of various​​ quality from the Web​​​‌ via natural language processing‌ (NLP) or machine learning‌​‌ techniques, that are often​​ themselves of a probabilistic​​​‌ nature. The traditional way‌ of dealing with this‌​‌ uncertainty is to simply​​ ignore it, e.g., by​​​‌ removing all rows containing‌ NULL values in a‌​‌ database before evaluating a​​ query. This naive approach​​​‌ can however lead to‌ incorrect or incomplete answers‌​‌ and is thus not​​ always desirable. The challenge​​​‌ is then to design‌ methods and algorithms that‌​‌ are able to take​​ this uncertainty into account​​​‌ in a principled and‌ efficient way, i.e., by‌​‌ reasoning over the possible​​ completions of the data​​​‌ in a logical fashion,‌ or by reasoning quantitatively‌​‌ over the possible worlds​​ of probabilistic data to​​​‌ determine which query results‌ are likely.

Finding structure‌​‌ in data.

Relational databases​​ assume that the data​​​‌ is strictly structured by‌ a fixed data schema‌​‌: this assumption is​​ central to the query​​​‌ language, and to efficient‌ algorithms to select query‌​‌ execution plans. However, when​​ data comes from various​​​‌ sources, or when it‌ is updated with new‌​‌ kinds of information, then​​ relational databases are inadequate:​​​‌ they require time-consuming engineering‌ tasks to refactor data‌​‌ schema to reflect the​​ new needs. By contrast,​​​‌ the model of graph‌ databases is more flexible‌​‌ and has become increasingly​​ popular for applications. As​​​‌ efficiency comes at a‌ cost, flexibility also has‌​‌ a cost, and data​​ may display irregularities and​​​‌ peculiarities that make it‌ harder to process. We‌​‌ focus on two main​​ research directions to explore​​​‌ on data graphs.

The‌ first research direction is‌​‌ to develop ways to​​ express restrictions on the​​​‌ shape of graph data,‌ via flexible notions of‌​‌ schemas. Schemas make it​​ possible to trade some​​​‌ of the flexibility of‌ the graph model with‌​‌ the efficiency of the​​ relational model: they make​​​‌ it possible to accommodate‌ diverse data representations in‌​‌ a controlled manner, while​​ disallowing specific kinds of​​​‌ irregularities, and while guiding‌ query evaluation algorithms to‌​‌ optimize query evaluation plans​​ with the information given​​​‌ by the schema.

The‌ second approach consists in‌​‌ analyzing data in order​​ to reveal its structure.​​​‌ Through comprehensive data examination‌ and modelling techniques, we‌​‌ aim to automatically discover​​ the underlying patterns and​​​‌ organization within the graph.‌ These methods could lead‌​‌ to the inference of​​ a schema that will​​​‌ be exploited by other‌ algorithms to exploit directly‌​‌ the data, to transform​​ it into a usable​​​‌ form, or to directly‌ help inference of queries‌​‌ or transformation.

Explaining query​​ results.

The increasing usage​​​‌ of computing in every‌ aspect of modern society‌​‌ has naturally raised a​​ need for transparency in​​​‌ algorithms. In some scenarios,‌ having the answer to‌​‌ a query is not​​ enough and one also​​​‌ wants some kind of‌ explanation on how the‌​‌ result was derived, or​​​‌ wants to sort the​ possible results according to​‌ some importance score.

In​​ particular, in contexts where​​​‌ decisions are made on​ the basis of data​‌ that can impact people's​​ lives, like in medicine,​​​‌ or that can impact​ society, like in policy​‌ making, it is important​​ not to just query​​​‌ data, but also to​ understand why the answers​‌ are obtained, or even​​ understand how pieces of​​​‌ data weigh in the​ answers using certain aggregates​‌ like Shapley values. Indeed​​ data is imperfect and​​​‌ query explanation is a​ way to deal with​‌ this imperfection. These concerns​​ are not confined to​​​‌ the database community: the​ problem arises in all​‌ data-driven computations. Most notably,​​ with the rise of​​​‌ AI, explainable AI has​ emerged as an important​‌ scientific domain.

From a​​ technical point of view,​​​‌ query explainability can be​ understood more generally as​‌ building complex query answers​​ that keep track of​​​‌ the operations performed during​ the evaluation, amounting to​‌ a form of user-defined​​ aggregation; it can also​​​‌ be understood in the​ abstract setting of semiring​‌ provenance. All these​​ new tasks can incur​​​‌ an increase in the​ time to answer queries,​‌ and thus call for​​ the development of new​​​‌ notions of explanations and​ of new efficient algorithms.​‌

Challenge 2: Processing large​​ data efficiently

The standard​​​‌ of SQL is continuously​ growing. Now, it encompasses​‌ a larger number of​​ data formats and ways​​​‌ to query these formats:​ recursive queries, XML and​‌ Xquery, JSON, and more​​ recently property graphs. Most​​​‌ of these extensions of​ the standard are encoded​‌ in the relational model,​​ sometimes in very simple​​​‌ manners. For example, JSON​ documents are in general​‌2 stored as strings.​​ Instead of attacking the​​​‌ problem of handling massive​ data by trying to​‌ treat all kinds of​​ formats and queries at​​​‌ once, like the SQL​ standard does, we prefer​‌ to concentrate on particular​​ fragments and propose efficient​​​‌ analyses, optimization methods and​ algorithms for these fragments.​‌ When the number of​​ query results is very​​​‌ large, specific methods are​ needed to construct them.​‌ For instance, streaming the​​ results one by one​​​‌ dilutes the computational effort​ over time. In situations​‌ when queries are repeated,​​ another strategy may be​​​‌ to store the results​ of the query and​‌ update it along with​​ the data. We shall,​​​‌ in some cases, develop​ high throughput prototypes for​‌ these specific query fragments.​​ In the end, the​​​‌ composition of these tools​ should result in highly​‌ efficient data processing pipelines.​​

As data comes in​​​‌ many forms (text, JSON​ documents, graphs, relational databases,​‌ ...), query languages are​​ highly diverse (regular expressions,​​​‌ JSONpath, Xpath, SQL, the​ numerous NOSQL querying APIs,​‌ ...). An important challenge​​ is to understand what​​​‌ is computationally feasible and​ what is not. Knowing​‌ which fragments of the​​ query languages can benefit​​​‌ from efficient evaluation will​ possibly allow us to​‌ develop specific tools to​​ handle them. We are​​​‌ going to focus on​ particular aspects of expressive​‌ queries that go beyond​​ the traditional relational model:​​ recursion, uncertainty handling, and​​​‌ support for embedded programming.‌ For these features, we‌​‌ aim at studying which​​ aspects of queries make​​​‌ evaluation more or less‌ efficient. This will result‌​‌ in the design of​​ fast algorithms for some​​​‌ families of queries and‌ proofs of computational hardness‌​‌ for others. Beyond these​​ results, for practically useful​​​‌ families of queries, we‌ shall develop domain specific‌​‌ languages and prototypes that​​ take advantage of the​​​‌ algorithms that we will‌ have designed.

Beyond this‌​‌ theoretical endeavour, we also​​ wish to explore how​​​‌ efficient query evaluation algorithms‌ can be articulated with‌​‌ modern hardware platforms. Indeed,​​ hardware now offers increasingly​​​‌ parallel, concurrent or distributed‌ primitives. We believe that‌​‌ one of the keys​​ to efficient query evaluation​​​‌ on large data is‌ to align novel algorithmic‌​‌ methods with modern hardware.​​

Last, in settings where​​​‌ efficient query evaluation remains‌ out of reach, we‌​‌ intend to study approximation​​ algorithms, to obtain approximate​​​‌ answers to queries with‌ reasonable time bounds. Approximation‌​‌ can be understood in​​ a logical sense (i.e.,​​​‌ principled overapproximations or underapproximations‌ of query results), in‌​‌ a quantitative sense (i.e.,​​ approximate numerical or probabilistic​​​‌ query answers), or in‌ a practical sense (i.e.,‌​‌ approximations that are validated​​ by an experimental approach).​​​‌

Enumeration and incremental algorithms.‌

Querying data is only‌​‌ one aspect of data​​ processing. Indeed, when looking​​​‌ at entire data processing‌ pipelines, we see that‌​‌ results are often combined​​ with other results coming​​​‌ from other data sources,‌ and can be further‌​‌ processed by programs. This​​ poses new research questions:​​​‌ instead of optimizing query‌ evaluation as a monolithic‌​‌ task, we can study​​ how to globally optimize​​​‌ query evaluation pipelines by‌ studying the interplay between‌​‌ the evaluation of several​​ queries.

In this light,​​​‌ we intend to study‌ two common algorithmic frameworks‌​‌ to deal with large​​ amounts of data, which​​​‌ are well-suited to orchestrate‌ the simultaneous evaluations of‌​‌ many queries in a​​ pipeline. These are enumeration​​​‌ algorithms and incremental algorithms‌.

The principle of‌​‌ enumeration algorithms is the​​ following: when the number​​​‌ of query results is‌ huge, it is neither‌​‌ feasible nor realistic to​​ produce the entire collection​​​‌ of results in answer‌ to a user query.‌​‌ Instead, it is preferable​​ to enumerate the results​​​‌ in streaming, one after‌ the other, while minimizing‌​‌ the delay between two​​ successive answers. Evaluating queries​​​‌ by enumerating results offers‌ two potential advantages. First,‌​‌ downstream applications in the​​ pipeline can access query​​​‌ results more quickly, as‌ soon as they are‌​‌ computed. Second, outputting results​​ one by one might​​​‌ drastically reduce the amount‌ of memory needed to‌​‌ perform the computation.

The​​ principle of incremental algorithms​​​‌ is that queries that‌ are repeatedly executed can‌​‌ be drastically accelerated by​​ memorizing previous results. The​​​‌ idea of incremental algorithms‌ is to avoid the‌​‌ recomputation of all the​​ results; this is usually​​​‌ done by designing data‌ structures that can efficiently‌​‌ be maintained when the​​ data is updated and​​​‌ that can serve to‌ compute quickly the new‌​‌ query result.

Designing query​​​‌ execution engines.

Once query​ results from various data​‌ sources are enumerated either​​ by an enumeration algorithm​​​‌ or from data structures​ that store them, one​‌ needs to orchestrate and​​ synchronize the processing of​​​‌ these results. In this​ context results are abstracted​‌ as streams that are​​ combined together.

Techniques pertaining​​​‌ to programming languages are​ well suited to work​‌ on streams. Streams of​​ data and their combinations​​​‌ are well studied in​ the context of synchronous​‌ and reactive programming. Generalizations​​ have been also studied​​​‌ by the community of​ functional programming languages. Bringing​‌ these methods to deal​​ with queries and more​​​‌ broadly to data processing​ can have a strong​‌ impact on the efficiency​​ of query execution engines.​​​‌

The relation with programming​ languages and thus to​‌ compilation naturally leans towards​​ trying to exploit hardware​​​‌ optimally to improve data​ throughput. In particular, exploiting​‌ all the levels of​​ parallelization: SIMD3 instructions,​​​‌ the several cores that​ are inside CPUs, GPUs,​‌ or in a distributed​​ context.

Finally another challenge​​​‌ is to develop methods​ at the frontier of​‌ enumeration and incremental algorithms,​​ i.e., to develop data​​​‌ structures that support the​ enumeration of query results,​‌ and that can also​​ be updated efficiently when​​​‌ the data itself is​ updated, so as to​‌ start again the enumeration​​ of the (possibly new)​​​‌ query results.

3 Research​ program

Our research organisation​‌ addresses the two scientific​​ challenges that motivate the​​​‌ project D-DAL, extracting value​ from data and processing​‌ large data efficiently,​​ with three axes that​​​‌ correspond to different level​ of abstractions with respect​‌ to these challenges.

The​​ first axis, the basement​​​‌, is about circuits​ which are a transverse​‌ to the two challenges.​​ Circuits form a handy​​​‌ abstraction for representing data​ and understanding its structure​‌ but also a proxy​​ that allows to study​​​‌ the complexity of queries​ and also to understand​‌ how to make the​​ most out of modern​​​‌ CPUs.

The second axis,​ the first floor,​‌ is about querying. It​​ builds upon what is​​​‌ produced in the basement​ and is intended to​‌ develop efficient querying algorithms​​ and eventually querying programs.​​​‌ This axis is concerned​ with explaining query results,​‌ studying fragments of querying​​ languages, dynamics aspects of​​​‌ querying and streaming. The​ first floor builds upon​‌ the basement, circuits play​​ an important role in​​​‌ explaining query results and​ various aggregation mechanisms as​‌ they provide the right​​ structure to represent concisely​​​‌ query results and then​ execute efficiently computations on​‌ these results. Circuits also​​ provide the frontier of​​​‌ the feasible and the​ efficient about queries and​‌ they will allow us​​ to focus on the​​​‌ right query fragments. Finally,​ circuits are the most​‌ basic streaming algorithms we​​ are going to study​​​‌ and implement. The study​ of streaming methods will​‌ allow us to compose​​ efficiently these basic streaming​​​‌ algorithms and enable the​ creation of efficient querying​‌ pipelines.

The third axis,​​ the roof, is​​​‌ about proposing and exploiting​ information about data. The​‌ roof has two slopes​​, one is about​​ proposing and studying schemas​​​‌ for graphs and the‌ other is about taking‌​‌ into account knowledge about​​ data so as to​​​‌ accelerate querying. The first‌ slope is concerned with‌​‌ the schema language ShEx​​ which has been developed​​​‌ within Links (the team‌ from which D-DAL originates)‌​‌ and D-DAL will pursue​​ this effort as it​​​‌ gives us concrete problems‌ in relation to data‌​‌ for query optimization exploiting​​ knowledge about data. The​​​‌ second slope is precisely‌ about taking into account‌​‌ knowledge about data when​​ answering queries. This information​​​‌ should make algorithms more‌ efficient and also to‌​‌ work with uncertain data​​ more consistently. We will​​​‌ work on the specific‌ information coming from ShEx‌​‌ schemas, but also with​​ more general settings. Overall,​​​‌ the roof axis stands‌ upon the second axis‌​‌ as it aims at​​ refining the work done​​​‌ on querying and applying‌ it to settings where‌​‌ we can take advantage​​ from extra hypotheses in​​​‌ order to lower the‌ complexity.

In a a‌​‌ nutshell, the three axes​​ build a coherent architecture​​​‌ which aims to work‌ out the maze of‌​‌ data. We here give​​ a summary of their​​​‌ contributions to the two‌ challenges:

  • Extracting value from‌​‌ data
     
    • Basement
      Circuits are​​ natural and concise representations​​​‌ of data, be it‌ raw data or answers‌​‌ to queries. With tools​​ coming from knowledge compilation,​​​‌ they reveal the inherent‌ structure of data, which‌​‌ can be used independently​​ from the queries that​​​‌ we intend to evaluate‌ over it.
    • First floor‌​‌
      Querying is the basic​​ way to extract information​​​‌ from data, which can‌ then be refined into‌​‌ value. We plan to​​ explore various querying fragments​​​‌ that enrich the state‌ of the art of‌​‌ querying, including expressive queries​​ featuring recursion, aggregation, and​​​‌ uncertainty handling. It will‌ enable the extraction of‌​‌ richer information.
    • Roof
      The​​ development of ShEx is​​​‌ a concrete proposal for‌ describing data in ways‌​‌ that allow to extract​​ value from graph data​​​‌ by taking logical constraints‌ into account. Furthermore, taking‌​‌ advantage from extra information​​ for querying, such as​​​‌ deduction rules and integrity‌ constraints, will expand the‌​‌ capabilities of querying.
  • Efficient​​ data processing
     
    • Basement
      Circuits​​​‌ can play the role‌ of abstract proxy towards‌​‌ exploiting modern CPUs to​​ process data efficiently. Furthermore,​​​‌ they also play an‌ important role in the‌​‌ study of computational complexity,​​ and in particular low​​​‌ complexity classes, or classes‌ allowing parallel computation. Such‌​‌ classes will allow us​​ to describe the landscape​​​‌ of queries that can‌ be solved very efficiently‌​‌ on large data.
    • First​​ floor
      Query fragments, and​​​‌ algorithms to process them‌ efficiently, are the central‌​‌ focus of this axis.​​ In particular, the dynamic​​​‌ aspects of queries will‌ be studied in two‌​‌ aspects. First by proposing​​ enumeration and incremental algorithms​​​‌ for queries. Second by‌ studying streaming and the‌​‌ composition of streams.
    • Roof​​
      Taking advantage of information​​​‌ about data and developing‌ ShEx are central to‌​‌ exploiting real world data,​​ which often comes with​​​‌ additional information such as‌ schemas. We will refine‌​‌ querying algorithms to make​​​‌ them more efficient with​ extra hypotheses, in particular​‌ to choose query evaluation​​ plans based on statistics​​​‌ of the data.

Our​ overall goal is to​‌ design highly efficient querying​​ languages in various contexts​​​‌ (data formats, uncertainty, etc.)​ and the tools that​‌ allow to combine them​​ into efficient data processing​​​‌ programs. D-DAL is on​ the ridge between expressiveness​‌ and efficiency. We have​​ structured our project into​​​‌ tree axes that are​ detailed in the rest​‌ of this section. We​​ summarize their objectives below​​​‌ according to the two​ challenges we attack.

  • Extracting​‌ value from data
     
    • Short​​ term
      We are going​​​‌ to extend queries and​ data representation in several​‌ directions. First we will​​ work on enhancing circuit​​​‌ classes by negation, and​ we will study generalizations​‌ of structural constraints that​​ allow for efficient counting​​​‌ algorithms and hence handle​ quantitative uncertainty. Second we​‌ will also incorporate probabilities​​ in ShEx and work​​​‌ on queries on graphs​ with constraints. We will​‌ also study which logical​​ constraints can be used​​​‌ to reason on data,​ and which counterfactual problems​‌ (e.g., resilience, Shapley values)​​ can benefit from our​​​‌ existing expertise on related​ tasks. These short term​‌ objectives build on our​​ previous work and work​​​‌ towards more expressive query​ fragments.
    • Medium term
      We​‌ plan to take a​​ more general view on​​​‌ negation within circuits, in​ particular work more thoroughly​‌ in connection with knowledge​​ compilation. One goal is​​​‌ to obtain a better​ understanding of the probabilistic​‌ setting specifically. Another goal​​ is to work on​​​‌ open-world query answering in​ the presence of uncertainty​‌ and of recursion. We​​ also plan to study​​​‌ the lifting of constraints​ of data graphs to​‌ views on graphs. This​​ medium-term objectives aim at​​​‌ having a better qualitative​ understanding of data.
    • Long​‌ term
      We plan to​​ build a unifying view​​​‌ of the relation between​ circuit representations and querying​‌ problems. This will provide​​ us with an integrated​​​‌ view of how data​ can be represented. We​‌ also plan to work​​ on relating aggregation and​​​‌ uncertainty problems with formalisms​ used for constraint satisfaction​‌ problems, to explore how​​ such approaches can lead​​​‌ to new structural algorithms​ for query evaluation. We​‌ also hope that such​​ approaches can help connect​​​‌ the multiple communities that​ propose constraint languages for​‌ data, in particular graph-shaped​​ data. Finally, we plan​​​‌ to have ShEx be​ standardized by the IEEE​‌ and have a set​​ of open software tools​​​‌ for it.
  • Efficient data​ processing
     
    • Short term
      We​‌ are going to consolidate​​ our team's first works​​​‌ which relate circuits to​ efficient implementations based on​‌ SIMD instructions. We shall​​ also work on dynamic​​​‌ and incremental algorithms for​ simple data like texts​‌ and trees, again building​​ on our existing research​​​‌ providing first results in​ these directions. We will​‌ also study efficient enumeration​​ of query results that​​​‌ respect a specific order​ (the order is a​‌ good property for further​​ processing), and connect this​​​‌ with efficient dynamic algorithms.​ All these works give​‌ basic building blocks for​​ streaming algorithms, that we​​ plan to orchestrate via​​​‌ work on simple programs‌ called transducers.
    • Medium‌​‌ term
      We shall build​​ a circuit-based compilation chain,​​​‌ which goes from query‌ fragments to efficient code,‌​‌ generalizing previous work to​​ transducers. We will also​​​‌ study how the compilation‌ can be doubled with‌​‌ efficient query plan selection​​ at runtime, leveraging the​​​‌ efficient data structures that‌ we will have designed‌​‌ by then. We will​​ also explore how dynamic​​​‌ and incremental algorithms can‌ accommodate expressive features such‌​‌ as recursion and aggregation​​ without compromising the effectiveness​​​‌ of query evaluation, and‌ identify tractability assumptions that‌​‌ can be leveraged to​​ that end. In particular,​​​‌ in connection with enumeration‌ algorithms, we shall work‌​‌ on how to efficiently​​ compose data processes, in​​​‌ particular connecting enumeration algorithms‌ that produce sequence of‌​‌ diffs with incremental algorithms​​ that maintain a query​​​‌ result over evolving data.‌
    • Long term
      We shall‌​‌ propose an end-to-end compilation​​ framework for efficient query​​​‌ compilation and the orchestration‌ of heterogeneous systems, including‌​‌ expressive tasks such as​​ reasoning and optimization. We​​​‌ shall take the direction‌ of probabilistic programming to‌​‌ sacrifice precision in favor​​ of efficiency for the​​​‌ evaluation of queries, guided‌ by a fine-grained theoretical‌​‌ understanding of the possible​​ approximation tradeoffs. We will​​​‌ also undertake this precise‌ complexity study in the‌​‌ dynamic and incremental settings​​ as well as for​​​‌ enumeration problems, again aiming‌ at a full understanding‌​‌ of the landscape of​​ possible complexity regimes.

3.1​​​‌ The Basement: Circuits for‌ Data Manipulation

Large data‌​‌ poses challenges that are​​ both theoretical and practical.​​​‌ D-DAL intends to use‌ circuits as a common‌​‌ abstraction to address several​​ of these problems. Circuits​​​‌ are an abstract object‌ that can be approached‌​‌ from various angles, corresponding​​ to different research communities.​​​‌ First, circuits are studied‌ from the knoweldge compilation‌​‌ perspective, an AI field​​ which studies classes of​​​‌ Boolean circuits and which‌ has connections to data‌​‌ management 24. Second,​​ circuits are studied as​​​‌ a compilation target,‌ this time to translate‌​‌ programs to abstract representations​​ which ensure that they​​​‌ can be evaluated in‌ a massively parallel or‌​‌ vectorized way. Third, circuits​​ can be studied from​​​‌ a fully theoretical angle,‌ in the field of‌​‌ circuit complexity theory.​​ We now present these​​​‌ axes.

Studying circuits from‌ Knowledge Compilation.

Participants: Antoine‌​‌ Amarilli, Mikaël Monet​​, Charles Paperman,​​​‌ Sylvain Salvati.

Knowledge‌ compilation 31 is a‌​‌ subfield of AI that​​ studies different ways to​​​‌ represent Boolean functions, in‌ particular restricted classes of‌​‌ Boolean circuits. It​​ studies tradeoffs between conciseness​​​‌ and tractability: which‌ representations can be used‌​‌ for which computational tasks​​ (e.g., counting satisfying assignments),​​​‌ and when can we‌ efficiently translate from one‌​‌ representation to another.

It​​ turns out that knowledge​​​‌ compilation is a useful‌ tool to solve many‌​‌ types of database tasks​​ 24, such as​​​‌ factorized representations of query‌ answers, enumeration algorithms, provenance‌​‌ computation, and uncertainty management​​ (in particular on probabilistic​​​‌ data). We wish to‌ pursue this study of‌​‌ knowledge compilation to databases​​​‌ and query evaluation, and​ also investigate knowledge compilation​‌ in itself to achieve​​ a better understanding of​​​‌ these circuit classes.

Short​ term.

In the short​‌ term, we will explore​​ how specific circuit classes​​​‌ can be used for​ efficient evaluation of particular​‌ queries, across various tasks​​ such as enumeration and​​​‌ ranked access. The PhD​ thesis of Oliver Irwin​‌ is set in this​​ context: it is co-directed​​​‌ by Florent Capelli from​ CRIL and Sylvain Salvati.​‌ It studies how to​​ efficiently find query answers​​​‌ using circuits for queries​ that may feature negation,​‌ and how to reinterpret​​ efficient query evaluation algorithms​​​‌ in a knowledge compilation​ perspective. We further intend​‌ to pursue our study​​ of query evaluation on​​​‌ probabilistic databases via circuit​ classes, understanding the power​‌ and limitations of this​​ method.

Medium term.

One​​​‌ longer-term challenge to be​ attacked by D-DAL is​‌ to understand the power​​ of negation in circuit​​​‌ formalisms that allow tractable​ counting. Indeed, for the​‌ circuit classes commonly used​​ in knowledge compilation 21​​​‌, the use of​ negation is highly restricted.​‌ However, it is not​​ well-understood whether the expressive​​​‌ power of such circuits​ is extended when we​‌ allow arbitrary negation. This​​ question is purely about​​​‌ knowledge compilation circuit classes,​ but it also has​‌ consequences for questions in​​ database theory about tractable​​​‌ query evaluation on probabilistic​ databases 27. Another​‌ challenge is whether knowledge​​ compilation classes can also​​​‌ capture more tractable enumeration​ tasks, in particular those​‌ involving negations (e.g., enumerating​​ objects which do not​​​‌ have a given property);​ these would be necessary​‌ for a knowledge-compilation-based proof​​ of tractable enumeration for​​​‌ some formalisms such as​ first-order logic on restricted​‌ graphs 43.

Long​​ term.

A far-reaching question​​​‌ in knowledge compilation and​ database theory concerns the​‌ relationship between two kinds​​ of results: algorithms (namely,​​​‌ efficient query evaluation algorithms,​ and computational complexity hardness​‌ results), and tractable circuits​​ (namely, efficient compilation algorithms,​​​‌ and lower bounds on​ the size of tractable​‌ circuits). One long-term research​​ direction of the D-DAL​​​‌ team would be to​ unify these settings. For​‌ instance, we could try​​ to understand which hardness​​​‌ results on query evaluation​ can imply circuit lower​‌ bounds, e.g., in the​​ context of recursive queries​​​‌ 25; and when​ tractable algorithms can be​‌ translated to circuit classes,​​ subsuming the inclusion-exclusion conjecture​​​‌ 27.

Another long-term​ goal of D-DAL could​‌ be to consolidate existing​​ results about knowledge compilation.​​​‌ The current reference on​ the topic is the​‌ seminal work of Darwiche​​ and Marquis 31,​​​‌ but it dates back​ to 2002. We have​‌ started a survey of​​ knowledge compilation in 21​​​‌, but far more​ efforts would be needed​‌ in order to survey​​ all known results for​​​‌ knowledge compilation circuit classes.​

Circuits as a compilation​‌ target for parallelism and​​ vectorization.

 

Participants: Charles Paperman​​​‌, Sylvain Salvati.​

Most of the data​‌ exchanged on the web​​ is now represented as​​​‌ JSON files which are​ text documents (other document​‌ formats are also in​​ use such as XML​​ or YAML documents). This​​​‌ is a consequence of‌ the fact that many‌​‌ data engine now operate​​ and communicate with these​​​‌ kinds of data. Extracting‌ information and answering queries‌​‌ from textual documents has​​ become a pervasive task​​​‌ when dealing with large‌ amount of data. However‌​‌ with the growth of​​ data that comes with​​​‌ larger and larger documents,‌ processing textual documents has‌​‌ become an important bottleneck​​ in data processing.

For​​​‌ the moment this bottleneck‌ is addressed first by‌​‌ speeding up the translation​​ of the documents into​​​‌ in memory data structures‌ on which queries are‌​‌ then executed. This translation,​​ also called parsing,​​​‌ is accelerated via low‌ level optimizations. The quickest‌​‌ implementations have required expert​​ programmers and complex costly​​​‌ developments. They leverage specific‌ CPU instructions known as‌​‌ vectorial instructions. The​​ diversity of CPU architectures​​​‌ require the production of‌ highly specialized code for‌​‌ each of them. This​​ work is worth doing​​​‌ since the resulting code‌ is often accelerated by‌​‌ one or two orders​​ of magnitude compared to​​​‌ the non-vectorial variant.

D-DAL‌ aims at developing tools‌​‌ that automatize these optimization​​ and reduce the cost​​​‌ of vectorizing programs that‌ handle textual data. We‌​‌ will not only optimize​​ parsing, but a combination​​​‌ of parsing with query‌ resolution. The PhD thesis‌​‌ of Claire Soyez-Martin 46​​, 41 (conducted under​​​‌ the supervision of Charles‌ Paperman and Sylvain Salvati)‌​‌ advocates that such automations​​ can be done via​​​‌ a notion of vectorial‌ circuits. Our plan‌​‌ is two fold: first​​ study the compilation of​​​‌ expressive queries into vectorial‌ circuits and then build‌​‌ dedicated compilation schemes for​​ vectorial circuits that takes​​​‌ benefits of vectorial instructions.‌

Short term.

In continuation‌​‌ of the work engaged​​ with the PhD thesis​​​‌ of Claire Soyez-Martin, we‌ are going to develop‌​‌ further the model of​​ vectorial circuits. The aim​​​‌ is to obtain a‌ cost model for these‌​‌ objects that reflects the​​ efficiency of their compilation​​​‌ to CPU and vectorial‌ instructions. We will also‌​‌ identify application domains that​​ can benefit from our​​​‌ methods. We have already‌ identified several possibilities: biological‌​‌ sequence analysis, network monitoring,​​ pervasive streaming tasks in​​​‌ relation with data (e.g.‌ utf8 validation, JSON processing,‌​‌ etc.)

Medium term.

We​​ plan to develop a​​​‌ compilation chain that will‌ contain two steps. The‌​‌ first step will consist​​ in sets of compilers​​​‌ from query fragments to‌ vectorial circuits. The second‌​‌ will be a set​​ compilers from vectorial circuits​​​‌ to efficient vectorized hardware‌ code. At first we‌​‌ will aim at X86_64​​ architectures. This compilation pipeline​​​‌ will be the result‌ of ground work conducted‌​‌ in algebraic automata theory.​​

Long term.

We plan​​​‌ to extend our work‌ to other hardware components.‌​‌ One of them is​​ the RiscV platform which​​​‌ has recently been endowed‌ with an interesting set‌​‌ of vectorial instructions. Another​​ one is the processing​​​‌ in memory units (PIM)‌ that provide a highly‌​‌ parallel architecture that could​​ benefit from the relationship​​​‌ between automata theory and‌ circuits.

Foundations of circuits:‌​‌ circuit complexity, algebra and​​​‌ abstract machines.

 

Participants: Antoine​ Amarilli, Mikaël Monet​‌, Charles Parperman,​​ Sylvain Salvati.

Our​​​‌ work on circuits takes​ its root in deep​‌ connections between logic, formal​​ language and algebra. These​​​‌ deep connections are at​ the heart of finite​‌ state verification techniques that​​ have been flourishing in​​​‌ the last decades. We​ exploit these connections in​‌ a different way. For​​ us logic is connected​​​‌ to queries as logical​ languages naturally express properties​‌ on object which in​​ turn coincides with the​​​‌ use of declarative languages​ to extract information from​‌ data. Moreover, logical formulae​​ on data of bounded​​​‌ size can be represented​ as circuits which extract​‌ information from data. Languages​​ or language transformations are​​​‌ a natural counter part​ of logical formulae. In​‌ general, they can be​​ represented as finite state​​​‌ machines. Finally, algebra (monoids,​ semigroups, forest algebras) naturally​‌ expresses the invariants of​​ these machines. Classes of​​​‌ algebra, computation power of​ finite state machines, complexity​‌ of circuits are then​​ deeply connected. The classification​​​‌ of algebras with respect​ to their properties often​‌ coincides with natural classes​​ of machines and of​​​‌ circuits. Algebras form a​ useful mediation between machines​‌ and circuits, that is,​​ between sequential devices and​​​‌ parallel ones. More generally,​ algebras formalize the structure​‌ of the information that​​ needs to be captured​​​‌ when processing queries. It​ thus plays a central​‌ role in many of​​ the applications D-DAL aims​​​‌ at developing. We plan​ to develop our understanding​‌ of these deep connections​​ in relation with querying​​​‌ with applications in many​ objectives of D-DAL, most​‌ notably: vectorization, incremental maintenance,​​ enumeration, transducers, etc.

Short​​​‌ term.

In connection to​ vectorization, we shall explore​‌ particular classes of circuits​​ and describe the classes​​​‌ of languages they correspond​ to. We shall also​‌ develop algebraic tools for​​ forest algebras to enhance​​​‌ the stream processing on​ tree structured documents. In​‌ particular, we wish to​​ extend the expressiveness of​​​‌ the queries that our​ software rsonpath can solve.​‌ Concerning incremental maintenance and​​ enumeration, we shall workout​​​‌ some algebraic invariants on​ trees that will allow​‌ us to extend the​​ methods that we have​​​‌ developed for strings.

Medium​ term.

The algebraic work​‌ on data (strings, trees,​​ graphs, ...) can be​​​‌ extended to more dynamic​ objects like transducers and​‌ even to programs via​​ denotational semantics. In general,​​​‌ queries not only check​ properties of data, but​‌ also transform data. Extending​​ algebraic methods to such​​​‌ dynamic setting will allow​ us to abstract data​‌ processing pipelines and open​​ the way to further​​​‌ optimizations.

Long term.

In​ the long term, we​‌ aim at developing a​​ coherent corpus of algebraic​​​‌ methods for trees and​ graphs that lend themselves​‌ to classify queries on​​ structured documents and graphs.​​​‌ We wish also to​ attack difficult classification problems​‌ that remain open on​​ circuit complexity for classes​​​‌ of regular languages. This​ ground work shall set​‌ the stage for future​​ applications similar to rsonpath​​​‌.

3.2 First Floor:​ Logic and Query Evaluation​‌

We now present scientific​​ objectives which operate at​​ a higher level, and​​​‌ deal with query evaluation‌ itself, rather than with‌​‌ the tools presented in​​ the previous section which​​​‌ are ways to implement‌ query evaluation.

Provenance, explanation,‌​‌ aggregation, counting, uncertainty, probabilistic​​ data, Shapley, approximation algorithms.​​​‌

 

Participants: Antoine Amarilli,‌ Mikaël Monet.

As‌​‌ already mentioned in Section​​ 2, querying data​​​‌ often goes beyond merely‌ computing a set of‌​‌ answers and returning them.​​ First, data is often​​​‌ uncertain by nature (e.g.,‌ probabilistic or missing values).‌​‌ Then, algorithms must be​​ developed that take into​​​‌ account this uncertainty, by‌ considering all possible states‌​‌ of the data; which​​ turns out to often​​​‌ be intractable. In this‌ context, approximate query answers‌​‌ (e.g., approximating the probability​​ that a probabilistic database​​​‌ satisfies a query) can‌ be employed to lower‌​‌ the complexity. Second, one​​ sometimes wishes to explain​​​‌ the query results, using‌ for instance the notion‌​‌ of provenance or using​​ specific scoring mechanisms such​​​‌ as the Shapley value‌ or the notion of‌​‌ resilience of a query​​ result.

Short term.

It​​​‌ is known that when‌ probabilistic data has bounded‌​‌ treewidth, then probabilistic​​ query evaluation (PQE) can​​​‌ be solved efficiently for‌ many queries. We shall‌​‌ explore restrictions that are​​ less limiting than bounded​​​‌ treewidth but could still‌ ensure tractable PQE. One‌​‌ idea would be to​​ design width measures that​​​‌ could be parameterized by‌ the queries. Ideally, we‌​‌ would also obtain lower​​ bounds for specific width​​​‌ measures and queries as‌ it was done in‌​‌ 28.

Medium term.​​

Another research direction would​​​‌ be to explore the‌ tractability of diverse counterfactual‌​‌ scoring mechanisms, such as​​ Shapley values as done​​​‌ in 32, 39‌, or the resilience‌​‌ 35 (intuitively representing how​​ “robust” a query answer​​​‌ is, by determing how‌ many input tuples should‌​‌ be removed from the​​ database so that the​​​‌ specific answer no longer‌ exists) for queries featuring‌​‌ recursion.

Long term.

An​​ intriguing and potentially fruitful​​​‌ research direction would be‌ to connect the kind‌​‌ of counting problems studied​​ in database theory (such​​​‌ as PQE or Shapley‌ values) to other well-established‌​‌ counting problems from other​​ fields such as counting​​​‌ versions of constraint satisfaction‌ problems (CSPs) 33,‌​‌ 38 or Holant problems​​ and holographic reduction 30​​​‌. We will also‌ strive to provide a‌​‌ toolbox to use these​​ algorithms in practice, leveraging​​​‌ for instance the existing‌ implementation of ProvSQL 45‌​‌, an open-source PostgreSQL​​ extension that computes (between​​​‌ other things) the Boolean‌ provenance of a query‌​‌ over a database.

Dynamic​​ data: incremental and dynamic​​​‌ algorithms and connections with‌ dynamic algorithms.

 

Participants: Antoine‌​‌ Amarilli, Mikaël Monet​​, Charles Paperman,​​​‌ Sylvain Salvati.

In‌ many query evaluation scenarios,‌​‌ the data to interrogate​​ is regularly updated. When​​​‌ re-evaluating queries multiple times‌ over different versions of‌​‌ the same data, it​​ is more efficient to​​​‌ build index data structures‌ which can maintain the‌​‌ results of the query​​ under updates, rather than​​​‌ re-evaluating the query every‌ time from scratch. This‌​‌ problem of query evaluation​​​‌ over dynamic data has​ been studied in many​‌ different contexts. The D-DAL​​ team will focus on​​​‌ two broad kinds of​ contexts: tree-shaped data on​‌ the one hand, which​​ includes texts, trees, genomic​​​‌ sequences, etc.; and unrestricted​ data, namely graph​‌ databases and general relational​​ databases.

Short term.

In​​​‌ the short term, we​ intend to focus on​‌ the first application setting​​ of texts and trees,​​​‌ and pursue our ongoing​ work in these settings.​‌ While the main results​​ in the area have​​​‌ already been proven, some​ work remains in expanding​‌ the frontiers of what​​ can be done efficiently​​​‌ over dynamic text and​ trees: can we maintain​‌ data structures for in-order​​ enumeration? in which orders?​​​‌ with which complexity? One​ important direction which remains​‌ open is that of​​ showing improved bounds on​​​‌ query evaluation which depends​ on the specific query​‌ (rather than for a​​ general query class). From​​​‌ our initial results for​ regular queries on words​‌ 26, one especially​​ promising ongoing effort is​​​‌ to extend these results​ to trees. Another is​‌ to extend such results​​ to enumeration data structures.​​​‌ In both cases, significant​ algebraic groundwork is required.​‌

Medium term.

We then​​ intend to focus on​​​‌ broader directions. One of​ them is to support​‌ more expressive updates:​​ the bulk of the​​​‌ literature on dynamic data​ focuses on single-character or​‌ single-tuple changes, but real​​ updates are often more​​​‌ expressive. On text, we​ want to be able​‌ to support copy-and-paste updates,​​ which move an entire​​​‌ part of the text​ to a different point;​‌ or search-and-replace updates. On​​ relational databases, we want​​​‌ to develop incremental algorithms​ that cope with updates​‌ which are themselves expressed​​ in SQL, i.e., with​​​‌ UPDATE statements. Another direction​ is to move to​‌ the study of incremental​​ maintenance for more expressive​​​‌ queries, in particular​ those featuring recursion or​‌ aggregation. For instance, can​​ we characterize how efficiently​​​‌ we can maintain the​ results of Datalog queries,​‌ with fine bounds that​​ depend on the specific​​​‌ Datalog query which is​ posed? Which assumptions on​‌ the underlying data can​​ be used to achieve​​​‌ tractable algorithms?

Long term.​

A more open direction​‌ for query evaluation over​​ dynamic data would be​​​‌ to achieve precise bounds​ on the complexity of​‌ maintaining each individual query​​ by charting the space​​​‌ of possible tradeoffs across​ various performance measures: what​‌ is the time needed​​ to handle updates to​​​‌ the data; what is​ the delay between successive​‌ results? How much memory​​ usage is required to​​​‌ store the data structure?​ Which updates are supported?​‌ In this light, one​​ especially challenging obstacle is​​​‌ the need for suitable​ lower bound techniques, in​‌ particular unconditional lower bounds,​​ which are sometimes achievable​​​‌ depending on the model​ studied. Another general question​‌ is to connect more​​ broadly to the literature​​​‌ studying data structures, in​ which questions about mutable​‌ data have been investigated​​ for decades. We could​​​‌ for instance leverage results​ on maintaining search trees​‌ balanced under updates, or​​ algorithms on dynamic graphs​​ 37 which have been​​​‌ studied in the algorithms‌ community (especially algorithms to‌​‌ dynamically maintain the connectivity​​ of directed graphs).

Efficient​​​‌ evaluation: streaming and parallelism.‌

 

Participants: Aurélien Lemay,‌​‌ Charles Paperman, Sylvain​​ Salvati.

Efficiency in​​​‌ data processing mostly comes‌ from streaming data. This‌​‌ means that data is​​ passed from small processing​​​‌ units to others so‌ as to produce an‌​‌ output. Certain operations cause​​ serious synchronization problems (sorting​​​‌ data, eliminating duplicates, combining‌ streams, etc.) they block‌​‌ the processing and cause​​ sometimes important lags. Sticking​​​‌ to functional programming, D-DAL‌ will develop streaming data‌​‌ processing programs by using​​ static analysis, speculative computations​​​‌ (to prepare results when‌ lags are unavoidable), develop‌​‌ stream fusion compilation methods​​ so as to obtain​​​‌ efficient streaming algorithms. Techniques‌ coming from synchronous programming‌​‌ will be considered, in​​ particular in connection with​​​‌ the leveraging of fine-grained‌ parallelism for low level‌​‌ streaming. Also, in the​​ medium term, we shall​​​‌ explore how to parallelize‌ query executions taking advantage‌​‌ of static analyses of​​ the functional code.

Short​​​‌ term.

Part of the‌ effort will concern very‌​‌ simple classes of functional​​ programs known as transducers​​​‌. D-DAL will implement‌ libraries for manipulating transducers,‌​‌ minimizing them and composing​​ them. Several extensions of​​​‌ the models of the‌ literature are being proposed‌​‌ by the team and​​ will benefit from these​​​‌ implementations. Of particular interest‌ to us are higher-order‌​‌ transducers which form an​​ elegant generalization of many​​​‌ transducers from the literature.‌ We shall work on‌​‌ compiling them for streaming​​ documents.

Medium term.

Orchestrating​​​‌ programs that stream data‌ is challenging as they‌​‌ may work at different​​ paces and as communications​​​‌ between them do not‌ have a negligible cost.‌​‌ We will work on​​ this orchestration problem in​​​‌ connection with programming languages‌ theory, most notably with‌​‌ synchronous and functional programming​​ languages. Indeed, when we​​​‌ have compiled programs with‌ our method, we shall‌​‌ be able to efficiently​​ compose them. Many technical​​​‌ methods for achieving this‌ have been developed for‌​‌ these programming languages.

Long​​ term.

In the long​​​‌ run, we will take‌ stock on the work‌​‌ of the community of​​ functional probabilistic programming. Indeed,​​​‌ when working with huge‌ data, approximate querying becomes‌​‌ hardly avoidable and thus​​ extending our methods to​​​‌ sampling data with various‌ probabilistic distributions becomes necessary.‌​‌ In particular, we shall​​ see how we implement​​​‌ probabilistic querying by means‌ of transducers.

Enumerating the‌​‌ results of recursive queries.​​

 

Participants: Antoine Amarilli,​​​‌ Mikaël Monet, Charles‌ Paperman, Sylvain Salvati‌​‌.

One important challenge​​ in modern database theory​​​‌ is efficient enumeration algorithms‌ 44, i.e., efficiently‌​‌ producing query results one​​ after the other. This​​​‌ is related to query‌ evaluation in streaming, but‌​‌ with a key difference:​​ here we are looking​​​‌ for algorithms which produce‌ a stream of results,‌​‌ instead of consuming a​​ stream of input data.​​​‌ Database theory has studied‌ which queries can efficiently‌​‌ be evaluated, but an​​ especially challenging topic is​​​‌ to understand under which‌ assumptions it can be‌​‌ done for recursive queries​​​‌, e.g., Datalog queries.​ This task can in​‌ particular be studied over​​ restricted formalisms of input​​​‌ data, such as textual​ data. This is the​‌ focus of document spanners​​ 34, which are​​​‌ a declarative way to​ specify how to extract​‌ matches of queries over​​ textual documents. There, recursivity​​​‌ is needed to express​ long-distance dependencies (e.g., extract​‌ all phone numbers that​​ are between two specific​​​‌ delimiters, no matter how​ far).

Short term.

One​‌ immediate research objective is​​ to study which enumeration​​​‌ tasks for (recursive) queries​ can be efficiently solved,​‌ in particular when considering​​ variations on the standard​​​‌ setting of enumeration algorithms.​ For instance, in many​‌ applications, we want to​​ produce results in a​​​‌ specific order; to​ directly access individual results​‌ by their offset; or​​ to efficiently test whether​​​‌ a candidate result is​ actually a solution, or​‌ finding out its position​​ in the enumeration order.​​​‌ We intend to pursue​ our ongoing work in​‌ such directions 22.​​

Medium term.

A medium-term​​​‌ research question is the​ combination of enumeration algorithms​‌ with the previous axis​​ of handling dynamic data​​​‌. Previous work by​ members of the team​‌ has already studied how​​ enumeration algorithms could be​​​‌ applied to dynamic data​ via knowledge compilation methods​‌ 23. However, we​​ are still missing an​​​‌ understanding of how efficient​ such algorithms can be​‌ made, especially if we​​ want query-dependent algorithms. This​​​‌ would probably require an​ algebraic theory of queries​‌ which can return results​​ (to cover document spanners),​​​‌ in particular understanding their​ connection to transducers.

A​‌ related medium-term research question​​ is the idea of​​​‌ enumerating query results via​ updates, i.e., producing​‌ each query results not​​ from scratch but by​​​‌ editing a previous result.​ Team members have started​‌ investigating this question in​​ a specific context 29​​​‌, but this enumeration​ framework could be fruitful​‌ in many other settings.​​ It also has an​​​‌ intriguing connection to the​ question of query evaluation​‌ over dynamic data, because​​ the two problems are​​​‌ related: query evaluation on​ dynamic data takes a​‌ stream of changes as​​ input and maintain query​​​‌ results, whereas enumerating query​ results via updates will​‌ produce a stream of​​ changes to go over​​​‌ all results to a​ query.

Long term.

Most​‌ research about enumeration algorithms​​ for recursive queries using​​​‌ circuit methods focus on​ the case of textual​‌ data (e.g., document spanners)​​ and tree-shaped data. This​​​‌ leaves open the setting​ of more general data,​‌ in particular graph data.​​ Of course, the task​​​‌ has already been studied,​ in particular in the​‌ comparatively simple setting of​​ regular path queries 40​​​‌. But we are​ still missing an understanding​‌ of those enumeration tasks​​ that can be solved​​​‌ efficiently in more expressive​ query languages (e.g., Datalog​‌ queries) and depending on​​ the various possible semantics​​​‌ (e.g., returning simple paths,​ trails, shortest paths, or​‌ arbitrary walks). As many​​ of these tasks will​​​‌ be intractable, efficient algorithms​ will probably also need​‌ to identify conditions on​​ the input data that​​ are sufficient to ensure​​​‌ tractability.

Another wide open‌ direction for long-term research‌​‌ in enumeration algorithms concerns​​ lower bounds. Indeed, many​​​‌ lower bounds are conditional,‌ and they usually do‌​‌ not apply to enumeration​​ but to related problems​​​‌ (e.g., checking whether a‌ solution exists, or materializing‌​‌ all solutions). An especially​​ difficult challenge would be​​​‌ to develop lower bound‌ methods that show the‌​‌ intractability of efficient enumeration​​ per se.

3.3 The​​​‌ roof: Knowledge on Data‌

Optimization for schemas for‌​‌ graphs.

 

Participants: Antoine Amarilli​​, Iovka Boneva,​​​‌ Charles Paperman.

The‌ role of schemas for‌​‌ graph data is a​​ bit different than schemas​​​‌ for relational data. The‌ latter are prescriptive (i.e.‌​‌ data is organized exactly​​ like this) and are​​​‌ an essential ingredient for‌ static optimization of queries.‌​‌ The former are de​​ facto often descriptive (i.e.​​​‌ data should look more‌ or less like this).‌​‌ Such looseness of schemas​​ is unavoidable, and even​​​‌ desirable, for big data‌ stores or knowledge bases‌​‌ which are constantly evolving,​​ fed with data from​​​‌ numerous sources that can‌ be incomplete, inconsistent and‌​‌ contain errors.

D-DAL aims​​ at providing tools for​​​‌ increasing the usefulness of‌ such approximate schemas in‌​‌ the context of dynamically​​ evolving data. This line​​​‌ of research continues our‌ work of the formalization,‌​‌ the standardization the development​​ and the tooling of​​​‌ the shape expressions language‌ (ShEx) for RDF graphs,‌​‌ and aims at extending​​ on some of our​​​‌ previous results and approaches‌ to property graphs and‌​‌ PG-Schemas.

Short term.

A​​ first short term objective​​​‌ is to enrich graph‌ schema languages such as‌​‌ ShEx with probabilistic information​​ indicating the degree of​​​‌ compliance to the actual‌ data. We will aim‌​‌ at a practical extension​​ with limited expressiveness in​​​‌ order to keep feasible‌ the associated computational tasks‌​‌ (data validation, schema extraction)​​ while still helping experts​​​‌ in assessing the quality‌ of a data set,‌​‌ detecting errors and monitoring​​ the evolution of data.​​​‌ This approach will be‌ built on our previous‌​‌ work in discovering schemas​​ from existing data, while​​​‌ extending it to property‌ graph schemas.

Medium term.‌​‌

As all software, data-centric​​ applications constantly evolve, either​​​‌ to match new use‌ cases, or to exploit‌​‌ new data sources, or​​ both. This most often​​​‌ requires to transform the‌ data model in order‌​‌ to match the new​​ processing needs. As a​​​‌ mid-term objective we plan‌ to investigate algorithms for‌​‌ incremental and conjoint data​​ and schema evolution, based​​​‌ on logic tools such‌ as mappings and the‌​‌ chase.

Although there exist​​ a number of native​​​‌ graph data stores, lots‌ of data is effectively‌​‌ stored in relational or​​ other data stores (e.g.,​​​‌ Wikibase for wikidata) and‌ is only exposed as‌​‌ a graph to the​​ user. There might also​​​‌ be the case that‌ data from streams is‌​‌ processed and integrated to​​ a knowledge graph. The​​​‌ original data can contain‌ schema information or satisfy‌​‌ some constraints. As a​​ mid-term objective, we plan​​​‌ to investigate lifting such‌ constraints from the underlying‌​‌ data to the graph​​​‌ view. We will focus​ on useful and feasible​‌ sub-classes.

Long term.

Finally,​​ we will continue our​​​‌ participation in Shape Expressions​ Working group which defines​‌ the ShEx language, currently​​ in the process of​​​‌ standardization at IEEE. We​ will also continue to​‌ participate in the implementation​​ of a validator for​​​‌ ShEx and related tools​ as open software.

Static​‌ analysis with constraints (open-world​​ query answering, query rewriting​​​‌ under constraints, consistency, certain​ answers).

 

Participants: Antoine Amarilli​‌, Iovka Boneva,​​ Aurélien Lemay, Sylvain​​​‌ Salvati, Sophie Tison​.

Contrary to the​‌ closed world assumption of​​ traditional database research, the​​​‌ absence of a piece​ of information often does​‌ not mean that this​​ piece of information is​​​‌ false, but rather that​ we do not know​‌ its status; this is​​ called the open world​​​‌ assumption. In this​ scenario, the data is​‌ typically enriched by constraints​​ that it should satisfy,​​​‌ such as key, cardinality​ or equality constraints. Several​‌ questions arise when querying​​ data under the open​​​‌ world assumption. Is the​ data consistent with the​‌ constraints we know about​​ it? Taking into account​​​‌ the constraints and the​ fact that the information​‌ is incomplete, what degree​​ of certainty can we​​​‌ have about the answer​ set of queries? Is​‌ it possible to take​​ advantage of the constraints​​​‌ so as to execute​ simpler queries (e.g., using​‌ rewriting techniques)? Moreover, it​​ may not be possible​​​‌ to know which constraints​ apply to the data​‌ at stake, in which​​ case it can then​​​‌ be desirable to infer​ knowledge using machine learning​‌ methods. D-DAL will apply​​ the querying methods it​​​‌ develops to that setting​ so as to produce​‌ added values from knowledge​​ bases.

Short term.

We​​​‌ will pursue work we​ have engaged with the​‌ PhD of Lily Gallois​​ 36 on regular path​​​‌ queries (RPQs) and word​ constraints on graph databases.​‌ Word constraints are the​​ simplest kind of constraints​​​‌ that can be put​ on a graph database;​‌ they seem to naturally​​ connect with rewrite systems.​​​‌ As we showed in​ 42, this relation​‌ is however more complicated​​ that expected. We are​​​‌ going to study further​ the classes of constraints​‌ that enable reasoning about​​ RPQs and their decidability.​​​‌ We will also explore​ the problems of certain​‌ query answering and query​​ rewriting in this setting.​​​‌ The aim is to​ propose efficient query resolution​‌ algorithms from which we​​ can build more complex​​​‌ queries, in particular queries​ with aggregation.

We are​‌ also going to work​​ on schema and constraints​​​‌ inference for graph databases.​ We have skills in​‌ symbolic learning methods which​​ have proved useful in​​​‌ extracting information from the​ web. Facing large graphs,​‌ these methods will have​​ difficulties to scale, moreover​​​‌ they are often not​ robust to noise. Machine​‌ learning methods, typically neural​​ networks, are more immune​​​‌ to this problem. We​ shall develop a hybrid​‌ approach: the method that​​ we envision consists in​​​‌ learning structures with neural​ networks and then extracting​‌ the information that is​​ useful in the context​​ of data processing by​​​‌ means of symbolic techniques.‌

Medium term.

The D-DAL‌​‌ team intends to study​​ several medium-term problems related​​​‌ to open-world query answering‌ and reasoning under rules.‌​‌ One especially interesting question​​ is how the notion​​​‌ of circuits can be‌ leveraged to achieve tractable‌​‌ algorithms for these tasks.​​ The hope is that​​​‌ circuits can be a‌ versatile tool unlocking many‌​‌ results that would otherwise​​ need to be shown​​​‌ independently: supporting uncertain facts‌ (and uncertain rules); providing‌​‌ provenance information to capture​​ how a specific fact​​​‌ is derived and explain‌ to the user why‌​‌ a query result is​​ true (given the data​​​‌ and the rules); giving‌ factorized representations of a‌​‌ possibly large number of​​ query answers. At a​​​‌ technical level, a broad‌ question is how forward-chaining‌​‌ deduction techniques such as​​ the chase can be​​​‌ extended to create a‌ circuit representation of the‌​‌ derivations. This is challenging​​ because the chase, unlike​​​‌ circuits, may represent some‌ cyclic dependencies, which may‌​‌ cause convergence issues depending​​ on the semiring in​​​‌ which the computation is‌ performed. Another question is‌​‌ how this can be​​ combined with backward-chaining methods​​​‌ such as query rewriting.‌

Long term.

In the‌​‌ long run, one unifying​​ challenge of the ontology-mediated​​​‌ query answering community is‌ the question of unifying‌​‌ different logical frameworks. Indeed,​​ while query languages are​​​‌ relatively standard, rules can‌ nowadays be specified in‌​‌ many different incompatible ways:​​ description logics, existential rules,​​​‌ graph patterns, database constraint‌ languages such as TGDs,‌​‌ etc. It is challenging​​ to understand the connections​​​‌ between these settings and‌ difficult to conceive systems‌​‌ that could orchestrate query​​ evaluation across back-ends using​​​‌ different query formalisms. Another‌ more algorithmic challenge concerns‌​‌ the complexity of ontology-mediated​​ queries: while there are​​​‌ many results showing complexity‌ upper and lower bounds‌​‌ for general classes (e.g.,​​ guarded TGDs), it is​​​‌ much more challenging to‌ try to identify precise‌​‌ complexities of individual queries​​, in the spirit​​​‌ of what is done‌ for less expressive queries‌​‌ such as conjunctive queries.​​ Thus, the D-DAL team​​​‌ aims at making progress‌ towards this very ambitious‌​‌ theoretical goal.

4 Application​​ domains

The project of​​​‌ the D-DAL team aims‌ at efficiently extracting information‌​‌ from large datasets. We​​ aim to process data​​​‌ in all the diversity‌ of its representations, and‌​‌ to take into account​​ its quality, including questions​​​‌ of uncertainty and probability.‌ Our approach is to‌​‌ develop theoretical results that​​ can later be implemented​​​‌ into concrete programs. As‌ such, our project relates‌​‌ to several Inria strategic​​ axes:

  • Data Science and​​​‌ Artificial Intelligence
    The work‌ of our team will‌​‌ study how to efficiently​​ produce data for other​​​‌ downstream applications that analyze‌ it or aggregate it‌​‌ from various sources. This​​ is key in analysing​​​‌ data where large chunks‌ of data are processed‌​‌ to produce statistics and​​ other kinds of aggregates.​​​‌ Apart from efficient data‌ processing, the longer terms‌​‌ objectives of the project​​ aim to incorporate randomized​​​‌ programming to support wider‌ possible scenarios in data‌​‌ science. Similarly, machine learning​​​‌ algorithms are fed with​ data. These algorithms require​‌ efficient data manipulation and​​ also a control over​​​‌ data quality. Our work​ on uncertainty and provenance​‌ is useful in this​​ setting. Finally, the older​​​‌ sense of AI (symbolic​ AI) is connected to​‌ D-DAL through the study​​ of circuits and knowledge​​​‌ compilation. We hope that​ these connections will reveal​‌ fruitful and that our​​ contributions will also have​​​‌ impact in this field.​
  • Software and Hardware

    We​‌ intend to develop query​​ execution engines, querying platform​​​‌ and schemas validators for​ graphs. Moreover, D-DAL will​‌ also have interests in​​ query languages, their extensions,​​​‌ and their execution. These​ topics make us have​‌ close interest in compilation​​ techniques and programming language​​​‌ methods. We adopt an​ integrating view of query​‌ processing that embeds computational​​ capabilities. This leads to​​​‌ see data processing and​ its interactions with other​‌ software differently, delegating more​​ and more computation to​​​‌ query engines.

    D-DAL will​ develop compilation methods for​‌ various types of query,​​ taking advantage of parallel​​​‌ hardware capabilities. In particular,​ we intend to develop​‌ tools to analyze queries​​ so as to detect​​​‌ opportunities for parallel execution.​ Models of abstract circuits​‌ will allow us to​​ harness the power of​​​‌ modern hardware, in particular​ making better use of​‌ SIMD instructions. We also​​ plan to work on​​​‌ RISC V architectures and​ processing in memory units​‌ (PIM). Databases have always​​ had a tradition of​​​‌ exploiting hardware capabilities. However,​ with open source architectures​‌ such as RISC V,​​ the particular needs of​​​‌ databases may be addressed​ by particular instructions. We​‌ hope that our work​​ can have influence on​​​‌ RISC V architectures and​ that specific hardware design​‌ for data processing can​​ emerge from our research.​​​‌

  • Databases
    Our view of​ databases is different from​‌ the integrating view that​​ SQL standardization gives. While​​​‌ SQL continues to integrate​ in a top-down manner​‌ more and more querying​​ capabilities under the same​​​‌ software, the approach of​ D-DAL is to combine​‌ with optimized streaming techniques,​​ in a bottom-up fashion,​​​‌ well-defined and limited query​ languages for which we​‌ have identified highly efficient​​ algorithms. Though more difficult​​​‌ to approach than the​ SQL view, our view​‌ may be more adapted​​ for data processing in​​​‌ contexts where data is​ scattered within files in​‌ different formats or is​​ too large for database​​​‌ management systems.

5 Social​ and environmental responsibility

5.1​‌ Footprint of research activities​​

  • Antoine Amarilli
    is environmental​​​‌ chair in the steering​ committee of the informal​‌ conference “Highlights of Logic,​​ Games, and Automata” (since​​​‌ 2024).
  • Iovka Boneva
    is​ the leader of the​‌ Sustainable Development Commission of​​ the CRIStAL research lab.​​​‌
  • Antoine Amarilli
    is a​ co-animator of the TCS4F​‌ pledge on sustainable research​​ in theoretical computer science​​​‌ (since 2020).
  • Sophie Tison​
    is a member of​‌ the general assembly of​​ the European Association for​​​‌ Theoretical Computer Science (EATCS)​ (elected in 2019).

5.2​‌ Parity and equal opportunities​​ in science

  • Sophie Tison​​​‌
    Member of the steering​ committee of the Programme​‌ de mentorat Femmes et​​ Sciences de Lille.
  • Sylvain​​ Salvati
    Member of commission​​​‌ parité et égalité des‌ chances at Inria.

5.3‌​‌ Impact of research results​​

Databases and methods from​​​‌ Artificial Intelligence are used‌ in virtually all aspects‌​‌ of the modern digitalized​​ world, from companies' web​​​‌ services to governments' institutions.‌

6 Highlights of the‌​‌ year

6.1 Creation of​​ the D-DAL team-project

The​​​‌ D-DAL team-project has been‌ created in October 2025‌​‌ It is the result​​ of the collective work​​​‌ of the team members.‌ We are greatful to‌​‌ INRIA for its trust​​ and support. In particular,​​​‌ INRIA allowed us to‌ recruit Antoine Amarilli this‌​‌ year has been very​​ important to us both​​​‌ in term of team‌ life and of scientific‌​‌ achievements. This further allowed​​ us to recruit numerous​​​‌ students to work with‌ us:

  • Trainees
    Thibault Tisserant,‌​‌ Paul Raphaël, Rémi de​​ Pretto, Vincent Peth, Harshul​​​‌ Sagar.
  • PhD students
    Sébastien‌ Labbe, Loup Lobet, Arthur‌​‌ Lombardo, Étienne Parent

7​​ Latest software developments, platforms,​​​‌ open data

7.1 Application‌ of database techniques to‌​‌ genomic

Charles Paperman has​​ developed an application axis​​​‌ in genomic in collaboration‌ with the CRIStAL team‌​‌ Bonsai. Two PhDs have​​ started to apply database​​​‌ methods to genomic. Very‌ promising results have already‌​‌ been obtained with software​​ that work better than​​​‌ the state of the‌ art. The software Vizitig‌​‌ that allows users to​​ query large genomic data​​​‌ is developing at a‌ fast pace.

7.2 Latest‌​‌ software developments

7.2.1 helicase​​

  • Keywords:
    SIMD, Parsing, Bioinformatics,​​​‌ Rust
  • Functional Description:
    It‌ is a rust crate‌​‌ (library) that allows to​​ parse and extract relevant​​​‌ information classical format in‌ bio-informatics. It servers as‌​‌ a demonstration of a​​ methodology that take benefits​​​‌ of SIMD instructions through‌ analysis from automata theory‌​‌ in the scope of​​ the Shannon Meet Cray​​​‌ ANR project
  • URL:
  • Contact:
    Charles Paperman

7.2.2‌​‌ NetworkDisk

  • Name:
    NetworkDisk
  • Keywords:​​
    Large graphs, Python, Databases​​​‌
  • Functional Description:
    NetworkDisk provides‌ a way to manipulate‌​‌ graphs on disk. The​​ goal is to be​​​‌ as much as possible‌ compatible with (Di)Graph objects‌​‌ of the NetworkX Python​​ package but lifting memory​​​‌ requirement and providing persistence‌ of the Graph.
  • URL:‌​‌
  • Contact:
    Charles Paperman​​

7.2.3 rsonpath

  • Keywords:
    JSon,​​​‌ Streaming, SIMD, Rust
  • Functional‌ Description:
    The rsonpath crate‌​‌ provides a JSONPath parser​​ and a query execution​​​‌ engine, which utilizes SIMD‌ instructions to provide massive‌​‌ throughput improvements over conventional​​ engines.
  • URL:
  • Contact:​​​‌
    Charles Paperman
  • Partners:
    Warsaw‌ University, Technical University of‌​‌ Munich (TUM)

7.2.4 ShapeDesigner​​

  • Name:
    ShapeDesigner
  • Keywords:
    Validation,​​​‌ Data Exploration, Verification
  • Functional‌ Description:
    ShapeDesigner allows construct‌​‌ a ShEx or SHACL​​ schema for an existing​​​‌ dataset. It combines algorithms‌ to analyse the data‌​‌ and automatically extract shape​​ constraints, and to edit​​​‌ and validate shape schemas.‌
  • URL:
  • Contact:
    Jeremie‌​‌ Dusart
  • Partner:
    CRIStAL

7.2.5​​ ShEx validator

  • Name:
    Validation​​​‌ of Shape Expression schemas‌
  • Keywords:
    Data management, RDF‌​‌
  • Functional Description:
    Shape Expression​​ schemas is a formalism​​​‌ for defining constraints on‌ RDF graphs. This software‌​‌ allows to check whether​​ a graph satisfies a​​​‌ Shape Expressions schema.
  • Release‌ Contributions:

    ShExJava now uses‌​‌ the Commons RDF API​​​‌ and so support RDF4J,​ Jena, JSON-LD-Java, OWL API​‌ and Apache Clerezza. It​​ can parse ShEx schema​​​‌ in the ShEcC, ShEJ,​ ShExR formats and can​‌ serialize a schema in​​ ShExJ.

    To validate data​​​‌ against a ShExSchema using​ ShExJava, you have two​‌ different algorithms:

    - the​​ refine algorithm: compute once​​​‌ and for all the​ typing for the whole​‌ graph

    - the recursive​​ algorithm: compute only the​​​‌ typing required to answer​ a validate(node,ShapeLabel) call and​‌ forget the results.

  • URL:​​
  • Contact:
    Iovka Boneva​​​‌
  • Partner:
    CRIStAL

7.2.6 vizitig​

  • Keywords:
    Genomics, Transcriptomics, Databases,​‌ De Bruijn graphs, Data​​ visualization
  • Functional Description:
    Vizitig​​​‌ is a genomics and​ transcriptomics exploration software built​‌ around a de Bruijn​​ graph that directly encodes​​​‌ k-mers from raw sequencing​ data, whether assembled or​‌ not. It enables querying​​ very large sequencing datasets​​​‌ without prior alignment by​ combining sequence structure, biological​‌ annotations, and sample metadata​​ within a single queryable​​​‌ framework. Users can search​ for regions of interest​‌ based on genes, loci,​​ biomarkers, or sample traits​​​‌ (presence, abundance, etc.), and​ interactively explore only the​‌ relevant graph neighborhoods in​​ a scalable way. Vizitig​​​‌ provides a human-readable query​ language, a full-featured web​‌ interface, as well as​​ command-line and Python interfaces,​​​‌ facilitating integration with standard​ analysis tools and data​‌ sharing.
  • Contact:
    Charles Paperman​​

8 New results

8.1​​​‌ Circuits for Data Manipulation​

Participants: Antoine Amarilli,​‌ Oliver Irwin, Mikael​​ Monet, Sylvain Salvati​​​‌.

8.1.1 Studying circuits​ from Knowledge Compilation

In​‌ 16, Florent Capelli,​​ Oliver Irwin and Sylvain​​​‌ Salvati present an elementary​ branch and bound algorithm​‌ with a simple analysis​​ of why it achieves​​​‌ worstcase optimality for join​ queries on classes of​‌ databases defined respectively by​​ cardinality or acyclic degree​​​‌ constraints. They then show​ that if one is​‌ given a reasonable way​​ for recursively estimating upper​​​‌ bounds on the number​ of answers of the​‌ join queries, the algorithm​​ can be turned into​​​‌ an algorithm for uniformly​ sampling answers with expected​‌ running time 𝒪(​​𝚄𝙿/𝙾𝚄𝚃)​​​‌˜ where 𝚄𝙿 is​ the upper bound, 𝙾𝚄𝚃​‌ is the actual number​​ of answers and 𝒪​​​‌(·)˜​ ignores polylogarithmic factors. Their​‌ approach recovers recent results​​ on worstcase optimal join​​​‌ algorithm and sampling in​ a modular, clean and​‌ elementary way.

8.1.2 Foundations​​ of circuits: circuit complexity,​​​‌ algebra and abstract machines​

In 14, with​‌ Florin Manea, Tina Ringleb​​ and Markus Schmid, Antoine​​​‌ Amarilli has studied linear​ time subsequence and supersequence​‌ regex matching. It is​​ well-known that checking whether​​​‌ a given string w​ matches a given regular​‌ expression r can be​​ done in quadratic time​​​‌ 𝒪(|w​|×|r​‌|) and that​​ this cannot be improved​​​‌ to a truly subquadratic​ running time of 𝒪​‌((|w​​|×|r​​​‌|)1-​ε) assuming the​‌ strong exponential time hypothesis​​ (SETH). They study a​​​‌ different matching paradigm where​ they ask instead whether​‌ w has a subsequence​​ that matches r,​​ and show that regex​​​‌ matching in this sense‌ can be solved in‌​‌ linear time 𝒪(​​|w|+​​​‌|r|)‌. Further, the same‌​‌ holds if asked for​​ a supersequence. They show​​​‌ that the quantitative variants‌ where one wants to‌​‌ compute a longest or​​ shortest subsequence or supersequence​​​‌ of w that matches‌ r can be solved‌​‌ in 𝒪(|​​w|×|​​​‌r|),‌ i. e., asymptotically no‌​‌ worse than classical regex​​ matching; and they show​​​‌ that 𝒪(|‌w|+|‌​‌r|) is​​ conditionally not possible for​​​‌ these problems. They also‌ investigate these questions with‌​‌ respect to other natural​​ string relations like the​​​‌ infix, prefix, left-extension or‌ extension relation instead of‌​‌ the subsequence and supersequence​​ relation. They further study​​​‌ the complexity of the‌ universal problem where one‌​‌ asks if all subsequences​​ (or supersequences, infixes, prefixes,​​​‌ left-extensions or extensions) of‌ an input string satisfy‌​‌ a given regular expression.​​

In 19, Antoine​​​‌ Amarilli, Mikaël Monet and‌ Rémi De Pretto study‌​‌ the class of local​​ languages which is a​​​‌ well-known subclass of the‌ regular languages that admits‌​‌ many equivalent characterizations. In​​ this short note they​​​‌ establish the PSPACE-completeness of‌ the problem of determining,‌​‌ given as input a​​ nondeterministic finite automaton (NFA)​​​‌ A, whether the‌ language recognized by A‌​‌ is local or not.​​ This contrasts with the​​​‌ case of deterministic finite‌ automata (DFA), for which‌​‌ the problem is known​​ to be in PTIME.​​​‌

In 18, Antoine‌ Amarilli, Mikaël Monet and‌​‌ Rémi De Pretto study​​ two rewrite rules on​​​‌ hypergraphs, called edge-domination and‌ node-domination, and show that‌​‌ they are confluent. These​​ rules are rather natural​​​‌ and commonly used before‌ computing the minimum hitting‌​‌ sets of a hypergraph.​​ Intuitively, edge-domination allows one​​​‌ to remove hyperedges that‌ are supersets of another‌​‌ hyperedge, and node-domination allows​​ one to remove nodes​​​‌ whose incident hyperedges are‌ a subset of that‌​‌ of another node. They​​ show that these rules​​​‌ are confluent up to‌ isomorphism, i.e., if one‌​‌ applies any sequences of​​ edge-domination and node-domination rules,​​​‌ then the resulting hypergraphs‌ can be made isomorphic‌​‌ via more rule applications.​​ This in particular implies​​​‌ the existence of a‌ unique minimal hypergraph, up‌​‌ to isomorphism.

In 17​​, Antoine Amarilli with​​​‌ Benoît Groz study the‌ cutwidth measure on graphs‌​‌ and ways to bound​​ the cutwidth of a​​​‌ graph by partitioning its‌ vertices. They consider bounds‌​‌ expressed as a function​​ of two quantities: on​​​‌ the one hand, the‌ maximal cutwidth y of‌​‌ the subgraphs induced by​​ the classes of the​​​‌ partition, and on the‌ other hand, the cutwidth‌​‌ x of the quotient​​ multigraph obtained by merging​​​‌ each class to a‌ single vertex. They consider‌​‌ in particular the decomposition​​ of directed graphs into​​​‌ strongly connected components (SCCs):‌ in this case, y‌​‌ is the maximal cutwidth​​ of an SCC, and​​​‌ x is the cutwidth‌ of the directed acyclic‌​‌ condensation multigraph. They show​​​‌ that the cutwidth of​ a graph is always​‌ in O(x​​+y),​​​‌ specifically it can be​ upper bounded by 1​‌.5x+​​y. They also​​​‌ show a lower bound​ justifying that the constant​‌ 1.5 cannot​​ be improved in general.​​​‌

8.2 Logic and Query​ Evaluation

Participants: Antoine Amarilli​‌, Mikael Monet,​​ Charles Paperman.

8.2.1​​​‌ Provenance, explanation, aggregation, counting,​ uncertainty, probabilistic data, Shapley,​‌ approximation algorithms.

In 9​​, Antoine Amarilli et​​​‌ al. study approximation algorithms​ for queries in probabilistic​‌ graphs. Query evaluation over​​ probabilistic databases is notoriously​​​‌ intractable – not only​ in combined complexity, but​‌ often in data complexity​​ as well. This motivates​​​‌ the study of approximation​ algorithms, and particularly of​‌ combined FPRASes, with runtime​​ polynomial in both the​​​‌ query and instance size.​ In this paper, they​‌ focus on tuple-independent probabilistic​​ databases over binary signatures,​​​‌ i.e., probabilistic graphs, and​ study when they can​‌ devise combined FPRASes for​​ probabilistic query evaluation. They​​​‌ settle the complexity of​ this problem for a​‌ variety of query and​​ instance classes, by proving​​​‌ both approximability results and​ (conditional) inapproximability results doubled​‌ with (unconditional) DNNF provenance​​ circuit size lower bounds.​​​‌ This allows them to​ deduce many corollaries of​‌ possible independent interest. For​​ example, they show how​​​‌ the results of Arenas​ et al. on counting​‌ fixed-length strings accepted by​​ an NFA imply the​​​‌ existence of an FPRAS​ for the two-terminal network​‌ reliability problem on directed​​ acyclic graphs: this was​​​‌ an open problem until​ now. They also show​‌ that one cannot extend​​ a recent result of​​​‌ van Bremen and Meel​ that gives a combined​‌ FPRAS for self-join-free conjunctive​​ queries of bounded hypertree​​​‌ width on probabilistic databases:​ neither the bounded-hypertree-width condition​‌ nor the self-join-freeness hypothesis​​ can be relaxed. They​​​‌ last show how their​ methods can give insights​‌ on the evaluation and​​ approximability of regular path​​​‌ queries (RPQs) on probabilistic​ graphs in the data​‌ complexity perspective, showing in​​ particular that some of​​​‌ them are (conditionally) inapproximable.​

In 10, with​‌ Wolfgang Gatterbauer and Neha​​ Makhija, Antoine Amarilli and​​​‌ Mikaël Monet have investigated​ the complexity classification of​‌ resilience for regular path​​ queries. The resilience problem​​​‌ for a query and​ an input set or​‌ bag database is to​​ compute the minimum number​​​‌ of facts to remove​ from the database to​‌ make the query false.​​ In this paper, they​​​‌ study how to compute​ the resilience of Regular​‌ Path Queries (RPQs) over​​ graph databases. Their goal​​​‌ is to characterize the​ regular languages L for​‌ which it is tractable​​ to compute the resilience​​​‌ of the existentially-quantified RPQ​ built from L.​‌ They show that computing​​ the resilience in this​​​‌ sense is tractable (even​ in combined complexity) for​‌ all RPQs defined from​​ so-called local languages. By​​​‌ contrast, they show hardness​ in data complexity for​‌ RPQs defined from the​​ following language classes (after​​​‌ reducing the languages to​ eliminate redundant words): all​‌ finite languages featuring a​​ word containing a repeated​​ letter, and all languages​​​‌ featuring a specific kind‌ of counterexample to being‌​‌ local (which they call​​ four-legged languages). The latter​​​‌ includes in particular all‌ languages that are not‌​‌ star-free. Their results also​​ imply hardness for all​​​‌ non-local languages with a‌ so-called neutral letter. They‌​‌ also highlight some remaining​​ obstacles towards a full​​​‌ dichotomy. In particular, for‌ the RPQ ab‌​‌c|be​​, resilience is tractable​​​‌ but the only known‌ PTIME algorithm uses submodular‌​‌ function optimization.

8.2.2 Dynamic​​ data: incremental and dynamic​​​‌ algorithms and connections with‌ dynamic algorithms

In 13‌​‌, with Corentin Barloy​​ and Louis Jachiet, Antoine​​​‌ Amarilli and Charles Paperman‌ study the dynamic membership‌​‌ problem for regular tree​​ languages under relabeling updates:​​​‌ we fix an alphabet‌ Σ and a regular‌​‌ tree language L over​​ Σ (expressed, e.g., as​​​‌ a tree automaton), we‌ are given a tree‌​‌ T with labels in​​ Σ, and we​​​‌ must maintain the information‌ of whether the tree‌​‌ T belongs to L​​ while handling relabeling updates​​​‌ that change the labels‌ of individual nodes in‌​‌ T. Their first​​ contribution is to show​​​‌ that this problem admits‌ an 𝒪(log‌​‌(n)/​​log(log(​​​‌n)))‌ algorithm for any fixed‌​‌ regular tree language, improving​​ over known 𝒪(​​​‌log(n)‌) algorithms. This generalizes‌​‌ the known 𝒪(​​log(n)​​​‌/log(log‌(n))‌​‌) upper bound over​​ words, and it matches​​​‌ the lower bound of‌ Ω(log(‌​‌n)/log​​(log(n​​​‌))) from‌ dynamic membership to some‌​‌ word languages and from​​ the existential marked ancestor​​​‌ problem. Their second contribution‌ is to introduce a‌​‌ class of regular languages,​​ dubbed almost-commutative tree languages,​​​‌ and show that dynamic‌ membership to such languages‌​‌ under relabeling updates can​​ be decided in constant​​​‌ time per update. Almost-commutative‌ languages generalize both commutative‌​‌ languages and finite languages:​​ they are the analogue​​​‌ for trees of the‌ ZG languages enjoying constant-time‌​‌ dynamic membership over words.​​ Their main technical contribution​​​‌ is to show that‌ this class is conditionally‌​‌ optimal when one assumes​​ that the alphabet features​​​‌ a neutral letter, i.e.,‌ a letter that has‌​‌ no effect on membership​​ to the language. More​​​‌ precisely, they show that‌ any regular tree language‌​‌ with a neutral letter​​ which is not almost-commutative​​​‌ cannot be maintained in‌ constant time under the‌​‌ assumption that the prefix-U1​​ problem from 26 also​​​‌ does not admit a‌ constant-time algorithm.

8.3 Knowledge‌​‌ on Data

Participants: Iovka​​ Boneva, Sylvain Salvati​​​‌, Sophie Tison.‌

8.3.1 Optimization for schemas‌​‌ for graphs

Iovka Boneva​​ has started a particularly​​​‌ fruitful international collaboration with‌ researchers from Poland, Austria‌​‌ and Italy which aims​​ at bridging the gap​​​‌ between graph schema languages‌ for RDF and for‌​‌ property graphs. This resulted​​ in a common core​​​‌ definition of a schema‌ language that is at‌​‌ the intersection of three​​​‌ schema languages for graphs​ of different types.

In​‌ 12, Iovka Boneva​​ collaborated with several specialists​​​‌ in schema languages for​ graphs. They established a​‌ common model for three​​ main graph schema languages:​​​‌ SHACL and ShEx for​ RDF graphs, and PG-Schema​‌ for property graphs. These​​ languages are designed and​​​‌ used for formalising properties​ of data and ensuring​‌ data quality in diverse​​ graph-shaped data driven applications​​​‌ such as Semantic Web​ and databases. Each language​‌ has its unique approach​​ to defining constraints and​​​‌ validating graph data, leaving​ potential users in the​‌ dark about their commonalities​​ and differences. The paper​​​‌ provides concise formal definitions​ of the core components​‌ of these languages, employs​​ a uniform framework to​​​‌ facilitate a comprehensive comparison​ between them, and identifies​‌ a common set of​​ functionalities, shedding light on​​​‌ both overlapping and distinctive​ features. This work is​‌ a first step towards​​ interoperability of systems based​​​‌ on different schema languages.​

In 15, Iovka​‌ Boneva with Jose E.​​ Labra Gayo, Eric G.​​​‌ Prud'hommeaux, Katherine Thornton and​ Andra Waagmeester presented formal​‌ semantics for the ShEx​​ schema language enriched with​​​‌ an inheritance mechanism. It​ is inspired by inheritance​‌ in object-oriented programming languages,​​ and provides similar advantages​​​‌ such as reuse, modularity,​ and more flexible data​‌ modelling. Using an example,​​ we explain the main​​​‌ features of the inheritance​ mechanism. We present its​‌ syntax and formal semantics.​​ The semantics is an​​​‌ extension of the semantics​ of ShEx 2.1. It​‌ also directly yields a​​ validation algorithm as an​​​‌ extension of the previous​ ShEx validation algorithms.

8.3.2​‌ Static analysis with constraints​​ (open-world query answering, query​​​‌ rewriting under constraints, consistency,​ certain answers)

With Pablo​‌ Barceló and Gaëlle Fontaine,​​ Sylvain Salvati and Sophie​​​‌ Tison have corrected some​ proofs in 11.​‌ This paper is about​​ handling graph databases with​​​‌ regular path constraints. Applications​ of graph databases are​‌ prone to inconsistency due​​ to interoperability issues. This​​​‌ raises the need for​ studying query answering over​‌ inconsistent graph databases in​​ a simple but general​​​‌ framework. They follow the​ approach of consistent query​‌ answering (CQA), and study​​ its data complexity over​​​‌ graph databases for conjunctive​ regular-path queries (CRPQs) and​‌ conjunctive regular-path constraints (CRPCs).​​ They deal with subset,​​​‌ superset and symmetric-difference repairs.​ Without restrictions, CQA is​‌ undecidable for the semantics​​ of superset- and symmetric-difference​​​‌ repairs, and Π2​P-co​‌mple​​te for subset-repairs.​​​‌ However, they identify restrictions​ on CRPCs and databases​‌ that lead to decidability,​​ and even tractability of​​​‌ CQA.

9 Partnerships and​ cooperations

9.1 International research​‌ visitors

9.1.1 Visits of​​ international scientists

Invitation of​​​‌ researchers to the team​ seminar

Our team often​‌ invites external researchers (from​​ French institutions or international)​​​‌ to give research seminars.​ Participants sometimes stay a​‌ few days to discuss​​ research with the team.​​​‌ Here is a non-exhaustive​ list of researchers that​‌ gave a presentation in​​ the team in 2025-2026:​​​‌

  • Pieter Bonte [Ghent University]​
  • Alexandra Rogova [University of​‌ Warsaw]
  • Harry Vinall-Smeeth [RWTH​​ Aachen University]
  • Sarah Kleest-Meißner​​ [Hasselt University]
  • Corto Mascle​​​‌ [MPI-SWS, Kaiserslautern]
  • Lê Thành‌ Dũng (Tito) Nguyen [Aix-Marseille‌​‌ University]
  • Michaël Cadilhac [DePaul​​ University]
  • Howard Straubing [Boston​​​‌ College]
  • Rémi Morvan [freelance‌ software engineer (previously University‌​‌ of Bordeaux)]

9.1.2 Visits​​ to international teams

Research​​​‌ stays abroad

Antoine Amarilli‌ visited the Institute of‌​‌ Computer Science at the​​ University of Göttingen (Germany),​​​‌ in the context of‌ a collaboration with Florin‌​‌ Manea, Markus Schmid, and​​ Tina Ringleb, from February​​​‌ 10 to February 16,‌ 2025.

9.2 National initiatives‌​‌

9.2.1 ANR JCJC KCODA​​

Participants: Florent Capelli [PI]​​​‌, Oliver Irwin,‌ Charles Paperman, Sylvain‌​‌ Salvati.

  • Duration
    2021-2026​​
  • Description
    The aim of​​​‌ KCODA is to study‌ how succinct representations can‌​‌ be used to efficiently​​ solve modern optimization and​​​‌ AI problems that use‌ a lot of data.‌​‌ We suggest using data​​ structures from the field​​​‌ of compilation of knowledge‌ that can represent large‌​‌ datasets succinctly by factoring​​ certain parts while allowing​​​‌ efficient analysis of the‌ represented data. The first‌​‌ goal of KCODA is​​ to understand how one​​​‌ can efficiently solve optimization‌ and training problems for‌​‌ data represented by these​​ structures. The second goal​​​‌ of KCODA is to‌ offer better integration of‌​‌ these techniques into the​​ systems of database management​​​‌ by proposing new algorithms‌ allowing to build factorized‌​‌ representations of the data​​ responses to DB requests​​​‌ and by proposing encodings‌ of these representations inside‌​‌ the DB.

9.2.2 ANR​​ SxC

Participants: Loup Lobet​​​‌, Charles Paperman [PI]‌, Sylvain Salvati.‌​‌

  • Duration
    2025-2030
  • Coordinator
    Charles​​ Paperman
  • Scientific Partners
    LIP​​​‌ (ENS Lyon), LCIS (Université‌ Grenobles Alpes) and LABRI‌​‌ (Université de Bordeaux)
  • Objectives​​

    Knowledge about handcrafted vectorization​​​‌ algorithms is scattered among‌ various fields, system architectures,‌​‌ programming languages paradigms and​​ industrial pieces of codes.​​​‌ The main objective of‌ the project is to‌​‌ bring structure to this​​ knowledge through novel conceptual​​​‌ and practical tools. The‌ end goal is to‌​‌ facilitate the writing of​​ SIMD programs and to​​​‌ pave the way to‌ future auto-vectorization methods for‌​‌ stream processing. The main​​ difficulty to achieve these​​​‌ goals is to understand‌ the interplay of sequentiality‌​‌ and bit-level parallelism. Circuit​​ complexity, as introduced by​​​‌ Shannon in the early‌ years of computer science,‌​‌ offers a powerful framework​​ to study such notion​​​‌ of parallelism. In this‌ multidisciplinary project, we bring‌​‌ together methodologies from the​​ fields of circuit complexity​​​‌ of automata, and programming‌ language design for parallel‌​‌ architectures, with motivations from​​ data processing and bioinformatics​​​‌ applications.

    The project’s results‌ will be implemented into‌​‌ a compilation tool-chain. At​​ the highest level, we​​​‌ will design and develop‌ a domain specific language,‌​‌ dubbed Vectoid. Its goal​​ is to allow more​​​‌ programmers to harness SIMD‌ instructions. Vectoid will be‌​‌ compiled to a Vectorial​​ Intermediate Representation (VIR) dedicated​​​‌ to stream processing. In‌ turn VIR will be‌​‌ compiled to low level​​ code for various SIMD​​​‌ architectures. The design of‌ this compilation tool-chain will‌​‌ be informed by a​​ theoretical investigation on the​​​‌ expressivity of vectorial circuits.‌ It will be evaluated‌​‌ on data processing and​​​‌ bioinformatics benchmarks.

9.2.3 ANR​ EXPAND

Participants: Antoine Amarilli​‌.

  • Duration
    2025-2030
  • Coordinator​​
    Michaël Thomazo
  • Scientific Partners​​​‌
    LABRI (Bordeaux), LIMOS (Clermont-Ferrand),​ LIRMM (Montpellier), DI ENS​‌ (Paris), IRISA (Rennes), and​​ CRIStAL (D-DAL and SPIRALS​​​‌ team)
  • Objectives
    The goal​ of EXPAND is to​‌ expand the applicability of​​ ontology-based query answering by​​​‌ allowing for enhanced query​ languages, and provide richer​‌ ways to manipulate and​​ understand query answers.

9.3​​​‌ Regional initiatives

In collaboration​ with Florent Capelli and​‌ Stefan Mengel of Centre​​ de Recherche en Informatique​​​‌ de Lens (CRIL), Antoine​ Amarili and Mikaël Monet​‌ have obtained a PostDoc​​ grant from région Nord​​​‌ via the CPER Cornelia.​ They will recruit a​‌ post-doctoral student who is​​ expected to start in​​​‌ September 2026.

10 Dissemination​

Participants: Antoine Amarilli,​‌ Iovka Boneva, Oliver​​ Irwin, Sébastien Labbe​​​‌, Aurélien Lemay,​ Loup Lobet, Arthur​‌ Lombardo, Mikael Monet​​, Charles Paperman,​​​‌ Sylvain Salvati, Sophie​ Tison.

10.1 Promoting​‌ scientific activities

Member of​​ organizing committees
  • Mikael Monet​​​‌ , Sophie Tison
    member​ of the EDBT/ICDT'27 organizing​‌ committee. The conference will​​ take place in Lille​​​‌ in 2027.

10.1.1 Scientific​ events: selection

Member of​‌ the conference program committees​​
  • Antoine Amarilli
    was member​​​‌ of the program committee​ of UndoneCS 2026
  • Iovka​‌ Boneva
    was member of​​ the program committee of​​​‌ BDA 2025 (the French​ conference on data management)​‌
Reviewer
  • Antoine Amarilli
    has​​ reviewed for various journals/conferences,​​​‌ including STACS'26, ICDT'25, LMCS​
  • Mikael Monet
    has reviewed​‌ for various journals/conferences, including​​ PODS'26, JACM (2025), TODS​​​‌ (2025), TOCL (2026)

10.1.2​ Journal

Member of the​‌ editorial boards
  • Mikael Monet​​
    Managing editor for the​​​‌ TheoretiCS journal.

10.1.3 Invited​ talks

  • Antoine Amarilli
    gave​‌ a talk at the​​ invitation-only workshop FMT'25 (finite​​​‌ model theory) in Les​ Houches, France. He was​‌ also invited (without giving​​ a talk) to Dagstuhl​​​‌ workshop 25081 "Semirings in​ Databases, Automata, and Logic"​‌

10.1.4 Leadership within the​​ scientific community

  • Antoine Amarilli​​​‌
    Vice-President (elected in 2024)​ of the EATCS (European​‌ Association for Theoretical Computer​​ Sciences).
  • Charles Paperman
    Elected​​​‌ member of EATCS (European​ Association for Theoretical Computer​‌ Sciences).
  • Sophie Tison
    Member​​ of the scientific committee​​​‌ of the GDR IM.​
  • Sophie Tison
    Member of​‌ the SSAAL Société des​​ Sciences, de l'Agriculture et​​​‌ des Arts de Lille.​

10.1.5 Scientific expertise

  • Sophie​‌ Tison
    Expert for HCERES​​ (vice-president of the DTIS-ONERA​​​‌ committee).
  • Sophie Tison
    External​ expert for IUF.
  • Sophie​‌ Tison
    Expert for the​​ programme "Jeunes Talents France​​​‌ 2025 L’Oréal-UNESCO Pour les​ Femmes et la Science".​‌

10.1.6 Research administration

  • Iovka​​ Boneva
    Member of Inria​​​‌ Lille CER (Commission des​ Emplois de Recherche).
  • Charles​‌ Paperman
    Elected member of​​ the research commission of​​​‌ FST (Université de Lille).​
  • Sylvain Salvati
    Elected member​‌ of CE (Commission d'évaluation​​ INRIA).
  • Sophie Tison
    Member​​​‌ of the Steering Committee​ of Highlights of Logic,​‌ Automata, and Games.

10.2​​ Teaching - Supervision -​​​‌ Juries - Educational and​ pedagogical outreach

10.2.1 Supervision​‌

  • Bastien Degardins
    PhD project​​ started in October 2024.​​​‌ On graph databases for​ DNA analysis. Supervised by​‌ Paperman and Marchet (CRIStAL​​ Bonsai).
  • Oliver Irwin
    PhD​​ project started in 2022.​​​‌ On compilation and aggregation‌ in databases. Co-supervised by‌​‌ Capelli and Salvati.
  • Sebastien​​ Labbe
    PhD student from​​​‌ Oct 2025, co-supervised by‌ Antoine Amarilli and Charles‌​‌ Paperman
  • Loup Lobet
    PhD​​ student from Oct 2025,​​​‌ co-supervised by Laure Gonnord‌ and Charles Paperman
  • Etienne‌​‌ Parent
    PhD student from​​ Oct 2025, co-supervised by​​​‌ Gabrien Radanne, Matthieu Moy‌ and Charles Paperman
  • Arthur‌​‌ Lombardo
    PhD student from​​ Oct 2025, co-supervised by​​​‌ Antoine Amarilli, Mikaël Monet‌ and Pierre Senellart
  • Vincent‌​‌ Peth
    M2 intern, co-supervised​​ by Antoine Amarilli and​​​‌ Mikaël Monet, summer 2025‌
  • Rémi de Pretto
    M1‌​‌ intern, co-supervised by Antoine​​ Amarilli and Mikaël Monet,​​​‌ summer 2025
  • Paul Raphaël‌
    L3 intern, co-supervised by‌​‌ Antoine Amarilli and Mikaël​​ Monet and Sylvain Salvati,​​​‌ summer 2025
  • Harshul Sagar‌
    PhD student at University‌​‌ of Illinois with whom​​ Antoine Amarilli and Mikaël​​​‌ Monet have an ongoing‌ research project that started‌​‌ during summer 2025
  • Thibault​​ Tisserand
    M1 intern, supervised​​​‌ by Sylvain Salvati, summer‌ 2025

10.2.2 Juries

  • Sylvain‌​‌ Salvati
    has been a​​ reviewer of the PhD​​​‌ of Vincent Moreau A‌ topological and fibrational approach‌​‌ to higher-order automata at​​ Université Paris Cité.
  • Charles​​​‌ Paperman
    has been in‌ the PhD committee of‌​‌ Gabriel Bathie.
  • Antoine Amarilli​​
    has been in the​​​‌ HDR committee of Tatiana‌ Starikovskaya (reviewer), in the‌​‌ PhD committee of Rémi​​ Morvan, in the PhD​​​‌ committee of Neha Makhija,‌ in the PhD committee‌​‌ of Harry Vinall-Smeeth (reviewer),​​ in the CSI committees​​​‌ of Javad Taheri, Oliver‌ Irwin, Fatemeh Ghasemi, and‌​‌ Lucas Larroque, and in​​ a hiring committee (​​​‌COS) for a‌ “Maître de conférences‌​‌” position in LABRI​​ (Bordeaux)
  • Sophie Tison
    has​​​‌ been in the PhD‌ committee of Rémi Morvan.‌​‌
  • Iovka Boneva
    has been​​ in the hiring committee​​​‌ (COS) for‌ two “Maître de‌​‌ conférences” positions at​​ Université de Lille

10.2.3​​​‌ Educational and pedagogical outreach‌

  • Aurélien Lemay
    Responsible for‌​‌ computer science and numeric​​ correspondent for his department.​​​‌
  • Paperman
    Responsible of alternance‌ for the department of‌​‌ Computer Science.
  • Sylvain Salvati​​
    Head of Master of​​​‌ Computer Science at the‌ University of Lille.
  • Sylvain‌​‌ Salvati
    Member of the​​ board for Computer Science​​​‌ of the doctoral school‌ MADIS (Mathematics and Computer‌​‌ Science).
  • Sylvain Salvati
    Member​​ of the collège doctoral​​​‌ (doctoral college) of University‌ of Lille: PhD students‌​‌ education and professional training.​​

10.2.4 Teaching Activities

  • Iovka​​​‌ Boneva
    Teaches computer science‌ in BUT Informatique of‌​‌ Université de Lille.
  • Oliver​​ Irwin
    Taught 64 hours​​​‌ in Computer Science in‌ the Computer Science Department‌​‌ of Université de Lille.​​
  • Aurélien Lemay
    Teaches computer​​​‌ science in the LEA‌ department of Université de‌​‌ Lille for around 200h​​ per year (Licence and​​​‌ Master). He is also‌ responsible for computer science‌​‌ and numeric correspondent for​​ its department.
  • Mikael Monet​​​‌
    Teaches computer science as‌ a temporary lecturer at‌​‌ Université de Lille and​​ at Centrale Lille. He​​​‌ teaches two databases courses,‌ as well as a‌​‌ course on "algorithms and​​ complexity".
  • Charles Paperman
    Teaches​​​‌ CS in master degrees‌ of the computer science‌​‌ department, Miashs at bachelor​​​‌ level in the math​ department and in the​‌ data science master degree.​​
  • Sara Riva
    Teaches CS​​​‌ both in the Bachelor​ and Master of the​‌ Department of Computer Science​​ of the Université de​​​‌ Lille. She also teaches​ a Database course at​‌ Centrale Lille.
  • Sylvain Salvati​​
    Teaches at the Bachelor​​​‌ and Master level in​ the computer science department​‌ of Université de Lille.​​

10.3 Popularization

10.3.1 Specific​​​‌ official responsibilities in science​ outreach structures

  • Sophie Tison​‌
    Member of the scientific​​ committee of the Maison​​​‌ Pour la Science.
  • Charles​ Paperman
    Co-organizer of the​‌ yearly thematic of GDR-IFM​​ and GDR-BIMMM l'adn dans​​​‌ tout ses états.​

10.3.2 Others science outreach​‌ relevant activities

  • Sara Riva​​
    is involved in the​​​‌ internship action Fourmis which​ fosters the interest of​‌ highschool students towards Mathematics​​ and Computer Science.

11​​​‌ Scientific production

11.1 Major​ publications

11.2 Publications‌​‌ of the year

International​​ journals

International peer-reviewed conferences​​​‌

Reports & preprints‌

11.3 Cited​ publications

  • 21 articleA.​‌Antoine Amarilli, M.​​Marcelo Arenas, Y.​​​‌YooJung Choi, M.​Mikaël Monet, G.​‌ V.Guy Van den​​ Broeck and B.Benjie​​​‌ Wang. A Circus​ of Circuits: Connections Between​‌ Decision Diagrams, Circuits, and​​ Automata.arXiv preprint​​​‌ arXiv:2404.096742024back to​ textback to text​‌
  • 22 inproceedingsA.Antoine​​ Amarilli, P.Pierre​​​‌ Bourhis, F.Florent​ Capelli and M.Mikaël​‌ Monet. Ranked Enumeration​​ for MSO on Trees​​​‌ via Knowledge Compilation.​ICDT2902024back​‌ to text
  • 23 inproceedings​​A.Antoine Amarilli,​​​‌ P.Pierre Bourhis,​ S.Stefan Mengel and​‌ M.Matthias Niewerth.​​ Enumeration on trees​​​‌ with tractable combined complexity​ and efficient updates.​‌PODS2019DOIback​​ to text
  • 24 article​​​‌A.Antoine Amarilli and​ F.Florent Capelli.​‌ Tractable Circuits in Database​​ Theory.ACM SIGMOD​​​‌ Record5322024​back to textback​‌ to text
  • 25 article​​A.Antoine Amarilli and​​​‌ İ. İ.İsmail İlkan​ Ceylan. The​‌ dichotomy of evaluating homomorphism-closed​​ queries on probabilistic graphs​​​‌.LMCS2022DOI​back to text
  • 26​‌ inproceedingsA.Antoine Amarilli​​, L.Louis Jachiet​​​‌ and C.Charles Paperman​. Dynamic Membership for​‌ Regular Languages.ICALP​​2021DOIback to​​​‌ textback to text​
  • 27 articleA.Antoine​‌ Amarilli, M.Mikaël​​ Monet and D.Dan​​​‌ Suciu. The Non-Cancelling​ Intersections Conjecture.CoRR​‌abs/2401.162102024back to​​ textback to text​​​‌
  • 28 articleA.Antoine​ Amarilli and M.Mikaël​‌ Monet. Weighted counting​​ of matchings in unbounded-treewidth​​​‌ graph families.arXiv​ preprint arXiv:2205.008512022back​‌ to text
  • 29 inproceedings​​A.Antoine Amarilli and​​​‌ M.Mikaël Monet.​ Enumerating regular languages​‌ with bounded delay.​​STACS2023DOIback​​​‌ to text
  • 30 inproceedings​J.-Y.Jin-Yi Cai,​‌ P.Pinyan Lu and​​ M.Mingji Xia.​​​‌ Holant problems and counting​ CSP.STOC2009​‌back to text
  • 31​​ articleA.Adnan Darwiche​​​‌ and P.Pierre Marquis​. A knowledge compilation​‌ map.Journal of​​ Artificial Intelligence Research17​​​‌2002back to text​back to text
  • 32​‌ inproceedingsD.Daniel Deutch​​, N.Nave Frost​​​‌, B.Benny Kimelfeld​ and M.Mikaël Monet​‌. Computing the Shapley​​ Value of Facts in​​​‌ Query Answering.SIGMOD​ ConferenceACM2022back​‌ to text
  • 33 article​​M.Martin Dyer and​​​‌ D.David Richerby.​ An effective dichotomy for​‌ the counting constraint satisfaction​​ problem.SIAM Journal​​ on Computing423​​​‌2013back to text‌
  • 34 articleR.Ronald‌​‌ Fagin, B.Benny​​ Kimelfeld, F.Frederick​​​‌ Reiss and S.Stijn‌ Vansummeren. Document spanners:‌​‌ A formal approach to​​ information extraction.JACM​​​‌6222015back‌ to text
  • 35 article‌​‌C.Cibele Freire,​​ W.Wolfgang Gatterbauer,​​​‌ N.Neil Immerman and‌ A.Alexandra Meliou.‌​‌ The Complexity of Resilience​​ and Responsibility for Self-Join-Free​​​‌ Conjunctive Queries.PVLDB‌932015DOI‌​‌back to text
  • 36​​ phdthesisL.Lily Gallois​​​‌. Dialog between chase‌ approach and string rewriting‌​‌ system approach.Université​​ de LilleUniversité de​​​‌ LilleDecember 2019HAL‌back to text
  • 37‌​‌ articleK.Kathrin Hanauer​​, M.Monika Henzinger​​​‌ and C.Christian Schulz‌. Recent advances in‌​‌ fully dynamic graph algorithms:​​ A quick reference guide​​​‌.JEA272022‌back to text
  • 38‌​‌ incollectionM.Mark Jerrum​​. Counting Constraint Satisfaction​​​‌ Problems.The Constraint‌ Satisfaction Problem7Dagstuhl‌​‌ Follow-Ups2017back to​​ text
  • 39 articleP.​​​‌Pratik Karmakar, M.‌Mikaël Monet, P.‌​‌Pierre Senellart and S.​​Stéphane Bressan. Expected​​​‌ Shapley-like scores of boolean‌ functions: Complexity and applications‌​‌ to probabilistic databases.​​PACMMOD222024​​​‌back to text
  • 40‌ inproceedingsW.Wim Martens‌​‌ and T.Tina Trautner​​. Evaluation and enumeration​​​‌ problems for regular path‌ queries.ICDT2018‌​‌back to text
  • 41​​ inproceedingsC.Charles Paperman​​​‌, S.Sylvain Salvati‌ and C.Claire Soyez-Martin‌​‌. An algebraic approach​​ to vectorial programs.​​​‌STACS2023HALDOI‌back to text
  • 42‌​‌ inproceedingsS.Sylvain Salvati​​ and S.Sophie Tison​​​‌. Containment of Regular‌ Path Queries Under Path‌​‌ Constraints.ICDT2024​​HALDOIback to​​​‌ text
  • 43 articleN.‌Nicole Schweikardt, L.‌​‌Luc Segoufin and A.​​Alexandre Vigny. Enumeration​​​‌ for FO queries over‌ nowhere dense graphs.‌​‌JACM6932022​​DOIback to text​​​‌
  • 44 articleL.Luc‌ Segoufin. Constant delay‌​‌ enumeration for conjunctive queries​​.ACM SIGMOD Record​​​‌4412015back‌ to text
  • 45 article‌​‌P.Pierre Senellart,​​ L.Louis Jachiet,​​​‌ S.Silviu Maniu and‌ Y.Yann Ramusat.‌​‌ ProvSQL: Provenance and probability​​ management in PostgreSQL.​​​‌PVLDB11122018‌back to text
  • 46‌​‌ phdthesisC.Claire Soyez-Martin​​. From semigroup theory​​​‌ to vectorization: recognizing regular‌ languages.Université de‌​‌ LilleUniversité de Lille​​December 2023HALback​​​‌ to text
  1. 1We‌ use the term uncertain‌​‌ data as it is​​ used in the field​​​‌ of databases. It refers‌ to data which is‌​‌ either unknown or that​​ contains some probabilistic aspects.​​​‌
  2. 2A notable exception‌ is PostgreSQL, which proposes‌​‌ the type JSONB that​​ allows for better indexation​​​‌ and avoids the cost‌ of parsing.
  3. 3SIMD‌​‌ stands for Single Instruction​​ Multiple Data and are​​​‌ CPU instructions that allow‌ to operate at the‌​‌ same time on several​​ machine words with the​​​‌ same CPU instruction like‌ bitwise operations, additions, substractions,‌​‌ etc.