Tarantool Summer of Code 2022 ideas

Please find below the preliminary list of ideas we would like to explore during this season of Tarantool Summer of Code. Each idea has a corresponding difficulty level, required/optional skill sets, and mentor names.

Not all ideas have the same amount of details in explanation, but we believe they might be of a good start in an interesting direction.

Database engines

Add column storage engine

Tarantool storage engines MemTx and Vinyl are row (tuple) based, which is working fine for OLTP activity but is suboptimal for OLAP queries. There is infrastructure in Tarantool to switch engines for spaces, and we may experiment with column-based storage. As a bonus, for cluster mode, column storage may store different column families on different nodes for a better locality. Such an approach may provide (or maybe not) better scalability for OLAP queries in a wide-column cluster.

We need to experiment with a local column storage scenario
And wide-column scenario for column families spread over cluster nodes;
Measure benefits and overhead;

Difficulty: Hard

Expected size: 350 hours

Required skills: C/C++

Mentor: Kirill Yukhin

Add B+ Tree storage engine

Persistent, LSM-based Tarantool Vinyl storage engine provides good performance for write-only activities, but is suboptimal for reading queries, especially if one query cold data. On the other hand, B+ tree-based storage engines better fit the scenarios when we not only write data (across clusters) but also need to query cold data for simple analytic needs. There is infrastructure in Tarantool to switch engines for spaces, and we may experiment with B+ tree-based storage.

One needs to create this engine, and plug it into the generic query flow in Tarantool (making sure that Lua and SQL requests still work for new engine);
One needs to compare its performance against Vinyl and measure memory benefits or losses.

Difficulty: Hard

Expected size: 350 hours

Required skills: C/C++

Mentor: Timur Safin, Cyrill Gorkunov

Calculate bloom filter and page size automatically

Vinyl is disk storage based on the LSM tree data structure. To speed up data access (which may take a while for disk storage) we use a probabilistic data structure - bloom filter. It allows us to eliminate excess data lookups - we can skip some files on disk if the bloom filter application returns a negative result. We should automatically adjust bloom filters and page size to fit a pre-defined page index size. We have the following automatic adjustment strategies, which we should all employ:

avoid having bloom filters on lower levels; the deeper the LSM level, the less value per byte used they provide
increase page size (again, starting from the bottom-most pages) - the larger the page size, the smaller the page index.

Related issues :

Difficulty: Medium

Expected size: 175 hours

Required skills: C

Mentor: Nikita Pettik, Anastasiy Belyaev

Implement persistent tuple cache

To speed up data access in Vinyl we also maintain a cache of tuples (to be more precise - chains of tuples) in RAM. However, it is not persistent, i.e. in instance reload its content is erased. It would be nice to dump the tuple cache during the checkpoint and recover it during vinyl recovery.

Related issues :

vinyl: persistent tuple cache #2065

Difficulty: Hard

Expected size: 350 hours

Required skills: C

Mentor: Nikita Pettik, Anastasiy Belyaev

Make vinyl memory level dumping independent from memtx

Currently, the only way of forcing LSM-tree based vinyl engine's memory level dump is by calling box.snapshot(). However, this call can be slow, because it also dumps the memtx engine's memory and it has some side effects, such as creating a checkpoint and invoking xlog garbage collection. It would be nice to have a separate handle for that, e.g. box.ctl.vinyl.dump().

Related issues:

Allow to trigger dump of vinyl memory without checkpoint

Difficulty: Medium

Expected size: 175 hours

Required skills: C

Mentors: Nikita Pettik, Georgiy Lebedev, Anastasiy Belyaev

Add massive end-to-end testing for concurrent operations in vinyl

In #4821 we found that Tarantool may crash or get stuck if a failed memory dump happens concurrently with a DDL operation. However, this is not the only case where dump failure occurring in parallel with another operation may lead to bugs. Thus, we need to provide test scenarios covering a combination of dump/compact fails and concurrent index/space drop/alter/creation operations.

This project involves researching various test scenarios and test means (e.g., generation of DDL operations, possibly even fuzzing), and also investigating and fixing the bugs discovered during research.

Related issues:

Difficulty: Medium

Expected size: 175 hours

Required skills: C

Mentors: Nikita Pettik, Georgiy Lebedev

Adding the HyperLogLog algorithm to Tarantool

As part of this task, it is necessary to implement the functionality described in the issue. You need to implement the HyperLogLog algorithm or its improved version as a separate module. We also need to figure out how to incorporate this into the current architecture - one idea is to make a new kind of index that supports this kind of operation.

Related issues:

Introduce index:dcount() function that uses HyperLogLog algorithm

Difficulty: Medium

Expected size: 175 hours

Required skills: C

Mentors: Nikita Pettik, Anastasiy Belyaev

LuaJIT

Fuzzing Lua (LuaJIT) in Tarantool

LuaJIT is the heart of Tarantool and its stability is closely connected to the stability of LuaJIT. But statistics of issues said that we catch bugs with LuaJIT [1] using Tarantool on production. We need to decrease the probability of such bugs and introduce more generative testing for LuaJIT in the Tarantool release cycle.

Some ideas for testing are:

Metamorphic testing using an existing corpus of Lua scripts.
Automated generation of scripts using Lua grammar and introspection of Tarantool embedded functions.
Fault injection in memory allocation using embedded error injection engine (see src/lib/core/errinj.c) or using an external library with LD_PRELOAD

References

Difficulty: Moderate

Expected size: 175 hours

Required skills: C/C++, Lua

Desired Skills: LuaJIT internals, metamorphic testing, fuzzing testing, LibFuzzer

Mentors: Sergey Bronnikov, Igor Munkin

SQL

Grammar-based fuzzing of SQL engine

Tarantool has an SQL engine inherited from SQLite. We are interested in good testing of SQL engine using fuzzing testing. The main idea here is to generate SQL queries valid for Tarantool SQL grammar (see SQL reference), execute a query on Tarantool and make sure is everything good after that.

There are some ideas:

Implement a LibFuzzer-based fuzzer, as the Chromium team did for SQLite, see https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/sqlite/fuzz
Create a port of Sqllogictest for Tarantool

References

Difficulty: Medium

Expected size: 175 hours

Required skills: C/C++

Optional skills: SQL, Lua

Mentor: Sergey Bronnikov

Integration with SQLancer

SQLancer is a tool for finding logic bugs in SQL engines, written in Java. It uses Pivoted Query Synthesis (PQS), Non-optimizing Reference Engine Construction (NoREC), and Ternary Logic Partitioning (TLP) for building correct from syntax point of view queries and executing in the database. The efficiency of SQLances has proven with widely-used DBMSs like PostgreSQL, MySQL, SQLite, etc. where SQLancer found over 450 unique, previously unknown bugs.

References

Difficulty: Medium

Expected size: 175 hours

Required skills: Java

Optional skills: SQL

Mentor: Sergey Bronnikov

LLVM JIT for Tarantool SQL engine (stage 2)

There are multiple virtual machines used at the moment by Tarantool:

LuaJIT within its 2 operating modes (Lua bytecode interpreter, and JIT traces with native code generated);
And SQLite-derivative VM code for SQL engine known as VDBE.

Vdbe is not providing the best performance and looks alien to the rest of Tarantool code base. One of GSoC 2021 projects explored LLVM JIT facilities for SQL engine speed improvements - LLVM JIT engine for Tarantool's DQL

One should pick this project up, make it feature-complete, benchmark performance, and fine-tune for the general case.

Difficulty: Hard

Expected size: 350 hours

Required skills: C/C++, JIT

Optional skills: LLVM, SQLite

Mentor: Georgiy Lebedev, Timur Safin, Nikita Pettik

Ecosystem

Implement Tarantool/LuaJIT backend for Compiler Explorer

Code Explorer also known as Godbolt (godbolt.org) became the greatest community tool used by multiple languages community for codeshare and performance investigations. It started with C++ compilers, but now it includes Ada, Go, Haskell, and many others (and more to come).

Unfortunately, there is no LuaJIT/Tarantool support and there is no a good substitution for missing Godbolt inside of the LuaJIT community.

There is a nice LuaJIT web inspector by @mejedi though, which shows bytecode IR and generated x86 assembly for the given Lua script.

But there is no Tarantool version of LuaJIT integrated;
There is no way to generate assembly listing for non-x86 targets;
and there is no convenient way to share links as we used to in Godbolt;

It looks like that godbolt.org is the best infrastructure that should be used here:

It has script persistence and short-cutter activated;
It has facilities to run any compiler inside of Docker VM;
It has builtin facilities to parse compiler assembly and for intercepting and redirecting it to disassembly (or IR) panes inside of Compiler Explorer.

So Lua/LuaJITs should be integrated into the Godbolt infrastructure similarly to many other languages are already presented there.

Difficulty: Medium

Expected size: 175 hours

Required skills: TypeScript/JavaScript, Docker

Optional skills: C/C++, LuaJIT

Mentor: Timur Safin, Igor Munkin

Introduce dynamic trace probes to Tarantool kernel for SystemTap/dtrace

Dynamic tracing probes as introduced to DTrace in Solaris, have been ported to BSD*, Linux, and even Windows lately. Similarly, Linux has SystemTap probing facilities that support tracing functionality but with fewer ergonomics.

It's a common approach to use trace probes in database engines to make easier collecting of system internals on production sites.

Please see examples of dynamic tracing developed for database engines:

PostgreSQL dynamic tracing https://www.postgresql.org/docs/current/dynamic-trace.html
Tracing Tracing mysqld Using DTrace
Adding User Space Probing to an Application https://sourceware.org/systemtap/wiki/AddingUserSpaceProbingToApps

One should add tracing probes, evaluate overhead, and provide tutorial guidelines for using DTrace/SystemTap for Tarantool performance investigations.

Difficulty: Medium

Expected size: 175 hours

Required skills: C/C++, Linux kernel

Mentor: Cyrill Gorkunov, Timur Safin

Deterministic simulator

Testing of distributed systems is hard because any non-deterministic behavior (changed timings in networking, differences in CPUs scheduler quotas, or even CPU shortages) may lead to flaky test failures.

Injection of errors (box.error.injection) used by Tarantool is considered a reliable approach to deterministically generate events during testing. But there is no way to make multi-mode networking or CPU timings behave deterministically in the current architecture.

One may try to virtualize various layers of API (e.g. monotonic clock or networking) via framework similar to whirl-framework

Another approach similar to error injection - is to generate single-threaded system simulator with all IO and scheduling being fully virtualized, using FoundationDB approach.

One should explore possible approaches to gain full determinism of testing for the "distributed" system, select the least invasive, and try to apply the selected approach to some of the currently available replication tests.

Difficulty: Hard

Expected size: 350 hours

Required skills: C/C++, Linux

Mentor: Timur Safin

Full Tarantool/Lua support for IntelliJ IDEA platform

At the moment there is rather a decent Lua syntax support available for the IntelliJ IDEA platform. There is no integrated Tarantool box API yet, so there is no native code complete available for built-in modules. And there is not currently possible to debug remotely Tarantool Lua scripts. Or run specific tests natively from inside of IDE.

One should add Tarantool box API definitions to syntax highlighter;
Integrate LuaTest framework support;
and modifying mobdebug by Paul Kluchenko for Tarantool flavor support, one should create an IntelliJ IDEA extension which would allow to remotely attach to Tarantool Lua debugger, set breakpoints, watch variables, step into and out.

Difficulty: Medium

Expected size: 175 hours

Required skills: Java, Lua

Optional skills: C/C++

Mentor: Timur Safin

Developer Guidelines ↗

Architecture

Server architecture
R tree index quick start and usage
LuaJIT
Vinyl
- Vinyl Architecture
- Vinyl Disk Layout
- Vinyl math
- Vinyl Cookbook
  - Bullet1
SQL
Appserver modules
- Datetime-internals
- Datetime‐RFC
Testing
Performance
Privileges and Access control

How To ...?

Recipes

Upgrade instructions

Useful links

Old discussions

Personal pages

Tarantool Summer of Code 2022 ideas

Table of Contents

Database engines

Add column storage engine

Add B+ Tree storage engine

Calculate bloom filter and page size automatically

Implement persistent tuple cache

Make vinyl memory level dumping independent from memtx

Add massive end-to-end testing for concurrent operations in vinyl

Adding the HyperLogLog algorithm to Tarantool

LuaJIT

Fuzzing Lua (LuaJIT) in Tarantool

SQL

Grammar-based fuzzing of SQL engine

Integration with SQLancer

LLVM JIT for Tarantool SQL engine (stage 2)

Ecosystem

Implement Tarantool/LuaJIT backend for Compiler Explorer

Introduce dynamic trace probes to Tarantool kernel for SystemTap/dtrace

Deterministic simulator

Full Tarantool/Lua support for IntelliJ IDEA platform

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!