-
Notifications
You must be signed in to change notification settings - Fork 387
Tarantool Summer of Code 2022 ideas
Please find below the preliminary list of ideas we would like to explore during this season of Tarantool Summer of Code. Each idea has a corresponding difficulty level, required/optional skill sets, and mentor names.
Not all ideas have the same amount of details in explanation, but we believe they might be of a good start in an interesting direction.
-
Database engines
- Add column storage engine
- Add B+ Tree storage engine
- Calculate bloom filter and page size automatically
- Implement persistent tuple cache
- Make vinyl memory level dumping independent from memtx
- Add massive end-to-end testing for concurrent operations in vinyl
- Adding the HyperLogLog algorithm to Tarantool
- LuaJIT
- SQL
- Ecosystem
Tarantool storage engines MemTx and Vinyl are row (tuple) based, which is working fine for OLTP activity but is suboptimal for OLAP queries. There is infrastructure in Tarantool to switch engines for spaces, and we may experiment with column-based storage. As a bonus, for cluster mode, column storage may store different column families on different nodes for a better locality. Such an approach may provide (or maybe not) better scalability for OLAP queries in a wide-column cluster.
- We need to experiment with a local column storage scenario
- And wide-column scenario for column families spread over cluster nodes;
- Measure benefits and overhead;
Difficulty: Hard
Expected size: 350 hours
Required skills: C/C++
Mentor: Kirill Yukhin
Persistent, LSM-based Tarantool Vinyl storage engine provides good performance for write-only activities, but is suboptimal for reading queries, especially if one query cold data. On the other hand, B+ tree-based storage engines better fit the scenarios when we not only write data (across clusters) but also need to query cold data for simple analytic needs. There is infrastructure in Tarantool to switch engines for spaces, and we may experiment with B+ tree-based storage.
- One needs to create this engine, and plug it into the generic query flow in Tarantool (making sure that Lua and SQL requests still work for new engine);
- One needs to compare its performance against Vinyl and measure memory benefits or losses.
Difficulty: Hard
Expected size: 350 hours
Required skills: C/C++
Mentor: Timur Safin, Cyrill Gorkunov
Vinyl is disk storage based on the LSM tree data structure. To speed up data access (which may take a while for disk storage) we use a probabilistic data structure - bloom filter. It allows us to eliminate excess data lookups - we can skip some files on disk if the bloom filter application returns a negative result. We should automatically adjust bloom filters and page size to fit a pre-defined page index size. We have the following automatic adjustment strategies, which we should all employ:
- avoid having bloom filters on lower levels; the deeper the LSM level, the less value per byte used they provide
- increase page size (again, starting from the bottom-most pages) - the larger the page size, the smaller the page index.
Related issues :
Difficulty: Medium
Expected size: 175 hours
Required skills: C
Mentor: Nikita Pettik, Anastasiy Belyaev
To speed up data access in Vinyl we also maintain a cache of tuples (to be more precise - chains of tuples) in RAM. However, it is not persistent, i.e. in instance reload its content is erased. It would be nice to dump the tuple cache during the checkpoint and recover it during vinyl recovery.
Related issues :
vinyl: persistent tuple cache #2065
Difficulty: Hard
Expected size: 350 hours
Required skills: C
Mentor: Nikita Pettik, Anastasiy Belyaev
Currently, the only way of forcing LSM-tree based vinyl engine's memory level dump is by calling box.snapshot()
. However, this call can be slow, because it also dumps the memtx engine's memory and it has some side effects, such as creating a checkpoint and invoking xlog garbage collection. It would be nice to have a separate handle for that, e.g. box.ctl.vinyl.dump()
.
Related issues:
Difficulty: Medium
Expected size: 175 hours
Required skills: C
Mentors: Nikita Pettik, Georgiy Lebedev, Anastasiy Belyaev
In #4821 we found that Tarantool may crash or get stuck if a failed memory dump happens concurrently with a DDL operation. However, this is not the only case where dump failure occurring in parallel with another operation may lead to bugs. Thus, we need to provide test scenarios covering a combination of dump/compact fails and concurrent index/space drop/alter/creation operations.
This project involves researching various test scenarios and test means (e.g., generation of DDL operations, possibly even fuzzing), and also investigating and fixing the bugs discovered during research.
Related issues:
- vinyl: provide long run stress-like testing
- vinyl: assertion fault in vy_scheduler_peek_dump()
- Improve test checking secondary index consistency
Difficulty: Medium
Expected size: 175 hours
Required skills: C
Mentors: Nikita Pettik, Georgiy Lebedev
As part of this task, it is necessary to implement the functionality described in the issue. You need to implement the HyperLogLog algorithm or its improved version as a separate module. We also need to figure out how to incorporate this into the current architecture - one idea is to make a new kind of index that supports this kind of operation.
Related issues:
Difficulty: Medium
Expected size: 175 hours
Required skills: C
Mentors: Nikita Pettik, Anastasiy Belyaev
LuaJIT is the heart of Tarantool and its stability is closely connected to the stability of LuaJIT. But statistics of issues said that we catch bugs with LuaJIT [1] using Tarantool on production. We need to decrease the probability of such bugs and introduce more generative testing for LuaJIT in the Tarantool release cycle.
Some ideas for testing are:
- Metamorphic testing using an existing corpus of Lua scripts.
- Automated generation of scripts using Lua grammar and introspection of Tarantool embedded functions.
- Fault injection in memory allocation using embedded error injection engine (see
src/lib/core/errinj.c
) or using an external library with LD_PRELOAD
References
- LuaJIT/LuaJIT#568
- https://github.com/tarantool/tarantool/issues?q=is%3Aissue+is%3Aopen+label%3Acrash+label%3Aluajit
- https://github.com/tarantool/tarantool/issues/4823
Difficulty: Moderate
Expected size: 175 hours
Required skills: C/C++, Lua
Desired Skills: LuaJIT internals, metamorphic testing, fuzzing testing, LibFuzzer
Mentors: Sergey Bronnikov, Igor Munkin
Tarantool has an SQL engine inherited from SQLite. We are interested in good testing of SQL engine using fuzzing testing. The main idea here is to generate SQL queries valid for Tarantool SQL grammar (see SQL reference), execute a query on Tarantool and make sure is everything good after that.
There are some ideas:
- Implement a LibFuzzer-based fuzzer, as the Chromium team did for SQLite, see https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/sqlite/fuzz
- Create a port of Sqllogictest for Tarantool
References
- Fuzzing Book: Fuzzing with Grammars
- LibFuzzer Tutorial
- libFuzzer – a library for coverage-guided fuzz testing
- An introduction to LLVM libFuzzer
- Structure-aware fuzzing for Clang and LLVM with libprotobuf-mutator
- libprotobuf-mutator
Difficulty: Medium
Expected size: 175 hours
Required skills: C/C++
Optional skills: SQL, Lua
Mentor: Sergey Bronnikov
SQLancer is a tool for finding logic bugs in SQL engines, written in Java. It uses Pivoted Query Synthesis (PQS), Non-optimizing Reference Engine Construction (NoREC), and Ternary Logic Partitioning (TLP) for building correct from syntax point of view queries and executing in the database. The efficiency of SQLances has proven with widely-used DBMSs like PostgreSQL, MySQL, SQLite, etc. where SQLancer found over 450 unique, previously unknown bugs.
References
Difficulty: Medium
Expected size: 175 hours
Required skills: Java
Optional skills: SQL
Mentor: Sergey Bronnikov
There are multiple virtual machines used at the moment by Tarantool:
- LuaJIT within its 2 operating modes (Lua bytecode interpreter, and JIT traces with native code generated);
- And SQLite-derivative VM code for SQL engine known as VDBE.
Vdbe is not providing the best performance and looks alien to the rest of Tarantool code base. One of GSoC 2021 projects explored LLVM JIT facilities for SQL engine speed improvements - LLVM JIT engine for Tarantool's DQL
One should pick this project up, make it feature-complete, benchmark performance, and fine-tune for the general case.
Difficulty: Hard
Expected size: 350 hours
Required skills: C/C++, JIT
Optional skills: LLVM, SQLite
Mentor: Georgiy Lebedev, Timur Safin, Nikita Pettik
Code Explorer also known as Godbolt (godbolt.org) became the greatest community tool used by multiple languages community for codeshare and performance investigations. It started with C++ compilers, but now it includes Ada, Go, Haskell, and many others (and more to come).
Unfortunately, there is no LuaJIT/Tarantool support and there is no a good substitution for missing Godbolt inside of the LuaJIT community.
There is a nice LuaJIT web inspector by @mejedi though, which shows bytecode IR and generated x86 assembly for the given Lua script.
- But there is no Tarantool version of LuaJIT integrated;
- There is no way to generate assembly listing for non-x86 targets;
- and there is no convenient way to share links as we used to in Godbolt;
It looks like that godbolt.org is the best infrastructure that should be used here:
- It has script persistence and short-cutter activated;
- It has facilities to run any compiler inside of Docker VM;
- It has builtin facilities to parse compiler assembly and for intercepting and redirecting it to disassembly (or IR) panes inside of Compiler Explorer.
So Lua/LuaJITs should be integrated into the Godbolt infrastructure similarly to many other languages are already presented there.
Difficulty: Medium
Expected size: 175 hours
Required skills: TypeScript/JavaScript, Docker
Optional skills: C/C++, LuaJIT
Mentor: Timur Safin, Igor Munkin
Dynamic tracing probes as introduced to DTrace in Solaris, have been ported to BSD*, Linux, and even Windows lately. Similarly, Linux has SystemTap probing facilities that support tracing functionality but with fewer ergonomics.
It's a common approach to use trace probes in database engines to make easier collecting of system internals on production sites.
Please see examples of dynamic tracing developed for database engines:
- PostgreSQL dynamic tracing https://www.postgresql.org/docs/current/dynamic-trace.html
- Tracing Tracing mysqld Using DTrace
- Adding User Space Probing to an Application https://sourceware.org/systemtap/wiki/AddingUserSpaceProbingToApps
One should add tracing probes, evaluate overhead, and provide tutorial guidelines for using DTrace/SystemTap for Tarantool performance investigations.
Difficulty: Medium
Expected size: 175 hours
Required skills: C/C++, Linux kernel
Mentor: Cyrill Gorkunov, Timur Safin
Testing of distributed systems is hard because any non-deterministic behavior (changed timings in networking, differences in CPUs scheduler quotas, or even CPU shortages) may lead to flaky test failures.
Injection of errors (box.error.injection
) used by Tarantool is considered a
reliable approach to deterministically generate events during testing. But
there is no way to make multi-mode networking or CPU timings behave
deterministically in the current architecture.
One may try to virtualize various layers of API (e.g. monotonic clock or networking) via framework similar to whirl-framework
Another approach similar to error injection - is to generate single-threaded system simulator with all IO and scheduling being fully virtualized, using FoundationDB approach.
- “Testing Distributed Systems w/ Deterministic Simulation” by Will Wilson
- Simulation and Testing
- FoundationDB or: How I Learned to Stop Worrying and Trust the Database by Markus Pilman
One should explore possible approaches to gain full determinism of testing for the "distributed" system, select the least invasive, and try to apply the selected approach to some of the currently available replication tests.
Difficulty: Hard
Expected size: 350 hours
Required skills: C/C++, Linux
Mentor: Timur Safin
At the moment there is rather a decent Lua syntax support available for the IntelliJ IDEA platform. There is no integrated Tarantool box API yet, so there is no native code complete available for built-in modules. And there is not currently possible to debug remotely Tarantool Lua scripts. Or run specific tests natively from inside of IDE.
- One should add Tarantool box API definitions to syntax highlighter;
- Integrate LuaTest framework support;
- and modifying mobdebug by Paul Kluchenko for Tarantool flavor support, one should create an IntelliJ IDEA extension which would allow to remotely attach to Tarantool Lua debugger, set breakpoints, watch variables, step into and out.
Difficulty: Medium
Expected size: 175 hours
Required skills: Java, Lua
Optional skills: C/C++
Mentor: Timur Safin
- C coding guidelines ↗
- Lua coding guidelines ↗
- Python coding guidelines ↗
- Maintainer's guide
- Debugging
Architecture
- Server architecture
- R tree index quick start and usage
- LuaJIT
- Vinyl
- Vinyl Architecture
- Vinyl Disk Layout
- Vinyl math
- Vinyl Cookbook
- Bullet1
- SQL
- Appserver modules
- Testing
- Performance
- Privileges and Access control
How To ...?
- ... configure build system
- ... add new fuzzers
- ... build RPM or Deb package using packpack
- ... calculate memory size
- ... debug core dump of stripped tarantool
- ... debug core from different OS
- ... debug fuzzer
- ... generate new bootstrap snapshot
- ... use Address Sanitizer
- ... collect a coredump
- ... generate luacov report for builtin module
- ... verify modified lua files via luacheck
- ... verify Lua files in third_party?
- ... rerun failed jobs
- ... update a third party repository
- Fix wrong decimal indexing after upgrade to 2.10.1
- Caveats when upgrading a cluster on Tarantool 1.6
- Fix illegal field type in a space format when upgrading to 2.10.4
Useful links