lakeFS

This is a draft for a design document describing the capabilities and implementation of lakeFS 0.1

Full Documentation

For full documentation, including in-depth architecture, design, API reference and deployment see https://docs.lakefs.io

Goals

lakeFS is data lake management solution, offering at a high level the following capabilities:

Cross-lake ACID operations - change several objects/collections as one atomic operations to avoid inconsistencies during complex migrations/recalculations
Reproducibility - Travel backwards in time and match versions of data to the code that generated it
Deduping by default - No more copying input/sample data to side directories that are later a nightmare to manage and track (see #2)
Collaboration - allow teams to share data and approve changes to data including review and validation steps
Production Safety - Accidentally deleted/overwritten/corrupted a critical collection? revert instantly.
Format agnostic - use Parquet, image files, csv's, all of the above. It doesn't matter. Works with structured or unstructured data

How?

To achieve this, we require 4 main capabilities:

Git-like semantics that can scale to many petabytes of data (or terabytes of metadata)
1. Committing and rolling back versions
2. Snapshot Isolation in such that one branch's changes are completely isolated from other branches
3. Branching and merging is (relatively) cheap to perform and should be done often
4. transaction support (i.e. throw-away branch that gets merged on commit and discarded on rollback)
Sit between processing and data: this will allow us to observe who/how data is being read and written
1. Strongly consistent writes and reads, including list-after-write and read-after-write (removing the need for S3Guard/EMRFS/etc.)
API compatibility with common Object Stores, starting with the S3 API (see API subset bellow)
Metadata journaling to allow view materialization, corruption recovery and allowing migration out of the service

Non-Goals

Improve performance/durability/availability
Compute/orchestration management. This should be a solution added to existing systems without migration costs or upfront investment
S3 API compatibility for anything other than what Spark/Hadoop/ML tooling is currently using (i.e. object-level versioning, static website hosting, torrents, etc)

High level Architecture

Configuration

when running the lakefs binary, you can pass a yaml configuration file:

$ lakefs --config /path/to/configuration.yaml

Here's an example configuration file:

---
logging:
  format: text # or json
  level: DEBUG # or INFO, WARN, ERROR, NONE
  output: "-" # for stdout, or a path to a log file

metadata:
  db:
    # Make sure the DB connection string includes search_path (no need to create this schema beforehand)
    uri: "postgres://localhost:5432/postgres?search_path=lakefs_index&sslmode=disable"

auth:
  db:
    # Make sure the DB connection string includes search_path (no need to create this schema beforehand)
    uri: "postgres://localhost:5432/postgres?search_path=lakefs_auth&sslmode=disable"
  encrypt:
    # This value must be set by the user.
    # In production it's recommended to read this value from a safe place like AWS KMS or Hashicorp Vault
    secret_key: "10a718b3f285d89c36e9864494cdd1507f3bc85b342df24736ea81f9a1134bcc09e90b6641393a0a89d1a645dcf990fbd5f48cae092a5eee7b804e45c7d6a20e6b840e8124334312e01dde9a087228485512feb0780f4589d01fd2cc825dbb1925c3968c95083c2fca5ac07d61a10d15fdb6f43236dc5347dddfa3e7852f1654410ef53082b0007f33387dcdfd735c5b48e61991ceef3e8bba7267af4f0383a73af07b0c767ddd78b9a771ccb8be3d6662191f1b76d0e725ac59f1a63d110b018c2d0a727097ed9363fcb3f822d8dc7f12584bda25182cd74fece779977ca24caf774a3d5e3579228b27bbac99a5b7384367a5a6f3da629d00159edec45bc8fa"


blockstore:
  type: s3 # or ["local", "mem"]
  s3:
    region: us-east-1 
    profile: default # optional, implies using a credentials file
    credentials_file: /path/to/.aws/credentials # optional, will use the default AWS path if not specified
    credentials: # optional, will use these hard coded credentials if supplied
      access_key_id: "AKIA..."
      access_secret_key: "..."
      session_token: "..."
  
  # if instead of S3 you'd like to write the data itself locally (for testing only!)
  local:
    path: ~/lakefs/data

gateways:
  s3:
    listen_address: "0.0.0.0:8000"
    domain_name: s3.example.com
    region: us-east-1

api:
  listen_address: "0.0.0.0:8001"

Building & Running Locally

Install Dependencies (varies based on OS):
1. Docker
2. PostgreSQL (>=11)
3. Node (10+) and NPM
4. GNU Make
Generate static assets:
```
$ make gen
```
Run tests (full suite including Go's Race Detector):
```
$ make test
```
Build a static binary:
```
$ make build
```
You should end up with 2 binaries in your work directory: lakefs and lakectl. See the docs on how to run them.

Running Load Tests

After building as described above, you should have a third binary called lakefs-loadtest. Run lakefs-loadtest --help for details on how to run load tests on your lakeFS instance. Alternatively, there is a unit test in the file local_load_test.go, which will run the tests on a dedicated test server.

Name		Name	Last commit message	Last commit date
Latest commit History 863 Commits
.github/workflows		.github/workflows
api		api
auth		auth
block		block
catalog		catalog
cmd		cmd
config		config
db		db
ddl		ddl
docs		docs
gateway		gateway
httputil		httputil
ident		ident
loadtest		loadtest
logging		logging
permissions		permissions
scripts		scripts
stats		stats
testutil		testutil
upload		upload
uri		uri
webui		webui
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.yml		.goreleaser.yml
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
arch.png		arch.png
go.mod		go.mod
go.sum		go.sum
swagger.yml		swagger.yml
wait-for		wait-for

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

lakeFS

Full Documentation

Goals

How?

Non-Goals

High level Architecture

Configuration

Building & Running Locally

Running Load Tests

About

Uh oh!

Releases

Packages

Languages

License

codeonix/lakeFS

Folders and files

Latest commit

History

Repository files navigation

lakeFS

Full Documentation

Goals

How?

Non-Goals

High level Architecture

Configuration

Building & Running Locally

Running Load Tests

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages