@@ -23,174 +23,46 @@ endif::rootpath[]
23
23
24
24
## Files
25
25
26
- The implementation of leveldb is similar in spirit to the representation of a
27
- single [Bigtable tablet (section 5.3)](https://research.google/pubs/pub27898/).
28
- However the organization of the files that make up the representation is
29
- somewhat different and is explained below.
30
-
31
- Each database is represented by a set of files stored in a directory. There are
32
- several different types of files as documented below:
33
26
34
27
### Log files
35
28
36
- A log file (*.log) stores a sequence of recent updates. Each update is appended
37
- to the current log file. When the log file reaches a pre-determined size
38
- (approximately 4MB by default), it is converted to a sorted table (see below)
39
- and a new log file is created for future updates.
40
29
41
- A copy of the current log file is kept in an in-memory structure (the
42
- `memtable`). This copy is consulted on every read so that read operations
43
- reflect all logged updates.
44
30
45
31
## Sorted tables
46
32
47
- A sorted table (* .ldb) stores a sequence of entries sorted by key. Each entry is
48
- either a value for the key, or a deletion marker for the key. (Deletion markers
49
- are kept around to hide obsolete values present in older sorted tables).
50
-
51
- The set of sorted tables are organized into a sequence of levels. The sorted
52
- table generated from a log file is placed in a special **young** level (also
53
- called level-0). When the number of young files exceeds a certain threshold
54
- (currently four), all of the young files are merged together with all of the
55
- overlapping level-1 files to produce a sequence of new level-1 files (we create
56
- a new level-1 file for every 2MB of data.)
57
-
58
- Files in the young level may contain overlapping keys. However files in other
59
- levels have distinct non-overlapping key ranges. Consider level number L where
60
- L >= 1. When the combined size of files in level-L exceeds (10^L) MB (i.e., 10MB
61
- for level-1, 100MB for level-2, ... ), one file in level-L, and all of the
62
- overlapping files in level-(L+1) are merged to form a set of new files for
63
- level-(L+1). These merges have the effect of gradually migrating new updates
64
- from the young level to the largest level using only bulk reads and writes
65
- (i.e., minimizing expensive seeks).
33
+
66
34
67
35
### Manifest
68
36
69
- A MANIFEST file lists the set of sorted tables that make up each level, the
70
- corresponding key ranges, and other important metadata. A new MANIFEST file
71
- (with a new number embedded in the file name) is created whenever the database
72
- is reopened. The MANIFEST file is formatted as a log, and changes made to the
73
- serving state (as files are added or removed) are appended to this log.
74
37
75
38
### Current
76
39
77
- CURRENT is a simple text file that contains the name of the latest MANIFEST
78
- file.
79
40
80
41
### Info logs
81
42
82
- Informational messages are printed to files named LOG and LOG.old.
83
43
84
44
### Others
85
45
86
- Other files used for miscellaneous purposes may also be present (LOCK, *.dbtmp).
87
46
88
47
## Level 0
89
48
90
- When the log file grows above a certain size (4MB by default):
91
- Create a brand new memtable and log file and direct future updates here.
92
49
93
- In the background:
94
-
95
- 1. Write the contents of the previous memtable to an sstable.
96
- 2. Discard the memtable.
97
- 3. Delete the old log file and the old memtable.
98
- 4. Add the new sstable to the young (level-0) level.
99
50
100
51
## Compactions
101
52
102
- When the size of level L exceeds its limit, we compact it in a background
103
- thread. The compaction picks a file from level L and all overlapping files from
104
- the next level L+1. Note that if a level-L file overlaps only part of a
105
- level-(L+1) file, the entire file at level-(L+1) is used as an input to the
106
- compaction and will be discarded after the compaction. Aside: because level-0
107
- is special (files in it may overlap each other), we treat compactions from
108
- level-0 to level-1 specially: a level-0 compaction may pick more than one
109
- level-0 file in case some of these files overlap each other.
110
-
111
- A compaction merges the contents of the picked files to produce a sequence of
112
- level-(L+1) files. We switch to producing a new level-(L+1) file after the
113
- current output file has reached the target file size (2MB). We also switch to a
114
- new output file when the key range of the current output file has grown enough
115
- to overlap more than ten level-(L+2) files. This last rule ensures that a later
116
- compaction of a level-(L+1) file will not pick up too much data from
117
- level-(L+2).
118
-
119
- The old files are discarded and the new files are added to the serving state.
120
-
121
- Compactions for a particular level rotate through the key space. In more detail,
122
- for each level L, we remember the ending key of the last compaction at level L.
123
- The next compaction for level L will pick the first file that starts after this
124
- key (wrapping around to the beginning of the key space if there is no such
125
- file).
126
-
127
- Compactions drop overwritten values. They also drop deletion markers if there
128
- are no higher numbered levels that contain a file whose range overlaps the
129
- current key.
130
-
131
- ### Timing
132
-
133
- Level-0 compactions will read up to four 1MB files from level-0, and at worst
134
- all the level-1 files (10MB). I.e., we will read 14MB and write 14MB.
135
-
136
- Other than the special level-0 compactions, we will pick one 2MB file from level
137
- L. In the worst case, this will overlap ~ 12 files from level L+1 (10 because
138
- level-(L+1) is ten times the size of level-L, and another two at the boundaries
139
- since the file ranges at level-L will usually not be aligned with the file
140
- ranges at level-L+1). The compaction will therefore read 26MB and write 26MB.
141
- Assuming a disk IO rate of 100MB/s (ballpark range for modern drives), the worst
142
- compaction cost will be approximately 0.5 second.
143
-
144
- If we throttle the background writing to something small, say 10% of the full
145
- 100MB/s speed, a compaction may take up to 5 seconds. If the user is writing at
146
- 10MB/s, we might build up lots of level-0 files (~50 to hold the 5*10MB). This
147
- may significantly increase the cost of reads due to the overhead of merging more
148
- files together on every read.
149
53
150
- Solution 1: To reduce this problem, we might want to increase the log switching
151
- threshold when the number of level-0 files is large. Though the downside is that
152
- the larger this threshold, the more memory we will need to hold the
153
- corresponding memtable.
154
54
155
- Solution 2: We might want to decrease write rate artificially when the number of
156
- level-0 files goes up.
55
+ ### Timing
157
56
158
- Solution 3: We work on reducing the cost of very wide merges. Perhaps most of
159
- the level-0 files will have their blocks sitting uncompressed in the cache and
160
- we will only need to worry about the O(N) complexity in the merging iterator.
161
57
162
58
### Number of files
163
59
164
- Instead of always making 2MB files, we could make larger files for larger levels
165
- to reduce the total file count, though at the expense of more bursty
166
- compactions. Alternatively, we could shard the set of files into multiple
167
- directories.
168
-
169
- An experiment on an ext3 filesystem on Feb 04, 2011 shows the following timings
170
- to do 100K file opens in directories with varying number of files:
171
-
172
-
173
- | Files in directory | Microseconds to open a file |
174
- |-------------------:|----------------------------:|
175
- | 1000 | 9 |
176
- | 10000 | 10 |
177
- | 100000 | 16 |
178
-
179
- So maybe even the sharding is not necessary on modern filesystems?
180
60
181
61
## Recovery
182
62
183
- * Read CURRENT to find name of the latest committed MANIFEST
184
- * Read the named MANIFEST file
185
- * Clean up stale files
186
- * We could open all sstables here, but it is probably better to be lazy...
187
- * Convert log chunk to a new level-0 sstable
188
- * Start directing new writes to a new log file with recovered sequence#
63
+
189
64
190
65
## Garbage collection of files
191
66
192
- `RemoveObsoleteFiles()` is called at the end of every compaction and at the end
193
- of recovery. It finds the names of all files in the database. It deletes all log
194
- files that are not the current log file. It deletes all table files that are not
195
- referenced from some level and are not the output of an active compaction.
67
+
196
68
0 commit comments