Skip to content

Commit bd71cad

Browse files
committed
added chapters on the packfile and how git stores objects
1 parent ab04d88 commit bd71cad

File tree

7 files changed

+146
-6
lines changed

7 files changed

+146
-6
lines changed
62.6 KB
Loading
105 KB
Loading
39.3 KB
Loading

script/html.rb

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,10 @@
88
def do_replacements(html, type = :html)
99

1010
# highlight code
11-
#html = html.gsub /<pre><code>.*?<\/code><\/pre>/m do |code|
12-
# code = code.gsub('<pre><code>', '').gsub('</code></pre>', '').gsub('&lt;', '<').gsub('&gt;', '>').gsub('&amp;', '&')
13-
# Uv.parse(code, "xhtml", "ruby", false, "mac_classic")
14-
#end
15-
11+
html = html.gsub /<pre><code>ruby.*?<\/code><\/pre>/m do |code|
12+
code = code.gsub('<pre><code>ruby', '').gsub('</code></pre>', '').gsub('&lt;', '<').gsub('&gt;', '>').gsub('&amp;', '&')
13+
Uv.parse(code, "xhtml", "ruby", false, "mac_classic")
14+
end
1615

1716
# replace gitlinks
1817
html.gsub! /linkgit:(.*?)\[\d\]/ do |code, waa|
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,77 @@
11
## How Git Stores Objects ##
22

3+
This chapter goes into detail about how Git physically stores objects.
4+
5+
All objects are stored as compressed contents by their sha values. They
6+
contain the object type, size and contents in a gzipped format.
7+
8+
There are two formats that Git keeps objects in - loose objects and
9+
packed objects.
10+
11+
### Loose Objects ###
12+
13+
Loose objects are the simpler format. It is simply the compressed data stored
14+
in a single file on disk. Every object written to a seperate file.
15+
16+
If the sha of your object is <code>ab04d884140f7b0cf8bbf86d6883869f16a46f65</code>,
17+
then the file will be stored in the following path:
18+
19+
GIT_DIR/objects/ab/04d884140f7b0cf8bbf86d6883869f16a46f65
20+
21+
It pulls the first two characters off and uses that as the subdirectory, so that
22+
there are never too many objects in one directory. The actual file name is
23+
the remaining 38 characters.
24+
25+
The easiest way to describe exactly how the object data is stored is this Ruby
26+
implementation of object storage:
27+
28+
ruby
29+
def put_raw_object(content, type)
30+
size = content.length.to_s
31+
32+
header = "#{type} #{size}\0"
33+
store = header + content
34+
35+
sha1 = Digest::SHA1.hexdigest(store)
36+
path = @git_dir + '/' + sha1[0...2] + '/' + sha1[2..40]
37+
38+
if !File.exists?(path)
39+
content = Zlib::Deflate.deflate(store)
40+
41+
FileUtils.mkdir_p(@directory+'/'+sha1[0...2])
42+
File.open(path, 'w') do |f|
43+
f.write content
44+
end
45+
end
46+
return sha1
47+
end
48+
49+
### Packed Objects ###
50+
51+
The other format for object storage is the packfile. Since Git stores each
52+
version of each file as a seperate object, it can get pretty inefficient.
53+
Imagine having a file several thousand lines long and changing a single line.
54+
Git will store the second file in it's entirety, which is a great big waste
55+
of space.
56+
57+
In order to save that space, Git utilizes the packfile. This is a format
58+
where Git will only save the part that has changed in the second file, with
59+
a pointer to the file it is similar to.
60+
61+
When objects are written to disk, it is often in the loose format, since
62+
that format is less expensive to access. However, eventually you'll want
63+
to save the space by packing up the objects - this is done with the
64+
linkgit:git-gc[1] command. It will use a rather complicated heuristic to
65+
determine which files are likely most similar and base the deltas off that
66+
analysis. There can be multiple packfiles, they can be repacked if neccesary
67+
(linkgit:git-repack[1]) or unpacked back into loose files
68+
(linkgit:git-unpack-objects[1]) relatively easily.
69+
70+
Git will also write out an index file for each packfile that is much smaller
71+
and contains offsets into the packfile to more quickly find specific objects
72+
by sha.
73+
74+
The actual details of the packfile implementation are found in the Packfile
75+
chapter a little later on.
76+
77+
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
## The Packfile ##
2+
3+
This chapter explains in detail, down to the bits, how the packfile and
4+
pack index files are formatted.
5+
6+
### The Packfile Index ###
7+
8+
First off, we have the packfile index, which is basically just a series of
9+
bookmarks into a packfile.
10+
11+
There are two versions of the packfile index - version one, which is the default
12+
in versions of Git earlier than 1.6, and version two, which is the default
13+
from 1.6 forward, but which can be read by Git versions going back to 1.5.2.
14+
15+
Version 2 also includes a CRC checksum of each object so compressed data
16+
can be copied directly from pack to pack during repacking without
17+
undetected data corruption. Version 2 indexes can also handle packfiles
18+
larger than 4 Gb.
19+
20+
[fig:packfile-index]
21+
22+
In both formats, the fanout table is simply a way to find the offset of a
23+
particular sha faster within the index file. In version 1, the offsets and
24+
shas are in the same space, where in version two, there are seperate tables
25+
for the shas, crc checksums and offsets. At the end of both files are
26+
checksum shas for both the index file and the packfile it references.
27+
28+
Importantly, packfile indexes are *not* neccesary to extract objects from
29+
a packfile, they are simply used to *quickly* retrieve individual objects from
30+
a pack. The packfile format is used in upload-pack and receieve-pack programs
31+
(push and fetch protocols) to transfer objects and there is no index used then
32+
- it can be built after the fact by scanning the packfile.
33+
34+
### The Packfile Format ###
35+
36+
The packfile itself is a very simple format. The first four bytes is the
37+
string 'PACK', which is sort of used to make sure you're getting the start
38+
of the packfile correctly. After that, you get a series of packed objects,
39+
which each consist of an object header and object contents. At the end
40+
of the packfile is a SHA1 sum of all the shas (in sorted order) in that
41+
packfile.
42+
43+
[fig:packfile-format]
44+
45+
The object header is a series of one or more 1 byte (8 bit) hunks that
46+
specify the type of object the following data is, and the size of the data
47+
when expanded. Each byte is really 7 bits of data, with the first bit being
48+
used to say if that hunk is the last one or not before the data starts. If
49+
the first bit is a 1, you will read another byte, otherwise the data starts
50+
next. The first 3 bits in the first byte specifies the type of data,
51+
according to the table below.
52+
53+
(Currently, of the 8 values that can be expressed
54+
with 3 bits (0-7), 0 (000) is 'undefined' and 5 (101) is not yet used.)
55+
56+
Here, we can see an example of a header of two bytes, where the first
57+
specifies that the following data is a commit, and the remainder of the first
58+
and the last 7 bits of the second specifies that the data will be 144 bytes
59+
when expanded.
60+
61+
[fig:packfile-logic]
62+
63+
It is important to note that the size specified in the header data is not
64+
the size of the data that actually follows, but the size of that data *when
65+
expanded*. This is why the offsets in the packfile index are so useful,
66+
otherwise you have to expand every object just to tell when the next header
67+
starts.

text/52_Working_With_Packfiles/0_ Working_With_Packfiles.markdown

Lines changed: 0 additions & 1 deletion
This file was deleted.

0 commit comments

Comments
 (0)