Internals/PatchIndex

Why do we need a patch index?
Creating and inspecting the patch index
Patch Index internal representation
Files and corresponding Data Structures
A complete example
Automatic update
Automatic Creation
Algorithm for creation of patch index
Algorithm for patch index update
Performance

Patch index is an internal optimization in darcs. darcs uses patch index to quickly identify the patches that modified a given file. This is especially useful for the commands log and annotate.

Why do we need a patch index?

A Darcs repository is roughly of type [Patch]. This view is required for merging, conflict handling, cherry-picking, etc.

Some operations are OK using this representation, while others are very slow (mostly query operations):

annotate "Makefile": has to apply all patches (backwards from the current state) back to the creation of the “Makefile”
Same problem:
- log "Makefile"
- show contents Makefile -p "Some old patch"

The solution is to create an additional data structure, that contains some data that speeds up these operations. Then when one runs annotate "Makefile", Darcs quickly finds out which patches touch “Makefile” and ignore all other patches.

Creating and inspecting the patch index

Patch index is created when running log and annotate, and can be created explicitely by passing the flag --with-patch-index to the commands init, clone and convert. It can also be created explicitely by running darcs optimize enable-patch-index.

To disable patch index creation when using log or annotate, pass the flag --no-patch-index.

Every time patch index is created, the user can hit Ctrl-C, which will interrupt its creation and disable its automatic creation in the future (unless optimize enable-patch-index is explicitely run).

If patch index is enabled, it is automatically updated and used.

Patch index tries to be as invisible as possible to the user. To this end, creating/using/modifying patch index is almost invisible to the end user.

There is however darcs show patch-index which lets the user glean the state of patch index. It is primarily meant for debugging, and outputs whether patch index is not created/out of date/in sync; tests the patch index in the repository with some logical consistency properties; dumps the patch index; displays the files tracked by patch index (this has to be same as the files tracked by darcs).

Patch Index internal representation

Patch index involves two datatypes.

##PatchId

It is a hash on the metadata of a patch, and is preserved under commutation, see patch hash. It is the hash we see when running darcs log.

example:

29561b8cad0e1c33e44e7c135cb7753667a67856

##FileId

This is a unique identifier for a given file. It does not change if the file is moved or renamed. It stores the name with which the file is created, along with the count on the number of times files had this name.

example:

2#./tools/upload.cgi
(identifies the second file created with the FileName ./tools/upload.cgi)

Files and corresponding Data Structures

Path Index is stored in the directory _darcs/patch_index/, and is made of the following files:

`repo_state`

Type: (Int8,String)
Stores:
- the patch index on-disk format version
- the inventory hash
Use:
- the inventory hash is used to check that patch index is in sync with the repository
Note:
- two repositories with same patches but different ordering, have different inventory hashes.

example:

Version: 1
Inventory hash: a6e4db850b1174d8dd782c52068c6917af9b62b6e8e4d51f5955e9959359ea9a

`patch_ids`

Type: [PatchId]
Stores:
- the patch hashes of the repository patches(order is preserved)
Use:
- the patches hashes explicitly specify the ordering used in fp_map, fid_map.
Note:
- two repositories with same patches but different ordering, have different patch indexes.

example:

Patches: 29561b8cad0e1c33e44e7c135cb7753667a67856, adf7a83591a9d9217ae6a982ee76065798cb5105, f230974d0257048c330c29bd91307d2fe90dbc6e

`touch_map`

Type: Map FileId (Bool, Set Word32)
Stores:
- identifies the path to be a directory or a file
- the set of patches that modified a file. We only store the first Word32 of the hashes, which Makes this Map smaller but it can have false positives and we can’t remove patches from it (it works like a Bloom filter).
Use:
- Serves the main purpose of patch index. It stores the correlation between a file and the set of patches that modified it.
- Stores whether a FileId is a file or directory because log and annotate need different selection of patches based on whether a path is a directory or file.

example:

1#./src -> False
1#./src -> 819275be
1#./src -> d20e6633

1#./Map.hs -> True
1#./Map.hs -> 3c861fbb
1#./Map.hs -> 770ef6f3

`fid_map`

Type: Map FileName [(FileId, PatchId, Maybe PatchId)]
file name fn corresponds to file id fid from patch id pid1(inclusive) to patch id pid2(exclusive)
If a file has a span ending in Nothing (ie. pid2 is “-”), it means that the fn-fid mapping is still correct even after applying the last patch, in other words that it exists in pristine
the patches are ordered according to the order stored in repo_state
fid_map is used to identify the FileId with the given FileName, at any point in the repository state.

example:

./A -> 2#./A from pid1 to pid2
./A -> 1#./A from pid3 to pid1

(./A no longer exists in pristine) (recall each pidN is a patch hash like 29561b8cad0e1c33e44e7c135cb7753667a67856#)

`fp_map`

Type: Map FileId [(FileName, PatchId, Maybe PatchId)]
file id fid has file name fn from patch id pid1(inclusive) to patch id pid2(exclusive)
If a file has a span ending in Nothing (ie. pid2 is “-”), it means that the fn-fid mapping is still correct even after applying the last patch, in other words that it exists in pristine
the patches are ordered by the ordering in repo_state
fp_map is used to identify the FileName with the given FileId, at any point in the repository state.

example:

1#./A -> ./D/B from pid1 to -
1#./A -> ./B from pid2 to pid1
1#./A -> ./A from pid3 to pid2

(./A exists in pristine)

A complete example

Creating a repository

    mkdir R
    darcs init R
    cd R

Patch index:

   repo_state : (inventory_hash1, [])

   fid_spans : Map [ ]

   fp_spans: Map [ ]

   touch_map: Map [ ]

Adding directories and a file

    mkdir d e
    echo "f1" > d/f
    darcs record -lam 'p1'

Patch index:

   repo_state : (inventory_hash2, [p1])

   fid_spans : Map [ ./d   -> [(1#./d,   pid1, Nothing)],
                     ./d/f -> [(1#./d/f, pid1, Nothing)],
                     ./e   -> [(1#./e,   pid1, Nothing)] ]

   fp_spans: Map [ 1#./d   -> [(./d,   pid1, Nothing)],
                   1#./d/f -> [(./d/f, pid1, Nothing)],
                   1#./e   -> [(./e,   pid1, Nothing)] ]

   touch_map: Map [ 1#./d  -> (False, {pid1}),
                    1#./d/f -> (True,  {pid1}),
                    1#./e  -> (False, {pid1}) ]

Moving a directory

    darcs move d d2
    mkdir d
    echo "f2" > d/f
    darcs record -lam 'p2'

Patch index:

   repo_state : (inventory_hash3, [p1, p2])

   fid_spans : Map [ ./d   -> [(2#./d,   pid2, Nothing), (1#./d,   pid1, Just pid2)],
                     ./d/f -> [(2#./d/f, pid2, Nothing), (1#./d/f, pid1, Just pid2)],
                     ./e   -> [(1#./e,   pid1, Nothing)],
                     ./d2  -> [(1#./d,   pid2, Nothing)],
                     ./d2/f -> [(1#./d/f, pid2, Nothing)] ]

   fp_spans: Map [ 1#./d   -> [(./d2, pid2, Nothing), (./d,   pid1, Just pid2)],
                   2#./d   -> [(./d,  pid2, Nothing)],
                   1#./d/f -> [(./d2/f, pid2, Nothing), (./d/f, pid1, Just pid2)],
                   2#./d/f -> [(./d/f, pid2, Nothing)],
                   1#./e   -> [(./e,   pid1, Nothing)] ]

   touch_map: Map [ 1#./d  -> (False, {pid1}),
                    2#./d  -> (False, {pid2}),
                    1#./d/f -> (True,  {pid1, pid2}),
                    2#./d/f -> (True,  {pid2}),
                    1#./e  -> (False, {pid1}) ]

#How patch index identifies patches that modified a file

the user enters a command, say darcs log FILE
the file id of FILE, fid1 is identified using fid_map.
the patches corresponding to fid1 are identified using touch_map

Automatic update

Patch Index is kept in sync with the repository.

The synchronization of patch index is tested by computing the hash of inventory, and comparing it with the hash of inventory stored in repo_state. This test is run whenever patch index is used (by some command that modifies the repository). If the patch index is not in sync, patch index is updated.

Patch index is automatically updated whenever the repository state changes. Automatic update was tested to be working upon apply, record, push, pull, amend-record, obliterate, unrecord, tag.

Automatic Creation

Patch index is automatically created when running darcs log FILE or when running darcs annotate FILE.

If the user interrupts patch index building with Ctrl-C, the repository is marked as having its patch index disabled. The only way to build it then is manually, running darcs optimize enable-patch-index.

Algorithm for creation of patch index

The algorithm first looks through each patch in the repository, and associates every patch to “changes” made to files. The change could be Touch, CreateFile, CreateDir, Remove, Rename, Invalid, DuplicateTouch.

For example, a patch p with primitives “addfile f”, “hunk f” and “move g h” has changes CreateFile f, Touch f, Rename g h.

The conversion from a primitive-patch to a change is mostly a straightforward one-to-one mapping. However a directory move primitive-patch can have multiple Rename changes. A directory move will have renames for the directory and all files in that directory.

The end result of this pass will be a list of patch id’s, each patch id being associated with some changes.

The patch index is recursively build using this result. Let there be a patch index pi { repo_state, fid_map, fp_map, touch_map } build with all patches until a patch with id pid. pid has some changes associated with it:

Touch f: Let the file id of file f, be fid. touch_map will be updated, such that fid will have pid associated with it.
CreateFile f: New spans will be started on fid_map, and fp_map. pid will be associated with fid in touch_map. touch map will indicate that f is a file.
CreateDir f: New spans will be started on fid_map, and fp_map. pid will be associated with fid in touch_map. touch map will indicate that f is a directory.
Remove f: The span associated with fp in fid_map, and fid in fp_map will end. pid will be associated with fid in touch map.
Rename f1 f2: The old spans in fid_map and fp_map will end, and new spans will be started. pid will be associated with fid in touch map.
Invalid : is ignored
DuplicateTouch f: Is treated the same way as Touch

Algorithm for patch index update

The algorithm first finds the common prefix and the dissimilar suffixes between the patches in the repository and the patches tracked by patch index. The patch index of the common prefix is computed by undoing the dissimilar suffix of the patch index. The suffix of the repository is applied on the intermediate patch index, giving the updated patch index.

The undo works as follows: The spans in fp_map and fid_map with the start patch and the end patch both in the prefix will not be changed. The spans which have the start patch and the end patch in the suffix will be removed. The span with start patch in prefix and end patch in suffix will be modified to not have an end patch. In touch_map, the patches in the suffix will be removed. If the patches corresponding to a fid become empty, the fid entry is removed.

Performance

Patch index greatly improves the performance of commands darcs log and darcs annotate.

However it makes commands like record, amend, obliterate and pull a little slower, because of patch index synchronization.

The commands darcs optimize enable-patch-index and darcs optimize disable-patch-index can be used to check the performance changes.