Patch Index in Darcs (formerly known as file cache)
Why do we need a patch index?
The following is simplified for presentation!
PrimitivePatch = Move FilePath FilePath | Hunk Line String String | AddFile FilePath | RmFile FilePath | .. – e.g. Move oldName newName, Hunk l old new, …
Patch = Patch PatchInfo [PrimitivePatch]
In Darcs, the repository is (roughly) of type [Patch]. This view is required for merging, conflict handling, cherry-picking, … .
Some operations are OK using this representation:
*Alice*: [p1,..,pn] <-- pull --> *Bob*: [p1,..,pn,pn+1,..,pn+k]
*Alice*: [p1,..,pn] unpull pk
Others are very slow (mostly query operations):
annotate "Makefile"=> has to apply all patches (backwards from the current state) back to the creation of the “Makefile”Same problem:
changes "Makefile"show contents Makefile -p "Some old patch"
How to solve this?
Keep [Patch] representation, extract some data that speeds up these operations
annotate "Makefile"- find out which patches touch “Makefile” => ignore all other patches
Need function: 'touchingPatches :: DatedFilePath -> (Set PatchId)'
Have to handle renames, directory moves:
p01 Add A/Foo {A/Foo:"",..}
p02 Hunk A/Foo .. {A/Foo:"some text",..}
p03 MV A B {B/Foo:"some text",..}
...
p10 Add A/Foo {B/Foo:"some text",A/Foo:"",..}
p11 RM A/Foo {B/Foo:"some text",..}
p12 MV B/Foo A/Foo {A/Foo:"some text",..}
Unique (local) identifier for each file. Identifier of file should be constant, name and content can change.
We use: creation name and counter as unique identifier for a file:
FileId = FileId {creationName :: ByteString, counter :: Int}
For example, there are 2 files in the repo above A/Foo#1 and A/Foo#2.
Can refer to “old filenames” that are interpreted relative to an old revision of the repository.
FileRelativeRevision = FileRelativeRevision {filePath :: FilePath, revision::PatchId}
A span starting with pstart to (maybe) pend where the file identified by fid had a certain name.
FileIdSpan = FileIdSpan {fid::FileID, pstart::PatchId, pend::Maybe PatchId}
The map that returns all spans for a certain name, we can use it to look up which FileId the name refers to at a certain revision.
SpanMap = Map FilePath [FileIdSpan]
The we just store all patches that touch a file.
PatchesMap = Map FileId (Set PatchId)
CreationNameMap = Map FileId (FilePath)
can use SpanMap to lookup FileId for FileRelativeRevision then use PatchesMap
Patch Index Structure
PatchId
- It is a unique identifier for a patch.
- Look at Patch Hash on this page http://wiki.darcs.net/Internals/Hashes.
- The name consists of three segments:
- timestamp (ISO8601-compatible yyyymmmddHHMMSS, UTC)
- SHA1 hash of the author
- SHA1 hash of the patch name, author, date, log, and "inverted" flag.
- Patch Index makes an ordering of the patches based on order of application on the repo.
- if patch p1 depends on patch p2 then p1 > p2
- if p1 < p2 then p1 cannot depend on p2
FileId
- The filename is insufficient to track a file as it can change
- FileId uniquely identifies each file with it’s creation name, and a counter
- Patch-index carefully tracks the FileName of each FileId, and FileId of each FileName.
Map FileId (Set PatchId)
- Used to ascertain the patches that modified a file.
Map FileName [(FileId, PatchId, PatchId)]
- Let (fn, (fi, ps, px)) be an entry
- Applying all the patches < than ps in the repo, you will have fn /= fi
- Applying all the patches ≥ ps and ≤ to px in the repo, you will have fn == fi
Results
You can try out changes using patch-index with this http://bugs.darcs.net/file3197/sync.dpatch. The command is darcs show patch-index-fileid ./Path_to_file_wrt_repo. Patch-index changes seems to be 10x faster that the existing changes.
Old Results
Seems to make annotate bearable for large repos (GHC-sized).
Best case for patch-index: README in the ghc repo - touched by 29 files of 21345
$ time darcs changes --patch-index README > /dev/null
real 0m0.669s
user 0m0.581s
sys 0m0.083s
without “–patch-index”: ran out of patience after 30 minutes
annotate still takes about 15 seconds with “–patch-index” => bottleneck is now reading inventories
