GSoC/2009-Hashed
This is a timeline for Petr Rockai’s hashed-storage GSoC project
Overview
Darcs supports ‘hashed’ repositories in which each file in the pristine cache is associated with a cryptographic hash. Hashed repositories help Darcs to resist some forms of corruption and also allow for nice features such as a global patch cache and lazy patch fetching. Unfortunately, our implementation can be rather slow.
Petr has some nice ideas for making these repositories work a lot faster. He will be producing a library he calls ‘hashed-storage’ which generalises the idea of storing files associating them with a cryptographic hash and furthermore improves on the current implementation used by Darcs. The hashed-storage library is general purpose and may find a use in other applications that need to manage a large number of files.
For more information, see http://mornfall.net/blog/summer_of_code.html
This project was completed and passed. :-)
Timeline
Note that Petr started a week earlier than the official dates, so everything is shifted back one.
(ending 25 May)
- progress: fill in quick blurb
(1 Jun)
- progress: fill in quick blurb
(8 Jun)
- Darcs format mechanism extended to deal with hashed-storage verions? [do we need this?]
- Future-proofing strategy elucidated - Petr already has written up email about this, so maybe just links
(15 Jun) - format documentation published (at least for index)
- TODO hashed-storage:
- without bytestring-mmap dependency
- the Diff module made optional
- TODO darcs-hs:
- darcs wh filename spends 80% of time in announce_files, since it slurps pristine to check if the files given are in the repository (this can be avoided using hashed-storage)
- context: Darcs 2.3 freeze
(22 Jun) - rough benchmarking infrastructure in place (‘’status?’’)
- done: API stabilising for darcs 2.3
- thought about: packing, and requirements for packed repository
- done: post darcs 2.3; all pristine -> working diffing using hashed-storage
(29 Jun)
- hashed-storage 0.3.4
- done: endianitiy
- done: magic word in index (index upgrade)
- Aribtrary Tree (QC)
(6 Jul)
(13 Jul)
- context: Darcs 2.3 release!
- context: Mid-term evaluations deadline
(20 Jul)
(27 Jul) - future work roadmap complete
(3 Aug)
(10 Aug)
(17 Aug) - nothing [ends at 12]
- context: GSoC pencils down deadline
Bigger picture neighbourhood
- Hacking sprint: 2009-09
- Darcs 2.4: 2010-01
- Darcs 2.5: 2010-07
- Summer of Code 2010
Deliverables
hashed-storage
documentation on formats used by hashed-storage (e.g. camp repo format)
API docs
unit tests
benchmarking infrastructure
- ./go.sh script or similar
- ./publish-benchmarks.sh or similar
published benchmarks
Darcs 2.3 integration!
- extension to Darcs format mechanism (numerical versions?)
Darcs 2.4 integration
future work roadmap and hints (e.g. ideas for how packs might work)
Design goals
Future proofing
- design in such a way that lets old hashed-storage read (and write to?) stores created by new hashed-storage
- versioning mechanism - hopefully one that we can avoid using
Facilitation of future work
- how would hashed-storage development work if you had contributors?
- is there anything you could spin off, e.g. into undergrad student projects?
Portability
- endianness issues?
- can the same store be read to, written by different kinds of systems?
Safety
- atomicity of operations
Robustness
- tolerant IO in working directories?
Good interactions with other systems
- patch application
- network code
- cache mechanism
Technical details
- index file (timestamps)
- can we combine small files into one? (see `camp repo format <http://projects.haskell.org/camp/repository>` offset mechanism)
- avoiding creating large directories
- could hashed-storage easily implement other formats, e.g. camp? e.g. git [relevant?]
Worries and questions for the roadmap
Note: for general design questions, see `hashed-storage <>`
Darcs 2.4 integration sounds tricky - hard to do in an incremental fashion.
We already do this incrementally (see darcs 2.3 integration patches). – Petr
Do we implement HashedStorageIO akin to DarcsIO and HashedIO?
See Storage.Hashed.Monad – I think that may cover what you mean? – Petr
What is the minimum set of features hashed-storage needs to work for darcs?
Why does hashed-storage need a diff mechanism?
It’s proof of concept, really. TODO make it optional. – Petr