I'm writing a simple parser of .git/* files. I covered almost everything, like objects, refs, pack files etc. But I have a problem. Let's say I have a big 300M repository (in a pack file) and I want to find out all the commits which changed /some/deep/inside/file file. What I'm doing now is:
- fetching last commit
- finding a file in it by:
- fetching parent tree
- finding out a tree inside
- recursively repeat until I get into the file
- additionally I'm checking hashes of each subfolders on my way to file. If one of them is the same as in commit before, I assume that file was not changed (because it's parent dir didn't change)
- then I store the hash of a file and fetch parent commit
- finding file again and check if hash change occurs
- if yes then original commit (i.e. one before parent) was changing a file
And I repeat it over and over until I reach very first commit.
This solution works, but it sucks. In worse case scenario, first search can take even 3 minutes (for 300M pack).
Is there any way to speed it up ? I tried to avoid putting so large objects in memory, but right now I don't see any other way. And even that, initial memory load will take forever :(
Greets and thanks for any help!