Instead of using git as a database, what if you used database as a git?
-
Instead of using git as a database, what if you used database as a git?
@andrewnez does forgejo really fork out to run the git binary? Why would it not just use libgit 😨
-
Instead of using git as a database, what if you used database as a git?
@andrewnez
This reminds me of https://m.youtube.com/watch?v=wN6IwNriwHc
Although this (unlike that video) has practical use -
@andrewnez does forgejo really fork out to run the git binary? Why would it not just use libgit 😨
@equinox https://github.com/go-gitea/gitea/issues/5142 i think it's to avoid CGo
-
@equinox https://github.com/go-gitea/gitea/issues/5142 i think it's to avoid CGo
@andrewnez I guess it's an unverified assumption but it feels like that's wasting a whole bunch of CPU cycles for no reason 😕
-
Instead of using git as a database, what if you used database as a git?
I knew Homebrew was bad, but thanks for explaining some other reasons I hadn't paid attention to such as:
"homebrew-core has one Ruby file per package formula, and every brew update used to clone or fetch the whole repository until it got large enough that GitHub explicitly asked them to stop."
I hadn't realized it was that bad! I'll add that to the list of reasons to continue avoiding Homebrew, as if the spyware by default and the founder turning into a cryptocoin grifter weren't bad enough already.
"Git packfiles use delta compression, storing only the diff when a 10MB file changes by one line, while the objects table stores each version in full. A file modified 100 times takes about 1GB in Postgres versus maybe 50MB in a packfile."
20X overhead, seems, kind of horrifying to me?
No, thank you.
"storing three full uncompressed copies of every repository across data centres because redundancy and operational simplicity beat storage efficiency even at hundreds of exabytes."
I can't even begin to wrap my head around "beat storage efficiency" at "hundreds of exabytes" but then, I have been witness to corporate largess on scales that defy rational explanation. Some companies can afford to light stacks of money on fire apparently, but I don't think taking inspiration from the Heath Ledger's portrayal of The Joker in 2008's The Dark Knight should be a guiding light for any sane sorts.
IMHO, SQL is an anti-pattern that Steve Jobs wasn't smart enough to avoid when he bundled Sybase with NeXT and subsequently we've been suffering from that oversight ever since. SQL should have died, or at least stayed with, IBM. There are so many better database paradigms in existence which are not SQL.
Anyway, I dislike Git and I dislike SQL and you have somehow managed to create what I guess to me, is like the opposite of the Reese's peanut butter cup commercials of the 20th century?
But y'know, you're probably not entirely off the mark? Fossil-scm is a DVCS (with limited Git interoperability) and issue tracking system and wiki and such, which is presumably by virtue of being developed by the author of SQLite, also wrapped around SQLite.
The concluding sentence: "there’s no filesystem of bare repos to manage alongside the database." hearkens back to some old Slashdot "Rob Pike Responds" Q&A about databases and filesystems: https://interviews.slashdot.org/story/04/10/18/1153211/rob-pike-responds but seems to ignore the reality: databases exist on filesystems, always have, and presumably, always will. So, figuring out how to not overly abstract that and get down to brass tax is vital.
I suppose, since elsewhere you write about S3, maybe you're too lost in the clouds and too far removed from bare metal and hardware implementations? That's, not a good thing. Pretty much, the opposite of good.
-
Instead of using git as a database, what if you used database as a git?
@andrewnez actually I would argue storage efficiency is important for making it accessible to self-host.
If you don't have a bunch of disposable income to put towards a homelab, storage really isn't very cheap. Especially now that prices of everything have doubled or tripled due to AI datacenter demand.
Renting a VPS with more than 20-50GB of storage gets expensive very fast. Too expensive for many people.
Using an old PC for hosting can sometimes be an option but that will have reliability issues, and you probably won't have more than like 1TB of space available then anyway. Also dependent on your ISP giving you public IPs. -
Instead of using git as a database, what if you used database as a git?
@andrewnez @stsp wasn't this done a few years back? I think maybe by AWS? They also leaned on libgit2 to interpose a SQL storage layer into git operations. They ran across the same storage size trade-offs of lacking delta-pack optimisations, but if I recall correctly their big lesson learned was that standard CLI git operations slowed down by a few thousand times because of an unexpectedly high number of round trips to the database.
Is this ringing any bells? I'll see if I can find it.
-
@andrewnez actually I would argue storage efficiency is important for making it accessible to self-host.
If you don't have a bunch of disposable income to put towards a homelab, storage really isn't very cheap. Especially now that prices of everything have doubled or tripled due to AI datacenter demand.
Renting a VPS with more than 20-50GB of storage gets expensive very fast. Too expensive for many people.
Using an old PC for hosting can sometimes be an option but that will have reliability issues, and you probably won't have more than like 1TB of space available then anyway. Also dependent on your ISP giving you public IPs.@lunareclipse @andrewnez I'd be interested to know if anybody's explored how effective it is to run Postgres on a deduplicating filesystem like ZFS or brtfs.
Hopefully, it might be that this is a non-issue in practice without having to build deduplication or delta compression into Postgres directly.
-
@andrewnez @stsp wasn't this done a few years back? I think maybe by AWS? They also leaned on libgit2 to interpose a SQL storage layer into git operations. They ran across the same storage size trade-offs of lacking delta-pack optimisations, but if I recall correctly their big lesson learned was that standard CLI git operations slowed down by a few thousand times because of an unexpectedly high number of round trips to the database.
Is this ringing any bells? I'll see if I can find it.
@andrewnez @stsp ah, here we go:
And
https://github.com/libgit2/libgit2-backends/pull/4#issuecomment-36115322
2013 and 2014 respectively. (And upon re-reading it I see that libgit2 has several SQL pluggable back-ends, so now I'm sure you've already seen this. Sorry for pointing out the obvious.)
-
@andrewnez @stsp ah, here we go:
And
https://github.com/libgit2/libgit2-backends/pull/4#issuecomment-36115322
2013 and 2014 respectively. (And upon re-reading it I see that libgit2 has several SQL pluggable back-ends, so now I'm sure you've already seen this. Sorry for pointing out the obvious.)
@gnomon @andrewnez We have the same latency issue in game of trees where objects and pack files are always parsed in a sub-process (via fork+exec and passing data back to the main process via a file descriptor or a small buffer copied across Unix pipes). To obtain reasonable performance we try to perform object graph traversal operations in batches inside the pack file reader if possible. Still slower than Git but the good news is that Git is so incredibly fast that programs running 10 or even 100 times slower can still be perfectly usable.
-
Instead of using git as a database, what if you used database as a git?
@andrewnez internally we use PG to version data. We have commits, branches, diffs, merges, cherry picks,... Internally, data is kept in tables with branch_id, commit_id_from, commit_id_to, data_pk, with some range based exclision constraint. Branching is expensive, as it needs to copy the whole data set (~150 tables, diffing has some hints to make it not read all tables but only the touched ones.
-
undefined hongminhee@hollo.social shared this topic on