Monorepo Tools Learnings
Howdy!
Most of my work life is spent in monorepos. Oddly, we have multiple monorepos š¤£.
A lot of what we do is in TypeScript. For monorepos we use things like lerna and lage. Both these tools leverage yarn workspaces.
One gripe I have is lerna and lage handle running commands differently. With lage, you donāt
have to specify the whole package name, I can do lage build --to app
instead of lage build --to @org/app
.
With Lerna (at least how we have it configured), we need to specify the whole package name.
Nowā¦ this wouldnāt be as bad as you think, except we have scripts that wrap things things,
and we mix lerna and lage within the same monorepo. so we might haveyarn build app
, but some
reason, I might need to type yarn other-thing @org/app
.
Donāt ask me why itās using both lerna and lageā¦ It exists that way, and I must live in it.
So, good news is, even with lage
, I can specify the whole package nameā¦ to get some consistency
in life.
Why is all this important.
Wellā¦ more backgroundā¦ Iām a big fan of fzf.fish.
I wrote a zsh inspired version. But it didnāt have monorepo,
helpers for fuzzy matching. Implementing custom yarn
completions for monorepo awareness is possibleā¦
but complex. So I started working on a monorepo.fish thatās
leverages fzf.fish. I also keep the fzf.zsh up-to-date, but prefer fish (sadly, fish isnāt a first class
citizen/shell in devcontainers, so I have a few fish plugins to make it work well
(nvm and
ADO)).
The main monorepo functionality is justā¦ package name completion š¤£. Originally, I was looking at
implementing custom completions, but that seemed more complicatedā¦ So I ended up writing just some
keybindings that wrap yarn info workspaces
and cargo metadata
to populate workspace info (I donāt
do rust at work, just for side projects, and I donāt even know if fuzzy finding package names is
usefulā¦).
Now the actual postā¦
Running yarn --json info workspaces
is oddly crazy expensive! One of the repos we work in has roughly 3,000 packages (they would tell you it has more, but itās really multiple monorepos withinā¦ a monorepo, so ~3K).
Running time yarn --json info workspaces
, I get:
________________________________________________________
Executed in 1.28 secs fish external
usr time 0.97 secs 1.01 millis 0.97 secs
sys time 1.29 secs 0.28 millis 1.29 secs
Then when piping it through jq
to do all the processingā¦
________________________________________________________
Executed in 3.83 secs fish external
usr time 2.03 secs 240.00 micros 2.03 secs
sys time 0.91 secs 510.00 micros 0.91 secs
Almost 4 seconds! Every time I want to fuzzy find a package name, I refuse to wait that long every time š±.
Dumb Caching Ideas
So, in order to not have to wait so long, I thought it best to come up with some caching (obviously š).
If you look at the repo history, thereās some fun bad caching in the history. Even the current hashing might be bad (who knows) š¤·āāļø. I made some bad assumptions, and maybe made some worse ones š».
PWD Modified Time
First cache, was using stat
to get the last modified time of the current directory. It worked greatā¦
untilā¦ you wanted to buildā¦ pullā¦ add a fileā¦ run testsā¦ Uhā¦ it was bad.
Did you also know Linux stat
and
BSD stat
ARE DIFFERENT!?!? (donāt get me started on sed
on macOSā¦)
This was just something like:
stat -f %m .
or
stat --format=%Y .
I was really excited about this idea, but then my cache invalidated too frequently.
Next dumb ideaā¦
Hashing package.json modified times
Now, the next thought, still hooked on modified times (did I mention stat
Linux vs BSD???). That we could do a hash of the modified times of individual package.json
files. As long as we donāt frequently modify package.json files this should work right?
The script was something like:
# use git ls-files since we need something fast to find package.json files
git ls-files '*package.json' | xargs stat ... | sha256sum | awk '{print $1}' # Copilot suggests I switch to `cut`...
I thoughtā¦ hey, maybe everything should mostly be reads right? Right??? Right???
WRONG! Some reason, some of our build/test/install scripts/commands update the modified time on package.json
files even if the contents do not change.
Soā¦ my cache kept getting invalidated (again)ā¦
Nextā¦
Just hashing the package.json list
Oddly, this one I didnāt implement, but came to mind.
Issue for this is what happens if the package name changes? My cache would not be invalidated since itās just based on the list of package.json files in the repo.
I think for the most part it would have worked, but the thought of using an invalid cache felt more misleading and harder to figure out what is going wrong for users without a way of force invalidating the cache.
Hash all the package.json files and hash the hashes
This is the current solution. I hash all the package.json files, and then hash the hashes.
I honestly thought this would be slow (I actually wrote it a very slow way at first (-n1
š) š¤¦āāļø, but then saw sha256sum
supports multiple files at once).
time git ls-files '*package.json' | xargs sha256sum | sha256sum | awk '{print $1}'
________________________________________________________
Executed in 147.51 millis fish external
usr time 86.47 millis 0.00 millis 86.47 millis
sys time 120.17 millis 1.59 millis 118.58 millis
Conclusion(s)
I think this was a fun side project for some minor life improvements.
It was also eye opening what was expensive vs cheap.
Originally, I was leaning towards rg --files
to get a quick list of package.json
files, but then git ls-files
was faster (and alsoā¦ just built in.). I didnāt think it would be so quick.
Calculating hashes of 3K files was cheap š¤£.
Modified times arenāt great since files can ābe modifiedā without content changes.
Everyone always says āmeasure measure measureā, but I feel I tend to still make assumptions on whatās worth measuring first.
I may change the hashing again in the future as I play around more and learn more.