Howdy!

Most of my work life is spent in monorepos. Oddly, we have multiple monorepos šŸ¤£.

A lot of what we do is in TypeScript. For monorepos we use things like lerna and lage. Both these tools leverage yarn workspaces.

One gripe I have is lerna and lage handle running commands differently. With lage, you donā€™t have to specify the whole package name, I can do lage build --to app instead of lage build --to @org/app. With Lerna (at least how we have it configured), we need to specify the whole package name.

Nowā€¦ this wouldnā€™t be as bad as you think, except we have scripts that wrap things things, and we mix lerna and lage within the same monorepo. so we might haveyarn build app, but some reason, I might need to type yarn other-thing @org/app.

Donā€™t ask me why itā€™s using both lerna and lageā€¦ It exists that way, and I must live in it.

So, good news is, even with lage, I can specify the whole package nameā€¦ to get some consistency in life.

Why is all this important.

Wellā€¦ more backgroundā€¦ Iā€™m a big fan of fzf.fish. I wrote a zsh inspired version. But it didnā€™t have monorepo, helpers for fuzzy matching. Implementing custom yarn completions for monorepo awareness is possibleā€¦ but complex. So I started working on a monorepo.fish thatā€™s leverages fzf.fish. I also keep the fzf.zsh up-to-date, but prefer fish (sadly, fish isnā€™t a first class citizen/shell in devcontainers, so I have a few fish plugins to make it work well (nvm and ADO)).

The main monorepo functionality is justā€¦ package name completion šŸ¤£. Originally, I was looking at implementing custom completions, but that seemed more complicatedā€¦ So I ended up writing just some keybindings that wrap yarn info workspaces and cargo metadata to populate workspace info (I donā€™t do rust at work, just for side projects, and I donā€™t even know if fuzzy finding package names is usefulā€¦).

Now the actual postā€¦

Running yarn --json info workspaces is oddly crazy expensive! One of the repos we work in has roughly 3,000 packages (they would tell you it has more, but itā€™s really multiple monorepos withinā€¦ a monorepo, so ~3K).

Running time yarn --json info workspaces, I get:

________________________________________________________
Executed in    1.28 secs    fish           external
   usr time    0.97 secs    1.01 millis    0.97 secs
   sys time    1.29 secs    0.28 millis    1.29 secs

Then when piping it through jq to do all the processingā€¦

________________________________________________________
Executed in    3.83 secs    fish           external
   usr time    2.03 secs  240.00 micros    2.03 secs
   sys time    0.91 secs  510.00 micros    0.91 secs

Almost 4 seconds! Every time I want to fuzzy find a package name, I refuse to wait that long every time šŸ˜±.

Dumb Caching Ideas

So, in order to not have to wait so long, I thought it best to come up with some caching (obviously šŸ™„).

If you look at the repo history, thereā€™s some fun bad caching in the history. Even the current hashing might be bad (who knows) šŸ¤·ā€ā™‚ļø. I made some bad assumptions, and maybe made some worse ones šŸ».

PWD Modified Time

First cache, was using stat to get the last modified time of the current directory. It worked greatā€¦ untilā€¦ you wanted to buildā€¦ pullā€¦ add a fileā€¦ run testsā€¦ Uhā€¦ it was bad.

Did you also know Linux stat and BSD stat ARE DIFFERENT!?!? (donā€™t get me started on sed on macOSā€¦)

Panda smashing keyboard

This was just something like:

stat -f %m .

or

stat --format=%Y .

I was really excited about this idea, but then my cache invalidated too frequently.

Next dumb ideaā€¦

Hashing package.json modified times

Now, the next thought, still hooked on modified times (did I mention stat Linux vs BSD???). That we could do a hash of the modified times of individual package.json files. As long as we donā€™t frequently modify package.json files this should work right?

The script was something like:

# use git ls-files since we need something fast to find package.json files
git ls-files '*package.json' | xargs stat ... | sha256sum | awk '{print $1}' # Copilot suggests I switch to `cut`...

I thoughtā€¦ hey, maybe everything should mostly be reads right? Right??? Right???

WRONG! Some reason, some of our build/test/install scripts/commands update the modified time on package.json files even if the contents do not change.

Soā€¦ my cache kept getting invalidated (again)ā€¦

Nextā€¦

Just hashing the package.json list

Oddly, this one I didnā€™t implement, but came to mind.

Issue for this is what happens if the package name changes? My cache would not be invalidated since itā€™s just based on the list of package.json files in the repo.

I think for the most part it would have worked, but the thought of using an invalid cache felt more misleading and harder to figure out what is going wrong for users without a way of force invalidating the cache.

Hash all the package.json files and hash the hashes

This is the current solution. I hash all the package.json files, and then hash the hashes.

I honestly thought this would be slow (I actually wrote it a very slow way at first (-n1 šŸŒ) šŸ¤¦ā€ā™‚ļø, but then saw sha256sum supports multiple files at once).

time git ls-files '*package.json' | xargs sha256sum | sha256sum | awk '{print $1}'

________________________________________________________
Executed in  147.51 millis    fish           external
   usr time   86.47 millis    0.00 millis   86.47 millis
   sys time  120.17 millis    1.59 millis  118.58 millis

Conclusion(s)

I think this was a fun side project for some minor life improvements.

It was also eye opening what was expensive vs cheap.

Originally, I was leaning towards rg --files to get a quick list of package.json files, but then git ls-files was faster (and alsoā€¦ just built in.). I didnā€™t think it would be so quick.

Calculating hashes of 3K files was cheap šŸ¤£.

Modified times arenā€™t great since files can ā€œbe modifiedā€ without content changes.

Everyone always says ā€œmeasure measure measureā€, but I feel I tend to still make assumptions on whatā€™s worth measuring first.

I may change the hashing again in the future as I play around more and learn more.