Node JS, PM2, and running apps in cluster mode - FIXED!

We’ve been increasingly receiving issues concerning cluster mode not running properly on servers with multiple CPU / Cores. An indication of this is high CPU usage and constant restarts if you check the NodeJS heartbeat in the services section.

This mainly impacts Strapi and Nuxt apps running on Node v16 and any other app types that start via the npm start command as opposed to having a specific file pointed to - ie, server.js, main.js, etc.

If you are using Node v16 and a server with 1 CPU, then you shouldn’t experience this issue.

Work-a-rounds

As we’re looking more into the issue and how we might resolve it, there are a couple of work-a-rounds you can try.

  1. From the services section, uninstall Node 16 and install Node 14; then you’ll need to click the ‘repair’ button on the ellipsis menu after Node 14 installs. This resets the apps and PM2 so they run correctly.

  2. If you need to keep node 16, then you can update your web app to point directly to the start file. However, this solution won’t work for using the GHA integration at the moment. In web app > settings > build:
    2.1 If you have anything listed in the ‘Artifact Path’ box, remove it and deploy out the app before moving to next step (this will make the site not render so please only do this during maintenance downtime)
    2.2 Replace Entry File from npm with ./node_modules/nuxt/bin/nuxt.js (note, if using Strapi, then replace nuxt with strapi)

2 Likes

We’ve resolved the PM2 issues around cluster-mode for Node v16. Here are the release notes.


As we’ve been continuing our efforts to harden Cleavr, some reports started to come in about Strapi and Nuxt apps not working properly when utilizing cluster-mode on Node v16. We’ve tackled that issue and added some new :candy: along the way!

Strapi

  • Split Strapi out to Strapi 3 and Strapi 4 app-types to provide improved support by version
  • Resolved cluster-mode issue on Node v16
  • Resolved persistent file storage issue during deployments

For existing Strapi apps

Unfortunately, we were not able to make the cluster-mode improvements backwards compatible for existing apps. But…! You can make some quick updates to take advantage of the new improvements.

Strapi v3 apps

Add the following file and it’s contents to the root of your project and push to your code repo:

.cleavr.runner.js

Add the following contents to the file:

const strapi = require("strapi");
strapi().start();

After pushing the above file to your code repo, see PM2 Ecosystem Updates section below to complete setup.

Strapi v4 apps

Add the following file and it’s contents to the root of your project and push to your code repo:

.cleavr.runner.js

Add the following contents to the file:

const strapi = require("@strapi/strapi");
strapi().start();

After pushing the above file to your code repo, see PM2 Ecosystem Updates section below to complete setup.

Nuxt

  • Split Nuxt SSR out to Nuxt SSR 2 and Nuxt SSR 3 app-types
  • Nitro server-engine support for Nuxt SSR 3 :zap::zap::zap:
  • Resolved cluster-mode issue on Node v16

For existing Nuxt apps

See PM2 Ecosystem Updates section below if you’d like to take advantage of new updates for cluster-mode with Node v16.

Directus

  • 1-click install! :classical_building:
  • Resolved cluster-mode issue on Node v16
  • Resolved persistent file storage issue during deployments

Now when you add a new Directus site, you can enter admin login credentials and install the Directus bootstrap with just 1-click! Need to deploy your Directus site from your code repository? No problem! A web app is still created when adding a new Directus site, so you can deploy from your code repo just as easy.

For existing Directus apps

See PM2 Ecosystem Updates section below if you’d like to take advantage of new updates for cluster-mode with Node v16.

PM2 Ecosystem Updates

Navigate to your web app > settings > build tab, and update script to ".cleavr.runner.js" and args to “”. Except for Nuxt SSR 2 apps, you’ll need to add start for args.

Also, make sure instances is set to max and exec_mode is set to cluster_mode.

After making the changes, deploy your project for the changes to take effect.

Note for monorepos

If you have a monorepo setup, we recommend you do not make these changes quite yet. We’ll be working on a solution to better handle monorepos.

Read more about the new updates in our Cleavr Slice blog.

Oh, no!

I’m running Quasar and i now also get

had too many unstable restarts (16). Stopped. "errored"

Please adwise, my customer is waiting

Hello @peterc,

I think there is another topic that is related to the issue that you’re facing. Please refer to this thread

Do let us know whether that works for you or not.

2 Likes

I was going to call my customer in 5 minutes, you save me from a lot of trouble @anish

Thank you so much!

3 Likes

Hello, I’m running NodeV16 with StrapiV4 at Vultr 2 CPU and added the .cleavr.runner.js file at the root. Then I deployed the app but the server still has a 100% CPU utilization.
I’m running a trial to test Cleavr. Any idea how to fix this?
Thanks

Hello @BureauBerg,

First of all welcome to Cleavr forum.

We’ll look into the issue but in the meantime, can you please check if the site is throwing any errors like 502 error? You can also check PM2 log from the deployments page by clicking on Load PM2 Logs button or by going to Server > Logs. We’ve also noticed that CPU utilization reach maximum while there are certain errors at the app level.

You can also view NodeJS logs from the services section and resolve issues if there are any.

You can follow these links to troubleshoot 502 errors for NodeJS based applications:

Hi Anish, thanks for your reply.

It indeed throws a 502 error so I checked the PM2 log which shows:
PM2 log: App name: xyz disconnected
PM2 log: App [xyz] exited with code [1] via signal [SIGINT]
PM2 log: App [xyz] starting in -cluster mode-
PM2 log: App [xyz] online

and the Nginx log is as follows:
The “connect() failed (111: Connection refused) while connecting to upstream, client”

I’m a newbie but it looks to me that Nginx doesn’t have access permission to the app. And some folder settings that might be not correct yet?
I created a new system user and added the app to that user when setting it up in Cleavr and I thought these permission settings would be added automatically.
Could anyone shed light on this?

Thanks a lot!
Jacco

Hello @BureauBerg,

Two possible case for your issue:

It looks like you’re building your app on GitHub. If your project requires database connection during build you need to provide database credentials in PM2 Ecosystem. To do so go to Webapp > Settings > Build > PM2 Ecosystem and on the env section add you credentials. It may look something like this:

  env: {
    "PORT": 3333,
    "CI": 1,
    "NUXT_TELEMETRY_DISABLED": 1,
    "DATABASE_HOST": "localhost",
    "DATABASE_PORT": "3306",
    "DATABASE_NAME": "database_name",
    "DATABASE_USERNAME": "database_username",
    "DATABASE_PASSWORD": "database_password"
  }

Another one, that you’ve not updated your environment variables yet. Strapi requires some secrets to run such as APP_KEY. If you’ve not updated your environment variables from Webapp > Environment make sure to check your .env file on the local project and update them accordingly.

Make sure to re-deploy after performing the above steps mentioned above.

I hope it helps. Let us know if that doesn’t resolve your issue.

I’m running into this issue with a remix app—is there a tutorial somewhere to help me set this up properly with remix instead of nuxt?

Hi @sheffield - we have a Remix tutorial that you can find here: Deploy Remix JS - Cleavr docs

Thanks. I followed that tutorial to set up my site initially—it’s running in cluster mode, and running with high CPU because the site keeps stopping and restarting with the Error: listen EADDRINUSE: address already in use error.

For this error, you can try to restart the app in the deployment workflow > app > app status section. If restarting from there doesn’t work, you can try doing a harder restart from server > services and select the repair option under Node JS service.

Or, it could just be that there is another app running on the same port number… If that’s the case, check the site nginx configs and see if more than one site is on the same port.

I don’t have any other apps running on the same port. I tried restarting the app, and run into the same problem. Looks like the problem is that it’s restarting over and over (see screenshot). I’m digging into dotenv a little to see if that’s related.

Gotcha! There are some additional details typically in the Server > NodeJS heartbeat which may pinpoint what PM2 is seeing as the failure.

Thanks! Taking a look.

Looks like it’s got this error a couple times (i swapped out my site name):

28|open.in | Error: ENOENT: no such file or directory, chdir '/' -> '/home/cleavr/<SITE_NAME>/artifact'
28|open.in |     at wrappedChdir (node:internal/bootstrap/switches/does_own_process_state:112:14)
28|open.in |     at process.chdir (node:internal/worker:99:5)
28|open.in |     at /usr/lib/node_modules/pm2/lib/ProcessContainer.js:298:13
28|open.in |     at wrapper (/usr/lib/node_modules/pm2/node_modules/async/internal/once.js:12:16)
28|open.in |     at next (/usr/lib/node_modules/pm2/node_modules/async/waterfall.js:96:20)
28|open.in |     at /usr/lib/node_modules/pm2/node_modules/async/internal/onlyOnce.js:12:16
28|open.in |     at WriteStream.<anonymous> (/usr/lib/node_modules/pm2/lib/Utility.js:186:13)
28|open.in |     at WriteStream.emit (node:events:513:28)
28|open.in |     at node:internal/fs/streams:75:16
28|open.in |     at FSReqCallback.oncomplete (node:fs:198:23)

This error from PM2 is related to the restart loop:

PM2        | Error: listen EADDRINUSE: address already in use :::9718
PM2        |     at Server.setupListenHandle [as _listen2] (node:net:1740:16)
PM2        |     at listenInCluster (node:net:1788:12)
PM2        |     at Server.listen (node:net:1876:7)
PM2        |     at Function.listen (/home/cleavr/<SITE_NAME>/releases/20230411071909064/client/node_modules/express/lib/application.js:635:24)
PM2        |     at Object.<anonymous> (/home/cleavr/<SITE_NAME>/releases/20230411071909064/client/node_modules/@remix-run/serve/dist/cli.js:44:84)
PM2        |     at Module._compile (node:internal/modules/cjs/loader:1254:14)
PM2        |     at Object.Module._extensions..js (node:internal/modules/cjs/loader:1308:10)
PM2        |     at Module.load (node:internal/modules/cjs/loader:1117:32)
PM2        |     at Function.Module._load (node:internal/modules/cjs/loader:958:12)
PM2        |     at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:81:12)
PM2        | 2023-04-13T18:21:19: PM2 log: App name:<SITE_NAME> id:29 disconnected
PM2        | 2023-04-13T18:21:19: PM2 log: App [<SITE_NAME>:29] exited with code [1] via signal [SIGINT]
PM2        | 2023-04-13T18:21:19: PM2 log: App [<SITE_NAME>:29] starting in -cluster mode-
PM2        | 2023-04-13T18:21:19: PM2 log: App [<SITE_NAME>:29] online

I dug into this for a while today and haven’t been able to find the root cause yet. From the PM2 monitor, it looks like the issue is that in cluster mode, it’s trying to run multiples of the same app and one of them is in a restart loop because the address is already in use. This is only happening on the remix apps I have running.
image

Hmmm… I don’t see anything that says Remix doesn’t support cluster_mode on their docs for any particular reason.

You could try SSH’ing, CD to the project path, run pm2 status to see processes running, then kill the processes using pm2 kill <process number and then run pm2 start .cleavr.config.js and see if that clears up the issue.

An alternative would be to set instances to 1 in the deployment workflow > settings > build > PM2 ecosystem. Update to:

...
 instances : "1",

...

You’d then need to redeploy the app for the change to take effect.

Did some digging on the Remix discord and it led to the answer. Short version is that pm2 and npm sometimes don’t work well together, so I had to create an express server on my root and run that way. Deployment build settings look like this now:

module.exports = {
  name: "...",
  script: "./server.js", // this is a basic express server. check remix docs.
  log_type: "json",
  cwd: "/home/cleavr/staging.open.ink/current", // had to update this to current to file the server file
  instances : "max", 
  exec_mode : "cluster_mode",
  env: {
    // currently have to include all env vars here, because it's not loading dotenv correctly
    // likely will switch to using something like 1password for env vars in the future, so not a big deal
  }
}