For most of the past three years, we’ve been building a people intelligence layer on top of GFoundry platform – trying to predict which employees are at risk of leaving before they actually leave. The technical challenge is real but tractable. The harder question turned out to be architectural: do you build one model that works for every client, or do you build a separate model for each?
We tried both. Here’s what we found.
The universal model is tempting for obvious reasons
When you’re building a SaaS product with multiple clients, the default impulse is to build one thing that works for everyone. It’s cleaner. It trains on more data. It’s cheaper to maintain. And on paper, it looks good – our first multi-client model, trained across five companies with roughly 18,000 employee records, reached an AUC of 0.72 on a temporal validation split. That’s a reasonable number.
Then we ran a backtest. We took the model’s predictions, applied a 0.50 probability threshold – the textbook “flag this person as at-risk” cutoff – and checked how many employees who actually left in the following ten days had been correctly flagged.
Zero. Zero out of 36 confirmed departures.
The model wasn’t completely useless. If you dropped the threshold down to 0.05, it caught 72% of them. But at that threshold it was also flagging nearly every low-tenure employee in the database. A model that tells you “almost everyone is at risk” isn’t a model. It’s noise with a probability score attached.
Why People Analytics resists generalisation
In theory, employee churn should have universal signals. Low engagement, infrequent platform activity, stagnating performance scores – these patterns probably transcend any single organisation. And to some extent they do.
But the devil is in the specifics.
One of our clients had disabled the learning module on the platform entirely. That meant features like content consumption, quiz performance, and learning frequency – which were informative for other clients – were zero across the board for this organisation. Not “this person doesn’t learn.” Just: “this module isn’t active here.” We were feeding noise into the model and calling it signal.
More broadly, different clients use different parts of the platform differently. They have different HR policies, different workforce compositions, different training cycles, different performance review cadences. A model trained on a logistics company with 6,000 employees has probably learned things that don’t transfer cleanly to a 200-person tech firm.
There’s also the data dimensionality problem. When you pool clients together, you’re not just pooling signal – you’re pooling idiosyncratic noise from different organisations. The model ends up learning a kind of average behaviour that doesn’t describe any individual client particularly well.
This is a known tension in machine learning. The bias-variance tradeoff at the model level mirrors a similar tradeoff at the data level: more data reduces variance, but heterogeneous data introduces bias. When your clients are structurally different enough, pooling them together may do more harm than good.
What changed when we went per-client
For one of our larger clients – a retail organisation with around 15,000 employee records and 6,200 currently active – we built dedicated models. We iterated through a series of versions: behavioural models using platform activity data, and a separate model trained entirely on annual performance review data specific to that organisation. Eventually we combined both using a logistic stacking approach that calibrates each signal against the other.
The improvement was not marginal.
The stacked model reached a PR-AUC 2.3 times higher than the generic multi-client model. At the top 10% of predicted risk, it correctly identified 41.6% of employees who actually left, compared to 20.2% with the universal model. Calibration was four times better by Brier score – meaning the probability estimates were actually useful as probability estimates, not just rankings.
The reasons aren’t mysterious. The performance review model could incorporate features specific to this client’s evaluation framework – things like how a manager’s leniency bias correlated with attrition on their team in prior years, or how a gap between self-assessment and manager assessment tracked with future departure. That data didn’t exist – and couldn’t exist – in the universal model. It was specific to one organisation’s way of working.
Some of the most predictive features in the per-client model were things you’d never see in a cross-industry dataset: whether an employee’s preparation for their annual review was marked as unfinished (a quiet proxy for disengagement), or whether their direct manager had a historically high team churn rate. These are organisational signals, not individual ones. And they only become visible when you’re looking at one organisation at a time.
The cost you don’t see on the slide deck
Here’s the part that doesn’t get celebrated: every new client is now, in principle, a new modelling pipeline.
That means you need enough historical data per client to train reliably – probably somewhere north of a few thousand employees with a meaningful period of churn history. Smaller clients fall below that threshold, and a bespoke model trained on insufficient data is probably worse than a universal one.
It means you need to audit features per client. Which modules are active? Which data exists? What’s the definition of “churned” for this organisation – is a transfer between subsidiaries a departure? These questions don’t have generic answers.
And it means you’re maintaining N models, not one. That’s N training pipelines, N validation processes, N monitoring setups. As the client list grows, so does the operational surface. This is manageable now. At ten times the client count, it probably isn’t.
We’re likely looking at a hybrid architecture eventually – a shared base model that captures cross-client signal, combined with a per-client adaptation layer trained on local data. Something closer to transfer learning or federated fine-tuning than full retraining from scratch. We haven’t built that yet. What we have is good enough to be useful, and honest enough to know its limits.
When does per-client make sense?
Honestly, probably not always. Our working hypothesis – still unverified at scale – is something like:
- If a client has fewer than a few thousand employees with multi-year history, the universal model is probably better, or at least no worse.
- If a client has distinctive data sources (a specific evaluation framework, unusual platform usage patterns, or module configurations that differ significantly from the norm), the per-client model will likely capture things the universal one misses.
- If the stakes are high – if someone is actually making retention decisions based on the output – calibration and precision matter more. If it’s just a heuristic for surfacing names to review, a well-ranked universal model is probably fine.
What we learned, more than anything, is that the metric on the validation report is not the thing that matters. The thing that matters is whether the model flags the people who actually leave. That requires a backtest. And backtests tend to be humbling.
We ran ours early enough that we could change course. That probably saved us a year of shipping something that looked impressive and wasn’t working.
João Carvalho is co-founder of Ubbin Labs, where the team builds people intelligence infrastructure for organisations that want to make better decisions about their workforce.
