Dmitry Ryzhenkov

Modeling the Unobservable

Sun, 08 Mar 2026 00:00:00 GMT

In a standard data science project, we are usually tasked with modeling a relationship between observed features and an observable target variable. When building a churn model, for instance, we usually group users into two buckets; those who left, and those who are still using the product. We know the ground truth, it's right in front of us. The same could be said about fraud, or stock market fluctuations, or whatever else you may have worked on.

However, what happens when the target variable is an abstract construct? There is no sensor that can output a measurement of a user's brand affinity. There is no direct way of measuring a candidate's programming skill. These are latent variables: traits we cannot directly observe, but must instead infer from secondary behavior.

When faced with these kinds of tasks, the industry standard is often a naive proxy. For instance, in the brand affinity example, we might design a short quiz and administer it to some of our users. We then use the results as if it is the latent variable we are trying to measure. In the programming skill example, we could administer a coding challenge and measure how many test cases are passed. Notice, in both scenarios, the instrument becomes a substitute for the latent variable itself. If we task a programmer to solve an exercise, and we measure how well they performed, our measure is not that one of programming skill, but of how good the programmer is at that specific exercise. If we send out a short quiz to our users, we are measuring not their brand affinity, but what will our users answer given those questions. These measurements might correlate with the latent traits we are trying to extract, but how can we be sure if we still cannot see these traits? How do we know the correlation is weak or strong, and how do we improve it?

This is one of the foundational problems in psychometrics, a psychology branch that studies how can we effectively measure these latent variables. Psychometrics is a wide field, but the modern approach is founded upon Item Response Theory (IRT). Under this paradigm, we model a specific interaction (e.g., a correct answer, a 5-star rating) as a probability function of both the subject's latent trait and the item's inherent parameters. IRT states that both the subjects and the items coexist on the same scale, and that every measurement has to be understood under this relativistic perspective. It is effectively a regression where the input features themselves are unobserved regressors.

In this article we will explore how to recover these latent signals from raw interaction matrices using Multidimensional Item Response Theory (MIRT) and Bayesian inference.

Unidimensional IRT

Before we can even begin to understand MIRT, we must first be able to conceptualize IRT itself. The most basic level of understanding is if we just use one dimension for our model.

The fundamental unit of analysis in IRT is a single interaction between a subject and an item. We typically represent this data as a sparse matrix of Users ($N$) by Items ($J$). In an e-commerce setting, this might be shoppers and products, and in a gaming matchmaking context we could be talking about players and other players. In an educational context, which we will be focusing on for simplicity's sake, this can be students and test questions. The values in this matrix are the observed responses, usually binary (pass/fail) or ordinal (partial credit).

If we were to use naive aggregate metrics, we would collapse this matrix row-wise to calculate a student's total score on the test. In IRT, however, we do not aggregate. Instead, as mentioned, we model the probability of each specific interaction in that matrix.

We assume that a student's behavior is driven by a latent variable, $\theta$. This variable represents their unobservable trait on a continuious scale, typically normalized to a standard normal distribution $N(0, 1)$. The relationship between this latent trait and the probability of a positive interaction (e.g., answering correctly) is non-linear. It follows a logistic curve known in psychometrics as the Item Characteristic Curve (ICC), which is basically an S-shaped curve, like a sigmoid.

The 2PL Model (Two-Parameter Logistic)

This is the most common starting point. This model characterizes an item using two distinct parameters that determine the shape of the ICC we just mentioned.

Difficulty ($b$): This paremeter represents the location of the curve on the x-axis. It is the point on the latent scale $\theta$ where the probability of a positive interaction is exactly $0.5$. It functions similarly to a bias or the intercept. If a test question has a high $b$ value, the logistic curve shifts to the right, requiring a higher latent ability to achieve the same probability of success.
Discrimination ($a$): This paremeter controls the slope of the curve at its steepest point ($b$). It represents how well an item differentiates between users of varying latent traits. It can be thought of as feature weight or importance. An item with a high discrimination parameter $a$ changes rapidly from $0$ to $1$ over a short range of $\theta$. This indicates a high information density.

Mathematically, the probability of a positive interaction for user $u$ by item $j$ is given by the logistic function:

$$ P(X_{ju} = 1|\theta_u,a_j,b_j) = \frac{1}{1 + e^{-a_j(\theta_u-b_j)}} $$

If you stare at this long enough, you'll notice it's just a logistic regression where the input feature is the latent trait $\theta$.

For a better visualization, consider the following graph.

The blue ICC is a test question that has a medium difficulty, but fairly low discrimination. A person with $-3\theta$ could still answer the blue item correctly about 5% of the time. Remember, $\theta$ is standardized, so a $-3\theta$ student can be considered an outlier, and they still can answer an average item correctly some of the time. So, this item is not that great.
The orange ICC is a harder test question than the blue one, but with the same discrimination. You'd need a $\theta$ of $2$ to answer this correctly half of the time, but someone below average, at $\theta = -1$, could still get it right.
The green ICC has the same difficulty as the blue one, but its discrimination is far better. That same person with $-3\theta$ has basically no hope of getting this question right, and another person with $-1\theta$ might get lucky only about 10% of the time.

By now, you might be already thinking that if a test consists of only easy items, we'd gain very little information about high-skill students, who will answer almost everything correctly. And you'd be right about that. Conversely, if a test is hard, low-skill students will score really badly, and that also fails at telling us how good they are in reality. These are formally known as ceiling and floor effects, respectively, and a good test avoids them by featuring items all across the latent variable dimension.

The 3PL Model

So, hard items help with preventing the ceiling effect, easy items help with preventing the floor effect, and high discrimination items give more information near their difficulty level. A $-3\theta$ student can hardly give a correct answer to a good discrimination item at $2\theta$ skill level, or can they? Couldn't they just get it right if they guessed? Well, it turns out they absolutely can do that; and we always assume that they will. To account for this, we introduce a third parameter to the equation.

Baseline ($c$): Formally know in the literature as guessing. For a four-option multiple-choice question on SQL, what is the probability that a student who did not study answers correctly? Is it 0? No. Instead, it hovers around $\frac{1}{4}$. If a student doesn't know the answer, they might just select one of the four options at random, making it a $\frac{1}{4}$ chance of success. This suggests a lower bound higher than $0$ must exist in our ICC. For a simple four-option test, we could just set a lower asymptote at $0.25$ for every item and call it a day, and that could be a sound decision.

However, it could be the case that the test was not designed correctly, and one of the three alternatives (aka distractors) is so obviously wrong, that even a low-skill student wouldn't consider it, increasing the guessing chance to $\frac{1}{3}$. But, of course, how can we know that for sure? The answer is that we don't just set the lower asymptote at $0.25$, but we also let it be fine-tuned during training. A 2PL model might overcorrect the discrimination parameter to account for this guessing effect (and the ICC would be underfit as a result), but the neat thing about the 3PL model is that it bakes in this complexity directly. The formula would now look like this:

$$ P(X_{ju} = 1|\theta_u,a_j,b_j,c_j) = c_j + (1 - c_j) \frac{1}{1 + e^{-a_j(\theta_u - b_j)}} $$

If we decide to let the $c$ parameter be inferred with training data, it will require more observations. Tuning four distinct parameters from a small dataset without overfitting can be really challenging, often impossible. In any case, whether we learn this parameter, or we just assume it's a completely random guess, we can expect this kind of ICCs now:

The 4PL Model

Sometimes, a floor is not enough. For instance, a high-skill student might still get a question wrong because they've been careless, or the question might be ambiguous.

Friction ($d$): Formally known in the literature as carelessness. We are talking about a baseline of false negatives now. This happens when a high-$\theta$ student fails to trigger the expected interaction due to external friction. In the coding assessment example, this can be the simple syntax error or a misinterpreted edge case that causes a failure regardless of the student's algorithmic competence. To factor this into our existing ICC, we simply bundle it into the scaling term $(1 - c_j)$:

$$ P(X_{ju} = 1|\theta_u,a_j,b_j,c_j,d_j) = c_j + (d_j - c_j) \frac{1}{1 + e^{-a_j(\theta_u - b_j)}} $$

This time, our ICC is squeezed between the $c$ and $d$ parameters, effectively accounting for baseline noise. This model, however, needs way more observations in order to not overfit.

Dimensionality

The models discussed so far rely on a strong assumption known as unidimensionality. This is the premise that a single latent trait, $\theta$, is sufficient to explain all the variance in our $N \times J$ interaction matrix. In a perfectly controlled theoretical environment, this might hold true. A pure arithmetic test measures arithmetic ability and nothing else. A perfectly isolated pricing experiment measures price sensitivity and nothing else, too.

However, in any realistic behavioral setting, this assumption is wrong. A question asking a student to optimize a SQL query involves SQL syntax knowledge, database logic, reading comprehension, and business understanding. A student might fail the item not because their database skills are lacking, but because they misunderstood the prompt. The response data contains overlapping signals from multiple distinct sources.

One way to attempt to disentangle these overlapping signals, which you are probably already thinking of, would be to apply PCA. This is, however, a mistake.

Applying standard PCA to binary or ordinal data does not work as intended. Because PCA is performed on a standard covariance matrix (Pearson), we begin our analysis on a wrong assumption; that the variables are continuous and normally distributed. When applied to boolean vectors (like pass/fail), the correlation coefficient becomes heavily dependent on the difficulty of the items. Easy items correlate with other easy items, and difficult items correlate with other difficult items, simply due to the skew in their distributions. If you plot this PCA, your items will cluster by difficulty rather than by semantic content.

So, PCA failed us, now what? In order to correctly map the latent space, we have to think of the observed binary responses as discretized manifestations of an underlying continuous distribution. Put simply, we assume that for every binary item, there is a latent continuous variable with a threshold. If the student's propensity exceeds this threshold, they answer correctly.

Introducing: the polychoric and the tetrachoric correlation coefficients. These estimators calculate the correlation between the underlying latent variables rather than the observed integers. The tetrachoric correlation is the one we would be using for a binary matrix; and the polychoric, which is actually a generalized version of the other one, can be used for any ordinal matrices.

Tetrachoric Correlation

The tetrachoric correlation relies on a fundamental assumption: latent normality. Even though we only observe a binary outcome ($0$ or $1$), we assume that the underlying trait driving that response is continuous and normally distributed.

Imagine two test items. We don't see the continuous propensity for either one, we only see whether the student crossed a threshold (pass) or didn't (fail). This creates a $2 \times 2$ contingency table of observed counts:

$$ \begin{array}{r|cc} & \text{B: 1 } & \text{B: 0 } \ \hline \text{A: 1 } & (1,1) & (1,0) \ \text{A: 0 } & (0,1) & (0,0) \end{array} $$

The tetrachoric correlation ($\rho$) asks this question: If we assume these two binary variables are actually thresholded versions of two continuous standard normal variables $X$ and $Y$, what correlation coefficient $\rho$ between $X$ and $Y$ would best reproduce our observed contingency table?

That is a really hard question to answer, but thankfully there is a way. Mathematically, we are looking for the $\rho$ that satisfies this double integral for the probability of observing a $(1,1)$ response:

$$ P(X=1, Y=1) = \int_{\tau_1}^{\infty} \int_{\tau_2}^{\infty} \frac{1}{2\pi\sqrt{1-\rho^2}} e^{-\frac{x^2 - 2\rho xy + y^2}{2(1-\rho^2)}} ,dx,dy $$

Where $\tau_1$ and $\tau_2$ are the difficulty thresholds.

Neat, right? Yeah, I thought so.

Anyway, unlike Pearson, which is a simple arithmetic calculation, the tetrachoric correlation is an inferred parameter estimated via maximum likelihood. It reverses the discretization, giving us the true correlation between the latent signals rather than the noisy observed pixels, granted that the latent traits are indeed normally distributed.

This visualization might help understanding this correlation coefficient if you've never seen it before.

Once the covariance structure of the latent traits is estimated, we could actually perform PCA on this matrix, take up as many dimensions as we see fit, and use the loadings to describe the items. However, the standard approach is to process it using a Bifactor Model. I won't be getting into much detail because it's a pretty deep rabbit hole, but this decomposition modeling is particularly well-suited for technical assessments. It posits a general factor ($G$) that influences all items, representing the core competency, such as "Python skills". Simultaneously, it models specific factors ($S$) that capture variance specific to a subdomain, such as "Python syntax", "algorithmic thinking" or "code modularity". The distinctive feature of this model is that the specific factors are all orthogonal to each other, as well as to the general factor. As a result, one specific factor might cover eight items in the test, and we'd know that no other specific factor will ever cover any of those same eight items.

This bifactor approach is widely used, as it helps us to separate the signal we care about from the noise of specific domain knowledge. But, as said, the model is fairly intricate, and many different variations exist, each one with its own caveats. That said, the very thing that makes this model so good at separating the noise from the signal, is also its greatest weakness: orthogonality.

Multidimensional IRT (MIRT)

Whether we decide to use PCA, or a bifactor model, we almost inevitably rely on the assumption of orthogonality. They presuppose independence among factors, that the dimensions don't correlate with each other. In practice, this is usually not a good assumption. Skills are inherently correlated. A student with strong algorithmic reasoning often possesses strong mathematical intuition. One could argue that the point of decomposition is to find a unique lens through which the data looks orthogonal. However, this is not always possible, specially as the number of dimensions grows larger. Ultimately, to model these traits as independent axes is a mathematical convenience, and it distorts and obfuscates the reality of the data.

Consider another SQL assessment example. A complex query challenge does not just test syntax. It requires a confluence of distinct latent abilities. First, the student must understand the relational schema, which is arguably a core component of data modeling (1). Second, they must translate a vague requirement like "calculate retention" into specific logic, which requires general (2) and specific (3) business understanding, as well as logical thinking (4). Finally, they must execute the command using correct keywords, which is their SQL syntax knowledge (5). I just proposed not one, not two, not three, not four, but FIVE distinct factors. And if we really stop and think about this, maybe we could come up with even more.

If we model this assuming a single "SQL ability" scalar, or even orthogonal factors, we fail to capture the interaction. A student might be exceptional at data modeling and SQL, but poor at business logic, resulting in a query that is syntactically perfect and logically sound, but functionally useless.

MIRT addresses this by promoting $\theta$ from a scalar to a vector $\vec{\theta} \in \mathbb{R}^d$. The probability of a correct response becomes a function of the inner product between the student's ability vector and the item's discrimination vector. Since we now treat ability and discrimination as vectors, the exponent becomes a dot product:

$$ P(\vec{x}) = \sigma(\vec{a}^T_i\vec{\theta_j} - b_i) $$

It can help to think about this as if we modeled embeddings for the latent trait. We are now operating in a higher-dimensional space that can capture as much nuance about our instrument and, by extension, the latent trait, as we'd like it to.

This geometry also enables us to model the compensatory nature of some items. In a compensatory model, a deficiency in one dimension can be offset by a surplus in another. Since this new model relies on a dot product, a large magnitude in one dimension of $\vec{\theta}$ can offset a small magnitude in another, provided they align with the discrimination vector $\vec{a}$.

To really conceptualize the dot product, imagine a 2-dimensional model measuring $[\text{SQL Syntax}, \text{Debugging}]$.

Let's say we have an item that requires a bit of syntax knowledge, but is mostly a test of debugging. Its discrimination vector ($\vec{a}$) and difficulty ($b$) might look like this:

$$ \vec{a} = \begin{bmatrix} 1.0 \ 2.5 \end{bmatrix}, \quad b = 0.5 $$

Now, consider a student who has terrible SQL syntax ability, but is exceptionally good at debugging:

$$ \vec{\theta} = \begin{bmatrix} -1.5 \ 1.2 \end{bmatrix} $$

If we take the dot product $\vec{a}^T \vec{\theta}$, the student's strong debugging skill compensates for their poor syntax:

$$ \vec{a}^T \vec{\theta} = (1.0 \times -1.5) + (2.5 \times 1.2) = -1.5 + 3.0 = 1.5 $$

Subtracting our difficulty ($b = 0.5$), the final logit is $1.0$. If we pass this through our sigmoid function $\sigma(1.0)$, the student has a ~73% probability of getting the item right. Despite lacking a fundamental skill, the weighted sum pushed the total above the threshold. The skills compensated for each other.

If we think of the opposite extreme, some tasks can be non-compensatory (or conjunctive). In these cases, you need all required skills to succeed. Being a genius in one area doesn't help if you lack the fundamental prerequisite in another. If you don't know any SQL, it doesn't matter how good your business understanding is; you are not writing that query.

This is yet another rabbit hole, but in reality this compensatory nature is a spectrum. Some skills compensate for each other easily, while others only compensate up to a certain point before a baseline requirement enforces a hard stop. Some skills do not compensate at all. Capturing this spectrum is highly advanced, and it also requires extremely large datasets that are not usually obtainable. In any case, I think it's nice to keep this nuance in mind, as it paints a bigger picture.

Estimation

So, we haven't talked about how to train these models yet. If we take a simple, unidimensional 2PL model, the parameter estimation is fairly simple. We treat the $\theta$ as a random nuisance variable, and integrate it out of the likelihood function to get the marginal likelihood:

$$ L(\text{parameters} | \text{data}) = \int P(\text{data} | \theta, \text{parameters}) P(\theta) d\theta $$

In one dimension, this integral is trivial to approximate using Gaussian quadrature (evaluating the area under the curve at fixed points). However, if we try to do this in a MIRT model, this becomes a volume integral. As the number of dimensions $k$ increases, the number of quadrature points required grows exponentially ($points^k$). For a $5$-dimensional model, this calculation is already quite expensive, and for larger number of dimensions it starts to become completely undoable. So, for MIRT, this approach is off the table.

To circumvent this limitation, we turn to Bayesian inference. Now, we are no longer searching for a single set of points estimates that maximize a likelihood function. Instead, we aim to characterize the full posterior distribution of the parameters given the observed data.

This offers two immediate advantages. First, it allows us to incorporate priors. We know, for instance, that discrimination paremeters $a$ must be positive. We can enforce this with a half-normal or a log-normal prior, effectively regularizing the model and preventing the estimation from diverging to infinity on sparse data. Second, it handles the integration problem through simulation rather than deterministic approximation, which scales way better for high-dimensional models.

Gibbs Sampling and Metropolis-Hastings are the classical algorithms we've seen in many MCMC simulations. They explore the parameter space by iteratively sampling conditionally or proposing jumps to new states. While theoretically sound, in practice they often struggle with the high curvature of IRT models, leading to slow convergence or getting stuck in local modes.

Hamiltonian Monte Carlo (HMC) is the modern standard. HMC uses the gradients of the log-posterior to guide the sampling. It simulates the trajectory of a particle moving across the landscape of the distribution. By utilizing the geometry of the target distribution, HMC explores the parameter space far more efficiently than random-walk behavior, making it viable for complex, hierarchical MIRT models.

One big flaw of MCMCs is that, despite being faster than brute-force integration in larger dimensions, they are still really slow on large datasets. Scaling the training in IRT requires Variational Inference models.

VI treats the integration problem as an optimization problem. Instead of sampling the true posterior, we posit a simpler familiy of distributions, and optimize their parameters to minimize something called the Kullback-Leibler divergence from the true posterior. Because the KL is basically a type of loss function, this is much, much faster, as it allows us to train on batches using gradient descent. However, because we are trying to shoehorn a simple distribution to a posterior that might not be so simple, this approach is also less accurate.

Differential Item Functioning (DIF)

One main assumption of any measurement system is invariance. The instrument should function identically regardless of who is being measured, given that their underlying latent trait is the same. A smart scale that reads different body fat percentages for people with different levels of hydration, despite them actually having the exact same body fat percentage, is a broken instrument. In psychometrics, the violation of this principle is known as Differential Item Functioning.

DIF is a form of bias. It occurs when individuals from different subgroups have a different probability of answering an item correctly, even after controlling for their actual ability level $\theta$. The item's parameters are not stable across populations. This indicates that the item is measuring not only the intended latent trait but also some other construct correlated with group membership. This is hardly avoidable in its entirety, but we can take steps during test design to mitigate it.

For example, consider a coding challenge that uses a complex, culturally specific sports analogy to explain the problem. A student from a different cultural background may struggle to parse the prompt, not because their algorithmic thinking or programming skill is weak, but because they lack the contextual knowledge. If we compare two students with the exact same underlying programming ability, the one who understands the analogy has a higher probability of success. The item is contaminated, and the signal-to-noise ratio is low, because it measures both coding ability and cultural context.

DIF can manifest in two distinct ways:

Uniform DIF: The advantage one group has over the other is consistent across all levels of ability. Using that same coding challenge example, we would find uniform DIF if we noticed that the ICCs for students that understand the sports analogy and for those who don't have very distinct $b$ parameters, but the curves themselves look very similar. Statistically, this means that there is an effect of group membership, but no interaction between group membership and skill.

Non-uniform DIF: Here, the advantage changes depending on the test-taker's ability level. The difference in the probability of answering correctly might be huge for low-ability students, but non-existent for high-ability students. In some cases, the advantage can even flip: group A has the advantage at the low-ability end, but group B has the advantage at the high-ability end. To notice this effect, we would check if the discrimination ($a$) of the item is different across groups. Visually, we will notice that the ICCs intersect at some point.

Detecting DIF involves ICCs visualizations like the ones above, and statistical tests that check for parameter invariance across groups, which can range from something as simple as a logistic regression, to more complex approaches like the IRT likelihood ratio test.

Computerized Adaptive Testing (CAT)

Traditional fixed-form tests are inherently inefficient. They administer the same sequence of items to every student, regardless of their performance. This means that a senior developer must waste time on trivial questions, while a junior developer becomes demoralized by a series of almost impossibly difficult challenges. In both cases, a good portion of items administered to each individual provide very little information about their true ability.

IRT models enable us to perform Computerized Adaptive Testing, which resolves this inefficiency. CATs treat the assessment as a real-time inference problem. The goal is to administer items one by one, and with each item-person interaction, determine which item should be administered next. This is possible because each interaction tells us something about the true skill level of the person, so with each step we can select items that would be more appropriate (those that decrease uncertainty of our estimate the most) for that person's skill level. It is a direct application of the Bayesian approach, but in real time.

A CAT operates as a continuous four-step loop:

Initialization: The test begins with a prior belief about the student's ability. Typically, this is a standard normal distribution, $N(0, 1)$, representing the average ability of the population. The initial $\theta$ estimate is set to the mean of this distribution.
Item Selection: This is the most important step. To select the best next item, we use Fisher information. The Item Information Function (IIF) is a measure of how much an item contributes to our knowledge of $\theta$. It is mathematically related to the second derivative of the ICC and visually appears as a bell-shaped curve that peaks at the item's difficulty ($b$). An item provides maximum information for students whose ability estimate is close to the item's difficulty. The selection algorithm, therefore, chooses the available item that has the highest information value at the student's current estimated ability level. Naturally, an item with high discrimination will provide way more information near the difficulty level than an item with lower discrimination. For the 2PL model, Fisher information is defined as:

$$ I_i(\theta) = a^2_iP_i(\theta)(1 - P_i(\theta)) $$

Notice the $P(1 - P)$ term. Information is maximized when $P = 0.5$ (maximum uncertainty). If a student has a 99% chance of getting an item right, this term approaches zero. The item provides no gradient for the model to update. This simple term explains why extremely easy and extremely hard questions are a waste.

Scoring and Updating: After the student responds, we perform the Bayesian update. The response (correct or incorrect) provides a new likelihood. We multiply this likelihood by our previous prior distribution to obtain a new, more precise posterior distribution for $\theta$. The student's ability estimate is then updated to the mean or mode of this new posterior.
Termination: The loop continues until a stopping rule is met. This can be a fixed test length, but a more effective method is to use a standard error threshold. The standard error is the standard deviation of the posterior distribution of $\theta$. We terminate the test when this value falls below a predefined level of precision, for example, when the standard error of the estimate is less than $0.3$. This ensures that every student is measured with the same degree of statistical certainty.

Shadow Testing

The maximum-information selection algorithm described above is elegant, but it fails in most real world cases because it's greedy. Without constraints, it will create bizarre and invalid tests. For instance, it might select ten difficult algorithm questions in a row and completely ignore a student's knowledge of SQL or system design. It also tends to repeatedly select the few highest-information items from the bank (aka item dataset), compromising their security because students could leak them.

The solution to this is shadow testing. It is a constrained optimization framework that balances the goal of maximizing information with the practical requirements of a valid assessment.

The mechanism is very straightforward, although implementation complexity can definitely vary. At each step of the test, before an item is selected, the algorithm runs a background simulation. It attempts to build a full-length, valid test from the pool of items that have not yet been administered. This test must satisfy all content constraints, such as "must contain 3 Python questions", "must contain 2 SQL questions", or "total expected time to completion must be under 60 minutes"; whatever those may be. Only the items that could be selected as the next item while still allowing for the construction of a valid test are considered. From this reduced set, the algorithm then actually selects the item with the maximum information at the student's current $\theta$ estimate.

Precisely, this process is an example of a Mixed Integer Programming (MIP) problem. It guarantees that every item administered is part of at least one possible valid test. It prevents the algorithm from backing itself into a corner where it cannot meet content constraints later in the test. This ensures that every student receives a test that is not only adapted to their ability but also balanced and fair in its content coverage.

Implementation

For decades, the implementation of advanced psychometric models has been confined to two ecosystems: academic R packages like mirt and expensive commercial software like Winsteps. While these tools are statistically robust and widely used by researchers, they are unfit for modern, scalable production pipelines. The R language lacks the robust tooling for environment management, dependency resolution, and microservice deployment. It is a language that has a solid place in research, but that's about it.

So, this leaves us with Python. This language is largely more mature than R when it comes to general purpose programming, and it is widely used in software development. Generally, Python's third party ecosystem is great for data science applications too. However, psychometrics is possibly a niche within a niche. Several attempts of psychometric libraries have been made over the years, like py-irt or girth, but they still lack the maturity necessary for production applications. Not only that, most of them are abandonware at the time of writing.

This leaves a gap for data science teams aiming to deploy these models at scale.

One solution right now is to directly bypass specialized dependencies and build the models from scratch. Well, not really, we will rely on libraries because Python is practically unusable in and of itself, but we will need to implement the models from basic building blocks, similarly to how we use pytorch as compared to something with a higher level API like keras.

Depending on the modeling approach (MLE integration, Bayesian modeling, VI), the implementation details will vary. However, in this instance, I will be using the Bayesian approach as an example, since it is what I'm most comfortable with.

PyMC

This library is ideal for MCMC-based estimation. The implementation involves specifying the model's generative process directly in code.

Let's imagine that we are building a multiple-choice test with four response alternatives. A 3PL model would look like this:

Define Item Parameter Priors: We set prior distributions for the item parameters. These act as a form of regularization, but the neat thing is that we can also incorporate our domain knowledge.
- a ~ LogNormal(0, 0.5): Discrimination must be positive, and we expect most values to be very close to $1.0$, so we use a very opinionated prior. We could also use HalfNormal(0, 0.5) for this, although the log-normal distribution converges better in most cases.
- b ~ Normal(0, 1.5): Difficulty is centered around the mean ability, with a reasonable standard deviation.
- c ~ 0.25: Baseline (guessing) is about $\frac{1}{4}$, assuming our items are well designed and that there are no obviously wrong options which the students could easily discard. However, if we wanted to infer this parameter too, we'd use something like Beta(5, 15), which is a fairly strong prior centered around our $\frac{1}{4}$ estimate, and also bounds the parameter ($0 \le c \le 1$).
```
with pm.Model() as irt_3pl:
    a = pm.LogNormal("a", mu=0, sigma=0.5, shape=num_items)
    b = pm.Normal("b", mu=0, sigma=1.5, shape=num_items)
    c = pm.Beta("c", alpha=5, beta=15, shape=num_items)

    # c = 0.25 # Or we make it constant if we don't have enough data
```

Define the Latent Variable Prior: The student ability parameter is also a random variable, which could be defined as theta ~ Normal(0, 1).

with irt_3pl:
    # We give it a shape of (num_students, 1) to allow for matrix broadcasting 
    # later when we subtract the item difficulty "b".
    theta = pm.Normal('theta', mu=0, sigma=1, shape=(num_students, 1))

Define the Likelihood: This is is the logistic function that links the parameters to the observed data.
- p = c + (1 - c) * invlogit(a * (theta - b)): This is our ICC formula we've seen before. Here, the only piece that might stand out if you've never worked with PyMC is the invlogit, which is actually just the sigmoid function. We use it to bound the $a(\theta - b)$ term during training because it's a probability term. You can better visualize this, if you take another look at the 3PL formula:
$$ P(X_{ju} = 1|\theta_u,a_j,b_j,c_j) = c_j + (1 - c_j) \frac{1}{1 + e^{-a_j(\theta_u - b_j)}} $$
- observed_responses ~ Bernoulli(p=p): This is the likelihood distribution itself. Remember, we are dealing with success rates (1s and 0s), which is why Bernoulli is used here.
```
with irt_3pl:
    p = c + (1 - c) * pm.math.invlogit(a * (theta - b))
    
    # "interactions" is our matrix containing 1s (correct) and 0s (incorrect) for each item-person pairing
    observed_responses = pm.Bernoulli(
        'observed_responses', 
        p=p, 
        observed=interactions
    )
```

The PyMC engine, typically a NUTS sampler, then explores the parameter space to generate the full posterior distributions for every $a$, $b$, $c$ (optionally) and $\theta$ parameter in the model. The entire thing will look something like this:

import pymc as pm
import numpy as np
import pandas as pd

interactions = pd.read_csv("response_matrix.csv").to_numpy()

num_students, num_items = interactions.shape

with pm.Model() as irt_3pl:
    a = pm.LogNormal('a', mu=0, sigma=0.5, shape=num_items)
    b = pm.Normal('b', mu=0, sigma=1.5, shape=num_items)
    c = pm.Beta("c", alpha=5, beta=15, shape=num_items)
    
    theta = pm.Normal('theta', mu=0, sigma=1, shape=(num_students, 1))
    
    p = c + (1 - c) * pm.math.invlogit(a * (theta - b))
    observed_responses = pm.Bernoulli('observed_responses', p=p, observed=interactions)
    
    # PyMC will automatically select the NUTS sampler for continuous variables
    idata = pm.sample(draws=1000, tune=1000, chains=4, target_accept=0.9)

    # We could use HMC by passing it as an argument, although NUTS is usually better
    # hmc_step = pm.HMC()
    # idata = pm.sample(draws=1000, tune=1000, chains=4, step=hmc_step)

A couple of technical notes regarding the code above:

For the sake of simplicity, the code assumes a dense matrix where every student answered every question. This, however, is not always the case. If you're reading this far into the article, hopefully you've already acquired a good insight on why multiple-choice tests penalize students when they answer incorrectly; it's because we need to counteract the baseline guessing ($c$). This, however, usually means that some students don't risk it when they are uncertain, and leave some questions blank. To deal with this kind of sparse data, we would need to mask the input or pass coordinate-format arrays, which makes the implementation a little more complex.
This is a deeper topic and beyond the scope of this article; but in standard HMC, the sampler simulates the physics of a particle gliding across the parameter space. To do this, it needs to know how long to let the particle glide before stopping and taking a sample. This is a hyperparameter we must tune (actually, it's two hyperparameters), and it functions similarly to a learning rate. The NUTS sampler, however, does this for us, in a way. It is able to decide this gliding length atuomatically during the simulation, which is why it is usually preferred.

Practical Applications

We have covered the math, the code, and the optimization strategies. But why go through all this trouble? Why not just sum up the correct answers and call it a day? The answer to this was actually introduced at in the very beginning of this article. The value of IRT is that it acknowledges the interaction between the instrument and the very thing it is trying to measure. It turns these interactions into precise, continuous signals. So, if that idea is finally beginning to make sense, here is how that capability translates across different use cases.

EdTech

To my knowledge, this is the context in which psychometrics are usually taught in psychology bachelors. At least, it is certainly where I was first introduced to the field. In modern adaptive learning platforms, IRT does more than just grade students. It is actually what powers the recommendation engines that are so valuable to the userbase.

By using CAT techniques and estimating a student’s $\theta$ in real-time, platforms can keep learners in what's called the Zone of Proximal Development. This is a psychology term/idea attributed to Vygotsky, but it is basically the sweet spot where a task is challenging enough to be engaging, but not so difficult that it causes frustration. If a student fails a high-discrimination item on quadratic equations, the system infers a drop in that specific latent dimension and immediately routes the user to related study or practice content. Or it just asks a simpler question. This turns assessment into a continuous feedback loop, and it's the main idea behind learning apps like Duolingo.

Hiring

Technical hiring is notoriously broken. It is often plagued by false negatives (rejecting good candidates because they failed a trivia question or missed some obscure language quirk) and false positives (hiring candidates who memorized LeetCode patterns without actually understanding DSA).

IRT allows hiring platforms to calibrate their item banks. By analyzing the discrimination parameter ($a$) of interview questions and coding tasks, companies can identify which challenges actually correlate with engineering talent and which ones are just noise. Moreover, using MIRT allows us to disentangle traits like general coding skill from specific language proficiency. A non-native English speaker might fail a wordy system design prompt not because they can't design a system, but because the item has a high friction parameter ($d$) related to reading comprehension. Modeling this explicitly makes the hiring funnel more predictive and explicable, and it can also inform technical test design itself.

Gaming

This may be a little outside of my comfort zone, because I've never worked in the gaming industry and there's so much more complexity than what I can possibly realize as an outsider. That said, most of the games that I personally enjoy are competitive multiplayer games. From what I understand, matchmaking algorithms are perhaps the largest scale deployment of IRT in the world. Systems like TrueSkill are effectively Bayesian IRT models where the item is just another opponent.

When a player enters a match, the system predicts the outcome based on the difference between their skill distributions ($\theta_1$ vs $\theta_2$). If a high-skill player defeats a low-skill player, the information gain is near zero (the outcome was expected), so the posterior distributions barely move. However, if a low-ranking player defeats someone a few ranks above them, the model registers a massive surprise (high gradient) and drastically updates the $\theta$ estimates for both players. This dynamic updating is basically a two-way CAT, and it allows games to rapidly converge on a player's true skill, minimizing the number of unbalanced matches they have to play.

Brand Affinity

Marketing teams often rely on Net Promoter Score (NPS), which is a single, noisy data point. But trust and affinity are latent variables that manifest through dozens of small behaviors: opening a newsletter, rating a purchase, returning an item, or clicking an ad, just to name a few.

We can model these actions as items in an IRT framework. A click on a generic discount email might be an easy item (low difficulty $b$), while clicking through a company tech blog might be a hard item (high difficulty $b$). A user who engages with the high-difficulty content is demonstrating a much higher level of the latent brand affinity trait than someone who only engages with the low-hanging fruit. By scoring users on this latent scale, companies can segment their audiences with far more nuance than simple behavioral aggregates would allow.

Case Study: Team Partitioning

Mon, 12 Jan 2026 00:00:00 GMT

In our high-intensity technical bootcamps, week-long projects serve as a critical active break from the fast-paced curriculum. These sprints have a dual pedagogical purpose: they allow students to consolidate recently taught material and build core competencies through hands-on application. While the primary goal is learning and skill acquisition, these projects also build the portfolios students later use to secure employment. During this week, direct instruction is paused, and students work with considerable autonomy. In this environment, success depends less on individual brilliance and more on the effectiveness of group dynamics.

Historically, cohorts of 15-25 students were partitioned into teams manually by the teaching staff. This process was guided entirely by experience, familiarity with students, and professional intuition. However, this manual approach had two fundamental weaknesses. First, it was unscalable; forming teams for a single cohort could consume several hours of an educator's time. Balancing out all the different variables, from skill level to availability and thematic preference, was not an easy task, and in some more complex cohorts it could even take up an entire day. Second, it was susceptible to bias. While a teacher's assessment of competency is valuable, the method relied too heavily on subjective beliefs about what constitutes an effective team, often leading to suboptimal groupings.

These limitations frequently resulted in preventable interpersonal conflicts. While some issues stemmed from personality clashes, my observations indicated that the root cause was rarely personal. It was structural.

The Friction

Through observation, I identified that the primary driver of team friction was the magnitude of the skill gap between the strongest and the weakest member. As this gap widened, the peer-to-peer dynamic weakened. This triggered a compounding cycle of dysfunction:

Carrying: Advanced students often worked at a significantly faster pace. Driven by their own personal standards and strict project deadlines, they felt compelled to complete the complex architectural work alone. This created a dual failure mode. It led to burnout for the lead student, since they were effectively carrying the team's weight on their shoulders. Simultaneously, it benched the rest of the team, denying them the opportunity to interact with the core codebase and preventing them from gaining the hands-on experience the project was meant to provide.
Disengagement: The least experienced students in high-gap groups often felt paralyzed. The speed at which advanced members solved problems was overwhelming. Moreover, the complexity of the challenges they self-imposed was often beyond the reach of less experienced peers, which contributed to a sense of imposter syndrome. Ultimately, instead of asking questions and trying to keep pace, the gap was so large that these students disengaged to avoid slowing down the team. The result was that they effectively learned nothing during the project.

Both effects are symptoms of a breakdown in collaboration. Beyond impeding skill acquisition, in extreme scenarios, this dynamic led to students dropping out altogether.

Status Quo

The prevalence of these issues was not accidental. It was a direct result of trying to map a corporate solution onto an educational problem.

In a corporate setting, professional teams follow a division of labor model. Tech companies organize employees into small, agile groups where each member is responsible for a specific part of the product. These teams are effective because they are generally balanced in terms of professional competence and are assembled through hiring filters designed to find specific talent.

Historically, the teaching staff tried to mirror this dynamic. They deliberately constructed heterogeneous groups, pairing stronger lead students with less experienced peers, expecting them to work similarly to how a professional team would. The rationale was that this natural role differentiation would allow teams to build more complex applications, producing better portfolio pieces for their job hunt and our marketing campaigns. It was also assumed that this grouping of students would serve as a direct practice for workflows found in professional environments.

However, a corporate environment is fundamentally different from a pedagogical one. A business can use its hiring process to assemble a team with somewhat comparable skill levels. Furthermore, businesses often work in a highly hierarchical structure as well, which enables a junior-senior kind of dynamic. An educational institution admits students based on broader criteria creating a wider spectrum of initial skills. More importantly, student teams lack the hierarchical structure required to absorb those disparities. In a student team, the structure is flat. Even the most advanced student is still a student. Their primary goal is to be challenged and learn, not to manage a junior developer. They lack the authority to direct their peers and the experience to mentor them effectively.

Applying this model to this flat and constrained system introduced critical failure points. The goal of a professional team is to deliver a product, while the goal of a student team is to learn. When students in these mixed-skill groups attempted to split the work to mimic a corporate structure, they struggled to integrate their contributions. This created a dependency on expert mediation that was fundamentally broken. The issue was not just the volume of tutoring required during project weeks, but also the very nature of student behavior during a crisis. Crucially, when students hit a wall, they rarely sought help proactively. Instead, they tended to freeze and stay silent, rendering timely intervention impossible. Moreover, the argument that role diversity produces better portfolio projects proved misleading. That benefit assumes a long-term project where teams have time to build chemistry and workflows. In a single-week sprint, there is no runway for that cohesion to develop.

Homogeneity

This realization led to a critical decision: to prioritize the learning process over the final deliverable. I chose to abandon the corporate simulation. Instead, the focus was placed strictly on maximizing intra-group skill homogeneity.

The core logic is straightforward. When students are at a comparable skill level, they are forced to confront challenges together. Because they operate with a similar mental model of the problem, communication increases. They cannot rely on a senior team member to solve the problem for them; they have to work together, learn together. While a corporate team collaborates to deliver, these student teams collaborate to survive the challenge. This approach prevents the carrying effect and ensures every team member remains an active participant.

The Engineering Goal

The objective was to operationalize these psychological observations into a reproducible system. My efforts focused on replacing manual intuition with an optimization engine designed primarily for risk reduction in social dynamics.

By strictly bounding the skill range within a team, the system enforces a baseline for effective technical communication. The priority was to minimize the probability of interpersonal conflict and isolation, ensuring that every student had a peer within their immediate zone of development.

Crucially, the system design relied on a foundational hypothesis: that team health is the leading indicator for all other success metrics. I operated under the assumption that a psychologically safe environment would trigger a cascade of downstream effects, driving deeper peer-to-peer learning and higher student satisfaction (NPS). I also hypothesized that the tangible success of delivering a high-quality project, built through genuine collaboration, would bolster student self-efficacy. By proving to themselves what they were capable of, students would approach the complexity of subsequent material with increased confidence and momentum.

Mathematical Formalization

To solve this, I proposed to transition from subjective intuition to a formal optimization model. The grouping challenge was framed as a Multi-Objective Set Partitioning Problem (SPP).

Given a set of students $S = {s_1, s_2, ..., s_n}$, the goal is to find a partition $P = {T_1, T_2, ..., T_k}$ such that every student belongs to exactly one subset (team) $T$, satisfying specific size constraints while maximizing a global utility function.

The Search Space

The complexity of this problem precludes brute-force solutions. For a standard cohort of $N=25$ partitioned into groups of sizes 3 to 5, the search space is discrete, non-convex, and combinatorially explosive. This landscape justifies the use of metaheuristic approaches over deterministic solvers, as finding the global optimum is less critical than finding a robust, "good enough" local optimum within a reasonable timeframe.

Feature Engineering

A core engineering challenge was dimensionality reduction. To make the problem tractable, I engineered a composite scalar metric called Workforce ($W$).

I defined $W$ for a student $i$ as the product of their composite skill level and their dedicated effort:

$$ W_i = \text{Skill}_i \times \text{Effort}_i $$

Where:

Effort: Total hours the student committed to dedicating to the project.
Skill: An unweighted product of grades, tutor assessment, and the student's self-efficacy (measured by an item in a survey).

The Abstraction Trade-off: This product formula introduces a deliberate abstraction. A score of $W=1000$ could result from a high-skill student with limited hours, from a novice student with massive dedication, or even from a low-skilled student with inflated self-efficacy. From a resource allocation perspective, I treated technical talent and time as fungible assets; either can be used to "pay" for the project's completion. Empirical testing during the PoC phase showed that the model remained robust with this simplified engineering feature. While this abstraction theoretically permits teams with high skill variances, the bootcamp's admissions process effectively neutralizes this edge case. Since all students are pre-filtered for high commitment, $Effort$ variance is minimal in practice, leaving $Skill$ as the principal component of $W$.

The Fitness Function

I defined a parametrized fitness function $F(P)$ to evaluate the quality of a partition. The function is a weighted sum of five objectives, where each weight can be manually calibrated for alignment with business goals.

$$ F(P) = \sum_{j=1}^{5} w_j \cdot O_j $$

I. Intra-Group Homogeneity ($O_1$) — Primary Priority

To minimize the carrying effect, we minimize the range between the strongest and weakest member of each team. For a team $T$, let $W_{max}$ and $W_{min}$ be the maximum and minimum Workforce scores. The homogeneity score calculates the normalized tightness of this range:

$$ O_1 = \frac{1}{|P|}\sum_{T \in P} \left( 1 - \frac{W_{max} - W_{min}}{W_{max}} \right) $$

II. Temporal Synchronization ($O_2$) — Secondary Priority

We mathematically model student availability as a discrete set of time slots. To ensure collaboration is logistically possible, we maximize the global Jaccard Index of the team. Crucially, this is calculated as the intersection of availability for all members against their union (not pairwise averages), ensuring strictly common slots for the entire group:

$$ O_2 = \frac{1}{|P|}\sum_{T \in P} \left( \frac{| \bigcap_{s \in T} A_s |}{| \bigcup_{s \in T} A_s |} \right) $$

III. Inter-Group Balance ($O_3$) — Tertiary Priority

To ensure fairness, we minimize the deviation of each team's total capacity from the cohort target ($\tau$). This was modeled as a ratio to ensure a normalized score between 0 and 1:

$$ O_3 = \frac{1}{|P|}\sum_{T \in P} \min \left( \frac{\sum W_s}{\tau}, \frac{\tau}{\sum W_s} \right) $$

IV & V. Social Boosters ($O_4, O_5$) — Low Priority

Finally, the algorithm considers Affinity (shared interests/hobbies) and Geography (location matches) as first-class citizens in the optimization loop, albeit with significantly lower weights. These act as soft guides for the solver when the primary mathematical constraints are equally met by multiple candidates. Both were modeled as a slight modification of the global Jaccard Index used for temporal synchronization. Specifically, it was modeled as a sloped all-or-nothing fitness metric. To prevent the formation of teams that can potentially marginalize a minority member, the metric penalizes partial matches. This prioritizes scenarios where all members share an attribute, rather than just a subset. The $\lambda$ parameter was used to control the slope, and it was set to $0.7$.

$$ O_{4,5} = \frac{1}{|P|}\sum_{T \in P} \left( \frac{| \bigcap_{s \in T} B_s | + \lambda (| \bigcup_{s \in T} B_s | - 1)}{| \bigcup_{s \in T} B_s |} \right) $$

Constraints

The optimization engine operates within strict boundaries:

Topology Constraint: Team sizes must be strictly bounded between 3 and 5 members ($3 \le |T| \le 5$).
Inclusivity Constraint: $\bigcup T_i = S$ and $T_i \cap T_j = \emptyset$. Every student must appear exactly once.

The Algorithmic Strategy

With the mathematical objective defined, now I needed a solver capable of traversing the discrete, non-convex search space. Traditional gradient-based methods were inapplicable as no gradients exist in set partitions, and brute force was so computationally expensive that it was off the table.

I selected an Evolutionary Strategy approach. Unlike deterministic algorithms, an ES embraces stochasticity to escape local optima, iteratively refining a population of candidate solutions toward the global maximum.

Constraint Preservation

Standard Genetic Algorithms typically rely on Crossover (Sexual Reproduction), combining parts of Parent A and Parent B to create an offspring.

In the context of Set Partitioning, Crossover is structurally destructive. Merging half of the teams from Partition A with half from Partition B almost invariably results in an invalid state:

Duplication: Student $X$ appears in both halves.
Omission: Student $Y$ appears in neither.

Repairing these invalid chromosomes is computationally expensive and, more importantly, it biases the search. Therefore, I engineered an Asexual Evolutionary Engine. Instead of mating, the system relies on Mitosis (cloning) followed by high-chance ($p \ge 0.9$) mutations. The intelligence of the search is not in the combination of solutions, but in the specific design of the mutation operators themselves.

Lifecycle

The engine operates on a strict generational loop designed to balance stability (exploiting good solutions) with pressure (exploring new ones):

Evaluation: Every candidate partition is scored using the Fitness Function $F(P)$.
Selection: To prevent population explosion (or extinction), we enforce a strict survival rate of 50%.
- Elitism: The top 1% of solutions survive automatically and unconditionally. This ensures that the best-known configuration is never lost due to random chance.
- Rank-Biased Probabilistic Survival: The remaining slots are filled stochastically based on rank. While higher-fitness candidates have a higher probability of survival, lower-fitness candidates still retain a non-zero chance of passing to the next generation. This mechanism is critical to maintain genetic diversity, preventing the algorithm from converging prematurely on a local optimum that is good but not great.
Mitosis: Surviving candidates clone themselves to replenish the population back to capacity.
Mutation: Clones undergo stochastic modification detailed below.

Mutation

A critical feature of this system is that the number of teams ($k$) is not fixed; it is a variable to be optimized within the bounds of group size ($3 \le |T| \le 5$). This means that the genome can have an arbitrary number of chromosomes (teams).

I implemented four distinct mutation operators, and to explore the dynamic topology I had to be a little creative with some of them. When a candidate is selected for mutation, only one of these operations is applied probabilistically:

Swap Genes (High Probability): Two students from different teams exchange places. This is the primary mechanism for fine-tuning. It allows the system to optimize across all fitness objectives (homogeneity, availability, balance) without disrupting the structural topology of the groups. It is a low-volatility move designed to climb local gradients.
Move Gene (Medium Probability): A student moves from Team $A$ to Team $B$. This acts as a load balancer. It alters the size distribution, allowing the system to fix under-filled or over-filled teams. It allows the algorithm to migrate members from a group of 5 to a group of 3, refining constraint satisfaction.
Dissolve Chromosome (Low Probability): A specific team is destroyed. Its members are distributed into other existing teams. This operator reduces $k$, effectively compacting the partition. It forces the system to explore denser configurations (larger average group sizes) and eliminates fragmented or low-fitness outlier groups.
Nucleate Chromosome (Low Probability): The inverse of dissolve. The algorithm scavenges single members from varying teams to form a new, valid team. This operator increases $k$, allowing the system to relieve pressure from large groups. It expands the topology, creating new space to resolve conflicts where students might not fit well in any existing group.

By balancing these operators, the algorithm naturally converges not just on the right people for each team, but on the optimal number of teams for the specific cohort.

The Proof of Concept

Before committing to a high-performance solution, I needed to validate the core hypothesis: Could the mathematical model actually produce psychologically viable teams?

I chose Python for the initial implementation to prioritize development velocity. Beyond the algorithm itself, I architected an end-to-end data pipeline, including a custom ingestion module that extracted assessment grades directly from the company's internal API. This ensured the model was fed with fresh, high-fidelity data rather than static exports.

Validation

Defining "success" in a live educational environment presented an ethical dilemma. Rigorous A/B testing would involve providing a potentially inferior grouping service to half the cohort, which was deemed unacceptable.

Instead, I established a conservative heuristic benchmark. I compared the algorithm's output against manual assignments performed by experienced educators. To ensure independence, these educators designed their partitions without seeing the algorithmic proposal, following the core principles we had discussed previously. The metric was "Perceived Fitness": when shown both options side-by-side, which one did they prefer?

This evaluation method contains an inherent bias. Educators are naturally inclined to prefer solutions they invested time in creating, since admitting that a machine outperformed their professional intuition induces cognitive dissonance. Consequently, this acted as a high-confidence threshold. The fact that the algorithm consistently matched or exceeded human preference, despite this adverse bias, provided strong validation that the model was producing psychologically viable teams.

Architectural Constraints

While the logic was sound, the runtime characteristics posed a challenge. To ensure convergence within the non-convex search space, the Evolutionary Strategy required a population of nearly a thousand candidates. Consequently, running a standard cohort ($N=25$) through the necessary generation cycles took between 2 to 3 minutes on a standard machine.

For a manual CLI tool, this latency is acceptable. However, the long-term vision was to integrate this engine into a fully automated pipeline triggered by calendar events. In this context, potentially running in resource-constrained environments or serverless functions, a multi-minute runtime introduces fragility and unnecessary cost.

I identified three theoretical bottlenecks inherent to the Python runtime for this specific workload, although no profiling was performed:

Object Overhead: In an Evolutionary Strategy, thousands of candidate solutions are generated and discarded per second. In Python, every Team instance is a heap-allocated PyObject with significant metadata overhead.
Garbage Collection: The massive churn of short-lived objects (due to the 50% generation cull rate) triggers constant GC cycles, pausing execution repeatedly.
Pointer Chasing: Since Python lists store references to objects scattered across the heap, the CPU is unable to leverage cache locality. Therefore, the fitness loop was dominated by memory lookups rather than arithmetic processing.

Why not Numba? I had also considered using JIT compilers like Numba to patch these performance issues. However, the domain logic relied heavily on set operations and graph-like relationships rather than simple matrix arithmetic. The heavy reliance on set operations contributed to the excessive runtime and precluded effective JIT optimization. Ultimately, I decided against forcing Python to act like a low-level language. The first iteration was meant to be a throwaway proof of concept anyway, so I re-architected the optimization core in a language natively designed for memory control, ensuring a lightweight, portable binary without heavy runtime dependencies.

The Rewrite

The primary goal of this phase was to eliminate the runtime bottleneck inherent in the Python prototype. While the logic was sound, the execution needed to be orders of magnitude faster to become usable in a fully automated production environment.

So, I selected Zig for the rewrite. While Rust or C++ are the industry standards for this domain, I prioritized developer velocity once again. As the sole maintainer, the most pragmatic choice was the systems language I had most experience with.

I was fully aware that introducing a niche language creates technical debt regarding future maintenance. However, I considered this an acceptable trade-off, almost a no-brainer I dare to say. The codebase is small, self-contained, very well documented, and logically explicit; a future maintainer could read the Zig source almost as pseudo-code or port it to C++ with minimal effort. I had also considered C for this rewrite; and it would've been a great choice, as it reduced the bus factor and I was fairly comfortable with it. Zig, however, offered the low-level control of C combined with modern developer ergonomics and safety features, such as checked arithmetic and spatial memory protection. It allowed me to write safe, performant code while keeping all the freedom and explicitness C has to offer, if not more.

Data Oriented Design

The initial Python prototype relied heavily on abstractions that proved costly at scale. For the rewrite, I avoided simply porting the logic syntax-for-syntax and I took my time to think through the implementation and re-architect the entire thing.

I simplified the data structures to their bare minimum. Zig's focus on memory layout guided me toward primitives rather than objects, leading to a design that was significantly leaner and cache-friendly by default.

The Team: In Python, a team was a list of Student references. In Zig, I redesigned the Team as a thin wrapper struct over a single u64 bitmask. Since the cohort size ($N \approx 25$) comfortably fits within a 64-bit integer, a team is represented simply by toggling bits.
The Partition: The collection of teams (the genome) was implemented as a contiguous dynamic array.

Conceptually, this created a sparse matrix-like structure for each one of the solutions, where each row was a u64 team representation, and each column was a single bit of information encoding student membership. As for most matrix representations, this one too had a contiguous memory layout.

This architectural shift yielded a massive performance dividend. By converting heavy heap objects into simple arrays of primitives, I eliminated the pointer chasing overhead. This also drastically reduced the entire memory footprint of the data, which improved cache utilization by reducing evictions caused by metadata overhead. Crucially, the Team entity transformed into a value small enough to fit entirely within a CPU register, which opened the door to some interesting performance optimizations.

The bitwise strategy extended naturally to the temporal availability checks. In Python, I relied on set theory logic which involved hashing and iterating over collection objects. In Zig, I mapped the weekly schedule (21 slots) to a u32. This allowed me to replace loops with bitwise operators. Using Zig's standard library, I utilized @popCount, a builtin that compiles down to a single hardware instruction (like POPCNT in x86) to count the set bits, making the intersection logic exceptionally fast.

// NOTE: Not the actual implementation

fn jaccard(cohort: Cohort, team: TeamMask) f32 {
    var intersection: ScheduleMask = std.math.maxInt(ScheduleMask); 
    var union_mask: ScheduleMask = 0;


    var bits: u64 = team.students;
    while (bits != 0) {
        const index = @ctz(bits); 
        
        const schedule = cohort.availability[index];

        intersection &= schedule;
        union_mask |= schedule;

        bits &= (bits - 1);
    }

    const total_slots = @popCount(union_mask);
    if (total_slots == 0) return 0.0;

    const common_slots = @popCount(intersection);

    return @as(f32, @floatFromInt(common_slots)) / @as(f32, @floatFromInt(total_slots));
}

This same logic applied to the social boosters ($O_4, O_5$). Shared interests and geographic locations were similarly encoded as bitmasks, reducing complex intersection calculations to a handful of hardware instructions.

Since the students themselves were modeled as bit positions in a u64, operations like membership checks or metadata access also became simple bitwise operations. Each student's metadata attribute (availability, workforce, etc.) was stored in a custom data structure that resembled a Struct of Arrays pattern. The entire cohort was stored in a monolithic struct where each field was an array representing a specific attribute of all students (e.g. workforce: []u16, availability: []u32). Accessing all metadata of one specific student had expensive side effects (cache line evictions), but iterating over one specific attribute at a time was fairly cheap in comparison, which is what the fitness function required as it computed each fitness metric score one at a time. This specific design pattern was possible thanks to comptime capabilities of Zig. While this could also be modeled with macros if I had decided to use C, C++ or Rust, Zig's metaprogramming was so much cleaner and easier to work with.

The Custom Parser

Another key component worthy of a brief mention was the data ingestion. Since the Zig ecosystem lacked (and still lacks at the time of writing) a maintained generic CSV library, I got to write a custom parser that tokenized the input stream. This allowed the engine to load and validate the cohort data with minimal overhead, strictly parsing only what was necessary for the internal representation.

Memory Management

In the Python version, the GC was a primary bottleneck. In Zig, the shift to manual management provided stable performance.

For the population storage, I utilized a standard GeneralPurposeAllocator. While I did not implement more advanced allocation strategies at this stage (a decision revisited in the Retrospective), simply moving to manual memory management removed the GC pauses. The performance gains were a compound effect of the bitwise data structures and the removal of runtime overhead.

The Interop

While the algorithm needed to be fast, the data loading did not. Writing C-bindings to link Python and Zig directly in memory felt like unnecessary complexity for this use case.

Instead, I opted for a loosely coupled architecture. The Python ETL pipeline dumps the processed student data into a sanitized CSV. Then, it spawns the Zig program as a subprocess, which reads this CSV, runs the optimization, and streams the result to stdout. Python captures this stream and parses the result. This kept the architecture modular and allowed me to focus on the optimization logic without fighting build systems or complex linking. More importantly, it made it that much easier to reason about and debug.

The Result

The performance improvement was over 200x. The runtime dropped from ~3 minutes to milliseconds.

Impact & Retrospective

Since rigorous A/B testing was not possible in a live educational environment, I gauged the system's impact by observing the long-term stability of the cohorts.

Business Impact

The most immediate difference was that the last-day team crisis, or worse, the post-deadline complaints, simply ceased to happen. Previously, I could rely on at least one group per cohort imploding due to personality clashes or unmanageable skill gaps, requiring staff mediation. The algorithm didn't necessarily produce a dream team every time, but it reliably prevented these disaster scenarios. This peace was the strongest validation that the core hypothesis (risk reduction via homogeneity) was correct.

This stability created downstream effects. Fewer initial conflicts meant fewer lingering grievances. I also noticed a significant increase in a specific phenomenon: teams approached me with requests to remain together for subsequent projects far more frequently. This was a clear signal that the groupings were not just functional, but psychologically safe and effectively balanced.

That said, let's address the elephant in the room: these results are observational. While the correlation between the system's deployment and the stability of the cohorts is strong, external factors in the curriculum or student selection could also have played a role.

Finally, one notable impact was the operational time saved. Previously, teaching staff would spend half a day building teams that often proved to be dysfunctional. This was a minimum 4 hour cost per teacher per month. Now, this entire process takes a few minutes. It's fast, it doesn't disrupt anyone's workflow, and it yields better results. That is a huge win in my book.

Performance

The transition from the Python prototype (2+ minutes) to the Zig engine (<1 second) did more than just save time; it fundamentally changed how I operated as an engineer.

With the prototype script, the tool was a black box. I would run it once, maybe twice, and we had to work with whatever it produced. The high latency discouraged experimentation. The Zig rewrite transformed it into an interactive exploration tool. I could generate a dozen distinct partitions in a minute, allowing me to apply professional judgment to a set of machine-vetted, high-quality choices. I could see concrete trade-offs: one partition might offer perfect homogeneity but sideline a student with a tricky schedule; another might widen the skill gap slightly to keep a local group of students together. This was ultimately what allowed me to balance out the default hyperparameter configuration for the engine itself.

Crucially, this speed acted as a diagnostic tool. Because I was now generating hundreds of variations, I began to notice statistical patterns that were invisible when I was running single batches.

Mathematical Flaws

The interactive nature of the new engine revealed that the algorithm had a persistent bias: it consistently favored solutions composed of many small teams (size 3) over larger ones (size 5).

The root cause is the fitness function's response to topology. In hindsight this seems really obvious, but it absolutely was not at the time. Basically, minimizing the skill variance ($O_1$) in a group of 3 is statistically easier than in a group of 5. Because the fitness function treated a tight range as an absolute value regardless of team size, the optimization gradient constantly pulled the topology toward smaller groups. This phenomenon extended to most other fitness metrics for that matter. A global Jaccard Index is easier to fit when team sizes are small, so $O_{2,4,5}$ were also culprits.

Furthermore, analyzing the outputs revealed a misalignment in the Inter-Group Balance ($O_3$) objective. My formula minimized the deviation from the cohort target, effectively pulling all groups toward a specific skill level. In hindsight, this was an over-correction. The pedagogical goal was strictly to prevent weak teams (raising the floor), not to suppress strong ones (capping the ceiling). By penalizing positive outliers, I was artificially preventing high-performing groups from emerging simply to satisfy a symmetry constraint that existed only in the math (and in my mind), not in the requirements.

In short, these defects were not implementation errors, but natural consequences of the model's axioms. Regardless of the specific mathematical fix, likely involving size-normalized weights and changes to the $O_3$ formula to remove the artificial ceiling, there was a valuable lesson in the engineering side of things: latency hides bugs. Had the tool remained slow, I likely never would have generated enough samples to spot these biases. Or, at the very least, it would've taken me a long time to do so. The rewrite didn't just buy me time; it bought me the bandwidth to be wrong, and the speed to eventually get it right.

Systems Engineering Lessons

On the implementation side, this project was my introduction to manual memory management and data oriented design in production. Looking back, my strategy was functional but naive.

I utilized a standard GeneralPurposeAllocator. At the time, this felt like a victory because it eliminated the GC pauses that plagued the Python version. Moreover, the initial performance gains were so large that I didn't even consider that there was much more room for improvement. However, for an Evolutionary Strategy where thousands of short-lived Team structs are created and destroyed every second, this approach causes heap fragmentation, especially for long-running simulations. The CPU is forced to chase pointers across non-contiguous memory, causing cache misses that leave performance gains on the table.

Today, I would implement this very differently. Since the lifecycle of a generation is predictable and the population size is fixed, I could pre-allocate one contiguous memory block (an Arena or Pool) for everything. Crucially, not only is the population size fixed, but there is a computable upper bound for the partition size (@divFloor(N, min_team_size)), which could greatly simplify the partition representation to a simple bounded array that does not need reallocations and memcopies like a dynamic one. This would ensure near-perfect data locality and reduce the cost of allocation and deallocation, each to a single operation at the very beginning and the very end of the runtime, respectively.

Similarly, I missed an opportunity to leverage concurrency. I kept the engine single-threaded for simplicity during initial development. When I saw the performance results, just like with the memory handling approach, adding multi-threading felt unnecessary. However, parallel execution could have significantly increased the population size without impacting runtime. Granted, a larger population size usually takes longer to converge, but it would ensure a wider coverage of the search space.

Furthermore, in relation to concurrency, I overlooked the algorithmic benefits of some more advanced strategies, such as the Island Model. Running isolated populations on separate threads with occasional migration of top solutions would have maintained higher genetic diversity and prevented the premature convergence I sometimes observed. I could've also modeled a better balance between exploration and exploitation by configuring each island's hyperparameters independently. Of course, this system would've been far more complex, and maybe the potential benefits would not justify such complexity (especially since the current model already works so well and solves the problem), but I have to admit that something inside me yearns for these kinds of engineering challenges.

Conclusion

This project was a success. It solved the business problem, operationalized the team-building process, and significantly improved the student experience.

But for me, the lasting value lies in the technical retrospective. It taught me that a mathematical model is only as good as the feedback loop that validates it. It also demonstrated that performance is not a luxury. Nor is it, as most have heard, merely "the root of all evil" when applied prematurely (also, what even is "prematurely"?). Instead, to me, performance is the lens through which we understand the behavior of our software. Just like security, it must never be an afterthought.

To wrap this up, here's a quote from someone dear to my heart. You might even know them from their work on the Linux kernel:

To some degree, people say you should not micro-optimize. But if what you love is micro-optimization, that's what you should do.