A Practical Guide to Choosing the Value of k for Age Smooths in Generalized Additive (Mixed) Models
Background
GAM smooth functions and splines
When using GAMs (or GAMMs) to model neurodevelopmental trajectories, we model age as a smooth term:
model <- gam(brain ~ s(age), method = "REML")
The smooth term for age produces an age spline. Here, the spline for age is derived from a combination of (simpler) basis functions, which can take on different functional forms – from linear, to polynomial, to very squiggly. Hence, splines are functions defined by the sum of simpler functions called basis functions. As an example, in the image below, the black line is a fitted spline and the colored lines are basis functions that comprise the resultant fitted spline.
The black line is the fitted trajectory we would see when visualizing a smooth term via plot.gam or model fitted values
plot.gam(model) model.fitted <- fitted_values(model, term = "s(age") ggplot(model.fitted, aes(x = age, y = fitted)) + geom_line()
The basis functions that comprise the smooth term we can visualize with e.g.
model.basis <- basis(model, term = "s(age)") ggplot(model.basis, aes(x = age, y = value, group = bf, color = bf)) + geom_line()
Spline wiggliness
When we fit our smooth term for age in developmental models, we need to determine how wiggly, or how non-linear, the resultant age spline can be. To control this wiggliness, we can choose different values of k in our smooth term:
model <- gam(brain ~ s(age, k = 4)) #in mgcv, the default value of k is 10
Ultimately, higher values of k allow our age spline (our developmental trajectory) to be more wiggly or non-linear. Lower values of k result in less wiggly and less complex fits.
Changing the value of k accomplishes this by changing the number of basis functions that we can use to create our fitted spline! We allow the model to use k-1 basis functions.
Note here, that the “k” syntax technically stands for “knots”. However, in many applications of GAM/GAMM smooth functions – including the default use of thin plate regression splines – we do not actually set a specific number (or location) of knots. In fact, with thin plate regression splines (tprs), there is a knot at every unique data value involved in the smooth (and as many basis functions as data), and then a “low rank smoother” is obtained by doing an eigendecomposition on the full basis (with all the knots) and picking the first k eigenvectors that explain most of the variance in the data.
Ultimately, we want to choose a value of k that allows our basis dimension to be high enough to capture the true representation/complexity of the data (the true developmental trajectory) without overfitting our data or producing developmental fits that are unrealistically non-linear.
Penalized splines
In addition to changing the value of k, we can change whether our spline is penalized, and this choice interacts to an extent with k.
model.penalized <- gam(brain ~ s(age, k = 4, fx = F)) #penalized spline. default in mgcv model.unpenalized <- gam(brain ~ s(age, k = 4, fx = T)) #unpenalized model
If your age smooth function is unpenalized (fx = T), it necessitates that your age spline is created from k - 1 basis functions. If your smooth is penalized (fx = F), then a larger number of k is penalized to balance (in theory) flexibility and overfitting, and any number of basis functions equal to or less than k - 1 can be used. This is accomplished by introducing a penalty term that controls the wiggliness (or, more formally). If using a high k, you should penalize the spline to allow the model to select an appropriate k value.
Choosing k
Choosing different values of k can affect your fitted age spline and thus the shape of your developmental trajectory and the derivatives derived from it (and thus derivative-based inferences, e.g. the age of change offset). This is therefore an important modeling decision! But it is also a tricky one: there is no universally correct value of k to use in developmental (or lifespan) trajectory modeling. The best value will depend on your study sample, measure, age range, N, etc. Luckily, there are tools available to help you pick a reasonable/appropriate value of k.
When beginning to model age-related changes in a new measure, consider the heuristic below:
1. Start by harnessing the inherent power of penalized splines
The beauty of penalized splines is that the penalty should balance model fit with smoothness/wiggliness. If k is large enough to recover the true functional form of your age smooth function, then penalization should allow this to occur while preventing overfitting. Thus, we can see what a purely data-driven approach to k gives us by fitting:
model.penalized <- gam(brain ~ s(age, fx = F))
And we can use this as our model!
In practice, however, overfitting often occurs with an “unlimited” (but penalized) k; this is probably more true in datasets with a smaller N or an uneven representation over the full distribution of ages. If your age spline appears overfit to the data (Is it wiggly in a biologically unrealistic way? Are the derivatives switching from positive to negative to positive and back in short age ranges?), move on to 2.
2. Limit your k, check your k, and compare models
You can fit models with a smaller, set number of k (e.g., k = 3-5) and with/without penalization (especially if using a very small k), and investigate the following to help you choose your final model:
- Fitted splines/edfs: Is the shape of your spline and associated derivatives changing considerably with higher values of k? Is your edf (reported in summary(model)) considerably increasing as you increase k? This could indicate you need to lean towards a higher k value. If your edf remains low and your fit remains stable despite your exact k, then a lower k is likely sufficient.
- k.check: Does k.check indicate that there is unmodeled patterns in your derivatives? You can check whether your chosen value of k is appropriately high enough via k.check(model), which runs diagnostic tests of whether the basis dimension choice is adequate by examining residual variance. A low p-value and a k-index < 1 may indicate that your chosen k is too low.
- Model comparison: Do multiple values of k seem appropriate for your data based on the above, and you're not sure which to choose? You can fit models with different values of k (and with/without penalization) and compare model fit with e.g. model AIC.
3. Be conservative when fitting many models
If you're fitting hundreds to thousands of models, it may be safest to choose a conservative approach with (k = 4, fx = F) or (k = 3, fx = T). There may also be certain situations where constraining the edf to be identical across models by setting fx = T facilitates comparisons of across regions.
Is my model really non-linear?
If your fitted splines look largely linear, you may be wondering to yourself: should I really use a gam here? Is my near-linear or subtly non-linear fit obtained with gams and a reasonable number of k a better description of the data than a purely linear fit? We can test this! Specifically, we can test if our smooth function is significantly non-linear by removing the intercept and the linear fit from our spline basis set, as below:
model.testlinearity <- gam(brain ~ age + s(age, m = c(2,0))
This s() function syntax removes the two non-penalized fits (intercept and linear fit) from the smooth. If the smooth is no longer significant, then the fit is not significantly non-linear and you can consider a linear model.