Generalized linear mixed model#
- statistical.glmm.run(data, label_name, factor_type, formula, contrasts, data_type='gaussian', return_covariance=False)#
Applies generalized linear mixed models (GLMM). As the formula needs to be parsed in order to align significances with model coefficients, it is highly recommended to check prior to execution once whether the formula is understood correctly via running check_formula(formula). The formula will be parsed (scrubbed and unpacked), and fixed and random factors extracted.
- Parameters:
data – Input data as a single matrix (Axis 0 => samples; Axis 1 data per sample). (Samples x Values).
label_name – Names of the individual columns; have to align with the names specified in the formula.
factor_type – Categorical vs. continuous factors.
formula – Formula to be applied by the GLMM. The formula must not contain a digit within the factor names.
contrasts – Contrast values for the GLMM. Contrast names have to align with the names specified in the formula.
data_type – Model type for the GLMM, default is gaussian.
return_covariance – Flag on whether to return a covariance matrix. Optional since this causes the output to be not equal across all dimensions.
- Returns:
(scores, df, p-values, coefficients, std_error, factor names)
Be aware: * Categorical variables: These get ordered and the lowest value becomes the reference. The returned model coefficients are the average coefficients of all levels vs. the reference coefficient, i.e. the levels are ‘0’ and ‘1’. In this case does ‘0’ determine the point of reference whereas ‘1’ determines the target value. Any coefficient value will correspond to the effect of changing said factor from ‘0’ to ‘1’, not the other way around. * Continuous variables: Effect sizes for this kind of variable are scaled towards 1 functionality of this factor. Therefore the coefficient value corresponds to the effect of changing said factor from ‘0’ to ‘1’.
How to write lmer models:
Variables are continuous by default. Factor(var) turns them into categorical variables. var0 ~ var1 + var2 | Evaluate effect of var1, var2 und var0. var0 ~ var1 * var2 == var0 ~ var1 + var2 + var1:var2 | Evaluate effect of var1, var2 and the interaction of var1 and var2. var0 ~ var1 + (1|var2) | Model var1 as a random effect.
The following code example shows how to apply the glmm for data evaluation.
import numpy as np
np.random.seed(0)
import finn.statistical.glmm as glmm
data_size = 100000
random_factor_count = 20
nested_random_factor_count = 2
data_01 = np.random.normal(0, 3, int(data_size/2)); data_02 = np.random.normal(1, 2, int(data_size/2)); data_0 = np.concatenate((data_01, data_02)); data_0 = np.expand_dims(data_0, axis = 0)
data_11 = np.random.binomial(1, 0.1, int(data_size/2)); data_12 = np.random.binomial(1, 0.9, int(data_size/2)); data_1 = np.concatenate((data_11, data_12)); data_1 = np.expand_dims(data_1, axis = 0)
data_21 = np.random.binomial(1, 0.5, int(data_size/2))*2-2; data_22 = np.random.binomial(1, 0.5, int(data_size/2))*2-2; data_2 = np.concatenate((data_21, data_22)); data_2 = np.expand_dims(data_2, axis = 0)
data_31 = np.random.normal(0, 1, int(data_size/2)); data_32 = np.random.normal(0.25, 1, int(data_size/2)); data_3 = np.concatenate((data_31, data_32)); data_3 = np.expand_dims(data_3, axis = 0)
data_4 = np.repeat(np.arange(0, random_factor_count), data_size/random_factor_count); np.random.shuffle(data_4); data_4 = np.expand_dims(data_4, axis = 0)
data_5 = np.repeat(np.arange(0, nested_random_factor_count), data_size/nested_random_factor_count); np.random.shuffle(data_5); data_5 = np.expand_dims(data_5, axis = 0)
data = np.concatenate((data_0, data_1, data_2, data_3, data_4, data_5), axis = 0).transpose()
data_label = ["measured_variable", "categorical_factor_A", "categorical_factor_B", "continous_factor_A", "random_effect_A", "nested_random_effect_A"]
glm_formula = "measured_variable ~ categorical_factor_A + categorical_factor_B + continous_factor_A + categorical_factor_A:continous_factor_A + (1|random_effect_A) + (1|random_effect_A:nested_random_effect_A)"
glm_factor_types = ["continuous", "categorical", "categorical", "continuous", "categorical", "categorical"]
glm_contrasts = "list(categorical_factor_A = contr.sum, categorical_factor_B = contr.sum, continous_factor_A = contr.sum, random_effect_A = contr.sum, nested_random_effect_A = contr.sum)"
glm_model_type = "gaussian"
stat_results = glmm.run(data = data, label_name= data_label, factor_type = glm_factor_types, formula = glm_formula, contrasts = glm_contrasts, data_type = glm_model_type)
print("Demo may return a singular fit since the naive applied data generation of this example\n does not guarantee sufficient observations for any random factor/nested random factor.\n")
for (factor_idx, factor_name) in enumerate(stat_results[5]):
print("factor: %s | p-value: %2.2f | effect size: %2.2f | std error %2.2f" % (factor_name, stat_results[2][factor_idx],
stat_results[3][factor_idx], stat_results[4][factor_idx]))
Applying the generalized lineax mixed model will identify categorical_factor_A as significant with a large effect size, continous_factor_A is also significant, but has a much smaller effect size, the intercept is the third significant factor with a relatively small effect size. Neither categorical_factor_B nor the interaction between categorical_factor_A and continous_factor_A are statistically significant or exhibit a large effect size (especially in reference to the std error).
Name |
p-value |
effect-size |
std-error |
---|---|---|---|
categorical_factor_A |
0.00 |
0.82 |
0.02 |
categorical_factor_B |
0.17 |
-0.02 |
0.02 |
continous_factor_A |
0.00 |
0.04 |
0.01 |
categorical_factor_A:continous_factor_A |
0.36 |
-0.01 |
0.02 |
(Intercept) |
0.00 |
0.10 |
0.01 |
- Note: This demo code may return a singular fit since the naive applied data generation of this example
does not guarantee sufficient observations for any random factor/nested random factor.