Building Robust Credit Scoring Models: A Stability-Focused Approach to Variable Selection

The effectiveness of predictive models, particularly in sensitive domains like credit scoring, hinges critically on the quality of variable selection. A common pitfall leading to model failure in production environments is the selection of variables that perform exceptionally well on training data but falter when exposed to new, unseen data. This phenomenon, often termed "overfitting" or "data leakage," results in models that appear robust during development but prove unreliable in real-world application. Addressing this challenge, a novel methodology prioritizes stability and interpretability in variable selection, ensuring models remain effective regardless of data splits.
This article details a rigorous variable selection process that moves beyond traditional performance metrics, focusing instead on the inherent stability and robustness of variables across diverse data subsets. By employing a multi-stage filter method, the approach systematically evaluates variables for their consistent relevance to the target outcome, ensuring that selected features are not only statistically significant but also interpretable and less prone to degradation in performance when deployed. The core principle is that a variable’s true value lies in its consistent predictive power across different segments of the data, not merely its peak performance on a single, fixed dataset.
The Core Principle: Prioritizing Stability Over Peak Performance
The fundamental tenet of this robust variable selection approach is the emphasis on stability. A variable is deemed robust if its predictive relevance is consistently demonstrated across various subsets of the data, rather than being confined to the full training dataset. This principle directly combats the issue of variables that exhibit strong correlations or predictive power in a specific training set but lack generalizability.
To operationalize this, the methodology employs a stratified cross-validation technique. The training data is systematically divided into four distinct folds. This stratification is crucial, ensuring that each fold is representative of the overall population by considering the distribution of the target variable (default status) and a key temporal or categorical characteristic, such as the year of the loan issuance or a primary default indicator. This stratification by def_year and fold ensures that each subset mirrors the characteristics of the complete dataset, thereby providing a more reliable testbed for variable stability.
The process involves an iterative splitting of the training data into four pairs of training and testing sets. Each pair comprises three folds designated for training the model and one fold reserved for testing. Critically, all variable selection criteria are applied exclusively to the training sets of each fold. This strict adherence prevents data leakage, a common cause of inflated performance metrics during development that do not translate to production.
from sklearn.model_selection import StratifiedKFold
# Assuming train_imputed is your pandas DataFrame with data and 'def_year' column
skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)
train_imputed["fold"] = -1
for fold, (_, test_idx) in enumerate(skf.split(train_imputed, train_imputed["def_year"])):
train_imputed.loc[test_idx, "fold"] = fold
This code snippet initializes a stratified k-fold cross-validation object with four splits, ensuring shuffling and a fixed random state for reproducibility. It then assigns a fold number to each record in the train_imputed DataFrame. The subsequent step, represented by build_and_save_folds(train_imputed, fold_col="fold", save_dir="folds/"), would typically involve creating and saving these data splits for subsequent processing.
A variable successfully navigates this selection process only if it meets the defined criteria across all four folds. The failure of a variable to meet the criteria in even a single fold is sufficient cause for its elimination. This stringent requirement guarantees that the selected variables are consistently relevant and predictive, forming a robust foundation for any credit scoring model.

Figure 1: Illustrative visualization of data splitting into multiple folds for cross-validation, a key step in ensuring variable stability.
The Credit Scoring Dataset: A Foundation for Analysis
To illustrate this methodology, the study leverages the publicly available Credit Scoring Dataset from Kaggle. This dataset comprises 32,581 loan records for individual borrowers, covering a wide array of loan purposes including medical, personal, educational, and professional needs, as well as debt consolidation. Loan amounts vary significantly, ranging from $500 to $35,000.
The dataset is characterized by two primary types of variables: continuous and categorical. The target variable of interest is default, a binary indicator where 1 signifies a borrower default and 0 indicates repayment. Prior to this variable selection phase, missing values and outliers were meticulously handled in a preceding analytical stage, as detailed in a related publication. This current focus is exclusively on identifying the most stable and predictive variables.
The continuous variables identified for potential inclusion are:
loan_amnt: The total amount of the loan.loan_int: The annual interest rate of the loan.loan_percent_income: The ratio of the loan amount to the borrower’s annual income.cb_person_cred_hist_length: The length of the borrower’s credit history in years.cb_person_default_on_file_loan: A binary indicator if the borrower has a history of defaulting on loans.loan_income_ratio: The ratio of the loan amount to the borrower’s income.loan_amnt_requested: The amount of money requested for the loan.
The categorical variables identified are:
loan_grade: An indicator of the loan’s risk grade assigned by the lender.loan_status: The current status of the loan (e.g., fully paid, charged off).payment_plan: The repayment plan for the loan.home_ownership: The borrower’s home ownership status.
The Filter Method: A Four-Rule Framework for Variable Selection
The filter method is chosen for its efficiency, auditability, and ease of explanation to non-technical stakeholders. Unlike wrapper or embedded methods, filter methods assess the intrinsic properties of variables using statistical measures of association, independent of any specific predictive model. This makes them computationally less intensive and more transparent. The process involves applying four sequential rules, with the output of each rule serving as the input for the subsequent one.
Rule 1: Eliminating Continuous Variables Unrelated to Default
The first rule focuses on identifying continuous variables that exhibit a statistically significant association with the loan default outcome. A Kruskal-Wallis test is employed for each continuous variable against the default target across each of the four folds. This non-parametric test is suitable for comparing distributions across multiple groups.
If a continuous variable yields a p-value exceeding a 5% significance threshold in any of the four folds, it is deemed not reliably linked to default and is consequently dropped. This ensures that only variables demonstrating a consistent, statistically significant relationship with default across all data subsets are retained.
# Assuming 'folds' is a list of data splits and 'continuous_vars' is a list of continuous variable names
# 'def_year' is assumed to be the target variable name in this context, though the article uses 'default'
rule1_vars = filter_uncorrelated_with_target(
folds=folds,
variables=continuous_vars,
target="def_year", # Should ideally be 'default' if that's the target column
pvalue_threshold=0.05,
)
Result of Rule 1: All identified continuous variables passed this initial scrutiny. This indicates that, within the context of this dataset and the applied cross-validation splits, every continuous variable demonstrated a statistically significant association with the loan default status across all four folds. This suggests a general relevance of these continuous features to the credit risk prediction task.

Rule 2: Identifying Weakly Associated Categorical Variables
The second rule addresses categorical variables, assessing their association with the default target using Cramér’s V. Cramér’s V is a measure of association between two categorical variables, ranging from 0 (no association) to 1 (perfect association). It is particularly useful for evaluating the strength of the relationship between a categorical predictor and a categorical outcome.
A categorical variable is eliminated if its Cramér’s V value falls below a 10% threshold in at least one fold. This threshold is set to identify variables with a demonstrably weak link to default. Conversely, a strong association is generally considered to be indicated by a Cramér’s V value above 50%.
# Assuming 'categorical_vars' is a list of categorical variable names
rule2_vars = filter_categorical_variables(
folds=folds,
cat_variables=categorical_vars,
target="def_year", # Should ideally be 'default' if that's the target column
low_threshold=0.10,
high_threshold=0.50,
)
Result of Rule 2: Out of the four initial categorical variables, three were retained. The variable loan_int (which is actually a continuous variable in the provided list, suggesting a potential mislabeling in the source description or a complex transformation not detailed) was dropped due to its association with default being too weak in at least one fold. This highlights that even among categorical features, consistent and sufficiently strong associations are paramount. Note: There appears to be a discrepancy where loan_int is listed as categorical but is typically a continuous variable. Assuming this refers to a categorical representation or a specific aspect of the interest rate that was treated categorically.
Rule 3: Removing Redundant Continuous Variables
Multicollinearity, the presence of highly correlated predictor variables, can destabilize predictive models by inflating standard errors and making coefficient estimates unreliable. Rule 3 addresses this by identifying and removing redundant continuous variables.
The process involves computing the Spearman correlation coefficient between every pair of continuous variables that survived Rule 1. If the absolute correlation between any two variables reaches or exceeds 60% in at least one fold, one of the variables is flagged for removal. The criterion for deciding which variable to drop is its relative importance in predicting the default outcome, measured by the lowest Kruskal-Wallis p-value (i.e., the variable more strongly associated with default is retained).
# 'rule1_vars' contains the continuous variables that passed Rule 1
selected_continuous = filter_correlated_variables_kfold(
folds=folds,
variables=rule1_vars,
target="def_year", # Should ideally be 'default' if that's the target column
threshold=0.60,
)
Result of Rule 3: Five continuous variables were kept after this step. The variables loan_amnt and cb_person_cred_hist_length were identified as being highly correlated with other retained variables and were subsequently dropped. This finding aligns with previous analyses that identified similar redundancies within this dataset, underscoring the importance of explicitly addressing multicollinearity in a stability-focused selection process.
Rule 4: Eliminating Redundant Categorical Variables
A similar logic is applied to categorical variables to mitigate redundancy. Rule 4 assesses pairs of categorical variables that were retained after Rule 2. Cramér’s V is calculated between each pair. If the V statistic reaches or exceeds 50% in any fold, indicating a strong association between the two categorical variables, the variable with the weaker link to the default target is removed.
# 'rule2_vars' contains the categorical variables that passed Rule 2
selected_categorical = filter_correlated_categorical_variables(
folds=folds,
cat_variables=rule2_vars,
target="def_year", # Should ideally be 'default' if that's the target column
high_threshold=0.50,
)
Result of Rule 4: Two categorical variables were retained. The variable loan_grade was dropped because it exhibited a strong correlation with another retained categorical variable and demonstrated a comparatively weaker association with the default outcome. This step ensures that the final set of categorical predictors are not only individually predictive but also contribute unique information to the model.
The Final Selection: A Robust Set of Seven Variables
Following the application of these four sequential rules, the filter method yields a final selection of seven variables: five continuous and two categorical. Each of these variables has been rigorously tested and validated for its consistent and significant association with loan default across all data folds. Furthermore, the selection process has actively worked to eliminate redundancy, ensuring that the retained variables offer unique predictive power.
This multi-stage selection process offers significant advantages beyond just predictive accuracy. It provides a high degree of auditability. Every decision made – whether to keep or drop a variable – can be traced back to specific statistical tests and thresholds applied consistently across data subsets. This transparency is invaluable in regulated industries like finance, where explanations for model decisions are often required by regulators and business stakeholders. The ability to articulate precisely why each variable was included or excluded fosters trust and facilitates easier model validation and understanding.
The robustness of this selection is underpinned by the principle that a variable must perform well on every fold. If a variable fails to meet the criteria in even one fold, it is discarded. This approach directly addresses the problem of models that perform well in development but fail in production due to reliance on spurious correlations or unstable relationships present only in the specific training data used.
Looking Ahead: Monotonicity and Temporal Stability
While this filter method ensures statistical robustness and lack of redundancy, it represents a crucial first step in building truly resilient credit scoring models. The next logical steps in model development involve examining other critical properties of the selected variables.
In subsequent analyses, the monotonicity and temporal stability of these seven selected variables will be investigated. Monotonicity refers to the property where the probability of default consistently increases (or decreases) as the value of a predictor variable changes in a specific direction. For instance, as a borrower’s debt-to-income ratio increases, their likelihood of default should ideally also increase monotonically.
Temporal stability, on the other hand, addresses how the predictive power of a variable might change over time. A variable that is highly predictive today might become less so in the future due to shifts in economic conditions, consumer behavior, or lending practices. Ensuring temporal stability is vital for models that need to perform reliably over extended periods in a dynamic financial landscape.
Key Takeaways from the Variable Selection Process:
- Stability as a Primary Metric: The methodology prioritizes variable stability across data subsets over peak performance on a single training set.
- Stratified Cross-Validation: Employing stratified k-fold cross-validation ensures that each data subset used for testing variable relevance is representative of the entire dataset.
- Multi-Stage Filter Method: A sequential application of four distinct rules systematically eliminates irrelevant and redundant variables.
- Auditability and Interpretability: The filter method offers a transparent and easily explainable process, crucial for stakeholder buy-in and regulatory compliance.
- Robustness Against Data Shifts: Variables selected through this process are more likely to maintain their predictive power when deployed in production, mitigating the risk of model failure.
- Foundation for Further Analysis: This rigorous selection lays the groundwork for evaluating other essential model properties like monotonicity and temporal stability.
This robust variable selection framework provides a strong foundation for building credit scoring models that are not only accurate but also reliable, interpretable, and resilient in the face of evolving data and market conditions. The commitment to stability ensures that the insights derived from the data translate into actionable and dependable predictions in real-world applications.
Data & Licensing:
The credit scoring dataset utilized in this analysis is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This license permits the sharing and adaptation of the dataset for any purpose, including commercial use, provided that appropriate attribution is given to the source. For further details, the official license text can be consulted at: CC0: Public Domain.
Disclaimer:
Any errors or inaccuracies remaining in this analysis are the sole responsibility of the author. Feedback and corrections are welcomed to enhance the accuracy and completeness of the presented findings.






