The Art of Credit Scorecard Development: Part 2— Credit ScoreCard Development — Using Toad Package

11 min readJun 10, 2024

Photo by The New York Public Library on Unsplash

This part of the series will guide you through the essential steps and best practices for developing a credit scorecard using Toad. In the first part, we covered the foundational concepts and understanding the business context. In Part 2, we shift our focus to the practical implementation, specifically utilizing the Toad package. Part 2 encompasses all preliminary procedures preceding scorecard creation. Part 3 will finalize with model and score card development. The final GitHub link for the notebook and dataset will be availed in part 3.

Employing the Toad package streamlines many of the laborious and intricate tasks inherent in scorecard development, including data cleansing, variable selection, and transformation. Nevertheless, it’s noteworthy that these calculations can still be performed manually in Python. This serves two vital purposes: validating the package outputs and fostering a deeper comprehension of the individual concepts.

Introduction

Unlike most data science use cases, the development of scorecards has its unique considerations. It requires a holistic iterative process that integrates domain and technical expertise, regulatory compliance, and ethical considerations to build reliable, and transparent models that support responsible lending practices.

As a result, most companies will adhere to the following guidelines when building scorecards

The decisions made based on credit scoring should be explainable, transparent, and ethical. Eg not use features such as religion or race in the decision process.
Data privacy and consumer protection is adhered to in the decision process.
Adherence to a governance framework through regulatory requirements and industry best practices.

This article will go through step-by-step industry best practices in the process of credit scorecard development.

Data Sources

Scorecards can be built from one rich data source or a variety of two or more reliable data sources such as payment and credit history behavior, demographic data, etc.

In this article, I’ll be using the publicly available Lending Club Data. There are several different versions of this dataset.

Data Exploration & Cleaning

Thorough data exploration is essential to evaluate data quality, completeness, and imbalance, as well as to identify trends crucial for informing model and scorecard development. Successful data cleaning relies on a deep understanding of the business context to rectify anomalies and inconsistencies accurately.

Data cleaning code

def clean_data(df):
 
    
    ## cols to be dropped
    #'Unnamed: 0'> index to be used
    #'id', 'member_id'> not relevant in creating score card
    #funded_amnt, funded_amnt_inv same as loan amount. To be removed to avoid multicollinearity
    #grade to be dropped as sub-grade is a subset of grade
    #drop title as the purpose column will suffice
    #text attributes to be removed ['url']
    #policy code and application_type have only one unique value
    #last_pymnt_d can't be used during application
    #total_pymnt ,total_pymnt_inv are the same.total_pymnt_inv to be removed
    #emp_title
    
    df.drop(df[['Unnamed: 0','id','member_id','funded_amnt','funded_amnt_inv',
            'grade','url','application_type','title','policy_code','emp_title',
                'pymnt_plan','last_pymnt_d','last_pymnt_amnt','total_pymnt_inv']],axis=1,inplace=True)
    
     #Drop columns with  70% N/A of the data:
    df.dropna(axis='columns', thresh=0.7*len(df),inplace=True)
    
    df['emp_length'].replace({ '10+ years': 10, '< 1 year': 0, '1 year': 1,'2 years': 2,'3 years': 3,'4 years': 4,
    '5 years': 5,'6 years': 6,'7 years': 7,'8 years': 8,'9 years': 9},inplace=True)
    
    
    # Calculate mode of the column
    mode_value = df['emp_length'].mode()[0]

    # Replace missing values with mode
    df['emp_length'] = df['emp_length'].fillna(mode_value)
    
    # Replace all numerical values where NA with -9999
    df[df.select_dtypes(include=['int64', 'float64']).columns] = df.select_dtypes(include=['int64', 'float64']).fillna(-9999)
    
    
    return df

df=clean_data(df)

Feature Generation

Feature generation is done to enhance the predictive power of the model. This is essentially the same process as most data science use cases.

def feat_eng(df):
    # credit history tenure
    df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], format='%b-%y')
    df['issue_d'] = pd.to_datetime(df['issue_d'], format='%b-%y')
    df['last_credit_pull_d'] = pd.to_datetime(df['last_credit_pull_d'], format='%b-%y')

    df['credit_history_tenure_years']=(df['issue_d']-df['earliest_cr_line']).dt.days / 365.25
    df['credit_history_tenure_months'] = (df['issue_d'].dt.to_period('M') - df['earliest_cr_line'].dt.to_period('M'))
    df['credit_history_tenure_months'] = df['credit_history_tenure_months'].apply(lambda x: x.n if x is not pd.NaT else 0)


    df['earliest_cr_line_month'] = df['earliest_cr_line'].dt.month

    
    df['credit_pull_months'] = (df['issue_d'].dt.to_period('M') - df['last_credit_pull_d'].dt.to_period('M'))
    df['credit_pull_months'] = df['credit_pull_months'].apply(lambda x: x.n if x is not pd.NaT else 0)
    
    
    df['loan_to_income_ratio'] = df['loan_amnt'] / df['annual_inc']

    df.drop(['earliest_cr_line','last_credit_pull_d','last_credit_pull_d'],axis=1,inplace=True)
    return df

df=feat_eng(df)

Definition of Good and Bad Loans

Defining good and bad loans is a critical step in scorecard development, serving as the target variable for model evaluation and training. In real-world scenarios, loans may have varied statuses like current, defaulted, paid, in grace period etc. Collaborating with domain experts is crucial to accurately categorize loans as good or bad based on their status.

Dataset Loan status

Categorization of the above

def good_bad(df):
    
    # assign 0 to bad loans(default) and 1 to good loans
    
    #bad loans
    condition=np.isin(df['loan_status'],['Charged Off','Late (31-120 days)','Late (16-30 days)','Default','Does not meet the credit policy. Status:Charged Off'])
    replace=['0']
    df['loan_status']=np.where(condition,replace,df['loan_status'])
    
    #good loans
    condition2=np.isin(df['loan_status'],['Fully Paid','Does not meet the credit policy. Status:Fully Paid'],)
    replace=['1']
    df['loan_status']=np.where(condition2,replace,df['loan_status'])
    
    ## I will drop status In Grace Period and Current because we are not sure if they will repay the loans or not
    values_to_drop = ['In Grace Period', 'Current']
    df = df[~df['loan_status'].isin(values_to_drop)]
    
    return df


df=good_bad(df)

Sample Window

The next crucial step is identifying the sample window. This time frame defines when you’ll gather your data sample for constructing the scorecard. It needs to encompass a substantial number of both good and bad loans. Additionally, it should strike a balance between not being too outdated or too recent.

train_start_date = '2010-01-01'
train_end_date = '2013-06-30'
test_start_date = '2013-07-01'
test_end_date = '2013-12-31'

# Split the data into (training data) and (test data)
train = df[(df['issue_d'] >= train_start_date) & (df['issue_d'] <= train_end_date)]
test = df[(df['issue_d'] >= test_start_date) & (df['issue_d'] <= test_end_date)]

X_train=train.drop('loan_status',axis=1)
y_train=train['loan_status']
X_test=test.drop('loan_status',axis=1)
y_test=test['loan_status']

Binning

Binning is the process of grouping a continuous variable into distinct bins or categories. A primary goal in scorecard development is to maximize information summarization. This is achieved by binning continuous variables and subsequently constructing a model that removes weak variables or those not aligned with sound business logic. Variables that are already categorical or binary do not require binning because they naturally represent distinct groups or states. Binning categorical variables may lead to loss of information or introduce unnecessary complexity into the modeling process.

Some common binning techniques include ChiMerge Algorithm, Decision Tree-Based Binning, Equal-Width binning, Equal-Size binning, Manual Binning, etc.

It is usually a very detailed process that will need to be iterated. Regardless of the binning technique used, a good bin should have the following characteristics:

Each bin should contain at least 5% of observations i.e. a minimum of 5 percent in each bin/bucket.
Missing values are binned separately.
There are no groups with 0 counts for good or bad.
The Weight of Evidence (WoE) for non-missing values adheres to a coherent distribution, progressing consistently from negative to positive values without any reversals. This alignment with business logic is confirmed by the pattern observed.
The disparity between the bad rate and Weight of Evidence (WoE) across different groups is significant. This indicates that the grouping has been structured to optimize the distinction between good and bad outcomes, enhancing differentiation from one group to the next.

import toad
#specify binning requirements
combiner = toad.transform.Combiner()
combiner.fit(X_train, y_train,   method='chi', min_samples = 0.05)

bins = combiner.export()

#apply binning on both the train and test datasets
X_train = combiner.transform(X_train)
X_test = combiner.transform(X_test[X_train.columns])


#The code above is using the Combiner class from the toad library in Python for binning continuous variables based on decision tree splits. 
# The fit method of the Combiner instance is called to fit the binning transformation to the training data (X_train and y_train).
# method='chi': Specifies that the binning method to be used is decision tree-based, 
# other methods could include decision tree, quantile,k-means, etc. 
# min_samples=0.05 specifies the minimum proportion of samples required in a bin. 
# The transform method of the Combiner instance is used to transform the continuous variables in the training dataset (X_train) into bins based on the fitted binning strategy.
# Similarly, the continuous variables in the test dataset (X_test) are transformed into bins.

Sample output

Benefits of Binning

Binning simplifies the representation of continuous variables by converting them into bins making the data easier to understand and interpret.
Binning helps neutralize the effects of outliers as outliers are binned with the nearest group.
Missing values are also handled by having their bin.
Binning also helps to understand the non-linear relationships between variables and the target outcome.

WOE(Weight of Evidence)

The WOE is a statistical technique that measures the strength of each attribute, or grouped attributes, in separating good and bad accounts.

In credit scoring, the WoE indicates the predictive power of a particular bin of a variable.Negative WoE values signify that the proportion of “bads” (e.g., defaulters) outweighs the proportion of “goods” (e.g., non-defaulters) within a category or bin. Conversely, positive WoE values indicate that the proportion of “goods” outweighs the proportion of “bads.” Categories with high WoE values are considered to have strong predictive power and vice versa.

The Weight of Evidence (WoE) should exhibit a monotonic trend, either increasing or decreasing across the bins. This characteristic enhances model stability and facilitates interpretability, strengthening the relationship between the target and predictor variables.

The Weight of Evidence (WoE) plays a crucial role in shaping the final scorecard outcome. When WoE values are closely clustered together, the corresponding points assigned in the scorecard will also exhibit minimal variation. Where the trend of the WoE, is illogical from a business perspective, the bins can be manually adjusted to reflect actual behavior.

t=toad.transform.WOETransformer()
#transform training set
train_woe = t.fit_transform(X=X_train,
                            y=y_train)
#transform testing set
test_woe = t.transform(X_test[X_train.columns])

Information Value (IV)

This statistical technique measures a variable's predictive power in distinguishing between good and bad credit risk. It helps in selecting and ranking the most significant variables for use in a credit scoring model.

The IV value can be interpreted using the following general guidelines:

IV < 0.02: Not Predictive
0.02 ≤ IV < 0.1: Weak Predictive Power
0.1 ≤ IV < 0.3: Medium Predictive Power
0.3 ≤ IV < 0.5: Strong Predictive Power
IV ≥ 0.5: Suspicious or Overfitting

Calculation of IV

Binning: Divide the predictor variable into bins or categories.
Calculate Distribution: For each bin, calculate the proportion of good and bad borrowers.
Weight of Evidence (WoE): Calculate the WoE for each bin.
IV Calculation: Sum up the contributions of each bin to get the total IV.

The total IV would be the sum of the IV contributions from each bin:

IV=0.138+0.102+0.308=0.548IV=0.138+0.102+0.308=0.548

This IV indicates a very strong predictive power of the variable in distinguishing between good and bad borrowers.

quality=toad.quality(train_woe, target=y_train,iv_only=False)
quality

Variance Inflation Factor(VIF)

Variance Inflation Factor (VIF) is a statistical measure used to detect multicollinearity among the predictor variables in a regression model. Multicollinearity occurs when two or more predictors are highly correlated, meaning they provide redundant information about the response variable.

The higher the VIF, the higher the possibility that multicollinearity exists and the need for more investigation. When VIF is greater than 10, this is an indication of significant multicollinearity that needs to be corrected.

# VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

def calculate_vif(X):
    
    if 'const' not in X.columns:
        X_with_const = add_constant(X)
    else:
        X_with_const = X.copy()  # Create a copy to avoid modifying the original DataFrame
    
    vif_values = [variance_inflation_factor(X_with_const.values, i) for i in range(X_with_const.shape[1])]
    return pd.DataFrame({'Feature': X_with_const.columns, 'VIF': vif_values})

# Assuming X_train is your DataFrame containing predictor variables
vif_df = calculate_vif(train_woe)
vif_df

Feature Selection based on IV and VIF

As a result of the rules on IV and VIF above, we will remove all features with VIF greater than 10, IV less than 0.02 and IV greater than 0.5. if any

combined=pd.merge(quality, vif_df, on='Feature')
combined.sort_values(by='iv', ascending=True)

filtered_columns = combined[(combined['VIF'] > 10) | (combined['iv'] < 0.02) | (combined['iv'] > 0.5)]
filtered_columns['Feature'].tolist()
columns_to_drop = filtered_columns['Feature'].tolist()

X_train.drop(columns=columns_to_drop,inplace=True)
X_test.drop(columns=columns_to_drop,inplace=True)

Drop the columns in both X_train and X_test.

X_train.drop(columns=columns_to_drop,inplace=True)
X_test.drop(columns=columns_to_drop,inplace=True)

Recompute the WOE

We will need to then recompute the WOE to ensure that the remaining features are properly adjusted to reflect their updated distributions and relationships with the target variable.

## Recompute the woe
t=toad.transform.WOETransformer()

#transform training set
train_woe = t.fit_transform(X=X_train,y=y_train)

#transform testing set
test_woe = t.transform(X_test[X_train.columns])

Recompute the IV and VIF

quality=toad.quality(train_woe, target=y_train,iv_only=False)

quality.reset_index(inplace=True)
quality.rename(columns={'index': 'Feature'}, inplace=True)


# VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

def calculate_vif(X):
    # Check if the DataFrame already contains a constant column
    if 'const' not in X.columns:
        X_with_const = add_constant(X)
    else:
        X_with_const = X.copy()  
    
    vif_values = [variance_inflation_factor(X_with_const.values, i) for i in range(X_with_const.shape[1])]
    return pd.DataFrame({'Feature': X_with_const.columns, 'VIF': vif_values})


vif_df = calculate_vif(train_woe)

combined=pd.merge(quality, vif_df, on='Feature')
combined.sort_values(by='iv', ascending=True)

PSI (Population Stability Index)

Population Stability Index (PSI) is typically used in scorecard development and monitoring to measure the stability of the score distribution over time. i.e.It answers the question: if the base year is 2020 and the comparison year is 2021, how did the distribution of credit scores behave? Did it change or remain the same?

PSI is primarily used after the scorecard has been deployed to monitor its performance over time. It is calculated periodically (e.g., monthly, quarterly) to compare the distribution of scores in the current period with the distribution in a baseline period (usually the development or initial validation period). However, it can be useful to calculate PSI between the training and validation datasets to ensure they are similar in distribution. This helps to check if the model built on the training data is applicable to the validation data.

PSI Calculation

Proportion calculation example in the above

$20000 — $30000 > 100/1000 = 0.1

Interpretation:

PSI < 0.1: No significant change.
0.1 ≤ PSI < 0.25: Moderate change, potential need for review.
PSI ≥ 0.25: Significant change in the population distribution, likely need for model recalibration or redevelopment.

psi_df = toad.metrics.PSI(train_woe, test_woe).sort_values()  
psi_df = psi_df.reset_index()  
psi_df = psi_df.rename(columns = {'index': 'feature', 0: 'psi'})
psi_df

#The industry level is to drop features with a PSI greater than 0.2
col_keep = list(set(list(psi_df[psi_df.psi<0.2].feature)))
train_psi = train_woe[col_keep]

print("keep:", train_psi.shape[1])

psi_df

Final check to remove highly correlated features

train_psi, drop_lst = toad.selection.select(train_psi,
                                               y_train,
                                               #empty=0.7,   
                                               #iv = 0.02 ,
                                               corr=0.7, 
                                               return_drop=True, 
                                               )  
print("keep:", train_psi.shape[1],  
      "drop corr:", len(drop_lst['corr']))

This is the end of part 2. We ensured that the data is properly prepared and the right features selected to optimize model development which will result in more accurate results and an interpretable scorecard. As previously mentioned, the final notebook will be availed in part 3 of this series.