Tuning / regularizing common linear regressions and classifiers in scikitlearn

If you read my article in January about my personal development goals, you might have seen that I’m working to achieve the DP-200 certification. During the learning process for DP-200, I learned that I lacked certain basic knowledge about how to do data engineering / data science in Python.

For that reason, I decided to pursue a course on Coursera to understand the basic of doing data science in Python. The course is really well laid out. It has weekly lectures and assignments. I’ve learnt a ton already during the lectures and the assignments.

This week’s course introduced multiple Supervised Machine Learning models. Some of the those models have tuning / regularization parameters to avoid overfitting. Overfitting is a phenomenon in machine learning that occurs when a trained model is a very good fit for the training data, but doesn’t generalize well to data the models hasn’t seen before. The factors/regularization parameters in this blog post can help with avoiding overfitting models to data.

For my own benefit, I decided to make some notes about these factors, and share these here. If you want more background on any of the models, please refer to the course on Coursera.

While writing this blog post, I created a Jupyter notebook to work on my understanding of the different models/factors/parameters and with some examples. You can access that on Azure Notebooks and clone it for yourself or download from Github.

Note: If you see anything that you consider isn’t 100% accurate, please let me know. I’m a true novice in all of this matter, and appreciate all feedback I can get.

Models and factors

In this post, we’ll discuss the following models and their factors:

KNN
Ridge regression
Lasso regression
Polynomial regression
Logistic regression
Linear support vector machines
Radial Basis Function Kernel Support Vector Machine
Decision trees

KNN

A KNN model has a single factor, namely n_neighbors.

If 1: perfect fit for training data, doesn’t generalize well
If >1 <n_trainingset: non-perfect fit for training data, generalizes better, better assumed result for testing dataset
If close to or equal n_trainingset: underfit, all results = most common case in training dataset (or average)

Ridge regression

Explanation: A ridge regression is a linear regression, that adds a penalty for large w parameters. As you can see, a Ridge model uses the squared w factors. This means is a L2 factor.

Regularization factor = α :

If = 0: Regular leased squared linear regression
Low values of α: Model overfit. Good results for training data, less for testing data.
Growing : Simpler model. The higher α gets, the more tuning occurs, brining w factors closer to zero. Likely a optimum point for best test dataset performance.

Lasso regression

Explanation: A lasso regression is a linear regression, that adds a penalty for large w parameters. Compared to Ridge, Lasso uses the absolute value of the coefficient. This has the effect that a lot of w factors will be zero. 𝛼 in Lasso is an L1 regularization factor.

Regularization factor = α :

If α = 0: Regular leased squared
Low values of α : Model overfit. Good results for training data, less for testing data.
Growing α : Simpler model. The higher α gets, the more tuning occurs, making more w become zero. Likely a optimum point for best test dataset performance.

Polynomial regression

Transforms x to x’, with x’ being all polynomial combinations of any original two features.

𝑥=(𝑥1,𝑥2)⇒𝑥′=(𝑥1,𝑥2,𝑥21,𝑥1𝑥2,𝑥22)

𝑦̂ =𝑤̂ 1𝑥1+𝑤̂ 2𝑥2+𝑤̂ 11𝑥21+𝑤̂ 12𝑥1𝑥2+𝑤̂ 22𝑥22

The tuning factor of a polynomial regression is the degree. The degree refers to how many variables (xi) should be combined into a new feature. A polynomial regression isn’t different from any other regression, since you update the feature space, not the regression algorithm itself. Meaning, you could still apply a Ridge/Lasso after applying a polynomial.

Logistic regression

First up: logistic regression isn’t actually a regression model, but a classification model. It takes input features and outputs a numbers between 0 and 1 to classify a certain outcome.

Regularization factor = C.

This factor is an L2 factor, like a Ridge model (meaning, it uses the squared factors wi).

If C very close to zero: the factors of the logistic regression can become very large. This has the impact that the model could behave poorly for both the training and test dataset.
Increasing C will decrease the factors w, which might lead to less overfitting and a better result. A optimum likely exists between 0 and ∞.

Linear support vector machine

A LSVM is a classifier. It fits a linear line in a dataset to seperate it in two groups. A key property of a LSVM is the margin of error that it uses for calculating the best fitting classifier. The margin of error is how much distance we tolerate on any point from the classifier. A higher margin of error means thus that points closer to the classifier would be ‘ignored’ in favor of better generalization.

Regularization factor: C.

Counter-intuitive: Larger C means less regularization, meaning more chance for overfitting. Smaller C means more regularization.

C is corelated with the margin of error. A larger value of C tries to fit the training data as well as possible, meaning the margin of error would be smaller. If C is smaller, the model would try to generalize better, meaning a larger margin of error would be tolerated.

Radial Basis Function Kernel Support Vector Machine

A Kernel SVM is a SVM which applies a transformation on the input data. One of those transformations is the Radial Basis Function.

𝑅𝐵𝐹(𝑥,𝑥′)=𝑒𝑥𝑝(−𝛾⋅𝑑(𝑥,𝑥′) with 𝑑(𝑥,𝑥′) being the Euclidian distance.

Parameters: γ and C.

C is the same as above, namely the regularization factor.

Bigger C means less regularization,
smaller C means more regularization.

γ is the function of the RBF to determine the effect of a single training point on others around it.

A higher gamma means less influence on farther points.
A lower gamma means more influence on farther points.

In other words.

A higher gamma means tighter decision boundaries.
A lower gamma means broader decision boundaries.

A good example of showing the relationship between C and 𝛾γ is shown here: https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

Decision trees

A decision tree is another classifier model. It tries to learn repeatable steps to navigate through a dataset, aka rules.

There are three parameters to tune a decision tree:

max_depth: Controls the maximum depth of the tree.
min_samples_leaf: Is the minimum amount of samples in each leaf.
max_leaf_nodes: The maximum amount of leaves (aka edges) in the tree.

Summary

This concludes some of the factors/regularization parameters of common classification and regression models. Let his article be a reference if you’re building ML models, so ensure you tune some of the factors to make sure the models do not overfit your training data.