Underfitting and Overfitting in Machine Learning
When a model fits the input dataset properly, it results in the machine learning application performing well, and predicting relevant output with good accuracy. We have seen many machine learning applications not performing well. There are 2 main causes of this- underfitting and overfitting. We will see both these situations in detail in this post.Â
OverfittingÂ
Let us understand overfitting from a supervised machine learning algorithm’s perspective. Supervised algorithms sole purpose is to generalize well on never-before-seen data. It is the ability of the machine learning model to produce relevant output for the input dataset.Â
Consider the below set of points which would be required to fit a Linear Regression model:Â
The aim of Linear Regression is that a straight line tries to fit/capture all/most of the data points present in the dataset.Â
It looks like the model has been able to capture all the data points and has learnt well. But now consider a new point being exposed to this model. Since the model has learnt too well from the data, it wouldn’t be able to capture this new data point and generalize on it.Â
With respect to a Linear Regression algorithm, when this algorithm is fed the input dataset, the general idea is to reduce the overall cost (which is the distance between the straight line generated and the input data points). This happens when the number of iterations increases, i.e when the algorithm is trained on a large dataset. If the number of iterations is too much, the model learns too well. Due to this, it can’t generalize well since the model would have learnt the noise which is present in the dataset (which needs to be skipped in reality).Â
Note:Â The model training can be stopped at certain point in time depending on certain conditions being met.Â
This phenomenon is known as ‘overfitting’. The model overfits the data, hence doesn’t generalize well on newly encountered data.Â
UnderfittingÂ
This is the opposite of overfitting. The aim of the machine learning algorithm is to generalize well, but not learn too much. It is also essential that the model shouldn’t learn too less due to which it would fail to capture the essential patterns in the data. Otherwise the model wouldn’t be able to predict or produce output for new data points.Â
Note: If the model training is stopped prematurely, it could lead to underfitting, or the data not being trained sufficiently, due to which it wouldn’t be able to capture the vital patterns in data. This would lead to the model not being able to produce satisfactory results.Â
Consider the below image which shows how underfitting looks visually:Â
The dashed line in blue is the model that underfits the data. The black parabola is the line of data points that fits the model well.Â
The consequence of underfitting is the model not being able to generalize on newly seen data, which would lead to unreliable predictions.Â
Underfitting and overfitting are equally bad and the model needs to fit the data just right.Â
Data Loading for ML Projects
The input data to a learning algorithm usually has a row x column structure, and is usually a CSV file. CSV refers to comma separated values which is a simple file format that helps in storing tabular data structure. This CSV format can be easily loaded into a Pandas dataframe with the help of the read_csv function. The CSV file can be loading using other libraries as well, and we will look at a few approaches in this post.Â
Let us now load CSV files in different methods:Â
Using Python standard libraryÂ
There are built-in modules, such as ‘csv’, that contains a reader function, which can be used to read the data present in a csv file. The CSV file can be opened in read mode, and the reader function can be used. Below is an example demonstrating the same:Â
<pre>import numpy as np
import csv
path = path to csv file
with open(path,'r') as infile:
reader = csv.reader(infile,delimiter = ',')
headers = next(reader)
data = list(reader)
data = np.array(data).astype(float) </pre>