Machine Learning Tutorial Part 3: Under & Overfitting + Data Intro

Underfitting and Overfitting in Machine Learning

When a model fits the input dataset properly, it results in the machine learning application performing well, and predicting relevant output with good accuracy. We have seen many machine learning applications not performing well. There are 2 main causes of this- underfitting and overfitting. We will see both these situations in detail in this post.

Overfitting

Let us understand overfitting from a supervised machine learning algorithm’s perspective. Supervised algorithms sole purpose is to generalize well on never-before-seen data. It is the ability of the machine learning model to produce relevant output for the input dataset.

Consider the below set of points which would be required to fit a Linear Regression model:

The aim of Linear Regression is that a straight line tries to fit/capture all/most of the data points present in the dataset.

It looks like the model has been able to capture all the data points and has learnt well. But now consider a new point being exposed to this model. Since the model has learnt too well from the data, it wouldn’t be able to capture this new data point and generalize on it.

With respect to a Linear Regression algorithm, when this algorithm is fed the input dataset, the general idea is to reduce the overall cost (which is the distance between the straight line generated and the input data points). This happens when the number of iterations increases, i.e when the algorithm is trained on a large dataset. If the number of iterations is too much, the model learns too well. Due to this, it can’t generalize well since the model would have learnt the noise which is present in the dataset (which needs to be skipped in reality).

Note: The model training can be stopped at certain point in time depending on certain conditions being met.

This phenomenon is known as ‘overfitting’. The model overfits the data, hence doesn’t generalize well on newly encountered data.

Underfitting

This is the opposite of overfitting. The aim of the machine learning algorithm is to generalize well, but not learn too much. It is also essential that the model shouldn’t learn too less due to which it would fail to capture the essential patterns in the data. Otherwise the model wouldn’t be able to predict or produce output for new data points.

Note: If the model training is stopped prematurely, it could lead to underfitting, or the data not being trained sufficiently, due to which it wouldn’t be able to capture the vital patterns in data. This would lead to the model not being able to produce satisfactory results.

Consider the below image which shows how underfitting looks visually:

The dashed line in blue is the model that underfits the data. The black parabola is the line of data points that fits the model well.

The consequence of underfitting is the model not being able to generalize on newly seen data, which would lead to unreliable predictions.

Underfitting and overfitting are equally bad and the model needs to fit the data just right.

Data Loading for ML Projects

The input data to a learning algorithm usually has a row x column structure, and is usually a CSV file. CSV refers to comma separated values which is a simple file format that helps in storing tabular data structure. This CSV format can be easily loaded into a Pandas dataframe with the help of the read_csv function. The CSV file can be loading using other libraries as well, and we will look at a few approaches in this post.

Let us now load CSV files in different methods:

Using Python standard library

There are built-in modules, such as ‘csv’, that contains a reader function, which can be used to read the data present in a csv file. The CSV file can be opened in read mode, and the reader function can be used. Below is an example demonstrating the same:

<pre>import numpy as np 

import csv 

path = path to csv file 

with open(path,'r') as infile: 

reader = csv.reader(infile,delimiter = ',') 

headers = next(reader) 

data = list(reader) 

data = np.array(data).astype(float) </pre>

The headers or the column names can be printed using the following line of code:

<pre>print(headers) </pre>

The dimensions of the dataset can be determined using the shape attribute as shown in the following line of code:

<pre>print(data.shape) 

<strong>Output:</strong> </pre>

<pre>250, 302 </pre>

The nature of data can be determined by examining the first few rows of the dataset using the below line of code:

<pre>data[:2] </pre>

Using numpy package

The numpy package has a function named ‘loadtxt’ that can be used to read CSV data. Below is an example demonstrating the same using StringIO.

<pre>from numpy import loadtxt 

from io import StringIO 

c = StringIO("0 1 2 \n3 4 5") 

data = loadtxt(c) 

print(data.shape) </pre>

Output:

<pre>(2, 3) </pre>

Using pandas package

There are a few things to keep in mind while dealing with CSV files using Pandas package.

The file header is basically the name of the column which describes that type of data the column holds. If the file already has a header, the function automatically assigns the same names to every column, otherwise every column needs to be manually named.
In any case, we need to explicitly mention in the read_csv function whether or not the CSV file contain header.
Comments in a CSV file are written using the # symbol.

Let us look at an example to understand how the CSV file is read as a dataframe.

<pre>import numpy as np 

import pandas as pd 

#Obtain the dataset 

df = pd.read_csv("path to csv file", sep=",") 

df[:5] </pre>

Output:

<pre>target012 ...295296297298299 </pre>

<pre>[5 rows x 302 columns] </pre>

Introduction to Data in Machine Learning

What is data?

It is the unprocessed, raw facts that can be extracted from various resources. Data is generated every millisecond and most of the data generated is unstructured. This means it doesn’t have a specific format. This is the reason why many machine learning algorithms don’t give great results even if a large amount of data is fed as input. Data is not in the right format; it is unstructured and hence difficult to process and get consumed.

What is information?

It is the processed form of data, i.e. data that has been cleaned and made sense. This information gives meaningful insights to users about specific aspects.

Data in machine learning

Data in machine learning is usually in the form of text that needs to be converted to numbers since it is difficult for machines to infer from text data. Input data to learning algorithms usually has a tabular structure that consists of rows and columns. The columns indicate the name of the feature and the rows have data of every feature.

Data is split into different sets so that a part of the dataset can be trained upon, a part can be validated and a part can be used for testing purposes.

Training data: This is the input dataset which is fed to the learning algorithm. Once the dataset is pre-processed and cleaned, it is fed to the algorithm. Sometimes, predefined datasets are readily available on multiple websites which can be downloaded and used. Some predefined data sets need to be cleaned and verified but some of them are usually cleaned beforehand. The machine learning model learns from this data and tries to fit a model on this data.
Validation data: This is similar to the test set, but it is used on the model frequently so as to know how well the model performs on never-before seen data. Based on the results obtained by passing the validation set to the learning algorithm, decision can be made as to how the algorithm can be made to learn better- the hyper parameters can be tweaked so that the model gives better results on this validation set in the next run, the features can be combined or new features can be created which better describe the data, thereby yielding better results.
Test data: This is the data on which the model’s performance/its ability to generalize is judged. In theend, the model’s performance can be determined based on how well it reacts to never-before-seen data. This is the data, which is used to test how well the model would generalize on new data. This is a way of knowing whether the model actually understood and learnt the patterns or it just overfit or underfit the data.

It is important to understand that good quality data (less to no noise, less to no redundancy, less to no discrepancies) in large amounts yields great results when the right learning algorithm is applied on the input data.

The LenovoPRO Community has partnered with KnowledgeHut to provide valuable learning resources to our members, including the following tutorial series on Machine Learning!

These chapters were originally featured on Knowledgehut at the following URL:

If you’re looking to upskill to advance your career, visit Knowledgehut today – plus, LenovoPRO Community members get 20% off your first course on the site (use promo code LENOVO)

For more chapters of the LenovoPRO Community Machine Learning Tutorial provided by Knowledgehut, click here.

What are your thoughts on machine learning? Do you have any familiarity with it when it comes to your business practices?

Leave your comments below to kick off the conversation!