Skip to main content

Linear Regression from Scratch

Hello World, This is Saumya, and I am here to help you understand and implement Linear Regression from scratch without any libraries. Here, I will implement this code in Python, but you can implement the algorithm in any other programming language of your choice just by basically developing 4-5 simple functions.

So now, first of all, what exactly is Linear Regression?
As discussed in my blog on Machine Learning, Linear Regression is used to identify linear relationships between the input features x(i) and the output labels of the training set y(i) and thus form a function F(x(i),θ), which would help in predicting future values.

This function, is called hypothesis and is usually denoted by h(x(i),θ). Note that, x(lowercase) is used to denote a single training example as a whole, where as we use X (i,j) is used to point the jth feature for the ith training example. But confusing?? Let's simplify it!!

As shown, to show the whole feature set for a single example, we use x(1). We can also point to the first feature in third example in training set as X(3,1) or x1(3).

For simplicity of understanding, let's assume, there is only one feature in our dataset of 10 examples. Let's plot it on a graph.

So, let's say, the equation of the line we are supposed to fit the curve is h(x(i),θ)= θ0 + x1θ1.
Or for simplicity, let's say we have one more feature x0 which is always equal to 1 so that we can rewrite hθ (x(i))= θ0 x01 x1.
Or for multiple features, we can write it as h(x(i),θ)= θ0 x0+ θ1 x1+ θ2 x2+ θ3x3…… θn xn for n number of features.
Therefore, in short, hθ(x(i))= i=1n∑  (θi*xi).

Now, the code for our hypothesis will be as follows:
def HypoThe(Theta,xi):
        for i in range(len(xi)):
        return sum
        return False

Now, since we have defined a mathematical model for prediction, the next problem we face will be finding the rest set of parameters Theta(θ).

To find Theta, let's break it into basic steps:
1. Assume a pair of Theta.
2. Compute the prediction error.
3. Assume again so that it minimizes the Prediction Error.

Sounds complicated? Well it's quite simple actually. Suppose we have selected some particular values for parameter θ. Now, we will devise a new cost function, which basically calculates the prediction error for θ.

J(θ)= i=1m( hθ(x(i))-y(i))2/ (2*m)

What is means is that, we predict the output using the hθ(x(i)) function, and subtract it from actual label y(i), and then square it and add it for all training examples(1….m). We average it out using (2*m). Why did we use 2*m instead of m, we'll find it out soon. Let’s just say, it's for the simplicity, and also because we are dealing with a mean-squared computed error function which we are supposed to minimize.

Although, cost function is useful only for debugging purposes, and not in actual implementation, we'll still code it out.

def RegCostFunc(Theta,X,Y):
    for i in range(len(X)):
        sum1+=((HypoThe(Theta, X[i])-Y[i])**2)
    return J/(2*len(X))

Now, what we are supposed to do next is find those value for Theta for which J(θ) reaches it minimum. If you are familiar with calculus, what we're supposed to do is to find a the minima for the J(θ) function. To solve this optimization problem, we'll be using the gradient descent algorithm.

So what exactly does it do?
Let's write down the algorithm first.
repeat until convergence: {

for all values of θ­ simultaneously update,
 θj= θj - α * (Ə J(θ)/ Əθj)

Let's understand this equation first.
(Ə J(θ)/ Əθj) means that take partial derivative of J(θ) w.r.t θj while rest of the θ are constant.
Ə J(θ)/ Əθj= Ə (  i=1m∑( hθ(x(i))-y(i))2/ (2*m) )  / Əθj.

If you solve this using calculus, we find that,
Ə J(θ)/ Əθj=  i=1m∑( hθ(x(i))-y(i))2 * xj(i)/ m

This basically means, while keeping the rest constant, decrease the value of θj in the direction of the slope of the function J(θ).

α is the learning rate by which, we reduce the value of the θj. If α is too large, the function will never converge actually.

Note that, do not update θj continuously, but update the whole θ simultaneously.
i.e. after one complete loop.

Rewriting the Algorithm, we get.
repeat until convergence: {

for all values of θ­ simultaneously update,
 θj= θj - α * ( i=1m∑( hθ(x(i))-y(i))2 * xj(i)/ m )

Let's code this now,
def GradTerm(X,Y,Theta,i):
    for j in range(len(X)):
    return sum1

Note that, here, I have divided the Gradient Descent Algorithm from the Gradient Term above, for the simplicity of the understanding of our code.

def GradDesc(Theta,alpha,Xfeature,Ylabels):
    for i in range(0,len(Theta)):
        Theta_.append(Theta(i)-(alpha* GradTerm(Xfeature,Ylabels,Theta,i)/len(Xfeature)))
    return Theta_

Now we are done with everything required. So how to Run all these functions for proper functioning?

Basically, here is how we'll proceed:

1. We'll call linear regression and provide it with a Features and Labels.
2. We'll add feature x0 = 1 to every example in data set.
3. We'll initialize Theta=[0]*len(Features[0]). (i.e.  number of parameters should be equal to number of features )
4. We'll select a particular value of alpha and no. of  iterations, and run gradient descent algorithm to minimize the parameters θ.
Make sure to keep alpha as small as possible in the order of 0.000001 or less and the number of iterations to be around 100,000-1,000,000 for better results.

def LinearRegression(Xfeature,Ylabels,alpha,iterations):
    if len(Xfeature)!=len(Ylabels):
        print("Missing Data");
        return False
        for i in range(len(Xfeature)):
        for i in range(iterations):
            print("\nIteration Number ",i)
            Theta=GradDesc(Theta, alpha, Xfeature, Ylabels)
        return Theta            

Now, I know you can easily implement it using scikit-learn library, but then, it's just a black-box module for machine learning. You'll never understand what is happening inside of it. However, if you have studied machine learning in detail, once, you'll have better understanding in using the sci-kit library and tweak the values accordingly for better results.

Below is the Complete code for this program. make sure to import math and sys for it. I am not a professional at coding, you might find some unwanted piece of lines in the code, which you can optimize yourself though.

That's it from this blog, if there are any suggestions, or corrections, feel free to mention in the comment section. Also if you have any doubts, feel free to ask.

- Machine Learning by Andrew Ng, (among the best MOOCs).


Popular posts from this blog

K-Means Clustering for Image Compression, from scratch.

Hello World, This is Saumya, and I am here to help you understand and implement K-Means Clustering Algorithm from scratch without using any Machine Learning libraries. We will further use this algorithm to compress an image. Here, I will implement this code in Python, but you can implement the algorithm in any other programming language of your choice just by basically developing 4-5 simple functions.

So now, first of all, what exactly is Clustering and in particular K-Means? As discussed in my blog on Machine Learning, Clustering is a type of unsupervised machine learning problem in which, we find clusters of similar data. K-means is the most widely used clustering algorithm. So basically, our task is to find those centers for the clusters around which our data points are associated.

These centres of the Clusters are called centroids(K). Note that, these cluster centroids, may or may not belong to our dataset itself. Since our problem is to choose these points, let's first of al…

Machine Learning

Hello World, This is Saumya, and I am here to help you understand the basics of Machine Learning, what exactly does it mean, what are its types, and how powerful of a tool it can be.

We have all been hearing recently about the term "Artificial Intelligence" recently, and how it will shape our future. Well, Machine Learning is nothing but a minor subfield of the vast field of A.I. Some of you might feel they both are basically the same thing, but in reality, they are not. A.I. is basically a cluster of interconnected fields, which makes it difficult for us to sometimes visualize the difference between them all.
Now then, what is the difference? By definition, A.I. is basically trying to create a machine that is capable to think the way we humans do and specifically learn from our experiences. On the other hand, M.L. is computer's way of learning from data and henceforth make decision from the information obtained.
Again, We can say that ML is basically a part of AI, but AI…

Linear Regression - Performance Checking & Diagnosis

Hello World, This is Saumya, and I am here to help you understand and implement Linear Regression in more detail and will discuss various problems we may encounter while training our model along with some techniques to solve those problems. There won't be any more programming done in this post, although, you can try it out yourself, whatever is discussed in this blog. 

So now, first of all, Let's recall what we studied about Linear Regression in our previous blog. So, we first discussed about certain notations regarding to machine learning in general, then the cost function, hθ(x(i))= θ0 x0+θ1 x1. Further we discussed about training the model using the training set by running the gradient descent algorithm over it. We also discussed about the Cost Function.
Now, before we begin, I want to talk about the Cost Function in brief. Cost function, as we defined, is, J(θ)= i=1m( hθ(x(i))-y(i))2/ (2*m). If we define cost function, we can define it as the function, whose value is penali…