Skip to main content

Linear Regression

4 things to take in consideration:

  1. Dataset:

    (xx, yy) with xx being the feature and yy being the target.

  2. Model:

    represented by the function f(x)=ax+bf(x) = ax + b, aa and bb being the parameters.

  3. Cost Function:

    A function to compute the error between the predicted values and the targets.

    In our case, J(a,b)=12mi=1m(f(x(i))y(i))2J(a,b) = {{1 \over{2m}} \sum_{i=1}^{m} (f(x^{(i)}) - y^{(i)})^2 }, mm being the number of rows.

  4. Minimisation Algorithm:

    In our case, we are going to use the Gradient Descent.

    Basically, for each parameter aia_{i}, we compute aia_{i} = aia_{i} - α\alpha Jai(a)\frac{\partial J}{\partial a_i}(a).

    α\alpha being the learning rate.

Transform this to a problem using matrices

Let's nn is the number of features, mm is the number of rows.

xx = (x1(1)x2(1)x3(1)xn(1)x1(2)x2(2)x3(2)xn(2)x1(m)x2(m)x3(m)xn(m))\begin{pmatrix} x_{1}^{(1)} & x_{2}^{(1)} & x_{3}^{(1)} & \dots & x_{n}^{(1)} \\ x_{1}^{(2)} & x_{2}^{(2)} & x_{3}^{(2)} & \dots & x_{n}^{(2)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ x_{1}^{(m)} & x_{2}^{(m)} & x_{3}^{(m)} & \dots & x_{n}^{(m)} \end{pmatrix} thus XX = (x1(1)x2(1)x3(1)xn(1)1x1(2)x2(2)x3(2)xn(2)1x1(m)x2(m)x3(m)xn(m)1)\begin{pmatrix} x_{1}^{(1)} & x_{2}^{(1)} & x_{3}^{(1)} & \dots & x_{n}^{(1)} & 1 \\ x_{1}^{(2)} & x_{2}^{(2)} & x_{3}^{(2)} & \dots & x_{n}^{(2)} & 1 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ x_{1}^{(m)} & x_{2}^{(m)} & x_{3}^{(m)} & \dots & x_{n}^{(m)} & 1 \end{pmatrix} I ⁣Rm×(n+1)\in {\rm I\!R}^{m \times (n+1)}

yy = (y(1)y(2)y(m))\begin{pmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{pmatrix} I ⁣Rm×1\in {\rm I\!R}^{m \times 1}

And the parameters θ\theta = (a1a2anb)\begin{pmatrix} a_{1} \\ a_{2} \\ \vdots \\ a_{n} \\ b \end{pmatrix} I ⁣R(n+1)×1\in {\rm I\!R}^{(n+1) \times 1}

Finally we get the model F=XθI ⁣Rm×1F = X\theta \in {\rm I\!R}^{m \times 1}

As for the cost function J(θ)=12mi=1m(Fy)2=12mi=1m(Xθy)2I ⁣RJ(\theta) = {{1 \over{2m}} \sum_{i=1}^{m} (F - y)^2 } = {{1 \over{2m}} \sum_{i=1}^{m} (X\theta - y)^2 } \in {\rm I\!R}

As for the gradient Jθ(θ)=1mXT(Xθy)I ⁣R(n+1)×1\frac{\partial J}{\partial \theta}(\theta) = {{1 \over{m}} X^T(X\theta - y) } \in {\rm I\!R^{(n+1) \times 1}}

Thus θ=θαJθ\theta = \theta - \alpha \frac{\partial J}{\partial \theta}

import numpy as np
from sklearn.datasets import make_regression

x, y = make_regression(n_samples=100, n_features=1, noise=10) # Here m=100 and n=1
y = y.reshape(y.shape[0], 1)
x.shape, y.shape

((100, 1), (100, 1))


import matplotlib.pyplot as plt

plt.scatter(x, y)

png


X = np.hstack((x, np.ones((x.shape[0], 1))))
X.shape

(100, 2)


theta = np.random.randn(X.shape[1], 1)
theta

array([[-1.11140959],[ 0.15718568]])

def model(X, theta):
return X.dot(theta)

plt.scatter(x, y, c='g')
plt.plot(x, model(X, theta), c='orange')

png

def cost_function(theta, X, y):
return 1 / (2 * len(y)) * np.sum((X.dot(theta) - y) ** 2)
cost_function(theta, X, y)

2466.3199193827004

def gradients(theta, X, y):
return 1 / len(y) * X.T.dot(X.dot(theta) - y)
gradients(theta, X, y)

array([[-76.34230311], [ -4.87819021]])

def gradient_descent(theta, X, y, learning_rate, n):
for i in range(n):
theta = theta - learning_rate * gradients(theta, X, y)
return theta
final_theta = gradient_descent(theta, X, y, 0.01, 1000)
plt.scatter(x, y, c='g')
plt.plot(x, model(X, final_theta), c='orange')

png

And there you have it, we have created a linear model that represents our data!

Now, let's study the learning process.

Let's visualize the cost function's variation, and the linear model variation!

def gradient_descent_v2(theta, X, y, learning_rate, n):
costs = []
thetas = []
for i in range(n):
theta = theta - learning_rate * gradients(theta, X, y)
costs.append(cost_function(theta, X, y))
thetas.append(theta)
return {"theta": theta, "costs": costs, "thetas": thetas}
result = gradient_descent_v2(theta, X, y, 0.01, 1000)
plt.plot(range(1000), result["costs"])

png

plt.scatter(x, y, c='g')
color = iter(plt.cm.rainbow(np.linspace(0, 1, 10)))
for i in range(10):
c = next(color)
plt.plot(x, model(X, result["thetas"][i * 100]), c=c)

png

What we can conclude is n=200n = 200 is sufficient to train the model. Let's test it!

result = gradient_descent_v2(theta, X, y, 0.01, 200)
plt.plot(range(200), result["costs"])

png

plt.scatter(x, y, c='g')
color = iter(plt.cm.rainbow(np.linspace(0, 1, 10)))
for i in range(10):
c = next(color)
plt.plot(x, model(X, result["thetas"][i * 20 + 19]), c=c)

png