diff --git a/Notes/Semester 4/MLDM - Machine Learning and Data Mining/Labs/L03_Linear_Regression_LAB_ASSIGNMENT.ipynb b/Notes/Semester 4/MLDM - Machine Learning and Data Mining/Labs/L03_Linear_Regression_LAB_ASSIGNMENT.ipynb new file mode 100644 index 0000000..da67427 --- /dev/null +++ b/Notes/Semester 4/MLDM - Machine Learning and Data Mining/Labs/L03_Linear_Regression_LAB_ASSIGNMENT.ipynb @@ -0,0 +1,633 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "FZEco2HK6D57" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from matplotlib import pyplot as plt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "a3nCUqopXHwv" + }, + "outputs": [], + "source": [ + "RANDOM_SEED = 0x0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jjTkUw7BWulH" + }, + "source": [ + "# Lab 03: Linear Regression" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gNnZUk36Xz7_" + }, + "source": [ + "For the first few Tasks, we will work with synthetic univariate data.\n", + "We generate $100$ features $x_i \\in [-1, 1]$ as `x` and two different\n", + "regression targets `y1` and `y2`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Ojta777H2ulb" + }, + "outputs": [], + "source": [ + "data_rng = np.random.default_rng(RANDOM_SEED)\n", + "n = 100\n", + "x = 2 * data_rng.random(n) - 1 # create n points between -1 and 1\n", + "\n", + "# setup synthetic linear data\n", + "true_offset = 0.5\n", + "true_slope = 1.25\n", + "noise = data_rng.normal(loc=0., scale=0.25, size=(n,))\n", + "\n", + "y1 = true_offset + true_slope * x + noise\n", + "\n", + "\n", + "# setup synthetic non-linear data\n", + "y2 = true_offset + np.sin(np.pi * x) + noise" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ntdpTWzqZqAU" + }, + "source": [ + "# Task 1 (1 Point): Pearson Correlation" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JbNJ7WhzbAtm" + }, + "source": [ + "### Task 1a\n", + "\n", + "Plot `x` against the target variable `y1`.\n", + "\n", + "* use `plt.scatter`\n", + "\n", + "\n", + "Do you think there is a linear relationship between `x` and the target?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MxYMdhfxyYAd" + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6Ak0nQ0PDGpm" + }, + "source": [ + "Plot `x` against the target variable `y2`.\n", + "\n", + "Do you think there is a linear relationship between `x` and the target?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "HpzwoBdQDd-d" + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HycYQm3tbvyf" + }, + "source": [ + "### Task 1b\n", + "\n", + "In class you have seen the formula for the Pearson Correlation:\n", + "$\\rho(a, b) = \\frac{\\sum_{i = 1}^{m} (a_i - \\bar{a})(b_i - \\bar{b})}{\\sqrt{\\sum_{i=1}^{m} (a_i - \\bar{a})^2\\sum_{i = 1}^{m}(b_i - \\bar{b})^2}} $, where $\\bar{a} = \\frac{1}{m}\\sum_{i=1}^{m} a_i$ and $\\bar{b} = \\frac{1}{m}\\sum_{i=1}^{m} b_i$.\n", + "\n", + "* Compute the Pearson Correlation $\\rho$ between `x` and the target `y1`.\n", + "* Compute the Pearson Correlation between `x` and `y2`.\n", + "* Check that you get the same result as the reference implementation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EUoJXIrCy0p6" + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "L_NesuDQddHS" + }, + "outputs": [], + "source": [ + "# Refer to the output of this cell to check whether your implementation of rho\n", + "# is correct.\n", + "\n", + "from scipy.stats import pearsonr\n", + "\n", + "print(f\"rho(x, y1): {pearsonr(x, y1)[0]:.4f}\")\n", + "print(f\"rho(x, y2): {pearsonr(x, y2)[0]:.4f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kr9OWmCilrAv" + }, + "source": [ + "## 📢 **HAND-IN** 📢: Report in Moodle whether you solved this task." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rbjhdwFceHlL" + }, + "source": [ + "# Task 2 (2 Points): Univariate Linear Regression" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ucnYGKbmecz_" + }, + "source": [ + "### Task 2a\n", + "\n", + "You will now implement Linear Regression with a single variable. In class you have seen that the underlying model is: $y = \\theta_0 + \\theta_1x$.\n", + "You also derived the maximum likelihood estimates for $\\theta_0$ and $\\theta_1$:\n", + "\n", + "* $\\hat{\\theta}_1 = \\frac{\\sum_{i=1}^{m} (x_i - \\bar{x})(y_i - \\bar{y})}{\\sum_{i=1}^{m}(x_i - \\bar{x})^2}$ with $\\bar{x} = \\frac{1}{m}\\sum_{i=1}^{m} x_i$ and $\\bar{y} = \\frac{1}{m}\\sum_{i=1}^{m} y_i$.\n", + "* $\\hat{\\theta}_0 = \\bar{y} - \\hat{\\theta}_1\\bar{x}$\n", + "\n", + "In the following cell, implement the `.fit` and `.predict` methods:\n", + "* In the `.predict` method you will have to apply the model to the input `x`\n", + "* In the `.fit` method you will have to compute $\\hat{\\theta}_0$ and $\\hat{\\theta}_1$." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qS0Oa5Btgk74" + }, + "outputs": [], + "source": [ + "class UnivariateLinearRegression:\n", + "\n", + " def __init__(self):\n", + " self.theta_0: float = 0.\n", + " self.theta_1: float = 0.\n", + "\n", + " def predict(self, x):\n", + " # y = theta_0 + theta_1 * x\n", + " return None # TODO\n", + "\n", + " def fit(self, x, y):\n", + "\n", + " self.theta_1 = ... # TODO\n", + " self.theta_0 = ... # TODO\n", + "\n", + " return self" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9LzenH1UhLOs" + }, + "source": [ + "### Task 2b\n", + "\n", + "Fit your linear model to `x` and the target `y1`.\n", + "\n", + "* Create an instance of the class `UnivariateLinearRegression`\n", + "* fit the model using its `.fit` method\n", + "* get the predicted values, using `.predict`\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UHGuDWAntd8R" + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "elE3OfjHjBRO" + }, + "source": [ + "* implement the function `plot_model`\n", + "* use `plot_model` to plot your linear regression model given the true datapoints" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "T0eKDuRt1YOF" + }, + "outputs": [], + "source": [ + "def plot_model(x, y_pred, y_true, title):\n", + " # TODO\n", + " ...\n", + " plt.show()\n", + "\n", + "plot_model(...)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tt2RnAwAG1n9" + }, + "source": [ + "* Fit another linear model to `x` and `y2`\n", + "* get the predicted values\n", + "* plot the model with `plot_model`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Ccq3GI17Ga2x" + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E0i3gWvIl7nY" + }, + "source": [ + "## 📢 **HAND-IN** 📢: A PDF document containing the following:\n", + "\n", + "* both plots containing the linear regression model and true datapoints\n", + "* a short (2-3 sentences) interpretation of the curves: why do you think they look the way\n", + "they do? can you draw any conclusions?\n", + "\n", + "**Solutions for Tasks 2, 3 and 4 should be in the same document: you will only upload 1 document with your solutions for all 3 tasks!**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0TK0Pi4ClphY" + }, + "source": [ + "# Task 3 (4 Points): Univariate Linear Regression using Stochastic Gradient Descent" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YL31gChVqLpC" + }, + "source": [ + "### Task 3a\n", + "\n", + "In class you have seen an alternative version to estimate the parameters $\\theta_i$ of the linear regression models by using Gradient Descent.\n", + "\n", + "For the univariate linear regression model, the stochastic gradient descent updates look like this:\n", + "* $\\theta_{0}^{(t+1)} = \\theta_{0}^{(t)} - \\alpha (\\theta_{0}^{(t)} + \\theta_{1}^{(t)} x_t - y_t)$\n", + "* $\\theta_{1}^{(t+1)} = \\theta_{1}^{(t)} - \\alpha (\\theta_{0}^{(t)} + \\theta_{1}^{(t)} x_t - y_t) x_t$\n", + "\n", + "Here $\\alpha$ is the learning rate, and $(x_t, y_t)$ is the data point sampled\n", + "at time $t$.\n", + "\n", + "\n", + "In the following cell, implement the `.fit` and `.predict` methods:\n", + "* In the `.predict` method you will have to apply the model to the input `x`.\n", + "* In the `.fit` method you will have to implement the update equations for\n", + "$\\theta_0$ and $\\theta_1$." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wJMHvQmXmVKr" + }, + "outputs": [], + "source": [ + "class SGDUnivariateLinearRegression:\n", + "\n", + " def __init__(self):\n", + " self.theta_0: float = 0.\n", + " self.theta_1: float = 0.\n", + " self.rng = np.random.default_rng(RANDOM_SEED)\n", + "\n", + " def predict(self, x):\n", + " # y = theta_0 + theta_1 * x\n", + " return None # TODO\n", + "\n", + " def fit(self, x, y, n_iter: int = 100, learning_rate: float = 1.0):\n", + " for t in range(n_iter):\n", + " sample_ix = self.rng.integers(0, len(x))\n", + "\n", + " xt = x[sample_ix]\n", + " yt = y[sample_ix]\n", + "\n", + " # TODO: update self.theta_0 and self.theta_1 SIMULTANEOUSLY (!!!) according to their update equations\n", + "\n", + " return self" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MHLBmTm4vK9p" + }, + "source": [ + "### Task 3b\n", + "\n", + "Run SGD for `x` and the target `y1` and compute the mean squared error (MSE).\n", + "The MSE is defined as: $\\frac{1}{n}\\sum_{i=1}^{n} (\\hat{y}_i - y_i)^2$, where\n", + "$\\hat{y}$ are the model predictions.\n", + "\n", + "* Create an instance of the class `SGDUnivariateLinearRegression`\n", + "* fit the model using its `.fit` method\n", + "* get the predicted values, using `.predict`\n", + "* implement the `mse` function\n", + "* compute the MSE of your predictions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CZ1szyQhK9so" + }, + "outputs": [], + "source": [ + "def mse(y_pred, y_true):\n", + " # TODO\n", + " return 0." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "V35vBU5Yti8Z" + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hSsE1o6GwA3K" + }, + "source": [ + "### Task 3c\n", + "\n", + "You will now plot the learning curves for different learning rates $\\alpha$.\n", + "A learning curves shows how a model's performance changes with increasing number of update steps.\n", + "In our case we will plot the model's MSE as a function of the number of update\n", + "steps `n_iter` for different values of `learning_rate`.\n", + "\n", + "In the following cell we setup most of the scaffold to create this plot. Follow\n", + "the instructions in the comments to finish the plots." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4Rr5ix7LNISB" + }, + "outputs": [], + "source": [ + "n_iters = [50, 100, 200, 500, 1000, 2000]\n", + "learning_rates = [1., .1, .01]\n", + "\n", + "# we plot the MSE achieved by the closed form model as a reference\n", + "closed_form = UnivariateLinearRegression()\n", + "closed_form.fit(x, y1)\n", + "mse_base = mse(y_pred=closed_form.predict(x), y_true=y1)\n", + "plt.plot(n_iters, np.ones_like(n_iters) * mse_base, label=\"closed form\", linestyle='--', c='b')\n", + "\n", + "for alpha in learning_rates:\n", + " mses = []\n", + " for n_iter in n_iters:\n", + " # fit a SGDUnivariateLinearRegression model using n_iter=n_iter and\n", + " # learning_rate=alpha\n", + " # compute its mse and append the mse value to the mses list\n", + "\n", + " mse_ = 1. # replace with mse calculation\n", + " mses.append(mse_)\n", + " plt.plot(n_iters, mses, label=f\"alpha = {alpha:.2f}\")\n", + "\n", + "plt.xlabel(\"n_iter\")\n", + "plt.ylabel(\"MSE\")\n", + "plt.legend()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SmCkMMJyEEgV" + }, + "source": [ + "## 📢 **HAND-IN** 📢: A PDF document containing the following:\n", + "\n", + "* the final plot containing learning curves\n", + "* a short (2-3 sentences) interpretation of the curves: why do you think they look the way\n", + "they do? can you draw any conclusions?\n", + "\n", + "In case you were not able to arrive at the final plot:\n", + "\n", + "* include screenshots of the code you wrote so we can assign partial credit\n", + "\n", + "**Solutions for Tasks 2, 3 and 4 should be in the same document: you will only upload 1 document with your solutions for all 3 tasks!**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dgrNtwsPyigH" + }, + "source": [ + "# Task 4 (3 Points): Multivariate Linear Regression" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_sPWegXCg2y1" + }, + "source": [ + "In this task we will apply linear regression to non-synthetic data.\n", + "The variable `X` is a `pandas` `Dataframe` containing features and `y` contains\n", + "the target. Read through the description to get an idea of the different variables." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "djGUQ3kVx9ob" + }, + "outputs": [], + "source": [ + "from sklearn.datasets import load_diabetes\n", + "\n", + "data = load_diabetes(as_frame=True)\n", + "\n", + "X = data['data']\n", + "y = data['target']\n", + "description = data['DESCR']\n", + "\n", + "print(description)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "byOVt9t9_2c7" + }, + "source": [ + "### Task 4a\n", + "\n", + "Implement linear regression using `sklearn`.\n", + "\n", + "* create an instance of the class `sklearn.linear_model.LinearRegression`. Refer to the documentation at: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html\n", + "* call its `.fit` method\n", + "* get the predicted values with `.predict`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "eyiU4nCQBovr" + }, + "outputs": [], + "source": [ + "from sklearn.linear_model import LinearRegression" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "G4AktC189PAc" + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qQUdYHOXpeLd" + }, + "source": [ + "### Task 4b\n", + "\n", + "The estimated parameters $\\theta$ of the linear model can be found in the `.coef_` member variable. The feature names can be found in the `.feature_names_in_` member variable. They are the same as the names of the columns of `X` and should be in the same order.\n", + "\n", + "Visualize the estimated parameters and the feature names in a bar plot.\n", + "\n", + "Using these, answer the following questions:\n", + "\n", + "* Which are the 3 most influential features?\n", + "* How do you interpret the sign of the coefficients?\n", + "* If you had to exclude 1 feature, which one would you select and why?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "odXnubfHqrfc" + }, + "outputs": [], + "source": [ + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xa_HDxFeolBj" + }, + "source": [ + "## 📢 **HAND-IN** 📢: A PDF document containing the following:\n", + "\n", + "* the bar plot\n", + "* your answers to the questions in Task 4b\n", + "\n", + "**Solutions for Tasks 2, 3 and 4 should be in the same document: you will only upload 1 document with your solutions for all 3 tasks!**\n" + ] + } + ], + "metadata": { + "colab": { + "private_outputs": true, + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.16" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file