Add MLDLM Lab03

2023-06-17 13:47:27 +02:00 · 2023-06-17 13:47:27 +02:00 · d32539c2b2
commit d32539c2b2
parent bdb2e6547d
1 changed files with 633 additions and 0 deletions
--- a/Mining/Labs/L03_Linear_Regression_LAB_ASSIGNMENT.ipynb
+++ b/Mining/Labs/L03_Linear_Regression_LAB_ASSIGNMENT.ipynb
@ -0,0 +1,633 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "FZEco2HK6D57"
+      },
+      "outputs": [],
+      "source": [
+        "import numpy as np\n",
+        "import pandas as pd\n",
+        "from matplotlib import pyplot as plt"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "a3nCUqopXHwv"
+      },
+      "outputs": [],
+      "source": [
+        "RANDOM_SEED = 0x0"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "jjTkUw7BWulH"
+      },
+      "source": [
+        "# Lab 03: Linear Regression"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "gNnZUk36Xz7_"
+      },
+      "source": [
+        "For the first few Tasks, we will work with synthetic univariate data.\n",
+        "We generate $100$ features $x_i \\in [-1, 1]$ as `x` and two different\n",
+        "regression targets `y1` and `y2`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "Ojta777H2ulb"
+      },
+      "outputs": [],
+      "source": [
+        "data_rng = np.random.default_rng(RANDOM_SEED)\n",
+        "n = 100\n",
+        "x = 2 * data_rng.random(n) - 1  # create n points between -1 and 1\n",
+        "\n",
+        "# setup synthetic linear data\n",
+        "true_offset = 0.5\n",
+        "true_slope = 1.25\n",
+        "noise = data_rng.normal(loc=0., scale=0.25, size=(n,))\n",
+        "\n",
+        "y1 = true_offset + true_slope * x + noise\n",
+        "\n",
+        "\n",
+        "# setup synthetic non-linear data\n",
+        "y2  = true_offset + np.sin(np.pi * x) + noise"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ntdpTWzqZqAU"
+      },
+      "source": [
+        "# Task 1 (1 Point): Pearson Correlation"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "JbNJ7WhzbAtm"
+      },
+      "source": [
+        "### Task 1a\n",
+        "\n",
+        "Plot `x` against the target variable `y1`.\n",
+        "\n",
+        "* use `plt.scatter`\n",
+        "\n",
+        "\n",
+        "Do you think there is a linear relationship between `x` and the target?"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "MxYMdhfxyYAd"
+      },
+      "outputs": [],
+      "source": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "6Ak0nQ0PDGpm"
+      },
+      "source": [
+        "Plot `x` against the target variable `y2`.\n",
+        "\n",
+        "Do you think there is a linear relationship between `x` and the target?"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "HpzwoBdQDd-d"
+      },
+      "outputs": [],
+      "source": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "HycYQm3tbvyf"
+      },
+      "source": [
+        "### Task 1b\n",
+        "\n",
+        "In class you have seen the formula for the Pearson Correlation:\n",
+        "$\\rho(a, b) = \\frac{\\sum_{i = 1}^{m} (a_i - \\bar{a})(b_i - \\bar{b})}{\\sqrt{\\sum_{i=1}^{m} (a_i - \\bar{a})^2\\sum_{i = 1}^{m}(b_i - \\bar{b})^2}} $, where $\\bar{a} = \\frac{1}{m}\\sum_{i=1}^{m} a_i$ and $\\bar{b} = \\frac{1}{m}\\sum_{i=1}^{m} b_i$.\n",
+        "\n",
+        "* Compute the Pearson Correlation $\\rho$ between `x` and the target `y1`.\n",
+        "* Compute the Pearson Correlation between `x` and `y2`.\n",
+        "* Check that you get the same result as the reference implementation"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "EUoJXIrCy0p6"
+      },
+      "outputs": [],
+      "source": []
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "L_NesuDQddHS"
+      },
+      "outputs": [],
+      "source": [
+        "# Refer to the output of this cell to check whether your implementation of rho\n",
+        "# is correct.\n",
+        "\n",
+        "from scipy.stats import pearsonr\n",
+        "\n",
+        "print(f\"rho(x, y1): {pearsonr(x, y1)[0]:.4f}\")\n",
+        "print(f\"rho(x, y2): {pearsonr(x, y2)[0]:.4f}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Kr9OWmCilrAv"
+      },
+      "source": [
+        "## 📢 **HAND-IN** 📢: Report in Moodle whether you solved this task."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rbjhdwFceHlL"
+      },
+      "source": [
+        "# Task 2 (2 Points): Univariate Linear Regression"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ucnYGKbmecz_"
+      },
+      "source": [
+        "### Task 2a\n",
+        "\n",
+        "You will now implement Linear Regression with a single variable. In class you have seen that the underlying model is: $y = \\theta_0 + \\theta_1x$.\n",
+        "You also derived the maximum likelihood estimates for $\\theta_0$ and $\\theta_1$:\n",
+        "\n",
+        "* $\\hat{\\theta}_1 = \\frac{\\sum_{i=1}^{m} (x_i - \\bar{x})(y_i - \\bar{y})}{\\sum_{i=1}^{m}(x_i - \\bar{x})^2}$ with $\\bar{x} = \\frac{1}{m}\\sum_{i=1}^{m} x_i$ and $\\bar{y} = \\frac{1}{m}\\sum_{i=1}^{m} y_i$.\n",
+        "* $\\hat{\\theta}_0 = \\bar{y} - \\hat{\\theta}_1\\bar{x}$\n",
+        "\n",
+        "In the following cell, implement the `.fit` and `.predict` methods:\n",
+        "* In the `.predict` method you will have to apply the model to the input `x`\n",
+        "* In the `.fit` method you will have to compute $\\hat{\\theta}_0$ and $\\hat{\\theta}_1$."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "qS0Oa5Btgk74"
+      },
+      "outputs": [],
+      "source": [
+        "class UnivariateLinearRegression:\n",
+        "\n",
+        "  def __init__(self):\n",
+        "    self.theta_0: float = 0.\n",
+        "    self.theta_1: float = 0.\n",
+        "\n",
+        "  def predict(self, x):\n",
+        "    # y = theta_0 + theta_1 * x\n",
+        "    return None  # TODO\n",
+        "\n",
+        "  def fit(self, x, y):\n",
+        "\n",
+        "    self.theta_1 = ...  # TODO\n",
+        "    self.theta_0 = ...  # TODO\n",
+        "\n",
+        "    return self"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "9LzenH1UhLOs"
+      },
+      "source": [
+        "### Task 2b\n",
+        "\n",
+        "Fit your linear model to `x` and the target `y1`.\n",
+        "\n",
+        "* Create an instance of the class `UnivariateLinearRegression`\n",
+        "* fit the model using its `.fit` method\n",
+        "* get the predicted values, using `.predict`\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "UHGuDWAntd8R"
+      },
+      "outputs": [],
+      "source": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "elE3OfjHjBRO"
+      },
+      "source": [
+        "* implement the function `plot_model`\n",
+        "* use `plot_model` to plot your linear regression model given the true datapoints"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "T0eKDuRt1YOF"
+      },
+      "outputs": [],
+      "source": [
+        "def plot_model(x, y_pred, y_true, title):\n",
+        "  # TODO\n",
+        "  ...\n",
+        "  plt.show()\n",
+        "\n",
+        "plot_model(...)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "tt2RnAwAG1n9"
+      },
+      "source": [
+        "* Fit another linear model to `x` and `y2`\n",
+        "* get the predicted values\n",
+        "* plot the model with `plot_model`"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "Ccq3GI17Ga2x"
+      },
+      "outputs": [],
+      "source": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "E0i3gWvIl7nY"
+      },
+      "source": [
+        "## 📢 **HAND-IN** 📢: A PDF document containing the following:\n",
+        "\n",
+        "* both plots containing the linear regression model and true datapoints\n",
+        "* a short (2-3 sentences) interpretation of the curves: why do you think they look the way\n",
+        "they do? can you draw any conclusions?\n",
+        "\n",
+        "**Solutions for Tasks 2, 3 and 4 should be in the same document: you will only upload 1 document with your solutions for all 3 tasks!**"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "0TK0Pi4ClphY"
+      },
+      "source": [
+        "# Task 3 (4 Points): Univariate Linear Regression using Stochastic Gradient Descent"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "YL31gChVqLpC"
+      },
+      "source": [
+        "### Task 3a\n",
+        "\n",
+        "In class you have seen an alternative version to estimate the parameters $\\theta_i$ of the linear regression models by using Gradient Descent.\n",
+        "\n",
+        "For the univariate linear regression model, the stochastic gradient descent updates look like this:\n",
+        "* $\\theta_{0}^{(t+1)} = \\theta_{0}^{(t)} - \\alpha (\\theta_{0}^{(t)} + \\theta_{1}^{(t)} x_t - y_t)$\n",
+        "* $\\theta_{1}^{(t+1)} = \\theta_{1}^{(t)} - \\alpha (\\theta_{0}^{(t)} + \\theta_{1}^{(t)} x_t - y_t) x_t$\n",
+        "\n",
+        "Here $\\alpha$ is the learning rate, and $(x_t, y_t)$ is the data point sampled\n",
+        "at time $t$.\n",
+        "\n",
+        "\n",
+        "In the following cell, implement the `.fit` and `.predict` methods:\n",
+        "* In the `.predict` method you will have to apply the model to the input `x`.\n",
+        "* In the `.fit` method you will have to implement the update equations for\n",
+        "$\\theta_0$ and $\\theta_1$."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "wJMHvQmXmVKr"
+      },
+      "outputs": [],
+      "source": [
+        "class SGDUnivariateLinearRegression:\n",
+        "\n",
+        "  def __init__(self):\n",
+        "    self.theta_0: float = 0.\n",
+        "    self.theta_1: float = 0.\n",
+        "    self.rng = np.random.default_rng(RANDOM_SEED)\n",
+        "\n",
+        "  def predict(self, x):\n",
+        "    # y = theta_0 + theta_1 * x\n",
+        "    return None  # TODO\n",
+        "\n",
+        "  def fit(self, x, y, n_iter: int = 100, learning_rate: float = 1.0):\n",
+        "    for t in range(n_iter):\n",
+        "      sample_ix = self.rng.integers(0, len(x))\n",
+        "\n",
+        "      xt = x[sample_ix]\n",
+        "      yt = y[sample_ix]\n",
+        "\n",
+        "      # TODO: update self.theta_0 and self.theta_1 SIMULTANEOUSLY (!!!) according to their update equations\n",
+        "\n",
+        "    return self"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "MHLBmTm4vK9p"
+      },
+      "source": [
+        "### Task 3b\n",
+        "\n",
+        "Run SGD for `x` and the target `y1` and compute the mean squared error (MSE).\n",
+        "The MSE is defined as: $\\frac{1}{n}\\sum_{i=1}^{n} (\\hat{y}_i - y_i)^2$, where\n",
+        "$\\hat{y}$ are the model predictions.\n",
+        "\n",
+        "* Create an instance of the class `SGDUnivariateLinearRegression`\n",
+        "* fit the model using its `.fit` method\n",
+        "* get the predicted values, using `.predict`\n",
+        "* implement the `mse` function\n",
+        "* compute the MSE of your predictions"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "CZ1szyQhK9so"
+      },
+      "outputs": [],
+      "source": [
+        "def mse(y_pred, y_true):\n",
+        "  # TODO\n",
+        "  return 0."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "V35vBU5Yti8Z"
+      },
+      "outputs": [],
+      "source": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "hSsE1o6GwA3K"
+      },
+      "source": [
+        "### Task 3c\n",
+        "\n",
+        "You will now plot the learning curves for different learning rates $\\alpha$.\n",
+        "A learning curves shows how a model's performance changes with increasing number of update steps.\n",
+        "In our case we will plot the model's MSE as a function of the number of update\n",
+        "steps `n_iter` for different values of `learning_rate`.\n",
+        "\n",
+        "In the following cell we setup most of the scaffold to create this plot. Follow\n",
+        "the instructions in the comments to finish the plots."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4Rr5ix7LNISB"
+      },
+      "outputs": [],
+      "source": [
+        "n_iters = [50, 100, 200, 500, 1000, 2000]\n",
+        "learning_rates = [1., .1, .01]\n",
+        "\n",
+        "# we plot the MSE achieved by the closed form model as a reference\n",
+        "closed_form = UnivariateLinearRegression()\n",
+        "closed_form.fit(x, y1)\n",
+        "mse_base = mse(y_pred=closed_form.predict(x), y_true=y1)\n",
+        "plt.plot(n_iters, np.ones_like(n_iters) * mse_base, label=\"closed form\", linestyle='--', c='b')\n",
+        "\n",
+        "for alpha in learning_rates:\n",
+        "  mses = []\n",
+        "  for n_iter in n_iters:\n",
+        "    # fit a SGDUnivariateLinearRegression model using n_iter=n_iter and\n",
+        "    # learning_rate=alpha\n",
+        "    # compute its mse and append the mse value to the mses list\n",
+        "\n",
+        "    mse_ = 1.  # replace with mse calculation\n",
+        "    mses.append(mse_)\n",
+        "  plt.plot(n_iters, mses, label=f\"alpha = {alpha:.2f}\")\n",
+        "\n",
+        "plt.xlabel(\"n_iter\")\n",
+        "plt.ylabel(\"MSE\")\n",
+        "plt.legend()\n",
+        "plt.show()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "SmCkMMJyEEgV"
+      },
+      "source": [
+        "## 📢 **HAND-IN** 📢: A PDF document containing the following:\n",
+        "\n",
+        "* the final plot containing learning curves\n",
+        "* a short (2-3 sentences) interpretation of the curves: why do you think they look the way\n",
+        "they do? can you draw any conclusions?\n",
+        "\n",
+        "In case you were not able to arrive at the final plot:\n",
+        "\n",
+        "* include screenshots of the code you wrote so we can assign partial credit\n",
+        "\n",
+        "**Solutions for Tasks 2, 3 and 4 should be in the same document: you will only upload 1 document with your solutions for all 3 tasks!**\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "dgrNtwsPyigH"
+      },
+      "source": [
+        "# Task 4 (3 Points): Multivariate Linear Regression"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "_sPWegXCg2y1"
+      },
+      "source": [
+        "In this task we will apply linear regression to non-synthetic data.\n",
+        "The variable `X` is a `pandas` `Dataframe` containing features and `y` contains\n",
+        "the target. Read through the description to get an idea of the different variables."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "djGUQ3kVx9ob"
+      },
+      "outputs": [],
+      "source": [
+        "from sklearn.datasets import load_diabetes\n",
+        "\n",
+        "data = load_diabetes(as_frame=True)\n",
+        "\n",
+        "X = data['data']\n",
+        "y = data['target']\n",
+        "description = data['DESCR']\n",
+        "\n",
+        "print(description)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "byOVt9t9_2c7"
+      },
+      "source": [
+        "### Task 4a\n",
+        "\n",
+        "Implement linear regression using `sklearn`.\n",
+        "\n",
+        "* create an instance of the class `sklearn.linear_model.LinearRegression`. Refer to the documentation at: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html\n",
+        "* call its `.fit` method\n",
+        "* get the predicted values with `.predict`"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "eyiU4nCQBovr"
+      },
+      "outputs": [],
+      "source": [
+        "from sklearn.linear_model import LinearRegression"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "G4AktC189PAc"
+      },
+      "outputs": [],
+      "source": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qQUdYHOXpeLd"
+      },
+      "source": [
+        "### Task 4b\n",
+        "\n",
+        "The estimated parameters $\\theta$ of the linear model can be found in the `.coef_` member variable. The feature names can be found in the `.feature_names_in_` member variable. They are the same as the names of the columns of `X` and should be in the same order.\n",
+        "\n",
+        "Visualize the estimated parameters and the feature names in a bar plot.\n",
+        "\n",
+        "Using these, answer the following questions:\n",
+        "\n",
+        "* Which are the 3 most influential features?\n",
+        "* How do you interpret the sign of the coefficients?\n",
+        "* If you had to exclude 1 feature, which one would you select and why?"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "odXnubfHqrfc"
+      },
+      "outputs": [],
+      "source": [
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xa_HDxFeolBj"
+      },
+      "source": [
+        "## 📢 **HAND-IN** 📢: A PDF document containing the following:\n",
+        "\n",
+        "* the bar plot\n",
+        "* your answers to the questions in Task 4b\n",
+        "\n",
+        "**Solutions for Tasks 2, 3 and 4 should be in the same document: you will only upload 1 document with your solutions for all 3 tasks!**\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "private_outputs": true,
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.8.16"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}