Add MLDLM Lab03

This commit is contained in:
Manuel Thalmann 2023-06-17 13:47:27 +02:00
parent bdb2e6547d
commit d32539c2b2

View file

@ -0,0 +1,633 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "FZEco2HK6D57"
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from matplotlib import pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "a3nCUqopXHwv"
},
"outputs": [],
"source": [
"RANDOM_SEED = 0x0"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jjTkUw7BWulH"
},
"source": [
"# Lab 03: Linear Regression"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gNnZUk36Xz7_"
},
"source": [
"For the first few Tasks, we will work with synthetic univariate data.\n",
"We generate $100$ features $x_i \\in [-1, 1]$ as `x` and two different\n",
"regression targets `y1` and `y2`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Ojta777H2ulb"
},
"outputs": [],
"source": [
"data_rng = np.random.default_rng(RANDOM_SEED)\n",
"n = 100\n",
"x = 2 * data_rng.random(n) - 1 # create n points between -1 and 1\n",
"\n",
"# setup synthetic linear data\n",
"true_offset = 0.5\n",
"true_slope = 1.25\n",
"noise = data_rng.normal(loc=0., scale=0.25, size=(n,))\n",
"\n",
"y1 = true_offset + true_slope * x + noise\n",
"\n",
"\n",
"# setup synthetic non-linear data\n",
"y2 = true_offset + np.sin(np.pi * x) + noise"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ntdpTWzqZqAU"
},
"source": [
"# Task 1 (1 Point): Pearson Correlation"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JbNJ7WhzbAtm"
},
"source": [
"### Task 1a\n",
"\n",
"Plot `x` against the target variable `y1`.\n",
"\n",
"* use `plt.scatter`\n",
"\n",
"\n",
"Do you think there is a linear relationship between `x` and the target?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "MxYMdhfxyYAd"
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "6Ak0nQ0PDGpm"
},
"source": [
"Plot `x` against the target variable `y2`.\n",
"\n",
"Do you think there is a linear relationship between `x` and the target?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "HpzwoBdQDd-d"
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "HycYQm3tbvyf"
},
"source": [
"### Task 1b\n",
"\n",
"In class you have seen the formula for the Pearson Correlation:\n",
"$\\rho(a, b) = \\frac{\\sum_{i = 1}^{m} (a_i - \\bar{a})(b_i - \\bar{b})}{\\sqrt{\\sum_{i=1}^{m} (a_i - \\bar{a})^2\\sum_{i = 1}^{m}(b_i - \\bar{b})^2}} $, where $\\bar{a} = \\frac{1}{m}\\sum_{i=1}^{m} a_i$ and $\\bar{b} = \\frac{1}{m}\\sum_{i=1}^{m} b_i$.\n",
"\n",
"* Compute the Pearson Correlation $\\rho$ between `x` and the target `y1`.\n",
"* Compute the Pearson Correlation between `x` and `y2`.\n",
"* Check that you get the same result as the reference implementation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "EUoJXIrCy0p6"
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "L_NesuDQddHS"
},
"outputs": [],
"source": [
"# Refer to the output of this cell to check whether your implementation of rho\n",
"# is correct.\n",
"\n",
"from scipy.stats import pearsonr\n",
"\n",
"print(f\"rho(x, y1): {pearsonr(x, y1)[0]:.4f}\")\n",
"print(f\"rho(x, y2): {pearsonr(x, y2)[0]:.4f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Kr9OWmCilrAv"
},
"source": [
"## 📢 **HAND-IN** 📢: Report in Moodle whether you solved this task."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rbjhdwFceHlL"
},
"source": [
"# Task 2 (2 Points): Univariate Linear Regression"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ucnYGKbmecz_"
},
"source": [
"### Task 2a\n",
"\n",
"You will now implement Linear Regression with a single variable. In class you have seen that the underlying model is: $y = \\theta_0 + \\theta_1x$.\n",
"You also derived the maximum likelihood estimates for $\\theta_0$ and $\\theta_1$:\n",
"\n",
"* $\\hat{\\theta}_1 = \\frac{\\sum_{i=1}^{m} (x_i - \\bar{x})(y_i - \\bar{y})}{\\sum_{i=1}^{m}(x_i - \\bar{x})^2}$ with $\\bar{x} = \\frac{1}{m}\\sum_{i=1}^{m} x_i$ and $\\bar{y} = \\frac{1}{m}\\sum_{i=1}^{m} y_i$.\n",
"* $\\hat{\\theta}_0 = \\bar{y} - \\hat{\\theta}_1\\bar{x}$\n",
"\n",
"In the following cell, implement the `.fit` and `.predict` methods:\n",
"* In the `.predict` method you will have to apply the model to the input `x`\n",
"* In the `.fit` method you will have to compute $\\hat{\\theta}_0$ and $\\hat{\\theta}_1$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qS0Oa5Btgk74"
},
"outputs": [],
"source": [
"class UnivariateLinearRegression:\n",
"\n",
" def __init__(self):\n",
" self.theta_0: float = 0.\n",
" self.theta_1: float = 0.\n",
"\n",
" def predict(self, x):\n",
" # y = theta_0 + theta_1 * x\n",
" return None # TODO\n",
"\n",
" def fit(self, x, y):\n",
"\n",
" self.theta_1 = ... # TODO\n",
" self.theta_0 = ... # TODO\n",
"\n",
" return self"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9LzenH1UhLOs"
},
"source": [
"### Task 2b\n",
"\n",
"Fit your linear model to `x` and the target `y1`.\n",
"\n",
"* Create an instance of the class `UnivariateLinearRegression`\n",
"* fit the model using its `.fit` method\n",
"* get the predicted values, using `.predict`\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UHGuDWAntd8R"
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "elE3OfjHjBRO"
},
"source": [
"* implement the function `plot_model`\n",
"* use `plot_model` to plot your linear regression model given the true datapoints"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "T0eKDuRt1YOF"
},
"outputs": [],
"source": [
"def plot_model(x, y_pred, y_true, title):\n",
" # TODO\n",
" ...\n",
" plt.show()\n",
"\n",
"plot_model(...)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tt2RnAwAG1n9"
},
"source": [
"* Fit another linear model to `x` and `y2`\n",
"* get the predicted values\n",
"* plot the model with `plot_model`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Ccq3GI17Ga2x"
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "E0i3gWvIl7nY"
},
"source": [
"## 📢 **HAND-IN** 📢: A PDF document containing the following:\n",
"\n",
"* both plots containing the linear regression model and true datapoints\n",
"* a short (2-3 sentences) interpretation of the curves: why do you think they look the way\n",
"they do? can you draw any conclusions?\n",
"\n",
"**Solutions for Tasks 2, 3 and 4 should be in the same document: you will only upload 1 document with your solutions for all 3 tasks!**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0TK0Pi4ClphY"
},
"source": [
"# Task 3 (4 Points): Univariate Linear Regression using Stochastic Gradient Descent"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YL31gChVqLpC"
},
"source": [
"### Task 3a\n",
"\n",
"In class you have seen an alternative version to estimate the parameters $\\theta_i$ of the linear regression models by using Gradient Descent.\n",
"\n",
"For the univariate linear regression model, the stochastic gradient descent updates look like this:\n",
"* $\\theta_{0}^{(t+1)} = \\theta_{0}^{(t)} - \\alpha (\\theta_{0}^{(t)} + \\theta_{1}^{(t)} x_t - y_t)$\n",
"* $\\theta_{1}^{(t+1)} = \\theta_{1}^{(t)} - \\alpha (\\theta_{0}^{(t)} + \\theta_{1}^{(t)} x_t - y_t) x_t$\n",
"\n",
"Here $\\alpha$ is the learning rate, and $(x_t, y_t)$ is the data point sampled\n",
"at time $t$.\n",
"\n",
"\n",
"In the following cell, implement the `.fit` and `.predict` methods:\n",
"* In the `.predict` method you will have to apply the model to the input `x`.\n",
"* In the `.fit` method you will have to implement the update equations for\n",
"$\\theta_0$ and $\\theta_1$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "wJMHvQmXmVKr"
},
"outputs": [],
"source": [
"class SGDUnivariateLinearRegression:\n",
"\n",
" def __init__(self):\n",
" self.theta_0: float = 0.\n",
" self.theta_1: float = 0.\n",
" self.rng = np.random.default_rng(RANDOM_SEED)\n",
"\n",
" def predict(self, x):\n",
" # y = theta_0 + theta_1 * x\n",
" return None # TODO\n",
"\n",
" def fit(self, x, y, n_iter: int = 100, learning_rate: float = 1.0):\n",
" for t in range(n_iter):\n",
" sample_ix = self.rng.integers(0, len(x))\n",
"\n",
" xt = x[sample_ix]\n",
" yt = y[sample_ix]\n",
"\n",
" # TODO: update self.theta_0 and self.theta_1 SIMULTANEOUSLY (!!!) according to their update equations\n",
"\n",
" return self"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MHLBmTm4vK9p"
},
"source": [
"### Task 3b\n",
"\n",
"Run SGD for `x` and the target `y1` and compute the mean squared error (MSE).\n",
"The MSE is defined as: $\\frac{1}{n}\\sum_{i=1}^{n} (\\hat{y}_i - y_i)^2$, where\n",
"$\\hat{y}$ are the model predictions.\n",
"\n",
"* Create an instance of the class `SGDUnivariateLinearRegression`\n",
"* fit the model using its `.fit` method\n",
"* get the predicted values, using `.predict`\n",
"* implement the `mse` function\n",
"* compute the MSE of your predictions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "CZ1szyQhK9so"
},
"outputs": [],
"source": [
"def mse(y_pred, y_true):\n",
" # TODO\n",
" return 0."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "V35vBU5Yti8Z"
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "hSsE1o6GwA3K"
},
"source": [
"### Task 3c\n",
"\n",
"You will now plot the learning curves for different learning rates $\\alpha$.\n",
"A learning curves shows how a model's performance changes with increasing number of update steps.\n",
"In our case we will plot the model's MSE as a function of the number of update\n",
"steps `n_iter` for different values of `learning_rate`.\n",
"\n",
"In the following cell we setup most of the scaffold to create this plot. Follow\n",
"the instructions in the comments to finish the plots."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "4Rr5ix7LNISB"
},
"outputs": [],
"source": [
"n_iters = [50, 100, 200, 500, 1000, 2000]\n",
"learning_rates = [1., .1, .01]\n",
"\n",
"# we plot the MSE achieved by the closed form model as a reference\n",
"closed_form = UnivariateLinearRegression()\n",
"closed_form.fit(x, y1)\n",
"mse_base = mse(y_pred=closed_form.predict(x), y_true=y1)\n",
"plt.plot(n_iters, np.ones_like(n_iters) * mse_base, label=\"closed form\", linestyle='--', c='b')\n",
"\n",
"for alpha in learning_rates:\n",
" mses = []\n",
" for n_iter in n_iters:\n",
" # fit a SGDUnivariateLinearRegression model using n_iter=n_iter and\n",
" # learning_rate=alpha\n",
" # compute its mse and append the mse value to the mses list\n",
"\n",
" mse_ = 1. # replace with mse calculation\n",
" mses.append(mse_)\n",
" plt.plot(n_iters, mses, label=f\"alpha = {alpha:.2f}\")\n",
"\n",
"plt.xlabel(\"n_iter\")\n",
"plt.ylabel(\"MSE\")\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SmCkMMJyEEgV"
},
"source": [
"## 📢 **HAND-IN** 📢: A PDF document containing the following:\n",
"\n",
"* the final plot containing learning curves\n",
"* a short (2-3 sentences) interpretation of the curves: why do you think they look the way\n",
"they do? can you draw any conclusions?\n",
"\n",
"In case you were not able to arrive at the final plot:\n",
"\n",
"* include screenshots of the code you wrote so we can assign partial credit\n",
"\n",
"**Solutions for Tasks 2, 3 and 4 should be in the same document: you will only upload 1 document with your solutions for all 3 tasks!**\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dgrNtwsPyigH"
},
"source": [
"# Task 4 (3 Points): Multivariate Linear Regression"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_sPWegXCg2y1"
},
"source": [
"In this task we will apply linear regression to non-synthetic data.\n",
"The variable `X` is a `pandas` `Dataframe` containing features and `y` contains\n",
"the target. Read through the description to get an idea of the different variables."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "djGUQ3kVx9ob"
},
"outputs": [],
"source": [
"from sklearn.datasets import load_diabetes\n",
"\n",
"data = load_diabetes(as_frame=True)\n",
"\n",
"X = data['data']\n",
"y = data['target']\n",
"description = data['DESCR']\n",
"\n",
"print(description)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "byOVt9t9_2c7"
},
"source": [
"### Task 4a\n",
"\n",
"Implement linear regression using `sklearn`.\n",
"\n",
"* create an instance of the class `sklearn.linear_model.LinearRegression`. Refer to the documentation at: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html\n",
"* call its `.fit` method\n",
"* get the predicted values with `.predict`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eyiU4nCQBovr"
},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "G4AktC189PAc"
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "qQUdYHOXpeLd"
},
"source": [
"### Task 4b\n",
"\n",
"The estimated parameters $\\theta$ of the linear model can be found in the `.coef_` member variable. The feature names can be found in the `.feature_names_in_` member variable. They are the same as the names of the columns of `X` and should be in the same order.\n",
"\n",
"Visualize the estimated parameters and the feature names in a bar plot.\n",
"\n",
"Using these, answer the following questions:\n",
"\n",
"* Which are the 3 most influential features?\n",
"* How do you interpret the sign of the coefficients?\n",
"* If you had to exclude 1 feature, which one would you select and why?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "odXnubfHqrfc"
},
"outputs": [],
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xa_HDxFeolBj"
},
"source": [
"## 📢 **HAND-IN** 📢: A PDF document containing the following:\n",
"\n",
"* the bar plot\n",
"* your answers to the questions in Task 4b\n",
"\n",
"**Solutions for Tasks 2, 3 and 4 should be in the same document: you will only upload 1 document with your solutions for all 3 tasks!**\n"
]
}
],
"metadata": {
"colab": {
"private_outputs": true,
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.16"
}
},
"nbformat": 4,
"nbformat_minor": 0
}