ZHAWNotes/Notes/Semester 4/MLDM - Machine Learning and Data Mining/Labs/L03_Linear_Regression_LAB_ASSIGNMENT.ipynb
2023-06-17 15:35:15 +02:00

1058 lines
200 KiB
Plaintext

{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "FZEco2HK6D57"
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from matplotlib import pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "a3nCUqopXHwv"
},
"outputs": [],
"source": [
"RANDOM_SEED = 0x0"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "jjTkUw7BWulH"
},
"source": [
"# Lab 03: Linear Regression"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "gNnZUk36Xz7_"
},
"source": [
"For the first few Tasks, we will work with synthetic univariate data.\n",
"We generate $100$ features $x_i \\in [-1, 1]$ as `x` and two different\n",
"regression targets `y1` and `y2`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "Ojta777H2ulb"
},
"outputs": [],
"source": [
"data_rng = np.random.default_rng(RANDOM_SEED)\n",
"n = 100\n",
"x = 2 * data_rng.random(n) - 1 # create n points between -1 and 1\n",
"\n",
"# setup synthetic linear data\n",
"true_offset = 0.5\n",
"true_slope = 1.25\n",
"noise = data_rng.normal(loc=0., scale=0.25, size=(n,))\n",
"\n",
"y1 = true_offset + true_slope * x + noise\n",
"\n",
"\n",
"# setup synthetic non-linear data\n",
"y2 = true_offset + np.sin(np.pi * x) + noise"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "ntdpTWzqZqAU"
},
"source": [
"# Task 1 (1 Point): Pearson Correlation"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "JbNJ7WhzbAtm"
},
"source": [
"### Task 1a\n",
"\n",
"Plot `x` against the target variable `y1`.\n",
"\n",
"* use `plt.scatter`\n",
"\n",
"\n",
"Do you think there is a linear relationship between `x` and the target?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "MxYMdhfxyYAd"
},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.collections.PathCollection at 0x7fbd80918090>"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.scatter(x, y1)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"_Yes, definitely!_"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "6Ak0nQ0PDGpm"
},
"source": [
"Plot `x` against the target variable `y2`.\n",
"\n",
"Do you think there is a linear relationship between `x` and the target?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "HpzwoBdQDd-d"
},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.collections.PathCollection at 0x7f2336ae7550>"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.scatter(x, y2)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"_No, there is not_"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "HycYQm3tbvyf"
},
"source": [
"### Task 1b\n",
"\n",
"In class you have seen the formula for the Pearson Correlation:\n",
"$\\rho(a, b) = \\frac{\\sum_{i = 1}^{m} (a_i - \\bar{a})(b_i - \\bar{b})}{\\sqrt{\\sum_{i=1}^{m} (a_i - \\bar{a})^2\\sum_{i = 1}^{m}(b_i - \\bar{b})^2}} $, where $\\bar{a} = \\frac{1}{m}\\sum_{i=1}^{m} a_i$ and $\\bar{b} = \\frac{1}{m}\\sum_{i=1}^{m} b_i$.\n",
"\n",
"* Compute the Pearson Correlation $\\rho$ between `x` and the target `y1`.\n",
"* Compute the Pearson Correlation between `x` and `y2`.\n",
"* Check that you get the same result as the reference implementation"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"id": "EUoJXIrCy0p6"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"rho(x, y1): 0.9513\n",
"rho(x, y2): 0.7052\n"
]
}
],
"source": [
"\n",
"def pearson(a, b):\n",
" return sum([(a[i] - a.mean()) * (b[i] - b.mean()) for i in range(0, len(a))]) / \\\n",
" np.sqrt(sum([(a[i] - a.mean()) ** 2 for i in range(0, len(a))]) * sum([(b[i] - b.mean()) ** 2 for i in range(0, len(b))]))\n",
"\n",
"print(f\"rho(x, y1): {pearson(x, y1):.4f}\")\n",
"print(f\"rho(x, y2): {pearson(x, y2):.4f}\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"id": "L_NesuDQddHS"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"rho(x, y1): 0.9513\n",
"rho(x, y2): 0.7052\n"
]
}
],
"source": [
"# Refer to the output of this cell to check whether your implementation of rho\n",
"# is correct.\n",
"\n",
"from scipy.stats import pearsonr\n",
"\n",
"print(f\"rho(x, y1): {pearsonr(x, y1)[0]:.4f}\")\n",
"print(f\"rho(x, y2): {pearsonr(x, y2)[0]:.4f}\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "Kr9OWmCilrAv"
},
"source": [
"## 📢 **HAND-IN** 📢: Report in Moodle whether you solved this task."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "rbjhdwFceHlL"
},
"source": [
"# Task 2 (2 Points): Univariate Linear Regression"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "ucnYGKbmecz_"
},
"source": [
"### Task 2a\n",
"\n",
"You will now implement Linear Regression with a single variable. In class you have seen that the underlying model is: $y = \\theta_0 + \\theta_1x$.\n",
"You also derived the maximum likelihood estimates for $\\theta_0$ and $\\theta_1$:\n",
"\n",
"* $\\hat{\\theta}_1 = \\frac{\\sum_{i=1}^{m} (x_i - \\bar{x})(y_i - \\bar{y})}{\\sum_{i=1}^{m}(x_i - \\bar{x})^2}$ with $\\bar{x} = \\frac{1}{m}\\sum_{i=1}^{m} x_i$ and $\\bar{y} = \\frac{1}{m}\\sum_{i=1}^{m} y_i$.\n",
"* $\\hat{\\theta}_0 = \\bar{y} - \\hat{\\theta}_1\\bar{x}$\n",
"\n",
"In the following cell, implement the `.fit` and `.predict` methods:\n",
"* In the `.predict` method you will have to apply the model to the input `x`\n",
"* In the `.fit` method you will have to compute $\\hat{\\theta}_0$ and $\\hat{\\theta}_1$."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"id": "qS0Oa5Btgk74"
},
"outputs": [],
"source": [
"class UnivariateLinearRegression:\n",
"\n",
" def __init__(self):\n",
" self.theta_0: float = 0.\n",
" self.theta_1: float = 0.\n",
"\n",
" def predict(self, x):\n",
" # y = theta_0 + theta_1 * x\n",
" return self.theta_0 + self.theta_1 * x\n",
"\n",
" def fit(self, x, y):\n",
"\n",
" self.theta_1 = sum([(x[i] - x.mean()) * (y[i] - y.mean()) for i in range(0, len(x))]) / sum([(x[i] - x.mean()) ** 2 for i in range(0, len(x))])\n",
" self.theta_0 = y.mean() - self.theta_1 * x.mean()\n",
"\n",
" return self"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "9LzenH1UhLOs"
},
"source": [
"### Task 2b\n",
"\n",
"Fit your linear model to `x` and the target `y1`.\n",
"\n",
"* Create an instance of the class `UnivariateLinearRegression`\n",
"* fit the model using its `.fit` method\n",
"* get the predicted values, using `.predict`\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"id": "UHGuDWAntd8R"
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0.81972252, -0.08621518, -0.65077025, -0.71108605, 1.25473192,\n",
" 1.50019396, 0.74489883, 1.04803555, 0.58943114, 1.55525673,\n",
" 1.26110578, -0.74510824, 1.36362466, -0.66899868, 1.04842757,\n",
" -0.31846659, 1.37787255, 0.58409243, -0.01238023, 0.29103911,\n",
" -0.68199134, -0.44521854, 0.9027792 , 0.84495782, 0.76648623,\n",
" 0.19478983, 1.70856978, 1.66816843, 0.9395856 , 0.85302538,\n",
" 0.94675253, 0.20772813, -0.41853885, 1.02827672, 0.54435158,\n",
" 0.0136006 , 0.4468457 , 1.44278502, 1.55271809, 0.1309298 ,\n",
" 0.65828127, 0.04228939, 0.71446262, 0.08186971, 0.21438391,\n",
" 1.44472561, -0.1913948 , 0.78573633, -0.54457236, 1.30253353,\n",
" 1.19015742, -0.16126428, 1.41070099, -0.60735898, 0.07744293,\n",
" -0.38107765, 0.35926577, 1.21292081, -0.18279715, -0.62351186,\n",
" 0.24629335, -0.26207004, -0.5279483 , 0.67999998, -0.01488642,\n",
" 0.90616057, -0.2595968 , 1.57262835, 0.14897817, -0.49157451,\n",
" 0.80034535, 1.53572082, 0.33468582, 1.60341403, 0.48153732,\n",
" 0.29730957, 0.77839929, 1.70335527, 1.58948153, 0.383213 ,\n",
" 1.1176936 , 0.47543535, 0.55411682, 1.1869188 , 0.27122316,\n",
" 1.0603401 , 1.00275116, 1.54782335, -0.46828955, 1.04684768,\n",
" 1.53638546, 1.63631745, -0.71557985, 1.3790104 , 1.66905593,\n",
" 1.60987763, -0.38481676, 1.64792032, 1.44388969, 1.27719337])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = UnivariateLinearRegression()\n",
"model.fit(x, y1)\n",
"model.predict(x)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "elE3OfjHjBRO"
},
"source": [
"* implement the function `plot_model`\n",
"* use `plot_model` to plot your linear regression model given the true datapoints"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"id": "T0eKDuRt1YOF"
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def plot_model(x, y_pred, y_true, title):\n",
" plt.title(title)\n",
" plt.plot(x, y_pred, label=\"prediction\", c=\"black\")\n",
" plt.scatter(x, y_true, label=\"true\")\n",
" plt.legend()\n",
" plt.show()\n",
"\n",
"plot_model(x, model.predict(x), y1, \"Prediction of y_1\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "tt2RnAwAG1n9"
},
"source": [
"* Fit another linear model to `x` and `y2`\n",
"* get the predicted values\n",
"* plot the model with `plot_model`"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"id": "Ccq3GI17Ga2x"
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"model = UnivariateLinearRegression()\n",
"model.fit(x, y2)\n",
"plot_model(x, model.predict(x), y2, \"Prediction of y_2\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "E0i3gWvIl7nY"
},
"source": [
"## 📢 **HAND-IN** 📢: A PDF document containing the following:\n",
"\n",
"* both plots containing the linear regression model and true datapoints\n",
"* a short (2-3 sentences) interpretation of the curves: why do you think they look the way\n",
"they do? can you draw any conclusions?\n",
"\n",
"**Solutions for Tasks 2, 3 and 4 should be in the same document: you will only upload 1 document with your solutions for all 3 tasks!**"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
" - _The prediction of $y_1$ is very accurate because there seems to be a linear relation between $x$ and $y_1$._\n",
" - _The prediction of $y_2$ looks very inaccurate - though, there seems to be a non-linear relation between_\n",
" - _The prediction of $y_2$ probably has an $R^2$ value nearby $0$._"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "0TK0Pi4ClphY"
},
"source": [
"# Task 3 (4 Points): Univariate Linear Regression using Stochastic Gradient Descent"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "YL31gChVqLpC"
},
"source": [
"### Task 3a\n",
"\n",
"In class you have seen an alternative version to estimate the parameters $\\theta_i$ of the linear regression models by using Gradient Descent.\n",
"\n",
"For the univariate linear regression model, the stochastic gradient descent updates look like this:\n",
"* $\\theta_{0}^{(t+1)} = \\theta_{0}^{(t)} - \\alpha (\\theta_{0}^{(t)} + \\theta_{1}^{(t)} x_t - y_t)$\n",
"* $\\theta_{1}^{(t+1)} = \\theta_{1}^{(t)} - \\alpha (\\theta_{0}^{(t)} + \\theta_{1}^{(t)} x_t - y_t) x_t$\n",
"\n",
"Here $\\alpha$ is the learning rate, and $(x_t, y_t)$ is the data point sampled\n",
"at time $t$.\n",
"\n",
"\n",
"In the following cell, implement the `.fit` and `.predict` methods:\n",
"* In the `.predict` method you will have to apply the model to the input `x`.\n",
"* In the `.fit` method you will have to implement the update equations for\n",
"$\\theta_0$ and $\\theta_1$."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"id": "wJMHvQmXmVKr"
},
"outputs": [],
"source": [
"class SGDUnivariateLinearRegression:\n",
"\n",
" def __init__(self):\n",
" self.theta_0: float = 0.\n",
" self.theta_1: float = 0.\n",
" self.rng = np.random.default_rng(RANDOM_SEED)\n",
"\n",
" def predict(self, x):\n",
" # y = theta_0 + theta_1 * x\n",
" return self.theta_0 + self.theta_1 * x\n",
"\n",
" def fit(self, x, y, n_iter: int = 100, learning_rate: float = 1.0):\n",
" for t in range(n_iter):\n",
" sample_ix = self.rng.integers(0, len(x))\n",
"\n",
" xt = x[sample_ix]\n",
" yt = y[sample_ix]\n",
"\n",
" # TODO: update self.theta_0 and self.theta_1 SIMULTANEOUSLY (!!!) according to their update equations\n",
" theta_0 = self.theta_0 - learning_rate * (self.theta_0 + self.theta_1 * xt - yt)\n",
" self.theta_1 = self.theta_1 - learning_rate * (self.theta_0 + self.theta_1 * xt - yt) * xt\n",
" self.theta_0 = theta_0\n",
"\n",
" return self"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "MHLBmTm4vK9p"
},
"source": [
"### Task 3b\n",
"\n",
"Run SGD for `x` and the target `y1` and compute the mean squared error (MSE).\n",
"The MSE is defined as: $\\frac{1}{n}\\sum_{i=1}^{n} (\\hat{y}_i - y_i)^2$, where\n",
"$\\hat{y}$ are the model predictions.\n",
"\n",
"* Create an instance of the class `SGDUnivariateLinearRegression`\n",
"* fit the model using its `.fit` method\n",
"* get the predicted values, using `.predict`\n",
"* implement the `mse` function\n",
"* compute the MSE of your predictions"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"id": "CZ1szyQhK9so"
},
"outputs": [],
"source": [
"def mse(y_pred, y_true):\n",
" return 1 / len(y_pred) * sum([(y_pred[i] - y_true[i]) ** 2 for i in range(0, len(y_pred))])"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"id": "V35vBU5Yti8Z"
},
"outputs": [
{
"data": {
"text/plain": [
"0.2393998747452736"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = SGDUnivariateLinearRegression()\n",
"model.fit(x, y1)\n",
"mse(y1, model.predict(x))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "hSsE1o6GwA3K"
},
"source": [
"### Task 3c\n",
"\n",
"You will now plot the learning curves for different learning rates $\\alpha$.\n",
"A learning curves shows how a model's performance changes with increasing number of update steps.\n",
"In our case we will plot the model's MSE as a function of the number of update\n",
"steps `n_iter` for different values of `learning_rate`.\n",
"\n",
"In the following cell we setup most of the scaffold to create this plot. Follow\n",
"the instructions in the comments to finish the plots."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"id": "4Rr5ix7LNISB"
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"n_iters = [50, 100, 200, 500, 1000, 2000]\n",
"learning_rates = [1., .1, .01]\n",
"\n",
"# we plot the MSE achieved by the closed form model as a reference\n",
"closed_form = UnivariateLinearRegression()\n",
"closed_form.fit(x, y1)\n",
"mse_base = mse(y_pred=closed_form.predict(x), y_true=y1)\n",
"plt.plot(n_iters, np.ones_like(n_iters) * mse_base, label=\"closed form\", linestyle='--', c='b')\n",
"\n",
"for alpha in learning_rates:\n",
" mses = []\n",
" for n_iter in n_iters:\n",
" # fit a SGDUnivariateLinearRegression model using n_iter=n_iter and\n",
" # learning_rate=alpha\n",
" # compute its mse and append the mse value to the mses list\n",
" model = SGDUnivariateLinearRegression()\n",
" model.fit(x, y1, n_iter=n_iter, learning_rate=alpha)\n",
"\n",
" mse_ = mse(model.predict(x), y1)\n",
" mses.append(mse_)\n",
" plt.plot(n_iters, mses, label=f\"alpha = {alpha:.2f}\")\n",
"\n",
"plt.xlabel(\"n_iter\")\n",
"plt.ylabel(\"MSE\")\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "SmCkMMJyEEgV"
},
"source": [
"## 📢 **HAND-IN** 📢: A PDF document containing the following:\n",
"\n",
"* the final plot containing learning curves\n",
"* a short (2-3 sentences) interpretation of the curves: why do you think they look the way\n",
"they do? can you draw any conclusions?\n",
"\n",
"In case you were not able to arrive at the final plot:\n",
"\n",
"* include screenshots of the code you wrote so we can assign partial credit\n",
"\n",
"**Solutions for Tasks 2, 3 and 4 should be in the same document: you will only upload 1 document with your solutions for all 3 tasks!**\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"- _The learning rate $1.0$ is too high causing the function to diverge from the actual results_\n",
"- _The best learning rate seems to be $0.01$._"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "dgrNtwsPyigH"
},
"source": [
"# Task 4 (3 Points): Multivariate Linear Regression"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "_sPWegXCg2y1"
},
"source": [
"In this task we will apply linear regression to non-synthetic data.\n",
"The variable `X` is a `pandas` `Dataframe` containing features and `y` contains\n",
"the target. Read through the description to get an idea of the different variables."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"id": "djGUQ3kVx9ob"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
".. _diabetes_dataset:\n",
"\n",
"Diabetes dataset\n",
"----------------\n",
"\n",
"Ten baseline variables, age, sex, body mass index, average blood\n",
"pressure, and six blood serum measurements were obtained for each of n =\n",
"442 diabetes patients, as well as the response of interest, a\n",
"quantitative measure of disease progression one year after baseline.\n",
"\n",
"**Data Set Characteristics:**\n",
"\n",
" :Number of Instances: 442\n",
"\n",
" :Number of Attributes: First 10 columns are numeric predictive values\n",
"\n",
" :Target: Column 11 is a quantitative measure of disease progression one year after baseline\n",
"\n",
" :Attribute Information:\n",
" - age age in years\n",
" - sex\n",
" - bmi body mass index\n",
" - bp average blood pressure\n",
" - s1 tc, total serum cholesterol\n",
" - s2 ldl, low-density lipoproteins\n",
" - s3 hdl, high-density lipoproteins\n",
" - s4 tch, total cholesterol / HDL\n",
" - s5 ltg, possibly log of serum triglycerides level\n",
" - s6 glu, blood sugar level\n",
"\n",
"Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).\n",
"\n",
"Source URL:\n",
"https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html\n",
"\n",
"For more information see:\n",
"Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) \"Least Angle Regression,\" Annals of Statistics (with discussion), 407-499.\n",
"(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)\n",
"\n"
]
}
],
"source": [
"from sklearn.datasets import load_diabetes\n",
"\n",
"data = load_diabetes(as_frame=True)\n",
"\n",
"X = data['data']\n",
"y = data['target']\n",
"description = data['DESCR']\n",
"\n",
"print(description)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "byOVt9t9_2c7"
},
"source": [
"### Task 4a\n",
"\n",
"Implement linear regression using `sklearn`.\n",
"\n",
"* create an instance of the class `sklearn.linear_model.LinearRegression`. Refer to the documentation at: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html\n",
"* call its `.fit` method\n",
"* get the predicted values with `.predict`"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"id": "eyiU4nCQBovr"
},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"id": "G4AktC189PAc"
},
"outputs": [
{
"data": {
"text/plain": [
"array([206.11667725, 68.07103297, 176.88279035, 166.91445843,\n",
" 128.46225834, 106.35191443, 73.89134662, 118.85423042,\n",
" 158.80889721, 213.58462442, 97.07481511, 95.10108423,\n",
" 115.06915952, 164.67656842, 103.07814257, 177.17487964,\n",
" 211.7570922 , 182.84134823, 148.00326937, 124.01754066,\n",
" 120.33362197, 85.80068961, 113.1134589 , 252.45225837,\n",
" 165.48779206, 147.71997564, 97.12871541, 179.09358468,\n",
" 129.05345958, 184.7811403 , 158.71516713, 69.47575778,\n",
" 261.50385365, 112.82234716, 78.37318279, 87.66360785,\n",
" 207.92114668, 157.87641942, 240.84708073, 136.93257456,\n",
" 153.48044608, 74.15426666, 145.62742227, 77.82978811,\n",
" 221.07832768, 125.21957584, 142.6029986 , 109.49562511,\n",
" 73.14181818, 189.87117754, 157.9350104 , 169.55699526,\n",
" 134.1851441 , 157.72539008, 139.11104979, 72.73116856,\n",
" 207.82676612, 80.11171342, 104.08335958, 134.57871054,\n",
" 114.23552012, 180.67628279, 61.12935368, 98.72404613,\n",
" 113.79577026, 189.95771575, 148.98351571, 124.34152283,\n",
" 114.8395504 , 121.99957578, 73.91017087, 236.71054289,\n",
" 142.31126791, 124.51672384, 150.84073896, 127.75230658,\n",
" 191.16896496, 77.05671154, 166.82164929, 91.00591229,\n",
" 174.75156797, 122.83451589, 63.27231315, 151.99867317,\n",
" 53.72959077, 166.0050229 , 42.6491333 , 153.04229493,\n",
" 80.54701716, 106.90148495, 79.93968011, 187.1672654 ,\n",
" 192.5989033 , 61.07398313, 107.4076912 , 125.04307496,\n",
" 207.72402726, 214.21248827, 123.47464895, 139.16439034,\n",
" 168.21372017, 106.92902558, 150.64748328, 157.92364009,\n",
" 152.75958287, 116.22381927, 73.03167734, 155.67052006,\n",
" 230.1417777 , 143.49797317, 38.09587272, 121.8593267 ,\n",
" 152.79404663, 207.99702587, 291.23106133, 189.17571129,\n",
" 214.02877593, 235.18106509, 165.38480498, 151.2469168 ,\n",
" 156.57659557, 200.44066818, 219.35193167, 174.78830391,\n",
" 169.23118221, 187.87537099, 57.49340026, 108.54836058,\n",
" 92.68731024, 210.87347343, 245.47097701, 69.84285129,\n",
" 113.03485904, 68.42650654, 141.69639374, 239.46240737,\n",
" 58.37858726, 235.47123197, 254.92309543, 253.30708899,\n",
" 155.51063293, 230.55961445, 170.44330954, 117.9953395 ,\n",
" 178.55406527, 240.07119308, 190.33892524, 228.66470581,\n",
" 114.24456339, 178.36552308, 209.091817 , 144.85615197,\n",
" 200.65926745, 121.34295733, 150.50993019, 199.01879825,\n",
" 146.27926469, 124.02163345, 85.25913019, 235.16173729,\n",
" 82.1730808 , 231.29474031, 144.36940116, 197.04628448,\n",
" 146.99841953, 77.18813284, 59.37368356, 262.68557988,\n",
" 225.12900796, 220.20301952, 46.59651844, 88.10194612,\n",
" 221.77450036, 97.25199783, 164.48838425, 119.90096817,\n",
" 157.80220788, 223.08012207, 99.59081773, 165.84386951,\n",
" 179.47680741, 89.83353846, 171.82590335, 158.36419935,\n",
" 201.48185539, 186.39194958, 197.47424761, 66.57371647,\n",
" 154.59985312, 116.18319159, 195.91755793, 128.04834496,\n",
" 91.20395862, 140.57223765, 155.22669143, 169.70326581,\n",
" 98.7573858 , 190.14568824, 142.51704894, 177.27157771,\n",
" 95.30812216, 69.06191507, 164.16391317, 198.0659024 ,\n",
" 178.25996632, 228.58539684, 160.67104137, 212.28734795,\n",
" 222.4833913 , 172.85421282, 125.27946793, 174.72103207,\n",
" 152.38094643, 98.58135665, 99.73771331, 262.29507095,\n",
" 223.74033222, 221.33976142, 133.61470602, 145.42828204,\n",
" 53.04569008, 141.82052358, 153.68617582, 125.22290891,\n",
" 77.25168449, 230.26180811, 78.9090807 , 105.2051755 ,\n",
" 117.99622779, 99.06233889, 166.55796947, 159.34137227,\n",
" 158.27448255, 143.05684078, 231.55890118, 176.64724258,\n",
" 187.23580712, 65.39099908, 190.66218796, 179.75181691,\n",
" 234.9080532 , 119.15669025, 85.63551834, 100.8597527 ,\n",
" 140.41937377, 101.83524022, 120.66560385, 83.0664276 ,\n",
" 234.58488012, 245.15862773, 263.26954282, 274.87127261,\n",
" 180.67257769, 203.05642297, 254.21625849, 118.44300922,\n",
" 268.45369506, 104.83843473, 115.86820464, 140.45857194,\n",
" 58.46948192, 129.83145265, 263.78607272, 45.00934573,\n",
" 123.28890007, 131.0856888 , 34.89181681, 138.35467112,\n",
" 244.30103923, 89.95923929, 192.07096194, 164.33017386,\n",
" 147.74779723, 191.89092557, 176.44360299, 158.3490221 ,\n",
" 189.19166962, 116.58117777, 111.449754 , 117.45232726,\n",
" 165.79598354, 97.80405886, 139.54451791, 84.17319946,\n",
" 159.93677518, 202.39971737, 80.48131518, 146.64558568,\n",
" 79.05314048, 191.33777472, 220.67516721, 203.75017281,\n",
" 92.86459928, 179.15576252, 81.79874055, 152.8290929 ,\n",
" 76.80052219, 97.79590831, 106.8371012 , 123.83461591,\n",
" 218.13908293, 126.01937664, 206.7587966 , 230.5767944 ,\n",
" 122.05921633, 135.67824405, 126.37042532, 148.49374458,\n",
" 88.07147107, 138.95823614, 203.8691938 , 172.55288732,\n",
" 122.95701477, 213.92310163, 174.89158814, 110.07294222,\n",
" 198.36584973, 173.25229067, 162.64748776, 193.31578983,\n",
" 191.53493643, 284.13932209, 279.31133207, 216.00823829,\n",
" 210.08668656, 216.21612991, 157.01450004, 224.06431372,\n",
" 189.06103154, 103.56515315, 178.70270016, 111.81862434,\n",
" 291.00196609, 182.64651752, 79.33315426, 86.33029851,\n",
" 249.1510082 , 174.51537682, 122.10291074, 146.2718871 ,\n",
" 170.65483847, 183.497196 , 163.36806262, 157.03297709,\n",
" 144.42614949, 125.30053093, 177.50251197, 104.57681546,\n",
" 132.17560518, 95.06210623, 249.89755705, 86.23824126,\n",
" 61.99847009, 156.81295053, 192.32218372, 133.85525804,\n",
" 93.67249793, 202.49572354, 52.54148927, 174.82799914,\n",
" 196.91468873, 118.06336979, 235.29941812, 165.09438096,\n",
" 160.41761959, 162.37786753, 254.05587268, 257.23492156,\n",
" 197.5039462 , 184.06877122, 58.62131994, 194.39216636,\n",
" 110.775815 , 142.20991224, 128.82520996, 180.13082199,\n",
" 211.26488624, 169.59494046, 164.33851796, 136.23374077,\n",
" 174.51001028, 74.67587343, 246.29432383, 114.14494406,\n",
" 111.54552901, 140.0224376 , 109.99895704, 91.37283987,\n",
" 163.01540596, 75.16804478, 254.06119047, 53.47338214,\n",
" 98.48397565, 100.66315554, 258.58683032, 170.67256752,\n",
" 61.91771186, 182.31148421, 171.26948629, 189.19505093,\n",
" 187.18494664, 87.12170524, 148.37964317, 251.35815403,\n",
" 199.69656904, 283.63576862, 50.85911237, 172.14766276,\n",
" 204.05976093, 174.16540137, 157.93182911, 150.50028158,\n",
" 232.97445368, 121.5814873 , 164.54245461, 172.67625919,\n",
" 226.7768891 , 149.46832104, 99.13924946, 80.43418456,\n",
" 140.16148637, 191.90710484, 199.28001608, 153.63277325,\n",
" 171.80344337, 112.11054883, 162.60002916, 129.84290324,\n",
" 258.03100468, 100.70810916, 115.87608197, 122.53559675,\n",
" 218.1797988 , 60.94350929, 131.09296884, 119.48376601,\n",
" 52.60911672, 193.01756549, 101.05581371, 121.22668124,\n",
" 211.85894518, 53.44727472])"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = LinearRegression()\n",
"model.fit(X, y)\n",
"model.predict(X)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "qQUdYHOXpeLd"
},
"source": [
"### Task 4b\n",
"\n",
"The estimated parameters $\\theta$ of the linear model can be found in the `.coef_` member variable. The feature names can be found in the `.feature_names_in_` member variable. They are the same as the names of the columns of `X` and should be in the same order.\n",
"\n",
"Visualize the estimated parameters and the feature names in a bar plot.\n",
"\n",
"Using these, answer the following questions:\n",
"\n",
"* Which are the 3 most influential features?\n",
"* How do you interpret the sign of the coefficients?\n",
"* If you had to exclude 1 feature, which one would you select and why?"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"id": "odXnubfHqrfc"
},
"outputs": [
{
"data": {
"text/plain": [
"<BarContainer object of 10 artists>"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.bar(model.feature_names_in_, height=model.coef_)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"- _`bmi`, `s1`, `s5`_\n",
"- Negative coefficients such as the one in `s1` indicate a negative correlation\n",
"- `age`, since it has nearly no influence"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "xa_HDxFeolBj"
},
"source": [
"## 📢 **HAND-IN** 📢: A PDF document containing the following:\n",
"\n",
"* the bar plot\n",
"* your answers to the questions in Task 4b\n",
"\n",
"**Solutions for Tasks 2, 3 and 4 should be in the same document: you will only upload 1 document with your solutions for all 3 tasks!**\n"
]
}
],
"metadata": {
"colab": {
"private_outputs": true,
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}