Add MLDM Lab04 course material

2023-06-17 16:22:28 +02:00 · 2023-06-17 16:22:28 +02:00 · 50ec2fe9ca
commit 50ec2fe9ca
parent d1264e158c
2 changed files with 592 additions and 0 deletions
--- a/Mining/Labs/L04_Polynomial_and_Logistic_Regression_LAB_ASSIGNMENT.ipynb
+++ b/Mining/Labs/L04_Polynomial_and_Logistic_Regression_LAB_ASSIGNMENT.ipynb
@ -0,0 +1,592 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "pAJRKdv9QA4C"
+      },
+      "outputs": [],
+      "source": [
+        "import numpy as np\n",
+        "import pandas as pd\n",
+        "from matplotlib import pyplot as plt"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "D1m4qcrpXdTt"
+      },
+      "outputs": [],
+      "source": [
+        "RANDOM_SEED = 0x0"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Ia9s_Q-KXf0T"
+      },
+      "source": [
+        "# TASK 1: Polynomial Regression (5 Points):"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "VV0Z3OdeXpha"
+      },
+      "source": [
+        "Let's create and explore the data."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "nb5WsezldFla"
+      },
+      "outputs": [],
+      "source": [
+        "# set the random seed to an RANDOM_SEED, so that everyone has the same data to work with\n",
+        "np.random.seed(seed=RANDOM_SEED)\n",
+        "# create predictor variable, that have standard normal distribution and reshape it in order to use for the model training\n",
+        "x = np.random.normal(0, 1, 100).reshape(-1, 1)\n",
+        "# create target variable\n",
+        "y = 3*x**3 + 2*x**2 + x + np.random.normal(0, 10, 100).reshape(-1, 1)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "E65IxT1Bwpmk"
+      },
+      "source": [
+        "Visualise the data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "nCZgTYP3fZe7"
+      },
+      "outputs": [],
+      "source": [
+        "plt.scatter(x, y)\n",
+        "plt.show()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Cx8aSpUnJCI7"
+      },
+      "source": [
+        "## Task 1a\n",
+        "Apply Linear Regression on the data\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "nvRxguOTJnjS"
+      },
+      "source": [
+        "1. Split the data in the train and test set (80/20), set `random_state` to `RANDOM_SEED`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ti7myWk7KS8Z"
+      },
+      "outputs": [],
+      "source": [
+        "from sklearn.model_selection import train_test_split\n",
+        "X_train, X_test, y_train, y_test = ..."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "RFSaakYLJuK7"
+      },
+      "source": [
+        "2. Apply Linear Regression on the data and predict `y` values for training as well test data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "Ez6t4Q4P82Qo"
+      },
+      "outputs": [],
+      "source": [
+        "from sklearn.linear_model import LinearRegression\n",
+        "..."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "P1k6VBGk8yI6"
+      },
+      "source": [
+        "3. Calculate MSE for training as well as for test data."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "qJGjXK8aKD8q"
+      },
+      "outputs": [],
+      "source": [
+        "from sklearn.metrics import mean_squared_error\n",
+        "\n",
+        "...\n",
+        "\n",
+        "print(f\"MSE of training data: {mse_train}\")\n",
+        "print(f\"MSE of test data: {mse_test}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "T0VOmhngKEQL"
+      },
+      "source": [
+        "4. Visualize the model's artefacts: Plot all the data as well as Linear Regression predictions for training and test data in a scatter plot. Don't forget a legend to differentiate the data."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "94LPydyRD4Nr"
+      },
+      "outputs": [],
+      "source": [
+        "def plot_artefacts(...):\n",
+        "    plt.figure()\n",
+        "    ...\n",
+        "    plt.show()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "yLwMEWirLLBA"
+      },
+      "outputs": [],
+      "source": [
+        "plot_artefacts(...)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "zwsmpB3oMJZf"
+      },
+      "source": [
+        "## Task 1b\n",
+        "Investigate how well polynomial regression with polynomial degrees = 2 can solve the task. In order to do so, follow these steps:\n",
+        "1. Transform the training and test data accordingly to describe polynomial distribution of degree=2\n",
+        "2. Train a Linear Regression model on polynomial data\n",
+        "3. Make predictions for training data\n",
+        "4. Make predictions for test data\n",
+        "5. Calculate MSE for training as well as test data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "oxni0o041MYH"
+      },
+      "outputs": [],
+      "source": [
+        "from sklearn.preprocessing import PolynomialFeatures\n",
+        "\n",
+        "def poly_regression(...):\n",
+        "    ...\n",
+        "    return y_pred_train_poly, y_pred_test_poly, mse_train_poly, mse_test_poly"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "mTiIynAqD4Nr"
+      },
+      "outputs": [],
+      "source": [
+        "...\n",
+        "print(f\"MSE of training data: {mse_train_poly}\")\n",
+        "print(f\"MSE of test data: {mse_test_poly}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "nhbIv-toOFoV"
+      },
+      "source": [
+        "6. Did it perform better than Linear Regression? Visualize the results similar to **Task 1a) 4**."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "yFNrIwDuOUXo"
+      },
+      "outputs": [],
+      "source": [
+        "..."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "lR_v9mTWOVNj"
+      },
+      "source": [
+        "## Task 1c\n",
+        "Investigate the influence of polynomial degrees on the results. Consider degrees in `range(0, 11)`. Visualize the results similar to **Task 1a) 4** and plot MSE (on training as well as test data) as a function of the number of the polynomial degrees."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4YFTQqWZO_Jx"
+      },
+      "outputs": [],
+      "source": [
+        "mses_test_poly = []\n",
+        "mses_train_poly = []\n",
+        "\n",
+        "..."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qY6QK6OhGBVp"
+      },
+      "source": [
+        "## 📢 **HAND-IN** 📢: Answer following questions in Moodle:\n",
+        "\n",
+        "What is the optimal value of the polynomial degrees? Do the values of MSE training and MSE test behave similarly? How do the models behave with polynomial degrees >= 8?"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "lhOvMhs_V4cY"
+      },
+      "source": [
+        "# Task 2: Polynomial Data Transformation (1 Point)\n",
+        "\n",
+        "As we have seen in the lecture, Polynomial Regression is nothing other than a generalization of Linear Regression. Every polynomial Regression can be expressed as a Multivariate Linear Regression. Only transformation of the initial data has to be done.\n",
+        "\n",
+        " $h_\\theta(a) = \\theta_0 + \\theta_1a_1 +\\theta_2a_2 $, where\n",
+        " $ a_0 = v^0, a_1 = v^1, a_2 = v^2 $\n",
+        "\n",
+        "In Task 1 `sklearn.preprocessing.PolynomialFeatures` transformed the X data for us. But in order to understand what exactly it is done to the data, in this task we transform an initial data array $v$ to\n",
+        "the form $(a_1...a_n)$ that can be used to build a Polynomial Regression model with polynomial degrees=2 by hand (without using `sklearn.preprocessing.PolynomialFeatures`). Please transform the array $v$ and insert your answer in Moodle."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Zz4WDaq436-y"
+      },
+      "source": [
+        "\\begin{align}\n",
+        "v=\n",
+        "\\begin{bmatrix}\n",
+        "3 \\\\\n",
+        "2 \\\\\n",
+        "0 \\\\\n",
+        "\\end{bmatrix}\n",
+        "\\end{align}"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DUDUkf1uJFWf"
+      },
+      "source": [
+        "## 📢 **HAND-IN** 📢: Write your answer in Moodle"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "XO800wc-WhyJ"
+      },
+      "source": [
+        "# Task 3: Logistic Regression (4 Points)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "prescribed-lawyer"
+      },
+      "outputs": [],
+      "source": [
+        "import pandas as pd\n",
+        "import numpy as np\n",
+        "import matplotlib.pyplot as plt\n",
+        "from random import randrange\n",
+        "import seaborn as sns\n",
+        "sns.set()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "_lLze_K1g0ZA"
+      },
+      "source": [
+        "## Task 3a. Data Exploration and Preprocessing\n",
+        "\n",
+        "We are using the Fashion MNIST Dataset from Zalando.\n",
+        "Firstly, we load and explore the dataset.\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "wR0ijS0VZ6dk"
+      },
+      "outputs": [],
+      "source": [
+        "from keras.datasets import fashion_mnist\n",
+        "(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()\n",
+        "print(X_train.shape)\n",
+        "print(y_train.shape)\n",
+        "print(X_train.dtype)\n",
+        "print(y_train.dtype)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "1fd78190-5445-4c53-9b7f-4a0f7aeaf87c"
+      },
+      "outputs": [],
+      "source": [
+        "label_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',\n",
+        "               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "_9FX76IifOik"
+      },
+      "source": [
+        "In following task we will only use training part of the dataset."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fDgGTbTMHxwN"
+      },
+      "source": [
+        "#### Prepare data\n",
+        "1. assign following datatypes to the arrays:\n",
+        "   - X_train -> 'float32'\n",
+        "   - y_train -> 'int64'\n",
+        "2. reshape X_train to 2-dimensional array.\n",
+        "Note:\n",
+        "   - it should have the same amount of samples/rows.\n",
+        "3. split the training data into (X_train, y_train) and (X_valid, y_valid), set the size of the validation dataset to 20% of the training data and set random state = 42."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "d3b1f9ef-da3e-445e-b8b0-8adfa6871f14"
+      },
+      "outputs": [],
+      "source": [
+        "..."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "26YTJ4uQYE46"
+      },
+      "source": [
+        "#### Visualize some data\n",
+        "Plot 25 images (hint: use ``imshow`` and ``subplots`` from matplotlib library), plot the label as title (e.g. shorts)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "1ee386c1-8502-4422-8a1c-8c852e67ac40"
+      },
+      "outputs": [],
+      "source": [
+        "plt.figure(figsize=(10,10))\n",
+        "..."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "b61c7d24-8e54-4827-a15f-b0ae553d5743"
+      },
+      "source": [
+        "#### Normalize the Images\n",
+        "With mean and standard deviation"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "cdf0d1e7-0466-4a49-a349-6b27a9aa3662"
+      },
+      "outputs": [],
+      "source": [
+        "..."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "1ca1ada5-1203-4e44-a5aa-38b512d522c6"
+      },
+      "source": [
+        "## Task 3b. Logistic Regression\n",
+        "1. Fit the `LogisticRegression` from `scikit-learn`. Set the `random_state` for reproducibility.\n",
+        "2. Try different parameters (either by hand or by using `GridSearchCV`)\n",
+        "\n",
+        "\n",
+        "**Accuracy should be >= 0.84**"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5Sk55fOkLhgm"
+      },
+      "source": [
+        "Please, check the documentation on:\n",
+        "GridSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html\n",
+        "\n",
+        "PredefinedSplit: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.PredefinedSplit.html\n",
+        "\n",
+        "You can ignore a warning \"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\" as long as GridSearchCV continues with the next hyperparameter and you reach the necessary accuracy."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "U5TiVcHmLtAy"
+      },
+      "outputs": [],
+      "source": [
+        "from sklearn.model_selection import GridSearchCV\n",
+        "from sklearn.model_selection import PredefinedSplit\n",
+        "# We use predefined split in order to control that no train samples would be used in validation step\n",
+        "\n",
+        "\n",
+        "train_indices = np.full((X_train.shape[0],), -1, dtype=int)\n",
+        "test_indices = np.full((X_valid.shape[0],), 0, dtype=int)\n",
+        "\n",
+        "ps = PredefinedSplit(np.append(train_indices, test_indices))\n",
+        "\n",
+        "...\n",
+        "\n",
+        "clf = LogisticRegression(...)\n",
+        "opt = GridSearchCV(clf, cv=ps, ...)\n",
+        "\n",
+        "# when we fit the model, we should use both training and validation samples\n",
+        "\n",
+        "..."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "X5JzFGLlMvk8"
+      },
+      "source": [
+        "Use the best found parameters for the next steps. `GridSearchCV` provides them in the `best_params_` attribute."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "JFqGPd65aM_l"
+      },
+      "source": [
+        "3. Create a new `LogisticRegression` instance with the best found parameters.\n",
+        "4. Fit it on the training set.\n",
+        "5. Calculate the accuracy on the validation set."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "d34138f1-d52a-4cb1-8b49-732f4d711d7a"
+      },
+      "outputs": [],
+      "source": [
+        "from sklearn.metrics import accuracy_score\n",
+        "\n",
+        "..."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "n9lT2RiDNPD2"
+      },
+      "source": [
+        "## 📢 **HAND-IN** 📢: Report in Moodle the accuracy you got in this task."
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python",
+      "version": "3.8.15"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
--- a/Mining/PreClassReading/L04
+++ b/Mining/PreClassReading/L04