diff --git a/Notes/Semester 4/MLDM - Machine Learning and Data Mining/Labs/L04_Polynomial_and_Logistic_Regression_LAB_ASSIGNMENT.ipynb b/Notes/Semester 4/MLDM - Machine Learning and Data Mining/Labs/L04_Polynomial_and_Logistic_Regression_LAB_ASSIGNMENT.ipynb new file mode 100644 index 0000000..1926799 --- /dev/null +++ b/Notes/Semester 4/MLDM - Machine Learning and Data Mining/Labs/L04_Polynomial_and_Logistic_Regression_LAB_ASSIGNMENT.ipynb @@ -0,0 +1,592 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pAJRKdv9QA4C" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from matplotlib import pyplot as plt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "D1m4qcrpXdTt" + }, + "outputs": [], + "source": [ + "RANDOM_SEED = 0x0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ia9s_Q-KXf0T" + }, + "source": [ + "# TASK 1: Polynomial Regression (5 Points):" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VV0Z3OdeXpha" + }, + "source": [ + "Let's create and explore the data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "nb5WsezldFla" + }, + "outputs": [], + "source": [ + "# set the random seed to an RANDOM_SEED, so that everyone has the same data to work with\n", + "np.random.seed(seed=RANDOM_SEED)\n", + "# create predictor variable, that have standard normal distribution and reshape it in order to use for the model training\n", + "x = np.random.normal(0, 1, 100).reshape(-1, 1)\n", + "# create target variable\n", + "y = 3*x**3 + 2*x**2 + x + np.random.normal(0, 10, 100).reshape(-1, 1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E65IxT1Bwpmk" + }, + "source": [ + "Visualise the data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "nCZgTYP3fZe7" + }, + "outputs": [], + "source": [ + "plt.scatter(x, y)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cx8aSpUnJCI7" + }, + "source": [ + "## Task 1a\n", + "Apply Linear Regression on the data\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nvRxguOTJnjS" + }, + "source": [ + "1. Split the data in the train and test set (80/20), set `random_state` to `RANDOM_SEED`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ti7myWk7KS8Z" + }, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "X_train, X_test, y_train, y_test = ..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RFSaakYLJuK7" + }, + "source": [ + "2. Apply Linear Regression on the data and predict `y` values for training as well test data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Ez6t4Q4P82Qo" + }, + "outputs": [], + "source": [ + "from sklearn.linear_model import LinearRegression\n", + "..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P1k6VBGk8yI6" + }, + "source": [ + "3. Calculate MSE for training as well as for test data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qJGjXK8aKD8q" + }, + "outputs": [], + "source": [ + "from sklearn.metrics import mean_squared_error\n", + "\n", + "...\n", + "\n", + "print(f\"MSE of training data: {mse_train}\")\n", + "print(f\"MSE of test data: {mse_test}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T0VOmhngKEQL" + }, + "source": [ + "4. Visualize the model's artefacts: Plot all the data as well as Linear Regression predictions for training and test data in a scatter plot. Don't forget a legend to differentiate the data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "94LPydyRD4Nr" + }, + "outputs": [], + "source": [ + "def plot_artefacts(...):\n", + " plt.figure()\n", + " ...\n", + " plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yLwMEWirLLBA" + }, + "outputs": [], + "source": [ + "plot_artefacts(...)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zwsmpB3oMJZf" + }, + "source": [ + "## Task 1b\n", + "Investigate how well polynomial regression with polynomial degrees = 2 can solve the task. In order to do so, follow these steps:\n", + "1. Transform the training and test data accordingly to describe polynomial distribution of degree=2\n", + "2. Train a Linear Regression model on polynomial data\n", + "3. Make predictions for training data\n", + "4. Make predictions for test data\n", + "5. Calculate MSE for training as well as test data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oxni0o041MYH" + }, + "outputs": [], + "source": [ + "from sklearn.preprocessing import PolynomialFeatures\n", + "\n", + "def poly_regression(...):\n", + " ...\n", + " return y_pred_train_poly, y_pred_test_poly, mse_train_poly, mse_test_poly" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mTiIynAqD4Nr" + }, + "outputs": [], + "source": [ + "...\n", + "print(f\"MSE of training data: {mse_train_poly}\")\n", + "print(f\"MSE of test data: {mse_test_poly}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nhbIv-toOFoV" + }, + "source": [ + "6. Did it perform better than Linear Regression? Visualize the results similar to **Task 1a) 4**." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yFNrIwDuOUXo" + }, + "outputs": [], + "source": [ + "..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lR_v9mTWOVNj" + }, + "source": [ + "## Task 1c\n", + "Investigate the influence of polynomial degrees on the results. Consider degrees in `range(0, 11)`. Visualize the results similar to **Task 1a) 4** and plot MSE (on training as well as test data) as a function of the number of the polynomial degrees." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4YFTQqWZO_Jx" + }, + "outputs": [], + "source": [ + "mses_test_poly = []\n", + "mses_train_poly = []\n", + "\n", + "..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qY6QK6OhGBVp" + }, + "source": [ + "## 📢 **HAND-IN** 📢: Answer following questions in Moodle:\n", + "\n", + "What is the optimal value of the polynomial degrees? Do the values of MSE training and MSE test behave similarly? How do the models behave with polynomial degrees >= 8?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lhOvMhs_V4cY" + }, + "source": [ + "# Task 2: Polynomial Data Transformation (1 Point)\n", + "\n", + "As we have seen in the lecture, Polynomial Regression is nothing other than a generalization of Linear Regression. Every polynomial Regression can be expressed as a Multivariate Linear Regression. Only transformation of the initial data has to be done.\n", + "\n", + " $h_\\theta(a) = \\theta_0 + \\theta_1a_1 +\\theta_2a_2 $, where\n", + " $ a_0 = v^0, a_1 = v^1, a_2 = v^2 $\n", + "\n", + "In Task 1 `sklearn.preprocessing.PolynomialFeatures` transformed the X data for us. But in order to understand what exactly it is done to the data, in this task we transform an initial data array $v$ to\n", + "the form $(a_1...a_n)$ that can be used to build a Polynomial Regression model with polynomial degrees=2 by hand (without using `sklearn.preprocessing.PolynomialFeatures`). Please transform the array $v$ and insert your answer in Moodle." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zz4WDaq436-y" + }, + "source": [ + "\\begin{align}\n", + "v=\n", + "\\begin{bmatrix}\n", + "3 \\\\\n", + "2 \\\\\n", + "0 \\\\\n", + "\\end{bmatrix}\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DUDUkf1uJFWf" + }, + "source": [ + "## 📢 **HAND-IN** 📢: Write your answer in Moodle" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XO800wc-WhyJ" + }, + "source": [ + "# Task 3: Logistic Regression (4 Points)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "prescribed-lawyer" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from random import randrange\n", + "import seaborn as sns\n", + "sns.set()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_lLze_K1g0ZA" + }, + "source": [ + "## Task 3a. Data Exploration and Preprocessing\n", + "\n", + "We are using the Fashion MNIST Dataset from Zalando.\n", + "Firstly, we load and explore the dataset.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wR0ijS0VZ6dk" + }, + "outputs": [], + "source": [ + "from keras.datasets import fashion_mnist\n", + "(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()\n", + "print(X_train.shape)\n", + "print(y_train.shape)\n", + "print(X_train.dtype)\n", + "print(y_train.dtype)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1fd78190-5445-4c53-9b7f-4a0f7aeaf87c" + }, + "outputs": [], + "source": [ + "label_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',\n", + " 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_9FX76IifOik" + }, + "source": [ + "In following task we will only use training part of the dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fDgGTbTMHxwN" + }, + "source": [ + "#### Prepare data\n", + "1. assign following datatypes to the arrays:\n", + " - X_train -> 'float32'\n", + " - y_train -> 'int64'\n", + "2. reshape X_train to 2-dimensional array.\n", + "Note:\n", + " - it should have the same amount of samples/rows.\n", + "3. split the training data into (X_train, y_train) and (X_valid, y_valid), set the size of the validation dataset to 20% of the training data and set random state = 42." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "d3b1f9ef-da3e-445e-b8b0-8adfa6871f14" + }, + "outputs": [], + "source": [ + "..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "26YTJ4uQYE46" + }, + "source": [ + "#### Visualize some data\n", + "Plot 25 images (hint: use ``imshow`` and ``subplots`` from matplotlib library), plot the label as title (e.g. shorts)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1ee386c1-8502-4422-8a1c-8c852e67ac40" + }, + "outputs": [], + "source": [ + "plt.figure(figsize=(10,10))\n", + "..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b61c7d24-8e54-4827-a15f-b0ae553d5743" + }, + "source": [ + "#### Normalize the Images\n", + "With mean and standard deviation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "cdf0d1e7-0466-4a49-a349-6b27a9aa3662" + }, + "outputs": [], + "source": [ + "..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1ca1ada5-1203-4e44-a5aa-38b512d522c6" + }, + "source": [ + "## Task 3b. Logistic Regression\n", + "1. Fit the `LogisticRegression` from `scikit-learn`. Set the `random_state` for reproducibility.\n", + "2. Try different parameters (either by hand or by using `GridSearchCV`)\n", + "\n", + "\n", + "**Accuracy should be >= 0.84**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5Sk55fOkLhgm" + }, + "source": [ + "Please, check the documentation on:\n", + "GridSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html\n", + "\n", + "PredefinedSplit: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.PredefinedSplit.html\n", + "\n", + "You can ignore a warning \"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\" as long as GridSearchCV continues with the next hyperparameter and you reach the necessary accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "U5TiVcHmLtAy" + }, + "outputs": [], + "source": [ + "from sklearn.model_selection import GridSearchCV\n", + "from sklearn.model_selection import PredefinedSplit\n", + "# We use predefined split in order to control that no train samples would be used in validation step\n", + "\n", + "\n", + "train_indices = np.full((X_train.shape[0],), -1, dtype=int)\n", + "test_indices = np.full((X_valid.shape[0],), 0, dtype=int)\n", + "\n", + "ps = PredefinedSplit(np.append(train_indices, test_indices))\n", + "\n", + "...\n", + "\n", + "clf = LogisticRegression(...)\n", + "opt = GridSearchCV(clf, cv=ps, ...)\n", + "\n", + "# when we fit the model, we should use both training and validation samples\n", + "\n", + "..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X5JzFGLlMvk8" + }, + "source": [ + "Use the best found parameters for the next steps. `GridSearchCV` provides them in the `best_params_` attribute." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JFqGPd65aM_l" + }, + "source": [ + "3. Create a new `LogisticRegression` instance with the best found parameters.\n", + "4. Fit it on the training set.\n", + "5. Calculate the accuracy on the validation set." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "d34138f1-d52a-4cb1-8b49-732f4d711d7a" + }, + "outputs": [], + "source": [ + "from sklearn.metrics import accuracy_score\n", + "\n", + "..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "n9lT2RiDNPD2" + }, + "source": [ + "## 📢 **HAND-IN** 📢: Report in Moodle the accuracy you got in this task." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.8.15" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/Notes/Semester 4/MLDM - Machine Learning and Data Mining/PreClassReading/L04 Pre-Class Reading.pdf b/Notes/Semester 4/MLDM - Machine Learning and Data Mining/PreClassReading/L04 Pre-Class Reading.pdf new file mode 100644 index 0000000..39d27f6 Binary files /dev/null and b/Notes/Semester 4/MLDM - Machine Learning and Data Mining/PreClassReading/L04 Pre-Class Reading.pdf differ