Add MLDM Lab04 course material

This commit is contained in:
Manuel Thalmann 2023-06-17 16:22:28 +02:00
parent d1264e158c
commit 50ec2fe9ca
2 changed files with 592 additions and 0 deletions

View file

@ -0,0 +1,592 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "pAJRKdv9QA4C"
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from matplotlib import pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "D1m4qcrpXdTt"
},
"outputs": [],
"source": [
"RANDOM_SEED = 0x0"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Ia9s_Q-KXf0T"
},
"source": [
"# TASK 1: Polynomial Regression (5 Points):"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VV0Z3OdeXpha"
},
"source": [
"Let's create and explore the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "nb5WsezldFla"
},
"outputs": [],
"source": [
"# set the random seed to an RANDOM_SEED, so that everyone has the same data to work with\n",
"np.random.seed(seed=RANDOM_SEED)\n",
"# create predictor variable, that have standard normal distribution and reshape it in order to use for the model training\n",
"x = np.random.normal(0, 1, 100).reshape(-1, 1)\n",
"# create target variable\n",
"y = 3*x**3 + 2*x**2 + x + np.random.normal(0, 10, 100).reshape(-1, 1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "E65IxT1Bwpmk"
},
"source": [
"Visualise the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "nCZgTYP3fZe7"
},
"outputs": [],
"source": [
"plt.scatter(x, y)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Cx8aSpUnJCI7"
},
"source": [
"## Task 1a\n",
"Apply Linear Regression on the data\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nvRxguOTJnjS"
},
"source": [
"1. Split the data in the train and test set (80/20), set `random_state` to `RANDOM_SEED`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ti7myWk7KS8Z"
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"X_train, X_test, y_train, y_test = ..."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RFSaakYLJuK7"
},
"source": [
"2. Apply Linear Regression on the data and predict `y` values for training as well test data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Ez6t4Q4P82Qo"
},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"..."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "P1k6VBGk8yI6"
},
"source": [
"3. Calculate MSE for training as well as for test data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qJGjXK8aKD8q"
},
"outputs": [],
"source": [
"from sklearn.metrics import mean_squared_error\n",
"\n",
"...\n",
"\n",
"print(f\"MSE of training data: {mse_train}\")\n",
"print(f\"MSE of test data: {mse_test}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "T0VOmhngKEQL"
},
"source": [
"4. Visualize the model's artefacts: Plot all the data as well as Linear Regression predictions for training and test data in a scatter plot. Don't forget a legend to differentiate the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "94LPydyRD4Nr"
},
"outputs": [],
"source": [
"def plot_artefacts(...):\n",
" plt.figure()\n",
" ...\n",
" plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "yLwMEWirLLBA"
},
"outputs": [],
"source": [
"plot_artefacts(...)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zwsmpB3oMJZf"
},
"source": [
"## Task 1b\n",
"Investigate how well polynomial regression with polynomial degrees = 2 can solve the task. In order to do so, follow these steps:\n",
"1. Transform the training and test data accordingly to describe polynomial distribution of degree=2\n",
"2. Train a Linear Regression model on polynomial data\n",
"3. Make predictions for training data\n",
"4. Make predictions for test data\n",
"5. Calculate MSE for training as well as test data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "oxni0o041MYH"
},
"outputs": [],
"source": [
"from sklearn.preprocessing import PolynomialFeatures\n",
"\n",
"def poly_regression(...):\n",
" ...\n",
" return y_pred_train_poly, y_pred_test_poly, mse_train_poly, mse_test_poly"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "mTiIynAqD4Nr"
},
"outputs": [],
"source": [
"...\n",
"print(f\"MSE of training data: {mse_train_poly}\")\n",
"print(f\"MSE of test data: {mse_test_poly}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nhbIv-toOFoV"
},
"source": [
"6. Did it perform better than Linear Regression? Visualize the results similar to **Task 1a) 4**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "yFNrIwDuOUXo"
},
"outputs": [],
"source": [
"..."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lR_v9mTWOVNj"
},
"source": [
"## Task 1c\n",
"Investigate the influence of polynomial degrees on the results. Consider degrees in `range(0, 11)`. Visualize the results similar to **Task 1a) 4** and plot MSE (on training as well as test data) as a function of the number of the polynomial degrees."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "4YFTQqWZO_Jx"
},
"outputs": [],
"source": [
"mses_test_poly = []\n",
"mses_train_poly = []\n",
"\n",
"..."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qY6QK6OhGBVp"
},
"source": [
"## 📢 **HAND-IN** 📢: Answer following questions in Moodle:\n",
"\n",
"What is the optimal value of the polynomial degrees? Do the values of MSE training and MSE test behave similarly? How do the models behave with polynomial degrees >= 8?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lhOvMhs_V4cY"
},
"source": [
"# Task 2: Polynomial Data Transformation (1 Point)\n",
"\n",
"As we have seen in the lecture, Polynomial Regression is nothing other than a generalization of Linear Regression. Every polynomial Regression can be expressed as a Multivariate Linear Regression. Only transformation of the initial data has to be done.\n",
"\n",
" $h_\\theta(a) = \\theta_0 + \\theta_1a_1 +\\theta_2a_2 $, where\n",
" $ a_0 = v^0, a_1 = v^1, a_2 = v^2 $\n",
"\n",
"In Task 1 `sklearn.preprocessing.PolynomialFeatures` transformed the X data for us. But in order to understand what exactly it is done to the data, in this task we transform an initial data array $v$ to\n",
"the form $(a_1...a_n)$ that can be used to build a Polynomial Regression model with polynomial degrees=2 by hand (without using `sklearn.preprocessing.PolynomialFeatures`). Please transform the array $v$ and insert your answer in Moodle."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Zz4WDaq436-y"
},
"source": [
"\\begin{align}\n",
"v=\n",
"\\begin{bmatrix}\n",
"3 \\\\\n",
"2 \\\\\n",
"0 \\\\\n",
"\\end{bmatrix}\n",
"\\end{align}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DUDUkf1uJFWf"
},
"source": [
"## 📢 **HAND-IN** 📢: Write your answer in Moodle"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XO800wc-WhyJ"
},
"source": [
"# Task 3: Logistic Regression (4 Points)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "prescribed-lawyer"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from random import randrange\n",
"import seaborn as sns\n",
"sns.set()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_lLze_K1g0ZA"
},
"source": [
"## Task 3a. Data Exploration and Preprocessing\n",
"\n",
"We are using the Fashion MNIST Dataset from Zalando.\n",
"Firstly, we load and explore the dataset.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "wR0ijS0VZ6dk"
},
"outputs": [],
"source": [
"from keras.datasets import fashion_mnist\n",
"(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()\n",
"print(X_train.shape)\n",
"print(y_train.shape)\n",
"print(X_train.dtype)\n",
"print(y_train.dtype)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1fd78190-5445-4c53-9b7f-4a0f7aeaf87c"
},
"outputs": [],
"source": [
"label_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',\n",
" 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_9FX76IifOik"
},
"source": [
"In following task we will only use training part of the dataset."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fDgGTbTMHxwN"
},
"source": [
"#### Prepare data\n",
"1. assign following datatypes to the arrays:\n",
" - X_train -> 'float32'\n",
" - y_train -> 'int64'\n",
"2. reshape X_train to 2-dimensional array.\n",
"Note:\n",
" - it should have the same amount of samples/rows.\n",
"3. split the training data into (X_train, y_train) and (X_valid, y_valid), set the size of the validation dataset to 20% of the training data and set random state = 42."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "d3b1f9ef-da3e-445e-b8b0-8adfa6871f14"
},
"outputs": [],
"source": [
"..."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "26YTJ4uQYE46"
},
"source": [
"#### Visualize some data\n",
"Plot 25 images (hint: use ``imshow`` and ``subplots`` from matplotlib library), plot the label as title (e.g. shorts)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1ee386c1-8502-4422-8a1c-8c852e67ac40"
},
"outputs": [],
"source": [
"plt.figure(figsize=(10,10))\n",
"..."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "b61c7d24-8e54-4827-a15f-b0ae553d5743"
},
"source": [
"#### Normalize the Images\n",
"With mean and standard deviation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cdf0d1e7-0466-4a49-a349-6b27a9aa3662"
},
"outputs": [],
"source": [
"..."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1ca1ada5-1203-4e44-a5aa-38b512d522c6"
},
"source": [
"## Task 3b. Logistic Regression\n",
"1. Fit the `LogisticRegression` from `scikit-learn`. Set the `random_state` for reproducibility.\n",
"2. Try different parameters (either by hand or by using `GridSearchCV`)\n",
"\n",
"\n",
"**Accuracy should be >= 0.84**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5Sk55fOkLhgm"
},
"source": [
"Please, check the documentation on:\n",
"GridSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html\n",
"\n",
"PredefinedSplit: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.PredefinedSplit.html\n",
"\n",
"You can ignore a warning \"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\" as long as GridSearchCV continues with the next hyperparameter and you reach the necessary accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "U5TiVcHmLtAy"
},
"outputs": [],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"from sklearn.model_selection import PredefinedSplit\n",
"# We use predefined split in order to control that no train samples would be used in validation step\n",
"\n",
"\n",
"train_indices = np.full((X_train.shape[0],), -1, dtype=int)\n",
"test_indices = np.full((X_valid.shape[0],), 0, dtype=int)\n",
"\n",
"ps = PredefinedSplit(np.append(train_indices, test_indices))\n",
"\n",
"...\n",
"\n",
"clf = LogisticRegression(...)\n",
"opt = GridSearchCV(clf, cv=ps, ...)\n",
"\n",
"# when we fit the model, we should use both training and validation samples\n",
"\n",
"..."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "X5JzFGLlMvk8"
},
"source": [
"Use the best found parameters for the next steps. `GridSearchCV` provides them in the `best_params_` attribute."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JFqGPd65aM_l"
},
"source": [
"3. Create a new `LogisticRegression` instance with the best found parameters.\n",
"4. Fit it on the training set.\n",
"5. Calculate the accuracy on the validation set."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "d34138f1-d52a-4cb1-8b49-732f4d711d7a"
},
"outputs": [],
"source": [
"from sklearn.metrics import accuracy_score\n",
"\n",
"..."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "n9lT2RiDNPD2"
},
"source": [
"## 📢 **HAND-IN** 📢: Report in Moodle the accuracy you got in this task."
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.8.15"
}
},
"nbformat": 4,
"nbformat_minor": 0
}