ZHAWNotes/Notes/Semester 4/MLDM - Machine Learning and Data Mining/Labs/L01_Data_Processing_LAB_ASSIGNMENT.ipynb

970 lines
80 KiB
Text
Raw Normal View History

2023-06-08 12:55:11 +00:00
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "8bPV9aEwTKC8"
},
2023-06-10 12:18:49 +00:00
"outputs": [],
2023-06-08 12:55:11 +00:00
"source": [
"import numpy as np\n",
"from matplotlib import pyplot as plt\n",
"import sklearn\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
2023-06-10 12:18:49 +00:00
"execution_count": 2,
2023-06-08 12:55:11 +00:00
"metadata": {
"id": "jFHJbjkfeepf"
},
"outputs": [],
"source": [
"RANDOM_SEED = 0x0"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "ykbI8UnR6PsU"
},
"source": [
"# TASK 1 (2 Points): \n",
"\n",
"We work with the \"Wine Recognition\" dataset. You can read more about this dataset at [https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-recognition-dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-recognition-dataset).\n",
"\n",
"The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators.\n",
"The data is loaded below and split into `data` and `target`. `data` is a `Dataframe` that contains the result of the chemical analysis while `target` contains an integer representing the wine cultivator."
]
},
{
"cell_type": "code",
2023-06-10 12:18:49 +00:00
"execution_count": 3,
2023-06-08 12:55:11 +00:00
"metadata": {
"id": "em6VCOuE6MRU"
},
"outputs": [],
"source": [
"from sklearn.datasets import load_wine\n",
"(data, target) = load_wine(return_X_y=True, as_frame=True)"
]
},
{
"cell_type": "code",
2023-06-10 12:18:49 +00:00
"execution_count": 4,
2023-06-08 12:55:11 +00:00
"metadata": {
"id": "HJoAuMNR6MgM"
},
2023-06-10 12:18:49 +00:00
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>alcohol</th>\n",
" <th>malic_acid</th>\n",
" <th>ash</th>\n",
" <th>alcalinity_of_ash</th>\n",
" <th>magnesium</th>\n",
" <th>total_phenols</th>\n",
" <th>flavanoids</th>\n",
" <th>nonflavanoid_phenols</th>\n",
" <th>proanthocyanins</th>\n",
" <th>color_intensity</th>\n",
" <th>hue</th>\n",
" <th>od280/od315_of_diluted_wines</th>\n",
" <th>proline</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>14.23</td>\n",
" <td>1.71</td>\n",
" <td>2.43</td>\n",
" <td>15.6</td>\n",
" <td>127.0</td>\n",
" <td>2.80</td>\n",
" <td>3.06</td>\n",
" <td>0.28</td>\n",
" <td>2.29</td>\n",
" <td>5.64</td>\n",
" <td>1.04</td>\n",
" <td>3.92</td>\n",
" <td>1065.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>13.20</td>\n",
" <td>1.78</td>\n",
" <td>2.14</td>\n",
" <td>11.2</td>\n",
" <td>100.0</td>\n",
" <td>2.65</td>\n",
" <td>2.76</td>\n",
" <td>0.26</td>\n",
" <td>1.28</td>\n",
" <td>4.38</td>\n",
" <td>1.05</td>\n",
" <td>3.40</td>\n",
" <td>1050.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>13.16</td>\n",
" <td>2.36</td>\n",
" <td>2.67</td>\n",
" <td>18.6</td>\n",
" <td>101.0</td>\n",
" <td>2.80</td>\n",
" <td>3.24</td>\n",
" <td>0.30</td>\n",
" <td>2.81</td>\n",
" <td>5.68</td>\n",
" <td>1.03</td>\n",
" <td>3.17</td>\n",
" <td>1185.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>14.37</td>\n",
" <td>1.95</td>\n",
" <td>2.50</td>\n",
" <td>16.8</td>\n",
" <td>113.0</td>\n",
" <td>3.85</td>\n",
" <td>3.49</td>\n",
" <td>0.24</td>\n",
" <td>2.18</td>\n",
" <td>7.80</td>\n",
" <td>0.86</td>\n",
" <td>3.45</td>\n",
" <td>1480.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>13.24</td>\n",
" <td>2.59</td>\n",
" <td>2.87</td>\n",
" <td>21.0</td>\n",
" <td>118.0</td>\n",
" <td>2.80</td>\n",
" <td>2.69</td>\n",
" <td>0.39</td>\n",
" <td>1.82</td>\n",
" <td>4.32</td>\n",
" <td>1.04</td>\n",
" <td>2.93</td>\n",
" <td>735.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>173</th>\n",
" <td>13.71</td>\n",
" <td>5.65</td>\n",
" <td>2.45</td>\n",
" <td>20.5</td>\n",
" <td>95.0</td>\n",
" <td>1.68</td>\n",
" <td>0.61</td>\n",
" <td>0.52</td>\n",
" <td>1.06</td>\n",
" <td>7.70</td>\n",
" <td>0.64</td>\n",
" <td>1.74</td>\n",
" <td>740.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>174</th>\n",
" <td>13.40</td>\n",
" <td>3.91</td>\n",
" <td>2.48</td>\n",
" <td>23.0</td>\n",
" <td>102.0</td>\n",
" <td>1.80</td>\n",
" <td>0.75</td>\n",
" <td>0.43</td>\n",
" <td>1.41</td>\n",
" <td>7.30</td>\n",
" <td>0.70</td>\n",
" <td>1.56</td>\n",
" <td>750.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>175</th>\n",
" <td>13.27</td>\n",
" <td>4.28</td>\n",
" <td>2.26</td>\n",
" <td>20.0</td>\n",
" <td>120.0</td>\n",
" <td>1.59</td>\n",
" <td>0.69</td>\n",
" <td>0.43</td>\n",
" <td>1.35</td>\n",
" <td>10.20</td>\n",
" <td>0.59</td>\n",
" <td>1.56</td>\n",
" <td>835.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>176</th>\n",
" <td>13.17</td>\n",
" <td>2.59</td>\n",
" <td>2.37</td>\n",
" <td>20.0</td>\n",
" <td>120.0</td>\n",
" <td>1.65</td>\n",
" <td>0.68</td>\n",
" <td>0.53</td>\n",
" <td>1.46</td>\n",
" <td>9.30</td>\n",
" <td>0.60</td>\n",
" <td>1.62</td>\n",
" <td>840.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>177</th>\n",
" <td>14.13</td>\n",
" <td>4.10</td>\n",
" <td>2.74</td>\n",
" <td>24.5</td>\n",
" <td>96.0</td>\n",
" <td>2.05</td>\n",
" <td>0.76</td>\n",
" <td>0.56</td>\n",
" <td>1.35</td>\n",
" <td>9.20</td>\n",
" <td>0.61</td>\n",
" <td>1.60</td>\n",
" <td>560.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>178 rows × 13 columns</p>\n",
"</div>"
],
"text/plain": [
" alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols \\\n",
"0 14.23 1.71 2.43 15.6 127.0 2.80 \n",
"1 13.20 1.78 2.14 11.2 100.0 2.65 \n",
"2 13.16 2.36 2.67 18.6 101.0 2.80 \n",
"3 14.37 1.95 2.50 16.8 113.0 3.85 \n",
"4 13.24 2.59 2.87 21.0 118.0 2.80 \n",
".. ... ... ... ... ... ... \n",
"173 13.71 5.65 2.45 20.5 95.0 1.68 \n",
"174 13.40 3.91 2.48 23.0 102.0 1.80 \n",
"175 13.27 4.28 2.26 20.0 120.0 1.59 \n",
"176 13.17 2.59 2.37 20.0 120.0 1.65 \n",
"177 14.13 4.10 2.74 24.5 96.0 2.05 \n",
"\n",
" flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue \\\n",
"0 3.06 0.28 2.29 5.64 1.04 \n",
"1 2.76 0.26 1.28 4.38 1.05 \n",
"2 3.24 0.30 2.81 5.68 1.03 \n",
"3 3.49 0.24 2.18 7.80 0.86 \n",
"4 2.69 0.39 1.82 4.32 1.04 \n",
".. ... ... ... ... ... \n",
"173 0.61 0.52 1.06 7.70 0.64 \n",
"174 0.75 0.43 1.41 7.30 0.70 \n",
"175 0.69 0.43 1.35 10.20 0.59 \n",
"176 0.68 0.53 1.46 9.30 0.60 \n",
"177 0.76 0.56 1.35 9.20 0.61 \n",
"\n",
" od280/od315_of_diluted_wines proline \n",
"0 3.92 1065.0 \n",
"1 3.40 1050.0 \n",
"2 3.17 1185.0 \n",
"3 3.45 1480.0 \n",
"4 2.93 735.0 \n",
".. ... ... \n",
"173 1.74 740.0 \n",
"174 1.56 750.0 \n",
"175 1.56 835.0 \n",
"176 1.62 840.0 \n",
"177 1.60 560.0 \n",
"\n",
"[178 rows x 13 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
2023-06-08 12:55:11 +00:00
"source": [
"data"
]
},
{
"cell_type": "code",
2023-06-10 12:18:49 +00:00
"execution_count": 5,
2023-06-08 12:55:11 +00:00
"metadata": {
"id": "xrsPKm3w6Mi-"
},
2023-06-10 12:18:49 +00:00
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 0\n",
"2 0\n",
"3 0\n",
"4 0\n",
" ..\n",
"173 2\n",
"174 2\n",
"175 2\n",
"176 2\n",
"177 2\n",
"Name: target, Length: 178, dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
2023-06-08 12:55:11 +00:00
"source": [
"target"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "B3W5r6Se8kXW"
},
"source": [
"Next, the data is split into training data and testing data.\n",
"The training data is used to train the model while the testing data is used to evaluate the model on different data than it was trained for. You will learn later in the course why this is necessary."
]
},
{
"cell_type": "code",
2023-06-10 12:18:49 +00:00
"execution_count": 6,
2023-06-08 12:55:11 +00:00
"metadata": {
"id": "m1w8dDgw6MoO"
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.33, random_state=42)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "J_eeYvZc-f_n"
},
"source": [
"\n",
"In the following, we define functions to classify the data. We use a [Decision Tree Classifier](https://scikit-learn.org/stable/modules/tree.html#tree) and a [Support Vector Classifier](https://scikit-learn.org/stable/modules/svm.html#svm-classification). You will learn later in the course how these classifiers work."
]
},
{
"cell_type": "code",
2023-06-10 12:18:49 +00:00
"execution_count": 7,
2023-06-08 12:55:11 +00:00
"metadata": {
"id": "pvm_zBOe-e_X"
},
"outputs": [],
"source": [
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.svm import SVC\n",
"from sklearn.metrics import accuracy_score\n",
"\n",
"def run_classifier(clf, X_train, y_train, X_test, y_test):\n",
" clf.fit(X_train, y_train)\n",
" y_test_predicted = clf.predict(X_test)\n",
" return accuracy_score(y_test, y_test_predicted)\n",
"\n",
"\n",
"def run_decision_tree(X_train, y_train, X_test, y_test):\n",
" clf = DecisionTreeClassifier(random_state=0)\n",
" accuracy = run_classifier(clf, X_train, y_train, X_test, y_test)\n",
" print(\"The accuracy of the Decision Tree classifier is\", accuracy)\n",
"\n",
"def run_svc(X_train, y_train, X_test, y_test):\n",
" clf = SVC(random_state=0)\n",
" accuracy = run_classifier(clf, X_train, y_train, X_test, y_test)\n",
" print(\"The accuracy of the Support Vector classifier is\", accuracy)\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "s1MS2D8LAMpD"
},
"source": [
"### Task 1a: Classify the data\n",
"\n",
"Classify the data by calling the two functions `run_decision_tree` and `run_svc`.\n",
"Which classifier works better (i.e. achieves the higher accuracy)?"
]
},
{
"cell_type": "code",
2023-06-10 12:18:49 +00:00
"execution_count": 8,
2023-06-08 12:55:11 +00:00
"metadata": {
"id": "5ToW8fx4ANZ8"
},
2023-06-10 12:18:49 +00:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The accuracy of the Decision Tree classifier is 0.9661016949152542\n",
"The accuracy of the Support Vector classifier is 0.711864406779661\n"
]
}
],
"source": [
"run_decision_tree(X_train, y_train, X_test, y_test)\n",
"run_svc(X_train, y_train, X_test, y_test)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"_The Decision Tree Classifier seems to be working better._"
]
2023-06-08 12:55:11 +00:00
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "BbM8OUZFBRGH"
},
"source": [
"### Task 1b: Normalize the data with mean and standard deviation\n",
"\n",
"Normalize the training and testing data using the following formula:\n",
"\n",
"$$X_{normalized} = \\frac{X-\\mu_X}{\\sigma_X}$$\n",
"\n",
"Calculate the mean and standard deviation __on the training data__ only (also when you normalize the testing dataset).\n",
"\n",
"`Pandas` provides built-in functions to calculate the average and the standard deviation. For example, `X_train.mean()` returns the average value per feature in the training dataset while `X_train.std()` returns the standard deviation per feature."
]
},
{
"cell_type": "code",
2023-06-10 12:18:49 +00:00
"execution_count": 9,
2023-06-08 12:55:11 +00:00
"metadata": {
"id": "K0qkP9TqBRft"
},
"outputs": [],
2023-06-10 12:18:49 +00:00
"source": [
"def std_norm(data):\n",
" return (data - X_train.mean()) / X_train.std()\n",
"\n",
"X_train_std_norm = std_norm(X_train)\n",
"X_test_std_norm = std_norm(X_test)"
]
2023-06-08 12:55:11 +00:00
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "_fNuBgC6BSFt"
},
"source": [
"Call the two classification functions again with the normalized data and report the changes in accuracy. What do you notice?"
]
},
{
"cell_type": "code",
2023-06-10 12:18:49 +00:00
"execution_count": 10,
2023-06-08 12:55:11 +00:00
"metadata": {
"id": "TFg6WbmgBShk"
},
2023-06-10 12:18:49 +00:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The accuracy of the Support Vector classifier is 0.9830508474576272\n",
"The accuracy of the Decision Tree classifier is 0.9661016949152542\n"
]
}
],
"source": [
"run_svc(X_train_std_norm, y_train, X_test_std_norm, y_test)\n",
"run_decision_tree(X_train_std_norm, y_train, X_test_std_norm, y_test)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"_Now the Support Vector Classifier is more accurate - however, both have a very high accuracy._"
]
2023-06-08 12:55:11 +00:00
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "1_1EVF-TBS7v"
},
"source": [
"### Task 1c: Repeat Task 1b with min-max Normalization\n",
"\n",
"Repeat the task 1b but use the following formula to normalize tha data:\n",
"\n",
"$$X_{normalized} = \\frac{X-X_{min}}{X_{max} - X_{min}}$$\n",
"\n",
"Again, calculate the maximum and minimum __on the training data__ only (also when you normalize the testing dataset) and use the built-in function `X_train.min()` resp. `X_train.max()`."
]
},
{
"cell_type": "code",
2023-06-10 12:18:49 +00:00
"execution_count": 11,
2023-06-08 12:55:11 +00:00
"metadata": {
"id": "i25XenppJ7gf"
},
"outputs": [],
2023-06-10 12:18:49 +00:00
"source": [
"def min_max_norm(data):\n",
" return (data - X_train.min()) / (X_train.max() - X_train.min())\n",
"\n",
"X_train_min_max_norm = min_max_norm(X_train)\n",
"X_test_min_max_norm = min_max_norm(X_test)"
]
2023-06-08 12:55:11 +00:00
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "NIy0ECbTJ7gq"
},
"source": [
"Call the two classification functions again with the normalized data and report the changes in accuracy. What do you notice?"
]
},
{
"cell_type": "code",
2023-06-10 12:18:49 +00:00
"execution_count": 12,
2023-06-08 12:55:11 +00:00
"metadata": {
"id": "99uuR7ngJ7gr"
},
2023-06-10 12:18:49 +00:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The accuracy of the Support Vector classifier is 0.9830508474576272\n",
"The accuracy of the Decision Tree classifier is 0.9661016949152542\n"
]
}
],
"source": [
"run_svc(X_train_min_max_norm, y_train, X_test_min_max_norm, y_test)\n",
"run_decision_tree(X_train_min_max_norm, y_train, X_test_min_max_norm, y_test)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"_The accuracy does not change._"
]
2023-06-08 12:55:11 +00:00
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "c_i1aBh6KnWw"
},
"source": [
"## 📢 **HAND-IN** 📢: Report on Moodle whether you solved this task."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "m7I1RBjQK7Ly"
},
"source": [
"---\n",
"# TASK 2 (2 Points): \n",
"\n",
"In Task 1 we clearly saw that normalization improves the result for Support Vector Classifiers but not for Decision Trees. You will learn later in the course why Decision Trees don't need normalization.\n",
"\n",
"However, to better understand the influence of normalization, we will plot the data with and without normalization.\n"
]
},
{
"cell_type": "code",
2023-06-10 12:18:49 +00:00
"execution_count": 13,
2023-06-08 12:55:11 +00:00
"metadata": {
"id": "w9qp3e4nBTPK"
},
"outputs": [],
"source": [
"import seaborn as sns\n",
"sns.set_theme(style=\"ticks\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "tnF26SbCNCRS"
},
"source": [
"### Task 2a: Plot the unnormalized data\n",
"\n",
"For simplicity, we only consider only the columns `alcohol` and `malic_acid` from the training dataset.\n",
"\n",
"Create a [Scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) from the corresponding training data mentioned below with the attribute `alcohol` on the `x`-axis and `malic_acid` on the `y`-axis.\n",
"\n",
"Plot the un-normalized data `X_train` as well as the two normalized versions from Exercise 1 in the same plot and describe what happens.\n",
"\n",
"__Hint:__ To visualize the data distributions in the same plot just call `sns.scatterplot` three times within the same code-cell. Add a 'label' argument to differentiate the data distributions."
]
},
{
"cell_type": "code",
2023-06-10 12:18:49 +00:00
"execution_count": 14,
2023-06-08 12:55:11 +00:00
"metadata": {
"id": "-lc07hbiOvYu"
},
2023-06-10 12:18:49 +00:00
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjcAAAG1CAYAAAAFuNXgAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAACJhUlEQVR4nO3dd3iTZdsG8DNNk+4FHUyBQhfQBaWICMhQ9hTxe6UgCgKKoKgsFVQUQRGQoYyXoQIqIKDIEAFfBVSW7NGWUZDZvZsmaZvvj5qYNLNp0oyev+PgOCDPkzx3ngZycd/Xdd0ChUKhABEREZGTcLH1AIiIiIgsicENERERORUGN0RERORUGNwQERGRU2FwQ0RERE6FwQ0RERE5FQY3RERE5FQY3BAREZFTcbX1AGpbQkICZDIZgoKCbD0UIiIiMlFmZibEYjFOnTpl9Nw6F9xIpVKUl5fbehhERERUDWVlZTB1U4U6F9wEBwcDAA4dOmTjkRAREZGpevbsafK5DpNz8/3336Nfv36Ijo5G//79sW/fPlsPiYiIiOyQQwQ3P/zwA9566y2MHDkSe/bswYABA/Daa6/hzJkzth4aERER2Rm7D24UCgWWLl2K0aNHY+TIkXjooYfw4osv4pFHHsGJEydsPTwiIiKyM3afc5OWloa7d+9i4MCBGo+vW7fORiMiIiIie+YQwQ0AlJSUYOzYsbh8+TKaNGmCF198ET169ND5HENJR/fv30fDhg1NunZ5eTnkcnn1B03kwEQiEYRCoa2HQURkNrsPboqKigAAM2bMwMsvv4w33ngD+/fvx0svvYQNGzagU6dOFr+mQqHAgwcPkJeXZ/HXJnIE/v7+aNCgAQQCga2HQkRUbXYf3IhEIgDA2LFjMXToUABAVFQULl++rDe4MVTmbUopmTKwCQ4OhqenJ/+BpzpDoVCgpKQEGRkZAGDyLCcRkT2x++AmJCQEABAeHq7xeKtWrfDrr79a/Hrl5eWqwKZ+/foWf30ie+fh4QEAyMjIQHBwMJeoiMjh2H21VJs2beDl5YVz585pPJ6amoqHHnrI4tdT5th4enpa/LWJHIXy88+cMyJyRHY/c+Pu7o5x48bhs88+Q0hICGJiYrBnzx78/vvv+OKLL6x2XS5FUV3Gzz8ROTK7D24A4KWXXoKHhweWLFmC9PR0tGzZEsuXL0fHjh1tPTQiIiKLKiyRIb9IimKJHF4eIvh5u8HHU2zrYTkUhwhuAOC5557Dc889Z+thEBERWU1mngTLt57BmZRM1WPxEUGYPCIeQf4eNhyZY7H7nBsy36hRozBz5kydx2bOnIlRo0aZ/drHjx9HREQEFixYoPN4REQEduzYYfbrW4v6+75z5w4iIiJw/PjxWrkeEZEhhSUyrcAGAM6kZGL51jMoLJHZaGSOh8EN1ciXX36J06dP23oYZmnYsCGOHj2K+Ph4Ww+FiAj5RVKtwEbpTEom8ouktTwix8XgxsoKS2S4k1GIlFs5uJNR6HSRd+PGjTFr1iyUlpbaeijVJhQKERQUBLGYa9lEZHvFEsPVicaO078cJufGETnK2mlERATmzZuH3bt34/Tp0/D19cX//d//4eWXXzb63HfffReTJk3C4sWL8eabb+o978yZM1iyZAkuXboEV1dX9OjRA9OnT0dAQAAAoEePHujduzd+++03ZGdnY/ny5Vi+fDnatm2LzMxMHDp0CF5eXpg0aRLCw8Mxd+5c3Lx5E1FRUViwYAGaN28OADh16hSWLVuGixcvQiaToWnTppg4cSIGDx6sNaY7d+6gZ8+e+OqrrwAAo0eP1jn2jRs3IjExEenp6ViwYAGOHDkCoVCI+Ph4zJw5U3VthUKBlStX4ttvv0VBQQH69u0LqZT/0yIi03h5iGp0nP7FmRsrcbS1048++ghDhw7Fnj17kJSUhOXLl+PkyZNGn9e8eXNMnToVGzduxKlTp3Sec/78eYwaNQphYWHYunUrli5dinPnzmHs2LEoLy9Xnbdp0ya8/fbbWLt2LeLi4gBUBhZRUVHYtWsXevbsiQ8++ADvvvsu3nzzTWzatAkZGRlYtGgRACA9PR1jx45FdHQ0du7cie+//x4xMTF46623kJWVZfB9xMfH4+jRo6pfBw8eRIsWLfDII4+gXbt2KCkpUeXObNq0CRs3bkRAQABGjBiB9PR0AMCaNWuwdu1aTJ8+HTt27ICvry/27t1r9B4SEQGAn7cb4iOCdB6LjwiCn7dbLY/IcTG4sRJHWzsdMmQIBg8erJrp8PX1NTmXZvTo0YiLi8Obb74JiUSidXz9+vWIiIjA7Nmz0bJlSzz88MNYvHgxLl26hKNHj6rO69atGx555BFER0erloqioqIwduxYNG3aFElJSSgrK8OoUaPQsWNHREdHo2/fvkhNTQUASKVSTJ48GW+88QaaNWuGVq1aYfz48ZDL5bh586bB9yAWixEUFISgoCAEBgbio48+gkAgwNKlS+Hq6oo9e/agoKAACxcuRGRkJMLDwzFv3jx4e3tj69atUCgU2LhxI0aPHo0BAwYgNDQUs2bNQlRUlIk/ASKq63w8xZg8Il4rwImPCMKUEfHVKgd39pQIY7gsZSX2sHbq6uqKiooKnccqKirg6vrvj79ly5Yax318fFTdaasm3O7Zs0fjzy4uLpg/fz4GDx6MxYsX46233tI4npqais6dO2s8FhkZCR8fH6SkpKBbt24AgGbNmmmNU70LtXJbgKZNm6oec3d3V43zoYcewrBhw/DVV18hNTUVf//9N5KTkwFAY4bImIULF+LkyZPYtm0bfH19AQCXL19Gfn4+OnTooHGuVCrF9evXkZubi8zMTERHR2scj4uLw/Xr102+NhHVbUH+HpiWlFCjPjeOkhJhTQxurMQe1k59fX1RUFCg81h+fj78/PxUf9aVVKtQKAAA33//vcbjwcHBuH37tsZjyuWpBQsWoHfv3jpfR9frKzdGBSoDlarUjyu5uOiecLx27RqeeeYZtGnTBo888gieeOIJBAQE4KmnntJ5vi7btm3DV199hfXr12sEVhUVFWjRogVWrlyp9Rz1zVWrvlf1AJKIyBQ+nmKzm/YZS4mYlpRQJxoCclnKSuxh7bRNmzaqxFp1MpkM58+f15pl0KdZs2Yav/R9YT/77LNo3749Zs2apfF4REQE/vrrL43HkpOTUVRUpDVjVBPffvst6tevjw0bNuCFF15At27dVLk2+gIsdX/88Qfee+89vPvuu0hMTNQ4Fh4ejnv37sHHx0d1Hxo1aoRFixbh5MmTCAgIQMOGDbXe58WLFy32/oiIjHG0lAhrYXBjJZZcOzXX8OHDUVFRgZdffhlnzpzB3bt3ceLECbz00ktwdXXF8OHDLXo9gUCADz/8EJmZmn+xnnvuOaSkpOD999/H9evXcfz4cbzxxhto3bo1OnXqZLHrN2jQAA8ePMBvv/2Gu3fv4ueff8a7774LAFoBXlXXr1/HlClT8J///Ac9evRAZmam6ldxcTEGDRoEPz8/TJkyBefOncP169cxc+ZMHD58GBEREQCAF154AZs3b8a2bduQlpaGTz/9FOfPn7fY+yMiMsYeUiLsAefMrcgSa6c1Ua9ePWzZsgVLly7F5MmTkZeXB39/fzz66KN4//33NZalLKVZs2Z47bXXMG/ePNVjsbGxWLt2LT799FMMGTIE3t7e6NWrF15//XWdy07mGj16NG7cuIHp06dDJpOhefPmeO2117Bs2TJcuHABXbt21fvcvXv3orCwEF999ZWqNFzp5ZdfxuTJk7Fp0yZ8/PHHqiqvNm3aYP369arZp5EjR6KiogIrV65EVlYWunTpguHDhyMtLc1i75GIyBB7SImwBwKFKfP1TqRnz54AgEOHDuk8XlpairS0NLRo0UJnDghRXcC/B0SOqbBEhoWbTulcmoqPCHLonBtj39/quCxFRETkJOwhJcIecFmKiIjIidg6JcIeMLghIiJyMjUpJ3cGXJYiIiIip8LghoiIiJwKgxsiIiJyKgxuiIiIyKkwuCEiIiKnwuCGiIiInAqDGye3a9cujBgxAnFxc
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"for data in [\n",
" [X_train, \"Un-Normalized\"],\n",
" [X_train_std_norm, \"Standard Deviation\"],\n",
" [X_train_min_max_norm, \"Min/Max Normalized\"]]:\n",
" sns.scatterplot(data[0], x=\"alcohol\", y=\"malic_acid\", label=data[1])"
]
2023-06-08 12:55:11 +00:00
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "pJ5Ncd5cN-9z"
},
"source": [
"We will now have a closer look at the data. Calculate for the un-normalized training data as well as for the two normalized versions of the training data\n",
"\n",
"- The average value in the column `avg(alcohol)`\n",
"- The standard deviation in the column `std(alcohol)`\n",
"- The minimum value in the column `min(alcohol)`\n",
"- The maxmium value in the column `max(alcohol)`\n",
"- The range in the column by subtracting the minimum of the maximum in the column `max(alcohol) - min(alcohol)`\n",
"\n",
"Compare the properties of the un-normalized training data with the normalized training data. What do you notice?"
]
},
{
"cell_type": "code",
2023-06-10 12:18:49 +00:00
"execution_count": 15,
2023-06-08 12:55:11 +00:00
"metadata": {
"id": "J3D06pyKQjGq"
},
2023-06-10 12:18:49 +00:00
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Average</th>\n",
" <th>Standard Deviation</th>\n",
" <th>Minimum</th>\n",
" <th>Maximum</th>\n",
" <th>Range</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Un-Normalized</th>\n",
" <td>1.297101e+01</td>\n",
" <td>0.851975</td>\n",
" <td>11.030000</td>\n",
" <td>14.830000</td>\n",
" <td>3.800000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Standard Deviation</th>\n",
" <td>-1.160603e-15</td>\n",
" <td>1.000000</td>\n",
" <td>-2.278245</td>\n",
" <td>2.181979</td>\n",
" <td>4.460224</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Min/Max Normalized</th>\n",
" <td>5.107917e-01</td>\n",
" <td>0.224204</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Average Standard Deviation Minimum Maximum \\\n",
"Un-Normalized 1.297101e+01 0.851975 11.030000 14.830000 \n",
"Standard Deviation -1.160603e-15 1.000000 -2.278245 2.181979 \n",
"Min/Max Normalized 5.107917e-01 0.224204 0.000000 1.000000 \n",
"\n",
" Range \n",
"Un-Normalized 3.800000 \n",
"Standard Deviation 4.460224 \n",
"Min/Max Normalized 1.000000 "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.DataFrame(\n",
" [\n",
" [\n",
" selector(dataset[\"alcohol\"])\n",
" for selector in [\n",
" lambda column : column.mean(),\n",
" lambda column : column.std(),\n",
" lambda column : column.min(),\n",
" lambda column : column.max(),\n",
" lambda column : column.max() - column.min()]\n",
" ]\n",
" for dataset in [X_train, X_train_std_norm, X_train_min_max_norm]\n",
" ],\n",
" columns=[\"Average\", \"Standard Deviation\", \"Minimum\", \"Maximum\", \"Range\"],\n",
" index=[\"Un-Normalized\", \"Standard Deviation\", \"Min/Max Normalized\"])\n",
"\n",
"data"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"As expected, the _Standard Deviation Normalization_ causes the the standard deviation to equal $1$. The _Min/Max Normalization_ causes all values to be in the range of $0..1$ making the range equal $1$."
]
2023-06-08 12:55:11 +00:00
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "AH7H07ZcSniv"
},
"source": [
"## 📢 **HAND-IN** 📢: Report on Moodle whether you solved this task.\n",
"\n",
"---"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "UT3_BLJDl-0o"
},
"source": [
"# TASK 3 (6 Points): Binning\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "q7K4Cikz4aZE"
},
"source": [
"The following list consists of the age of several people: \n",
"```python\n",
"[13, 15, 16, 18, 19, 20, 20, 21, 22, 22, 25, 25, 26, 26, 30, 33, 34, 35, 35, 35, 36, 37, 40, 42, 46, 53, 70]\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "lsHmNGlW4aZE"
},
"source": [
"### Task 3a: Equal-Width Binning\n",
"Apply binning to the dataset using 3 equal-width bins. Smooth the data using the mean of the bins.\n",
"\n",
"Tips:\n",
"1. Calculate the size of the bins\n",
"2. Assign each value to the corresponding bin\n",
"3. Calculate the mean per bin\n",
"4. Replace each value by the mean of its bin\n",
"\n",
"__Solve this exercise by hand without using Python__"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "eukBUnVs4aZE"
},
"source": [
2023-06-10 12:18:49 +00:00
"❗ TODO ❗\n",
" 1. $\\frac{70 - 13}{3} = 19$ $\\rightarrow$ `[[13-32], [32-51], [51-70]]`\n",
" 2. `[[13, 15, 16, 18, 19, 20, 20, 21, 22, 22, 25, 25, 26, 26, 30], [33, 34, 35, 35, 35, 36, 37, 40, 42, 46], [53, 70]]`\n",
" 3. `[21.2, 37.3, 61.5]`\n",
" 4. `[21.2, 21.2, 21.2, 21.2, 21.2, 21.2, 21.2, 21.2, 21.2, 21.2, 21.2, 21.2, 21.2, 21.2, 21.2, 37.3, 37.3, 37.3, 37.3, 37.3, 37.3, 37.3, 37.3, 37.3, 37.3, 61.5, 61.5]`"
2023-06-08 12:55:11 +00:00
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "8UL9OUG44aZF"
},
"source": [
"### Task 3b: Equal-Depth Binning\n",
"\n",
"Apply binning to the dataset using 3 equal-depth bins. Smooth the data using the mean of the bins. Explain the steps of your approach and give the final result.\n",
"\n",
"Tips:\n",
"1. Calculate the number of elements per bin\n",
"2. Assign each value to the corresponding bin\n",
"3. Calculate the mean per bin\n",
"4. Replace each value by the mean of its bin\n",
"\n",
"__Please solve this exercise by hand without using Python__ "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "Vhf3wkSm4aZF"
},
"source": [
2023-06-10 12:18:49 +00:00
"❗ TODO ❗\n",
" 1. $9$\n",
" 2. `[[13, 15, 16, 18, 19, 20, 20, 21, 22], [22, 25, 25, 26, 26, 30, 33, 34, 35], [35, 35, 36, 37, 40, 42, 46, 53, 70]]`\n",
" 3. `[18.222, 28.444, 43.778]`\n",
" 4. `[18, 18, 18, 18, 18, 18, 18, 18, 18, 28, 28, 28, 28, 28, 28, 28, 28, 28, 43, 43, 43, 43, 43, 43, 43, 43, 43]`"
2023-06-08 12:55:11 +00:00
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "ex21HuPTl_Qx"
},
"source": [
"## 📢 **HAND-IN** 📢: Describe on Moodle the results of Exercise 3: \n",
"\n",
"* Copy the results of Exercise 3a and 3b to Moodle\n",
"* Describe the differences between task 3a and task 3b\n",
"* Describe situations when binning should be used and give a concrete example. Are there also circumstances in which binning should not be applied?"
]
}
],
"metadata": {
"colab": {
"private_outputs": true,
"provenance": []
},
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2023-06-10 12:18:49 +00:00
"version": "3.11.3"
2023-06-08 12:55:11 +00:00
},
"vscode": {
"interpreter": {
"hash": "558914ba93c675d10f9462c68c84c8e4fd6bd2548e0b0f568325ba1dc72cba58"
}
}
},
"nbformat": 4,
"nbformat_minor": 0
}