demo_eda/EDA.ipynb

6263 lines
5.3 MiB
Text
Raw Normal View History

2025-07-12 01:17:12 +02:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploratory Data Analysis\n",
"\n",
"The goal of this notebook is to explore the data and get a better understanding of the data with the idea of to create the best penguin classifier.\n",
"\n",
"## Steps\n",
"\n",
"- 1 Understanding the dataset\n",
"- 2 Load the dataset\n",
"- 3 Inspect the dataset\n",
"- 4 Data type conversion\n",
"- 5 Identify missing values\n",
"- 6 Detect and handle outliers\n",
"- 7 Analyze relationships between variables\n",
"- 8 Explore categorical variables\n",
"- 9 Impute missing values\n",
"- 10 Normalize and scale data\n",
"- 11 Pandas Profiling\n",
"- 12 Clustering, dimension reduction with PCA\n",
"- 13 Model selection and evaluation\n",
"- 14 Report and communicate findings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1) Understanding the dataset\n",
"\n",
"Start familiarizing with the dataset. Understand the context, the purpose of the data, and the problem trying to solve. Look for documentation or metadata that provides information about the dataset, its variables, and its source.\n",
"\n",
"### Columns description\n",
"\n",
"- studyName: The name of the study that collected the data. (Categorical)\n",
"- Sample Number: A unique identifier for each sample in the dataset. (Categorical)\n",
"- Species: The species of the penguin (e.g., Adelie Penguin). (Categorical)\n",
"- Region: The region where the penguin was observed. (Categorical)\n",
"- Island: The island where the penguin was observed. (Categorical)\n",
"- Stage: The stage of the penguin's life cycle at the time of observation (e.g., Adult, 1 Egg Stage). (Categorical)\n",
"- Individual ID: A unique identifier for each individual penguin. (Categorical)\n",
"- Clutch Completion: Whether the penguin has completed its clutch (a set of eggs laid by a bird). (Categorical - Binary)\n",
"- Date Egg: The date the egg was observed. (Datetime)\n",
"- Culmen Length (mm): The length of the penguin's culmen (the upper ridge of a bird's beak) in millimeters. (Numerical - Discrete)\n",
"- Culmen Depth (mm): The depth of the penguin's culmen in millimeters. (Numerical - Discrete)\n",
"- Flipper Length (mm): The length of the penguin's flipper (wing) in millimeters. (Numerical - Discrete)\n",
"- Body Mass (g): The body mass of the penguin in grams. (Numerical - Discrete)\n",
"- Sex: The sex of the penguin (Male, Female). (Categorical - Binary)\n",
"- Delta 15 N (o/oo): The ratio of stable isotopes of nitrogen (15N/14N) in the penguin's blood, indicating the penguin's trophic level. (Numerical - Continuous)\n",
"- Delta 13 C (o/oo): The ratio of stable isotopes of carbon (13C/12C) in the penguin's blood, indicating the penguin's foraging habitat. (Numerical - Continuous)\n",
"- Comments: Additional comments related to the penguin or the sample. (Text)\n",
"\n",
"### First impressions\n",
"\n",
"The dataset contains several variables, such as:\n",
"\n",
"- Species: The species of the penguin (e.g., Adelie, Gentoo, Chinstrap).\n",
"- Island: The island where the penguin was observed.\n",
"- Culmen Length and Culmen Depth: Measurements of the penguin's beak, which can be useful for differentiating between species and understanding feeding habits.\n",
"- Flipper Length and Body Mass: Measurements of the penguin's body size, which can be indicators of their overall health, age, and ability to swim or dive.\n",
"- Sex: The sex of the penguin.\n",
"- Delta 15 N and Delta 13 C: Isotopic ratios that can provide information about the penguin's diet, foraging ecology, and trophic position within the food web.\n",
"\n",
"I can already see that some data are missing, and I will need to deal with them later.\n",
"\n",
"The \"Species\" column contains the common name and the (scientific name). Keeping both' doesn't add any value to our analysis.\n",
"\n",
"The \"Sample Number\" is a unique identifier for each sample, so it will not be useful for our analysis.\n",
"\n",
"The \"IndividualID\" may be useful to identify the penguins, but I will need to check if it is unique.\n",
"\n",
"The \"Stage\" column contains only a value, so it will not be useful for our analysis.\n",
"\n",
"The \"Sex\" feature contains a binary value, some values are missing and some are wrong.\n",
"\n",
"The \"Date Egg\" column is a trying and needs to be converted to a datetime format.\n",
"\n",
"Some sample are missing many values, and we will need to decide if we want to keep them or not.\n",
"\n",
"Overall, the dataset is a study on penguins. It seems they were studying the morphological characteristics and the isotopic signatures of different species and geographic areas. Also, they keep track of they egg and the clutch completion.\n",
"\n",
"## 2) Load the dataset\n",
"\n",
"Importing the data into the notebook using padnas. And check the correct interpretation."
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [],
"source": [
"# Import libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"import warnings\n",
"from sklearn.decomposition import PCA\n",
"from sklearn.manifold import TSNE\n",
"from ydata_profiling import ProfileReport\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.metrics import accuracy_score\n",
"from sklearn.tree import DecisionTreeClassifier"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>studyName</th>\n",
" <th>Sample Number</th>\n",
" <th>Species</th>\n",
" <th>Region</th>\n",
" <th>Island</th>\n",
" <th>Stage</th>\n",
" <th>Individual ID</th>\n",
" <th>Clutch Completion</th>\n",
" <th>Date Egg</th>\n",
" <th>Culmen Length (mm)</th>\n",
" <th>Culmen Depth (mm)</th>\n",
" <th>Flipper Length (mm)</th>\n",
" <th>Body Mass (g)</th>\n",
" <th>Sex</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" <th>Comments</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>PAL0708</td>\n",
" <td>1</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Anvers</td>\n",
" <td>Torgersen</td>\n",
" <td>Adult, 1 Egg Stage</td>\n",
" <td>N1A1</td>\n",
" <td>Yes</td>\n",
" <td>11/11/07</td>\n",
" <td>39.1</td>\n",
" <td>18.7</td>\n",
" <td>181.0</td>\n",
" <td>3750.0</td>\n",
" <td>MALE</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Not enough blood for isotopes.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>PAL0708</td>\n",
" <td>2</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Anvers</td>\n",
" <td>Torgersen</td>\n",
" <td>Adult, 1 Egg Stage</td>\n",
" <td>N1A2</td>\n",
" <td>Yes</td>\n",
" <td>11/11/07</td>\n",
" <td>39.5</td>\n",
" <td>17.4</td>\n",
" <td>186.0</td>\n",
" <td>3800.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.94956</td>\n",
" <td>-24.69454</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>PAL0708</td>\n",
" <td>3</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Anvers</td>\n",
" <td>Torgersen</td>\n",
" <td>Adult, 1 Egg Stage</td>\n",
" <td>N2A1</td>\n",
" <td>Yes</td>\n",
" <td>11/16/07</td>\n",
" <td>40.3</td>\n",
" <td>18.0</td>\n",
" <td>195.0</td>\n",
" <td>3250.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.36821</td>\n",
" <td>-25.33302</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>PAL0708</td>\n",
" <td>4</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Anvers</td>\n",
" <td>Torgersen</td>\n",
" <td>Adult, 1 Egg Stage</td>\n",
" <td>N2A2</td>\n",
" <td>Yes</td>\n",
" <td>11/16/07</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Adult not sampled.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>PAL0708</td>\n",
" <td>5</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Anvers</td>\n",
" <td>Torgersen</td>\n",
" <td>Adult, 1 Egg Stage</td>\n",
" <td>N3A1</td>\n",
" <td>Yes</td>\n",
" <td>11/16/07</td>\n",
" <td>36.7</td>\n",
" <td>19.3</td>\n",
" <td>193.0</td>\n",
" <td>3450.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.76651</td>\n",
" <td>-25.32426</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" studyName Sample Number Species Region \\\n",
"0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers \n",
"1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers \n",
"2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers \n",
"3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers \n",
"4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers \n",
"\n",
" Island Stage Individual ID Clutch Completion Date Egg \\\n",
"0 Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 \n",
"1 Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 \n",
"2 Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 \n",
"3 Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 \n",
"4 Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 \n",
"\n",
" Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) \\\n",
"0 39.1 18.7 181.0 3750.0 \n",
"1 39.5 17.4 186.0 3800.0 \n",
"2 40.3 18.0 195.0 3250.0 \n",
"3 NaN NaN NaN NaN \n",
"4 36.7 19.3 193.0 3450.0 \n",
"\n",
" Sex Delta 15 N (o/oo) Delta 13 C (o/oo) \\\n",
"0 MALE NaN NaN \n",
"1 FEMALE 8.94956 -24.69454 \n",
"2 FEMALE 8.36821 -25.33302 \n",
"3 NaN NaN NaN \n",
"4 FEMALE 8.76651 -25.32426 \n",
"\n",
" Comments \n",
"0 Not enough blood for isotopes. \n",
"1 NaN \n",
"2 NaN \n",
"3 Adult not sampled. \n",
"4 NaN "
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Load the dataset\n",
"df = pd.read_csv('data/penguins.csv')\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset seems correctly loaded.\n",
"\n",
"## 3) Inspect the dataset\n",
"\n",
"Use functions like head(), tail(), info() or describe() to get an overview of the dataset, its shape, and basic summary statistics to identify any immediate issues or patterns in the data."
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(344, 17)"
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>studyName</th>\n",
" <th>Sample Number</th>\n",
" <th>Species</th>\n",
" <th>Region</th>\n",
" <th>Island</th>\n",
" <th>Stage</th>\n",
" <th>Individual ID</th>\n",
" <th>Clutch Completion</th>\n",
" <th>Date Egg</th>\n",
" <th>Culmen Length (mm)</th>\n",
" <th>Culmen Depth (mm)</th>\n",
" <th>Flipper Length (mm)</th>\n",
" <th>Body Mass (g)</th>\n",
" <th>Sex</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" <th>Comments</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>PAL0708</td>\n",
" <td>1</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Anvers</td>\n",
" <td>Torgersen</td>\n",
" <td>Adult, 1 Egg Stage</td>\n",
" <td>N1A1</td>\n",
" <td>Yes</td>\n",
" <td>11/11/07</td>\n",
" <td>39.1</td>\n",
" <td>18.7</td>\n",
" <td>181.0</td>\n",
" <td>3750.0</td>\n",
" <td>MALE</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Not enough blood for isotopes.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>PAL0708</td>\n",
" <td>2</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Anvers</td>\n",
" <td>Torgersen</td>\n",
" <td>Adult, 1 Egg Stage</td>\n",
" <td>N1A2</td>\n",
" <td>Yes</td>\n",
" <td>11/11/07</td>\n",
" <td>39.5</td>\n",
" <td>17.4</td>\n",
" <td>186.0</td>\n",
" <td>3800.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.94956</td>\n",
" <td>-24.69454</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>PAL0708</td>\n",
" <td>3</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Anvers</td>\n",
" <td>Torgersen</td>\n",
" <td>Adult, 1 Egg Stage</td>\n",
" <td>N2A1</td>\n",
" <td>Yes</td>\n",
" <td>11/16/07</td>\n",
" <td>40.3</td>\n",
" <td>18.0</td>\n",
" <td>195.0</td>\n",
" <td>3250.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.36821</td>\n",
" <td>-25.33302</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>PAL0708</td>\n",
" <td>4</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Anvers</td>\n",
" <td>Torgersen</td>\n",
" <td>Adult, 1 Egg Stage</td>\n",
" <td>N2A2</td>\n",
" <td>Yes</td>\n",
" <td>11/16/07</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Adult not sampled.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>PAL0708</td>\n",
" <td>5</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Anvers</td>\n",
" <td>Torgersen</td>\n",
" <td>Adult, 1 Egg Stage</td>\n",
" <td>N3A1</td>\n",
" <td>Yes</td>\n",
" <td>11/16/07</td>\n",
" <td>36.7</td>\n",
" <td>19.3</td>\n",
" <td>193.0</td>\n",
" <td>3450.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.76651</td>\n",
" <td>-25.32426</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" studyName Sample Number Species Region \\\n",
"0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers \n",
"1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers \n",
"2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers \n",
"3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers \n",
"4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers \n",
"\n",
" Island Stage Individual ID Clutch Completion Date Egg \\\n",
"0 Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 \n",
"1 Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 \n",
"2 Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 \n",
"3 Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 \n",
"4 Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 \n",
"\n",
" Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) \\\n",
"0 39.1 18.7 181.0 3750.0 \n",
"1 39.5 17.4 186.0 3800.0 \n",
"2 40.3 18.0 195.0 3250.0 \n",
"3 NaN NaN NaN NaN \n",
"4 36.7 19.3 193.0 3450.0 \n",
"\n",
" Sex Delta 15 N (o/oo) Delta 13 C (o/oo) \\\n",
"0 MALE NaN NaN \n",
"1 FEMALE 8.94956 -24.69454 \n",
"2 FEMALE 8.36821 -25.33302 \n",
"3 NaN NaN NaN \n",
"4 FEMALE 8.76651 -25.32426 \n",
"\n",
" Comments \n",
"0 Not enough blood for isotopes. \n",
"1 NaN \n",
"2 NaN \n",
"3 Adult not sampled. \n",
"4 NaN "
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 344 entries, 0 to 343\n",
"Data columns (total 17 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 studyName 344 non-null object \n",
" 1 Sample Number 344 non-null int64 \n",
" 2 Species 344 non-null object \n",
" 3 Region 344 non-null object \n",
" 4 Island 344 non-null object \n",
" 5 Stage 344 non-null object \n",
" 6 Individual ID 344 non-null object \n",
" 7 Clutch Completion 344 non-null object \n",
" 8 Date Egg 344 non-null object \n",
" 9 Culmen Length (mm) 342 non-null float64\n",
" 10 Culmen Depth (mm) 342 non-null float64\n",
" 11 Flipper Length (mm) 342 non-null float64\n",
" 12 Body Mass (g) 342 non-null float64\n",
" 13 Sex 334 non-null object \n",
" 14 Delta 15 N (o/oo) 330 non-null float64\n",
" 15 Delta 13 C (o/oo) 331 non-null float64\n",
" 16 Comments 26 non-null object \n",
"dtypes: float64(6), int64(1), object(10)\n",
"memory usage: 45.8+ KB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Sample Number</th>\n",
" <th>Culmen Length (mm)</th>\n",
" <th>Culmen Depth (mm)</th>\n",
" <th>Flipper Length (mm)</th>\n",
" <th>Body Mass (g)</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>344.000000</td>\n",
" <td>342.000000</td>\n",
" <td>342.000000</td>\n",
" <td>342.000000</td>\n",
" <td>342.000000</td>\n",
" <td>330.000000</td>\n",
" <td>331.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>63.151163</td>\n",
" <td>43.921930</td>\n",
" <td>17.151170</td>\n",
" <td>200.915205</td>\n",
" <td>4201.754386</td>\n",
" <td>8.733382</td>\n",
" <td>-25.686292</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>40.430199</td>\n",
" <td>5.459584</td>\n",
" <td>1.974793</td>\n",
" <td>14.061714</td>\n",
" <td>801.954536</td>\n",
" <td>0.551770</td>\n",
" <td>0.793961</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>1.000000</td>\n",
" <td>32.100000</td>\n",
" <td>13.100000</td>\n",
" <td>172.000000</td>\n",
" <td>2700.000000</td>\n",
" <td>7.632200</td>\n",
" <td>-27.018540</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>29.000000</td>\n",
" <td>39.225000</td>\n",
" <td>15.600000</td>\n",
" <td>190.000000</td>\n",
" <td>3550.000000</td>\n",
" <td>8.299890</td>\n",
" <td>-26.320305</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>58.000000</td>\n",
" <td>44.450000</td>\n",
" <td>17.300000</td>\n",
" <td>197.000000</td>\n",
" <td>4050.000000</td>\n",
" <td>8.652405</td>\n",
" <td>-25.833520</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>95.250000</td>\n",
" <td>48.500000</td>\n",
" <td>18.700000</td>\n",
" <td>213.000000</td>\n",
" <td>4750.000000</td>\n",
" <td>9.172123</td>\n",
" <td>-25.062050</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>152.000000</td>\n",
" <td>59.600000</td>\n",
" <td>21.500000</td>\n",
" <td>231.000000</td>\n",
" <td>6300.000000</td>\n",
" <td>10.025440</td>\n",
" <td>-23.787670</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Sample Number Culmen Length (mm) Culmen Depth (mm) \\\n",
"count 344.000000 342.000000 342.000000 \n",
"mean 63.151163 43.921930 17.151170 \n",
"std 40.430199 5.459584 1.974793 \n",
"min 1.000000 32.100000 13.100000 \n",
"25% 29.000000 39.225000 15.600000 \n",
"50% 58.000000 44.450000 17.300000 \n",
"75% 95.250000 48.500000 18.700000 \n",
"max 152.000000 59.600000 21.500000 \n",
"\n",
" Flipper Length (mm) Body Mass (g) Delta 15 N (o/oo) \\\n",
"count 342.000000 342.000000 330.000000 \n",
"mean 200.915205 4201.754386 8.733382 \n",
"std 14.061714 801.954536 0.551770 \n",
"min 172.000000 2700.000000 7.632200 \n",
"25% 190.000000 3550.000000 8.299890 \n",
"50% 197.000000 4050.000000 8.652405 \n",
"75% 213.000000 4750.000000 9.172123 \n",
"max 231.000000 6300.000000 10.025440 \n",
"\n",
" Delta 13 C (o/oo) \n",
"count 331.000000 \n",
"mean -25.686292 \n",
"std 0.793961 \n",
"min -27.018540 \n",
"25% -26.320305 \n",
"50% -25.833520 \n",
"75% -25.062050 \n",
"max -23.787670 "
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"studyName 3\n",
"Sample Number 152\n",
"Species 3\n",
"Region 1\n",
"Island 3\n",
"Stage 1\n",
"Individual ID 190\n",
"Clutch Completion 2\n",
"Date Egg 50\n",
"Culmen Length (mm) 164\n",
"Culmen Depth (mm) 80\n",
"Flipper Length (mm) 55\n",
"Body Mass (g) 94\n",
"Sex 3\n",
"Delta 15 N (o/oo) 330\n",
"Delta 13 C (o/oo) 331\n",
"Comments 7\n",
"dtype: int64"
]
},
"execution_count": 98,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# get the number of unique values for each column\n",
"df.nunique()"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Region</th>\n",
" <th>Stage</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Anvers</td>\n",
" <td>Adult, 1 Egg Stage</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Anvers</td>\n",
" <td>Adult, 1 Egg Stage</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Anvers</td>\n",
" <td>Adult, 1 Egg Stage</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Anvers</td>\n",
" <td>Adult, 1 Egg Stage</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Anvers</td>\n",
" <td>Adult, 1 Egg Stage</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Region Stage\n",
"0 Anvers Adult, 1 Egg Stage\n",
"1 Anvers Adult, 1 Egg Stage\n",
"2 Anvers Adult, 1 Egg Stage\n",
"3 Anvers Adult, 1 Egg Stage\n",
"4 Anvers Adult, 1 Egg Stage"
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# remove the columns with only one unique value because are useless for the analysis but\n",
"# keep it in a separate dataframe\n",
"df_unique = df.loc[:, df.nunique() == 1]\n",
"df_unique.head()"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [],
"source": [
"# drop the columns with only one unique value\n",
"df.drop(df_unique.columns, axis=1, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Not enough blood for isotopes.', nan, 'Adult not sampled.',\n",
" 'Nest never observed with full clutch.',\n",
" 'No blood sample obtained.',\n",
" 'No blood sample obtained for sexing.',\n",
" 'Nest never observed with full clutch. Not enough blood for isotopes.',\n",
" 'Sexing primers did not amplify. Not enough blood for isotopes.'],\n",
" dtype=object)"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# list all comments\n",
"df['Comments'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Sex</th>\n",
" <th>Clutch Completion</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" <th>Comments</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>MALE</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Not enough blood for isotopes.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Adult not sampled.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>FEMALE</td>\n",
" <td>No</td>\n",
" <td>9.18718</td>\n",
" <td>-25.21799</td>\n",
" <td>Nest never observed with full clutch.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>MALE</td>\n",
" <td>No</td>\n",
" <td>9.46060</td>\n",
" <td>-24.89958</td>\n",
" <td>Nest never observed with full clutch.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>NaN</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>No blood sample obtained.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>NaN</td>\n",
" <td>Yes</td>\n",
" <td>9.13362</td>\n",
" <td>-25.09368</td>\n",
" <td>No blood sample obtained for sexing.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>NaN</td>\n",
" <td>Yes</td>\n",
" <td>8.63243</td>\n",
" <td>-25.21315</td>\n",
" <td>No blood sample obtained for sexing.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>NaN</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>No blood sample obtained.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>FEMALE</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Not enough blood for isotopes.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>MALE</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Not enough blood for isotopes.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>FEMALE</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Not enough blood for isotopes.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>FEMALE</td>\n",
" <td>No</td>\n",
" <td>8.38404</td>\n",
" <td>-25.19837</td>\n",
" <td>Nest never observed with full clutch.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>MALE</td>\n",
" <td>No</td>\n",
" <td>8.90027</td>\n",
" <td>-25.11609</td>\n",
" <td>Nest never observed with full clutch.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38</th>\n",
" <td>FEMALE</td>\n",
" <td>No</td>\n",
" <td>9.41131</td>\n",
" <td>-25.04169</td>\n",
" <td>Nest never observed with full clutch.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39</th>\n",
" <td>MALE</td>\n",
" <td>No</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Nest never observed with full clutch. Not enou...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>41</th>\n",
" <td>MALE</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Not enough blood for isotopes.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46</th>\n",
" <td>MALE</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Not enough blood for isotopes.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47</th>\n",
" <td>NaN</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Sexing primers did not amplify. Not enough blo...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>68</th>\n",
" <td>FEMALE</td>\n",
" <td>No</td>\n",
" <td>8.47781</td>\n",
" <td>-26.07821</td>\n",
" <td>Nest never observed with full clutch.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>69</th>\n",
" <td>MALE</td>\n",
" <td>No</td>\n",
" <td>8.86853</td>\n",
" <td>-26.06209</td>\n",
" <td>Nest never observed with full clutch.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>120</th>\n",
" <td>FEMALE</td>\n",
" <td>No</td>\n",
" <td>9.04296</td>\n",
" <td>-26.19444</td>\n",
" <td>Nest never observed with full clutch.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>121</th>\n",
" <td>MALE</td>\n",
" <td>No</td>\n",
" <td>9.11066</td>\n",
" <td>-26.42563</td>\n",
" <td>Nest never observed with full clutch.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>130</th>\n",
" <td>FEMALE</td>\n",
" <td>No</td>\n",
" <td>8.98460</td>\n",
" <td>-25.57956</td>\n",
" <td>Nest never observed with full clutch.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>131</th>\n",
" <td>MALE</td>\n",
" <td>No</td>\n",
" <td>8.86495</td>\n",
" <td>-26.13960</td>\n",
" <td>Nest never observed with full clutch.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>138</th>\n",
" <td>FEMALE</td>\n",
" <td>No</td>\n",
" <td>8.61651</td>\n",
" <td>-26.07021</td>\n",
" <td>Nest never observed with full clutch.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>139</th>\n",
" <td>MALE</td>\n",
" <td>No</td>\n",
" <td>9.25769</td>\n",
" <td>-25.88798</td>\n",
" <td>Nest never observed with full clutch.</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Sex Clutch Completion Delta 15 N (o/oo) Delta 13 C (o/oo) \\\n",
"0 MALE Yes NaN NaN \n",
"3 NaN Yes NaN NaN \n",
"6 FEMALE No 9.18718 -25.21799 \n",
"7 MALE No 9.46060 -24.89958 \n",
"8 NaN Yes NaN NaN \n",
"9 NaN Yes 9.13362 -25.09368 \n",
"10 NaN Yes 8.63243 -25.21315 \n",
"11 NaN Yes NaN NaN \n",
"12 FEMALE Yes NaN NaN \n",
"13 MALE Yes NaN NaN \n",
"15 FEMALE Yes NaN NaN \n",
"28 FEMALE No 8.38404 -25.19837 \n",
"29 MALE No 8.90027 -25.11609 \n",
"38 FEMALE No 9.41131 -25.04169 \n",
"39 MALE No NaN NaN \n",
"41 MALE Yes NaN NaN \n",
"46 MALE Yes NaN NaN \n",
"47 NaN Yes NaN NaN \n",
"68 FEMALE No 8.47781 -26.07821 \n",
"69 MALE No 8.86853 -26.06209 \n",
"120 FEMALE No 9.04296 -26.19444 \n",
"121 MALE No 9.11066 -26.42563 \n",
"130 FEMALE No 8.98460 -25.57956 \n",
"131 MALE No 8.86495 -26.13960 \n",
"138 FEMALE No 8.61651 -26.07021 \n",
"139 MALE No 9.25769 -25.88798 \n",
"\n",
" Comments \n",
"0 Not enough blood for isotopes. \n",
"3 Adult not sampled. \n",
"6 Nest never observed with full clutch. \n",
"7 Nest never observed with full clutch. \n",
"8 No blood sample obtained. \n",
"9 No blood sample obtained for sexing. \n",
"10 No blood sample obtained for sexing. \n",
"11 No blood sample obtained. \n",
"12 Not enough blood for isotopes. \n",
"13 Not enough blood for isotopes. \n",
"15 Not enough blood for isotopes. \n",
"28 Nest never observed with full clutch. \n",
"29 Nest never observed with full clutch. \n",
"38 Nest never observed with full clutch. \n",
"39 Nest never observed with full clutch. Not enou... \n",
"41 Not enough blood for isotopes. \n",
"46 Not enough blood for isotopes. \n",
"47 Sexing primers did not amplify. Not enough blo... \n",
"68 Nest never observed with full clutch. \n",
"69 Nest never observed with full clutch. \n",
"120 Nest never observed with full clutch. \n",
"121 Nest never observed with full clutch. \n",
"130 Nest never observed with full clutch. \n",
"131 Nest never observed with full clutch. \n",
"138 Nest never observed with full clutch. \n",
"139 Nest never observed with full clutch. "
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# shows the blood info column that has a comment and its comment\n",
"df[df['Comments'].notnull()][['Sex', 'Clutch Completion', 'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)', 'Comments']]"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>studyName</th>\n",
" <th>Sample Number</th>\n",
" <th>Species</th>\n",
" <th>Island</th>\n",
" <th>Individual ID</th>\n",
" <th>Clutch Completion</th>\n",
" <th>Date Egg</th>\n",
" <th>Culmen Length (mm)</th>\n",
" <th>Culmen Depth (mm)</th>\n",
" <th>Flipper Length (mm)</th>\n",
" <th>Body Mass (g)</th>\n",
" <th>Sex</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>PAL0708</td>\n",
" <td>1</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N1A1</td>\n",
" <td>Yes</td>\n",
" <td>11/11/07</td>\n",
" <td>39.1</td>\n",
" <td>18.7</td>\n",
" <td>181.0</td>\n",
" <td>3750.0</td>\n",
" <td>MALE</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>PAL0708</td>\n",
" <td>2</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N1A2</td>\n",
" <td>Yes</td>\n",
" <td>11/11/07</td>\n",
" <td>39.5</td>\n",
" <td>17.4</td>\n",
" <td>186.0</td>\n",
" <td>3800.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.94956</td>\n",
" <td>-24.69454</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>PAL0708</td>\n",
" <td>3</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N2A1</td>\n",
" <td>Yes</td>\n",
" <td>11/16/07</td>\n",
" <td>40.3</td>\n",
" <td>18.0</td>\n",
" <td>195.0</td>\n",
" <td>3250.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.36821</td>\n",
" <td>-25.33302</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>PAL0708</td>\n",
" <td>4</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N2A2</td>\n",
" <td>Yes</td>\n",
" <td>11/16/07</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>PAL0708</td>\n",
" <td>5</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N3A1</td>\n",
" <td>Yes</td>\n",
" <td>11/16/07</td>\n",
" <td>36.7</td>\n",
" <td>19.3</td>\n",
" <td>193.0</td>\n",
" <td>3450.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.76651</td>\n",
" <td>-25.32426</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" studyName Sample Number Species Island \\\n",
"0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Torgersen \n",
"1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Torgersen \n",
"2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Torgersen \n",
"3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Torgersen \n",
"4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Torgersen \n",
"\n",
" Individual ID Clutch Completion Date Egg Culmen Length (mm) \\\n",
"0 N1A1 Yes 11/11/07 39.1 \n",
"1 N1A2 Yes 11/11/07 39.5 \n",
"2 N2A1 Yes 11/16/07 40.3 \n",
"3 N2A2 Yes 11/16/07 NaN \n",
"4 N3A1 Yes 11/16/07 36.7 \n",
"\n",
" Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex \\\n",
"0 18.7 181.0 3750.0 MALE \n",
"1 17.4 186.0 3800.0 FEMALE \n",
"2 18.0 195.0 3250.0 FEMALE \n",
"3 NaN NaN NaN NaN \n",
"4 19.3 193.0 3450.0 FEMALE \n",
"\n",
" Delta 15 N (o/oo) Delta 13 C (o/oo) \n",
"0 NaN NaN \n",
"1 8.94956 -24.69454 \n",
"2 8.36821 -25.33302 \n",
"3 NaN NaN \n",
"4 8.76651 -25.32426 "
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# drop the comments useless column\n",
"df.drop('Comments', axis=1, inplace=True)\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1 3\n",
"45 3\n",
"52 3\n",
"51 3\n",
"50 3\n",
" ..\n",
"129 1\n",
"128 1\n",
"127 1\n",
"126 1\n",
"152 1\n",
"Name: Sample Number, Length: 152, dtype: int64"
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# count the number of \"sample number\" unique repetitions\n",
"df['Sample Number'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>studyName</th>\n",
" <th>Species</th>\n",
" <th>Island</th>\n",
" <th>Individual ID</th>\n",
" <th>Clutch Completion</th>\n",
" <th>Date Egg</th>\n",
" <th>Culmen Length (mm)</th>\n",
" <th>Culmen Depth (mm)</th>\n",
" <th>Flipper Length (mm)</th>\n",
" <th>Body Mass (g)</th>\n",
" <th>Sex</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N1A1</td>\n",
" <td>Yes</td>\n",
" <td>11/11/07</td>\n",
" <td>39.1</td>\n",
" <td>18.7</td>\n",
" <td>181.0</td>\n",
" <td>3750.0</td>\n",
" <td>MALE</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N1A2</td>\n",
" <td>Yes</td>\n",
" <td>11/11/07</td>\n",
" <td>39.5</td>\n",
" <td>17.4</td>\n",
" <td>186.0</td>\n",
" <td>3800.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.94956</td>\n",
" <td>-24.69454</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N2A1</td>\n",
" <td>Yes</td>\n",
" <td>11/16/07</td>\n",
" <td>40.3</td>\n",
" <td>18.0</td>\n",
" <td>195.0</td>\n",
" <td>3250.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.36821</td>\n",
" <td>-25.33302</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N2A2</td>\n",
" <td>Yes</td>\n",
" <td>11/16/07</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N3A1</td>\n",
" <td>Yes</td>\n",
" <td>11/16/07</td>\n",
" <td>36.7</td>\n",
" <td>19.3</td>\n",
" <td>193.0</td>\n",
" <td>3450.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.76651</td>\n",
" <td>-25.32426</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" studyName Species Island Individual ID \\\n",
"0 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A1 \n",
"1 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A2 \n",
"2 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N2A1 \n",
"3 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N2A2 \n",
"4 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N3A1 \n",
"\n",
" Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) \\\n",
"0 Yes 11/11/07 39.1 18.7 \n",
"1 Yes 11/11/07 39.5 17.4 \n",
"2 Yes 11/16/07 40.3 18.0 \n",
"3 Yes 11/16/07 NaN NaN \n",
"4 Yes 11/16/07 36.7 19.3 \n",
"\n",
" Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) \\\n",
"0 181.0 3750.0 MALE NaN \n",
"1 186.0 3800.0 FEMALE 8.94956 \n",
"2 195.0 3250.0 FEMALE 8.36821 \n",
"3 NaN NaN NaN NaN \n",
"4 193.0 3450.0 FEMALE 8.76651 \n",
"\n",
" Delta 13 C (o/oo) \n",
"0 NaN \n",
"1 -24.69454 \n",
"2 -25.33302 \n",
"3 NaN \n",
"4 -25.32426 "
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# drop the sample number column\n",
"df.drop('Sample Number', axis=1, inplace=True)\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset consist in 17 features and 344 samples. The dataset contains 7 pure numerical features. Many values are missing and have been replaced by NaN. The dataset has 1 date feature. Some columns are clearly useless.\n",
"\n",
"The numerical features have different scales, and could be useful to apply some scaling later. Also, these values are not too big, therefore should not be a problem to manipulate them.\n",
"\n",
"Also, the columns \"Region\" and \"Stage\" have been dropped because they contain only one value.\n",
"\n",
"The \"comments\" columns contains 26 non-null values where only 6 are comments, this comments basically says that they couldn't get the sample since the blood wasn't enough or that \"Nest never observed with full clutch\". We can say that because the blood information are missing in the first case, or because the column \"Clutch completion\" is \"No\", the comments are redundant and can be dropped since they add no information.\n",
"\n",
"The comment \"No blood sample obtained for sexing\" indicate that these researchers used the blood sample to determinate the sex, so if\n",
"they failed to amplify the DNA or didn't take enough blood, the sex of the penguin will be empty.\n",
"\n",
"The \"Sample Number\" column contains 152 unique values, but the column \"Individual ID\" contains 191 unique values. This means that some samples are from the same penguin. This is may not a problem, but I need to keep it in mind. The total number of samples is 344, but the total number of unique \"Sample Number\" is 152, this because some penguins have been sampled more than once with the same \"Individual ID\". And a sample get more individuals in one go. This means I can drop the sample ID because is an arbitrary number that add no information other than a chronological order, which is already in the date feature, but I will keep the Individual ID.\n",
"\n",
"The \"Study number\" may also be dropped since there is logic correlation and seems less important at this point, but I will keep it for now.\n",
"\n",
"## 4 Data type conversion\n",
"\n",
"Ensure that each variable is of the appropriate data type (e.g., numeric, categorical, datetime).\n",
"Convert any variables with incorrect data types.\n",
"Convert with LabelEncoder or OrdinalEncoder.\n",
"\n",
"In the step 3 I found out many columns that need to be converted to the correct data type.\n",
"\n",
"For this step I looked at the values of the non-numerical columns, if they are non-ordinal and only 2 values are possible (eg: sex)\n",
"I will use the LabelEncoder, if more than 2 values are possible I will use the OneHotEncoder (eg: species).\n",
"If the values are ordinal I will use the OrdinalEncoder.\n",
"\n",
"Since the \"Individual ID\" is non-ordinal but contains a lot of unique values, I will encode is with the label encoder\n",
"because the one-hot encoder could generate too many columns. I could then reduce the number of columns with a PCA, but this\n",
"may start to be too complicated for a TP.\n",
"\n",
"Also, I will convert the \"Date Egg\" column to a datetime format.\n",
"\n",
"To simplify the coding and encoding I wrote a convenience class that will do all the work for me."
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The autoreload extension is already loaded. To reload it, use:\n",
" %reload_ext autoreload\n"
]
}
],
"source": [
"# Import the custom EncoderManager class\n",
"%load_ext autoreload\n",
"%autoreload 2\n",
"\n",
"from lib.EncoderManager import EncoderManager"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"studyName 3\n",
"Species 3\n",
"Island 3\n",
"Individual ID 190\n",
"Clutch Completion 2\n",
"Date Egg 50\n",
"Sex 3\n",
"dtype: int64"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# list of non-numerical columns and the count of their unique values\n",
"df.select_dtypes(exclude='number').nunique()"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [],
"source": [
"# Convert to date with pandas.to_datetime\n",
"df['Date Egg'] = pd.to_datetime(df['Date Egg'])\n",
"\n",
"# Define the columns to encode with LabelEncoder\n",
"columns_to_label_encode = ['Clutch Completion', 'Sex', 'Individual ID']\n",
"\n",
"# Define the columns to encode with OrdinalEncoder\n",
"# since the date type gives me many problem il convert it to ordinal\n",
"# however, this column is not very interesting for the prediction\n",
"columns_to_ordinal_encode = [\"Date Egg\"]\n",
"\n",
"# Define the columns to encode with OneHotEncoder\n",
"columns_to_onehot_encode = ['studyName', 'Species', 'Island']"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Individual ID</th>\n",
" <th>Clutch Completion</th>\n",
" <th>Date Egg</th>\n",
" <th>Culmen Length (mm)</th>\n",
" <th>Culmen Depth (mm)</th>\n",
" <th>Flipper Length (mm)</th>\n",
" <th>Body Mass (g)</th>\n",
" <th>Sex</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" <th>studyName_PAL0708</th>\n",
" <th>studyName_PAL0809</th>\n",
" <th>studyName_PAL0910</th>\n",
" <th>Species_Adelie Penguin (Pygoscelis adeliae)</th>\n",
" <th>Species_Chinstrap penguin (Pygoscelis antarctica)</th>\n",
" <th>Species_Gentoo penguin (Pygoscelis papua)</th>\n",
" <th>Island_Biscoe</th>\n",
" <th>Island_Dream</th>\n",
" <th>Island_Torgersen</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>22</td>\n",
" <td>1</td>\n",
" <td>2.0</td>\n",
" <td>39.1</td>\n",
" <td>18.7</td>\n",
" <td>181.0</td>\n",
" <td>3750.0</td>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>23</td>\n",
" <td>1</td>\n",
" <td>2.0</td>\n",
" <td>39.5</td>\n",
" <td>17.4</td>\n",
" <td>186.0</td>\n",
" <td>3800.0</td>\n",
" <td>1</td>\n",
" <td>8.94956</td>\n",
" <td>-24.69454</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>44</td>\n",
" <td>1</td>\n",
" <td>6.0</td>\n",
" <td>40.3</td>\n",
" <td>18.0</td>\n",
" <td>195.0</td>\n",
" <td>3250.0</td>\n",
" <td>1</td>\n",
" <td>8.36821</td>\n",
" <td>-25.33302</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>45</td>\n",
" <td>1</td>\n",
" <td>6.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>66</td>\n",
" <td>1</td>\n",
" <td>6.0</td>\n",
" <td>36.7</td>\n",
" <td>19.3</td>\n",
" <td>193.0</td>\n",
" <td>3450.0</td>\n",
" <td>1</td>\n",
" <td>8.76651</td>\n",
" <td>-25.32426</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>339</th>\n",
" <td>63</td>\n",
" <td>0</td>\n",
" <td>49.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>340</th>\n",
" <td>64</td>\n",
" <td>1</td>\n",
" <td>45.0</td>\n",
" <td>46.8</td>\n",
" <td>14.3</td>\n",
" <td>215.0</td>\n",
" <td>4850.0</td>\n",
" <td>1</td>\n",
" <td>8.41151</td>\n",
" <td>-26.13832</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>341</th>\n",
" <td>65</td>\n",
" <td>1</td>\n",
" <td>45.0</td>\n",
" <td>50.4</td>\n",
" <td>15.7</td>\n",
" <td>222.0</td>\n",
" <td>5750.0</td>\n",
" <td>2</td>\n",
" <td>8.30166</td>\n",
" <td>-26.04117</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>342</th>\n",
" <td>74</td>\n",
" <td>1</td>\n",
" <td>45.0</td>\n",
" <td>45.2</td>\n",
" <td>14.8</td>\n",
" <td>212.0</td>\n",
" <td>5200.0</td>\n",
" <td>1</td>\n",
" <td>8.24246</td>\n",
" <td>-26.11969</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>343</th>\n",
" <td>75</td>\n",
" <td>1</td>\n",
" <td>45.0</td>\n",
" <td>49.9</td>\n",
" <td>16.1</td>\n",
" <td>213.0</td>\n",
" <td>5400.0</td>\n",
" <td>2</td>\n",
" <td>8.36390</td>\n",
" <td>-26.15531</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>344 rows × 19 columns</p>\n",
"</div>"
],
"text/plain": [
" Individual ID Clutch Completion Date Egg Culmen Length (mm) \\\n",
"0 22 1 2.0 39.1 \n",
"1 23 1 2.0 39.5 \n",
"2 44 1 6.0 40.3 \n",
"3 45 1 6.0 NaN \n",
"4 66 1 6.0 36.7 \n",
".. ... ... ... ... \n",
"339 63 0 49.0 NaN \n",
"340 64 1 45.0 46.8 \n",
"341 65 1 45.0 50.4 \n",
"342 74 1 45.0 45.2 \n",
"343 75 1 45.0 49.9 \n",
"\n",
" Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex \\\n",
"0 18.7 181.0 3750.0 2 \n",
"1 17.4 186.0 3800.0 1 \n",
"2 18.0 195.0 3250.0 1 \n",
"3 NaN NaN NaN 3 \n",
"4 19.3 193.0 3450.0 1 \n",
".. ... ... ... ... \n",
"339 NaN NaN NaN 3 \n",
"340 14.3 215.0 4850.0 1 \n",
"341 15.7 222.0 5750.0 2 \n",
"342 14.8 212.0 5200.0 1 \n",
"343 16.1 213.0 5400.0 2 \n",
"\n",
" Delta 15 N (o/oo) Delta 13 C (o/oo) studyName_PAL0708 \\\n",
"0 NaN NaN 1.0 \n",
"1 8.94956 -24.69454 1.0 \n",
"2 8.36821 -25.33302 1.0 \n",
"3 NaN NaN 1.0 \n",
"4 8.76651 -25.32426 1.0 \n",
".. ... ... ... \n",
"339 NaN NaN 0.0 \n",
"340 8.41151 -26.13832 0.0 \n",
"341 8.30166 -26.04117 0.0 \n",
"342 8.24246 -26.11969 0.0 \n",
"343 8.36390 -26.15531 0.0 \n",
"\n",
" studyName_PAL0809 studyName_PAL0910 \\\n",
"0 0.0 0.0 \n",
"1 0.0 0.0 \n",
"2 0.0 0.0 \n",
"3 0.0 0.0 \n",
"4 0.0 0.0 \n",
".. ... ... \n",
"339 0.0 1.0 \n",
"340 0.0 1.0 \n",
"341 0.0 1.0 \n",
"342 0.0 1.0 \n",
"343 0.0 1.0 \n",
"\n",
" Species_Adelie Penguin (Pygoscelis adeliae) \\\n",
"0 1.0 \n",
"1 1.0 \n",
"2 1.0 \n",
"3 1.0 \n",
"4 1.0 \n",
".. ... \n",
"339 0.0 \n",
"340 0.0 \n",
"341 0.0 \n",
"342 0.0 \n",
"343 0.0 \n",
"\n",
" Species_Chinstrap penguin (Pygoscelis antarctica) \\\n",
"0 0.0 \n",
"1 0.0 \n",
"2 0.0 \n",
"3 0.0 \n",
"4 0.0 \n",
".. ... \n",
"339 0.0 \n",
"340 0.0 \n",
"341 0.0 \n",
"342 0.0 \n",
"343 0.0 \n",
"\n",
" Species_Gentoo penguin (Pygoscelis papua) Island_Biscoe Island_Dream \\\n",
"0 0.0 0.0 0.0 \n",
"1 0.0 0.0 0.0 \n",
"2 0.0 0.0 0.0 \n",
"3 0.0 0.0 0.0 \n",
"4 0.0 0.0 0.0 \n",
".. ... ... ... \n",
"339 1.0 1.0 0.0 \n",
"340 1.0 1.0 0.0 \n",
"341 1.0 1.0 0.0 \n",
"342 1.0 1.0 0.0 \n",
"343 1.0 1.0 0.0 \n",
"\n",
" Island_Torgersen \n",
"0 1.0 \n",
"1 1.0 \n",
"2 1.0 \n",
"3 1.0 \n",
"4 1.0 \n",
".. ... \n",
"339 0.0 \n",
"340 0.0 \n",
"341 0.0 \n",
"342 0.0 \n",
"343 0.0 \n",
"\n",
"[344 rows x 19 columns]"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguin_manager = EncoderManager(df, columns_to_label_encode, columns_to_ordinal_encode, columns_to_onehot_encode)\n",
"\n",
"penguin_manager.encode(inplace=True).get_df()"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Adelie Penguin (Pygoscelis adeliae)',\n",
" 'Chinstrap penguin (Pygoscelis antarctica)',\n",
" 'Gentoo penguin (Pygoscelis papua)'], dtype=object)"
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguin_manager = EncoderManager(df, columns_to_label_encode, columns_to_ordinal_encode, columns_to_onehot_encode)\n",
"df = penguin_manager.get_df()\n",
"df[\"Species\"].unique()"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)"
]
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = penguin_manager.get_df()\n",
"df[\"Island\"].unique()"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['PAL0708', 'PAL0809', 'PAL0910'], dtype=object)"
]
},
"execution_count": 112,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = penguin_manager.get_df()\n",
"df[\"studyName\"].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So now I have a correctly encoded dataset, and I can easily decode it back to the original format with the penguin_manager.decode() method."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5 Identify missing values\n",
"\n",
"Determine the extent of missing values in the dataset by calculating the percentage of missing values for each variable.\n",
"Decide whether to impute, drop, or interpolate missing values based on the context and the importance of the variable.\n",
"\n",
"For this step I will use the library missingno to visualize the missing values. Missingno is a library specialized in visualize the *missing values* in a dataset in a very intuitive way."
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [],
"source": [
"# if uncommented, install missingno if not already installed\n",
"# !pip install missingno\n",
"# note: to works it need matplotlib=3.5.0\n",
"import missingno as msno"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAB/MAAAPQCAYAAADKHVi5AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd1RU1/738ffA0KsoEBVsiIC9G8Wa2EvU2HvBbtTYuzFq7IoooCBgw16jJvbeC3aNNSLYCyjSGWaeP3zmXEZNbnJ/KoLf11p3Gaece/aa7Tn77M8uKp1Op0MIIYQQQgghhBBCCCGEEEIIIYQQnw2jzD4BIYQQQgghhBBCCCGEEEIIIYQQQhiSMF8IIYQQQgghhBBCCCGEEEIIIYT4zEiYL4QQQgghhBBCCCGEEEIIIYQQQnxmJMwXQgghhBBCCCGEEEIIIYQQQgghPjMS5gshhBBCCCGEEEIIIYQQQgghhBCfGQnzhRBCCCGEEEIIIYQQQgghhBBCiM+MhPlCCCGEEEIIIYQQQgghhBBCCCHEZ0bCfCGEEEIIIYQQQgghhBBCCCGEEOIzI2G+EEIIIYQQQgghhBBCCCGEEEII8ZmRMF8IIYQQQgghhBBCCCGEEEIIIYT4zEiYL4QQQgghhBBCCCGEEEIIIYQQQnxmJMwXQgghhBBCCCFEtqbVajP7FIQQQgghhBBCiH9NndknIIQQQgghhBBCCPEhXbhwgZs3b+Lm5oaHhwfm5uYYGcl8BiGEEEIIIYQQWYuE+UIIIYQQQgghhMg2Xr16RVhYGEePHkWtVmNqakrJkiVp3LgxZcqUIXfu3ADodDpUKlUmn60QQgghhBBCCPHXVDqdTpfZJyGEEEIIIYQQQgjxoaSlpZGSksKFCxf4/fffOXDgALGxsXh5edGoUSN69OiR2acohBBCCCGEEEL8VxLmCyGEEEIIIYQQIlvQz7ZPT0/H2NhYef3SpUscPnyYRYsWodFoqFWrFoMHD8bNzc3gc0IIIYQQQgghxOdEwnwhhBBCCCGEEEJkS+8L9SdMmMD169cpVaoU/fr1o3LlypiammbiWQohhBBCCCGEEO8nYb4QQgghhBBCCCGyPa1Wi5GREZGRkfj5+bFjxw68vLwYMWIElStXzuzTEx+J/nfXr9ogBEi9EEIIIYQQWYc6s09ACCGEEEIIIYQQ4v9KH879FX1wV6BAAYYNG4axsTHbt29n4cKFeHh44ODgIMFeNnHjxg2sra2xsrLC3t4eQPld5Tf+cp0/f54nT55QoEABcuXKRa5cuaReCCGEEEKIz57MzBdCCCGEEEIIIUSWlnE5/WvXrpGeno5Op6NkyZLvfFYf2t27d49x48Zx5swZ2rVrx08//fSpT1t8BM+fP6dt27bExsaiUqkoX7485cqVo06dOuTNmxcTExMJbr9Aqamp+Pj4cP/+fR49eoSjoyNff/013t7efPvtt9jY2GT2KQohhBBCCPFeEuYLIYQQQgghhBAiy8oYzAYHB7N8+XISEhJISkpiyJAhtGzZEgcHh/d+58aNG3To0IH4+HiCg4OpXr16ZhRBfECpqanEx8dz69YtLly4wNatW7lz5w52dnZUrFiRESNGkCdPHmXwh/hyaDQakpKSOHv2LHv37mXfvn28fPmS/Pnz88MPP1CmTBlcXFwy+zSFEEIIIYQwIGG+EEIIIYQQQgghsrxFixYxb9487O3t8fLy4sSJEwB06tSJHj164OzsbPB5faC/efNmxowZQ8+ePRkyZEhmnLr4iB49esTNmzeZN28ef/zxB46OjnTq1ImGDRtKcPsFeXs1hvT0dF6+fMm8efPYvXs3ycnJVKhQgb59+1KuXLlMPFMhhBBCCCEMSZgvhBBCCCGEEEKILO3Vq1d0794da2trRo0ahZeXF1u3bmXu3Lk8fvz4LwN9gPv37zN06FAePnzI+vXr+eqrrzKhBOJDezu8TU5Oxt/fn23btvHy5UsaNWpEly5d8PDwyMSzFJlFq9ViZGREWloahw4dYt26dRw+fJi8efMyYcIEatSokdmnKIQQQgghBABGmX0CQgghhBBCCCGEEP+GVqs1+HtCQgK3bt2iWbNmeHl5AfDdd98xevRo8uTJw4oVKwgJCeHJkyfKd/RzG1xcXKhduzbPnj3j/v37Bu+JrOvtWdjm5uYMGjSIkSNH4unpyaZNmwgICOD27duZeJYisxgZGaHVajExMeHbb79l4sSJtG7dmgcPHjBixAiOHTuW2acohBBCCCEEIGG+EEIIIYQQQgghspD09HSMjN50Zxw4cID169ezevVq8uXLR+HChYE3+6YD1KtXj5EjR7430FepVKSnpwPQrl07cufOza5duzKhROJDeN8ADP2gD2NjYyW4rV+/PoMGDaJs2bLs27ePtWvXEhMT86lPV3wibw/80dPpdMp1RKVSkSdPHkaNGkX79u159eoVM2fO5I8//viUpyqEEEIIIcR7qTP7BIQQQgghhBBCCCH+CZ1Oh7GxMQDz588nMDDQ4P1t27ZRokQJTE1N0Wg0qNVq6tWrB8CMGTNYsWIFRkZGdO/eHWdnZ+VYpqamFC5cmLt37wKGs7rF5y89PV35LV+9ekVCQgJ58uRRwlp9cKv/8+uvvyY+Pp558+bx66+/UrVqVWrUqKEsvS6yh4z1IioqipcvX5IrVy4cHBwwNzd/5/e2tLSkT58+xMbGsmPHDrZu3UqhQoUwNTWVa4IQQgghhMg08oQihBBCCCGEEEKILEEfqC1dupTAwEDy589P+/btKVSoEGZmZuzatYstW7ag0+lQq9VoNBrgPzP0XV1dWbZsGQsWLCAxMRF4E/SamppStWpVnjx5QmJioiyzn4VotVolsF2yZAldu3alTp069OzZkx07dpCUlIRKpUKn0yl/GhkZ8c0339C6dWvi4uLw8/MjOTlZgvxsJOPAn8DAQNq3b0/r1q3p0qULM2fOJCYmRllqPyMnJye6dOlC4cKF+fXXX3n06JFSb4QQQgghhMgM8pQihBBCCCGEEEKIz5p+OXx4s4T+mTNnKFq0KH5+fkyYMIE5c+bQvHlzYmJiWLJkCb///vt7A/0hQ4ZgZWWFm5sblpaWwH8GCJQpU4bAwEAsLS1lFm4Wog/g58+fz4wZM4iMjMTS0pIjR44wd+5cNm7cSGJi4juBvlqtpl27dlSpUoVr166xbds24K+XZRdZi/7fcHBwMPPnz0elUlGsWDESExNZtWoVkydP/stAv3jx4tSpU4eYmBiCg4PRarVyTRBCCCGEEJlGpZOhpUIIIYQQQgghhMgC9Mvk+/r60rNnT3r37q28d/fuXZYtW8bGjRspWLAgvXv3pmHDhqhUKmXJfYB79+6RP39+ACXcFVlbZGQknTt3xt3dnWHDhmFtbU14eDibNm3CysqKHj168P3332Npaan85vol2O/evUu7du2oWrUqs2fPzuyiiA9Eq9Xy8uVLunbtip2dHRMnTsTFxYWjR48yb948bt26Rf369ZkwYQIODg7Kkvv6+pGYmEjbtm1JT09n8+bNmJqaZnaRhBBCCCHEF0pm5gshhBBCCCGEEOKzd+HCBX755RcWL16Mubk5OXPmBN7M1AcoWLAgXbt2pUWLFty9e5egoCCDGfppaWkASpAvs22zrrdnUr948YKnT5/Sr18/vLy8cHV1xcfHh86dO5OQkEBISAibNm0ymKFvbGyMTqfD0dGRChUqsH//fm7dupVJJRIfQsZ6YWRkRGJiIo8fP6Zbt264ublhZmZG9erVGTduHO7u7uzcuZNJkyYZzNDXD/SwtLSkffv23Llzh/3792diqYQQQgghxJdOwnwhhBBCCCGEEEJ89goUKMD48eMxMjLi+fPnbN68meTkZExNTZWl9AsUKGAQ6IeGhrJ9+3Z0Oh0mJiYGx5P90bOm9PR05be7fv06J0+e5MGDB+TNm5f8+fOj0WhIT0/HycmJdu3a/W2gr1KpsLa2pmnTpiQmJvLw4cNMLp34X2WsF0eOHGHjxo3s2rWL9PR0LCwsAJTrQPny5f820Dc2NgagdOnSWFpaEhkZmVnFEkIIIYQ
"text/plain": [
"<Figure size 2500x1000 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# visualize a matrix of the missing values\n",
"warnings.filterwarnings('ignore')\n",
"msno.matrix(penguin_manager.get_df())\n",
"warnings.filterwarnings('default')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With this matrix plot in particular we can see which column has missing value (axis x) in function of the sample index (axis y). Also, on the right side there is a line that shows the number of non-missing in that \"zone\" of the matrix.\n",
"\n",
"With this we can also see if some zone contains the missing values are missing in a specific zone, in more columns together, or if they are missing randomly.\n",
"\n",
"In this dataset we see that the 2 blood measures are missing together. Which il logic if we think that those aren't taken if they could get enough blood from the penguin. The measures of the anatomical parts also tend to miss together. Also, we see some missing values in the \"sex\" column. Later in this step i want to see if this value could be inferred by another sample of the same penguin \"Individual ID\". Or maybe, since usually sex differences are quite evident, I could try to predict it with a machine learning model.\n",
"\n",
"We can also see that in the firsts and lasts samples some row are missing many columns, in this case could be simpler to drop those rows because they are missing too many values."
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"studyName 0\n",
"Species 0\n",
"Island 0\n",
"Individual ID 0\n",
"Clutch Completion 0\n",
"Date Egg 0\n",
"Culmen Length (mm) 2\n",
"Culmen Depth (mm) 2\n",
"Flipper Length (mm) 2\n",
"Body Mass (g) 2\n",
"Sex 10\n",
"Delta 15 N (o/oo) 14\n",
"Delta 13 C (o/oo) 13\n",
"dtype: int64"
]
},
"execution_count": 115,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# get an idea of what to expect in terms of missing values\n",
"penguin_manager.get_df().isna().sum()"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot:>"
]
},
"execution_count": 116,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAB/wAAAPzCAYAAAC6LnSsAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdebxVdb0//tfhwJFJZsXZEhRnr1M0iVTorUxN8zqUs2Wl5tecNbUsNL3OionzQGpeUym9WWA5pTjkhBOgmAYqY4IiMp2zf3/441wJ8OyNCw8rn8/Ho0e612et/V744rAfvNZau65SqVQCAAAAAAAAAJRKm9YeAAAAAAAAAAConcIfAAAAAAAAAEpI4Q8AAAAAAAAAJaTwBwAAAAAAAIASUvgDAAAAAAAAQAkp/AEAAAAAAACghBT+AAAAAAAAAFBCCn8AAAAAAAAAKCGFPwAAAAAAAACUkMIfAAAAAAAAAEpI4Q8AAAAAAAAAJaTwBwAAAGpSqVTS1NTU/M/wUckURZMpiiZTAMCKSuFPoaZMmZIkzR9+4aOSKYomUxRNpiiaTFE0mWJ5qKury4IFCxZ5Tcb4KGSKoskURZMpiuZzOlAGfkaVQ13F5YgU5PTTT8/vfve7/PrXv87GG2+cpqamtGnjmhKWnUxRNJmiaDJF0WSKoskUy8Of/vSnPPHEE3nqqaeyzjrrZJNNNsnuu++ebt26yRjLRKYomkxRNJmiaD6nU7Snn346EyZMyNtvv53u3btn4MCBadeuXdq1aydfLJP/+Z//yZ577pkkMlQCCn8Ks9dee+WZZ57JuuuumwsuuMAHFT4ymaJoMkXRZIqiyRRFkymKdv755+eKK65Y7PXPfe5zOe+889KjR49UKpXU1dW1wnSUkUxRNJmiaDLF8uBzOkW65JJLcu2112b27NnNr22yySb50pe+lP322y9du3aVL2ry6KOP5oADDsjAgQMzdOjQJEr/FZ3/MnxkCx/n0atXryTJa6+9liOPPDJjxoxJmzZtPO6DmskURZMpiiZTFE2mKJpMsTxceumlueKKK7LVVlvlqquuyl133ZUhQ4Zko402yqhRo3LRRRdl7ty5Cg+qJlMUTaYomkxRNJ/TKdrQoUNz6aWXpm/fvvn5z3+ec845J7vttlumTJmSIUOG5Mc//nGmT58uX9Rk1VVXTefOnXPffffl+9//fpLI0ApO4c9HtvADbf/+/dOhQ4esu+66mThxYn70ox/5oMIykSmKJlMUTaYomkxRNJmiaI899liuv/769OnTJ6eddlq++MUvpm/fvhk0aFBOPvnkdO/ePc8888widxXBh5EpiiZTFE2mWB58TqdI48ePzy233JINNtggZ5xxRvbcc8/svPPOOeWUU3LhhRdm4403zsMPP5wDDzwwU6dOlS+q1qFDh3Ts2DH19fW5//77c/jhhydR+q/IFP58ZAs/pPTq1SvvvfdeTjrppHzzm9/MhAkTFvmg8sFvj/ADgQ8jUxRNpiiaTFE0maJoMkXRJkyYkFmzZuXb3/52Ntxww1Qqleb89OnTJ926dcuYMWPy1FNPtfKklIVMUTSZomgyxfLgczpF+uc//5lJkyblG9/4RtZff/00NTWlUqmkU6dO2XrrrXP11Vdnm222yUsvvZT9998/06ZNU9hSlXfffTfvvvtu+vTpk65du+bPf/5zjjjiiCRK/xWVwp/CbLbZZkmS119/PSeeeGK+/OUvZ8KECTniiCMyZsyY1NXV5bnnnss///nPxT60wJLIFEWTKYomUxRNpiiaTFGUF198MU1NTWnfvn2SZMGCBamrq0ulUkmPHj2y7bbbNr8O1ZApiiZTFE2mWJ58TqcIU6dOTaVSydSpU5vL/oUXlTQ1NaV79+655pprsu222+bvf/97jjjiiLz11lsyRYumTJmSd999N/vuu2/uvPPOdOvWLffcc4/SfwWm8KcwvXv3Tvfu3TNu3Lh069Ytv/zlLzNo0KBMnDgxRx55ZH7/+9/npz/9aXbaaae8++67rT0uJSBTFE2mKJpMUTSZomgyRVH69u2bJPnb3/6WJGnXrt0i21daaaUkyZw5cz7ewSgtmaJoMkXRZIrlyed0irDZZpulR48ezU+GqK+vby5h27Rpk8bGxjQ0NOSyyy7LFltskaeffjqXXHJJ5syZ03xhACzJCy+8kCSpr6/PqquumhtvvDHdu3dX+q/AFP4UYsGCBamvr896662XsWPHprGxMV26dMkZZ5yRr371q/nHP/6R0047Lc8//3y+//3vp1OnTv5A4UPJFEWTKYomUxRNpiiaTFGkbbbZJp06dcqdd96Z0aNHN7/e2NiY5P/Kjg/e5fivf/njL4P4IJmiaDJF0WSKj2pp//19TmdZ/Wumunbtmj59+uTxxx/Peeedl2TREra+vj6NjY3p3LlzTj/99Kyzzjq577778uqrryaJu/xZ6s+p9957Lz179sx2222X5P2vslH6r9gU/lTtoYceypQpU5a4rW3btqmvr8/mm2+esWPHZubMmamrq0vXrl2zxx57pHv37pk3b166deuWr33ta0mS+fPnf5zjswKSKYomUxRNpiiaTFE0mWJ5+GCuKpVKmpqa0rdv35x77rk588wzs/nmmzevbdPm/b9WmDt3bpJk5ZVXTvJ+GbJw27333tv8OFo+mWSKoskURZMpPg4LSzGf0ylKly5dcvTRR6dt27a54YYb8pvf/CbJ4qV/kqy33nrZdddd88Ybb2TEiBFJ4kISFrPwIrYjjjgixx9/fLp165bk/T/j1ltvPaX/CswnDqpy7bXX5pBDDsnNN9+cqVOnLnXdmmuumffeey8zZ85Mkjz55JO54oor8tZbb6VPnz6ZMWNG9t1337z44otp166dHwKfYDJF0WSKoskURZMpiiZTLA//mqu6urrmvwgcOHBgdtlllyRZ5FGhSTJv3rzU19enZ8+eSf7vLxZ/+9vf5phjjsmxxx6bxsZGdxF9AskURZMpiiZTFO3ee+/NpZdemv322y/nnHNO/vd//zfJ4qWYz+lUa2mZSpItt9wyxx9/fBobG3PFFVdk+PDhSRbPW0NDQ7bbbru0adMmr7322sd9Cqxglpaptm3bZt68eUmSXXfdNQ0NDalUKs1Pi1D6r7gU/rSoUqlk+vTpSZLf//73H/oXiguvdh07dmyee+65XHDBBXnsscdy4okn5s4778yOO+6YCRMmZP/998+4ceNc5foJJVMUTaYomkxRNJmiaDLF8rC0XC0sPT54B9DCnCz8S50333wzjY2NzY85TpJbb701559/ftq2bZsTTjgh9fX17iL6hJEpiiZTFE2mKNoFF1yQH/3oR7n00kvzxBNP5Oqrr84xxxyTK6+8MkkW+aztczrVaClTSfL1r389Bx10UCZPnpwhQ4bkt7/9bZL389bY2Nh813bv3r3Ttm1bT4z4hGspUw0NDYusX/jn2AdL/5tuuqm59D/yyCOTxM+oVuZXnxbV1dVl2223TfL+B9kbb7wxN998c6ZNm7bY2u7du6dz5875wx/+kIsuuiiPP/54jj/++Bx44IFJkp/97Gf5/Oc/n3feeScrrbTSx3karEBkiqLJFEWTKYomUxRNplgeasnVB/dZsGBB3n777dTV1aVHjx5J3i88LrjggsyfPz+//vWv069fv4/lHFixyBRFkymKJlMUaejQobn88suz1VZb5eqrr84tt9ySU089NW3atMlvf/vb/OMf/1hkvc/ptKSlTC28U79Xr17Zc889c9BBB2XSpEk599xzM3To0CTvl7Rt27ZNktxxxx2ZN29eNt1001Y7J1pXrT+n/tXC0v/Tn/50brrppqyyyioZMWJEjj322I/pDFiatq09AOWw8LuoPvOZz2Ty5Mm5/vrrkyTf/va306tXr+YrW7t375611lqr+Ttgjj322Bx88MFJ3v+Ojx49euTCCy/M7Nmzs9pqq7XCmbCikCmKJlMUTaYomkxRNJlieWgpV5VKZZE7Fevq6tK2bds0NDSkZ8+eadeuXe66665ccMEFmTdvXm666aZssMEGrXIurBhkiqLJFEWTKYr
"text/plain": [
"<Figure size 2500x1000 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# visualize a bar chart of the missing values\n",
"msno.bar(penguin_manager.get_df())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With this bar chart we can see that even if some values are missing and at which percentage, overall the dataset is quite complete and quite balanced. Some data is missing in the columns as we seen in the matrix plot."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With this bar chart we can see that even if some values are missing, overall the dataset is quite complete."
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABm8AAARgCAYAAAAPacLlAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeZhe890/8PdkXzVEJEIIaUwtiURCqKUVW2ljiVjbWGKLnaKWWlpq65OqSoQ8trRoKSWWJxRFLFWxRBMqhFhCLEkIssgkMr8//GZqMhMkmZgT83pd13095pzvfc7nzNx6Xc/99vl+SsrLy8sDAAAAAABAITSo6wIAAAAAAAD4L+ENAAAAAABAgQhvAAAAAAAACkR4AwAAAAAAUCDCGwAAAAAAgAIR3gAAAAAAABSI8AYAAAAAAKBAhDcAAAAAAAAFIrwBAAAAAAAoEOENAAAAAABQb3zwwQfZYYcd8uSTTy52zZgxY9KvX7/06NEjO++8cx566KEq56+66qpss8026dGjRwYOHJjJkyfXao3CGwAAAAAAoF545plnss8+++TNN99c7JrXX389xx57bI4//vg8/fTTOfbYY3PCCSfkvffeS5Lcfvvtuf7663PNNdfkySefzIYbbpjjjjsu5eXltVan8AYAAAAAAPjWu/3223PyySfnxBNP/Mp1vXv3zvbbb59GjRpll112yaabbpqbb745SfLXv/41+++/f7p27ZqmTZvmpJNOytSpU7+0k2dJCW8AAAAAAIAVUllZWWbNmlXlVVZWVuParbbaKvfff3922WWXL73mK6+8kvXWW6/Kse9+97uZOHFijecbN26czp07V56vDY1q7UoAAAAAAMAKp0nPQXVdwlL73aCeGTZsWJVjxxxzTI499thqa9u1a/e1rjl79uw0b968yrFmzZplzpw5X+t8bRDeAAAAAAAAK6QjjjgiBx98cJVjTZo0WaZrNm/ePJ9++mmVY59++mlatmz5tc7XBtumAQAAAAAAK6QmTZqkVatWVV7LGt6st956mTRpUpVjr7zySrp27Zok6dq1a5Xz8+fPz+uvv15tq7VlIbwBAAAAAAD4/3bdddeMHTs2o0ePzoIFCzJ69OiMHTs2u+22W5Jkzz33zA033JCJEydm3rx5+d3vfpdVV101vXv3rrUabJsGAAAAAAD1WEmDhnVdQp3r2bNnfv3rX2fXXXdNly5dcvnll2fIkCH55S9/mTXWWCNDhw7NOuuskyQZMGBAPvnkkxx99NH54IMP0q1bt4wYMSKNGzeutXpKysvLy2vtagAAAAAAwAqlaa/D6rqEpTbvmavquoTlwrZpAAAAAAAABWLbNAAAAAAAqMdsm1Y8Om8AAAAAAAAKRHgDAAAAAABQIMIbAAAAAACAAjHzBgAAAAAA6jEzb4pH5w0AAAAAAECBCG8AAAAAAAAKRHgDAAAAAABQIGbeAAAAAABAPWbmTfHovAEAAAAAACgQ4Q0AAAAAAECB2DYNAAAAAADqsZKGtk0rGp03AAAAAAAABSK8AQAAAAAAKBDhDQAAAAAAQIGYeQMAAAAAAPVYgwZm3hSNzhsAAAAAAIACEd4AAAAAAAAUiG3TAAAAAACgHiuxbVrh6LwBAAAAAAAoEOENAAAAAABAgQhvAAAAAAAACsTMGwAAAAAAqMfMvCkenTcAAAAAAAAFIrwBAAAAAAAoEOENAAAAAABAgZh5AwAAAAAA9VhJA30eReMvAgAAAAAAUCDCGwAAAAAAgAKxbRoAAAAAANRjJQ0a1nUJLELnDQAAAAAAQIEIbwAAAAAAAApEeAMAAAAAAFAgZt4AAAAAAEA9ZuZN8ei8AQAAAAAAKBDhDQAAAAAAQIEIbwAAAAAAAArEzBsAAAAAAKjHzLwpHp03AAAAAAAABSK8AQAAAAAAKBDbpgEAAAAAQD1W0tC2aUWj8wYAAAAAAKBAhDcAAAAAAAAFIrwBAAAAAAAoEDNvAAAAAACgHitpYOZN0ei8AQAAAAAAKBDhDQAAAAAAQIHYNg0AAAAAAOox26YVj84bAAAAAACAAhHeUCvKy8vrugQAAAAAAPhWWOrwpqysLKNHj84RRxyR7bbbLt26dcumm26a/fffP9dff33KysqWubgnn3wypaWl2W+//Zb5WkUzdOjQlJaW5rTTTqvrUpbJa6+9lsMOOyxvvvlmleN9+/ZNaWlp3njjjVq717Rp0/L9738/1113Xa1dc3k499xzs9NOO2XOnDl1XQoAAAAAACugpZp588orr+SEE07IpEmT0rx585SWlmbDDTfM+++/nwkTJuSZZ57JzTffnJEjR2bVVVet7ZopkEMPPTRvvfXWN3KvM844I9/5znfys5/97Bu539I64YQTMnr06Fx00UU599xz67ocAAAAAIAv1cDMm8JZ4vDmjTfeyN57753Zs2dn4MCBOfroo7PyyitXnn/33Xdzxhln5PHHH8+BBx6Yv/3tb2nWrFmtFk1xfFPbpY0ePTqPPPJILr/88jRu3PgbuefSWmmllTJ48OBcdNFF2W233dKrV6+6LgkAAAAAgBXIEm2bVl5enpNOOimzZ8/O4MGDc+aZZ1YJbpKkQ4cOGTZsWDp37pxXXnklt9xyS60WTP2zYMGC/P73v88666yT7bffvq7L+Vr23XfftG7dOkOGDKnrUgAAAAAAWMEsUXjzzDPPZMKECWnXrl2OPPLIxa5r0aJFjjjiiPTu3bvK8dNOOy2lpaU1BjoV820GDhz4pTW89dZbKS0tzXHHHZepU6fm5JNPzuabb54ePXpkr732ypgxY5IkkyZNyuDBg9O7d+9sscUWOfroozNlypQar/n444/n0EMPTZ8+fdKtW7fsvPPOGTp0aLWZJRX3Puqoo/Lee+/l9NNPz5Zbbplu3brlxz/+ca677rp89tlnX1r/snrttddy6qmnZuutt85GG22UbbbZJr/85S/z9ttvV1vbt2/f9O7dO2VlZRk6dGh22GGHbLTRRvnBD36Q3/zmN/nwww+rvWfBggX54x//mH79+mXjjTfO1ltvnQsuuCCzZs3KBhtskL59+yb579+r4r477rhjSktLq22hNm/evAwbNiw77rhjunXrVnnvTz755Gs/87333ps333wze+21V7VzpaWl6d+/f2bOnJlf/epX2WqrrdK9e/f069cvo0aNSvJ5N9hJJ52UPn36ZNNNN81BBx2UF198scp1brvttpSWluaGG27I008/nQMPPDA9e/bMpptumsGDB+e1115LkjzwwAPZa6+90qNHj/Tt2ze/+c1vMnv27Gp1NWvWLLvuumueffbZPP3001/7WQEAAAAAYInCm9GjRydJtt9++6/cCq1///658cYbvzKMWVpTp07NgAED8thjj6VXr15Za621Mn78+Bx55JG55ZZbsvfee2fSpEnp06dPmjZtmgceeCD77bdftUDmiiuuyKBBg/LEE09knXXWyQ9/+MPMmjUrw4YNy/7775+PPvqo2r2nTZuWvfbaK/fff3822GCD9OzZM5MnT85FF12UCy64YLk8b5I89thj2WOPPTJq1Ki0adMm2267bb7zne/k1ltvTf/+/fP8889Xe8/ChQtzxBFH5Morr8xqq62WrbfeOh9//HGuv/76HHzwwVmwYEGVtccff3wuuOCCTJ06NVtuuWU6d+6c66+/PgceeGCVLdJWXXXV9OvXLy1atEiSbLfddlV+rnD88cfniiuuyOqrr54tttii8t4HHXRQlXt/mdtvvz1JFtt1M2vWrOyzzz658847061bt6y//vp5+eWXc+qpp+ZPf/pTBgwYkCeffDK9evXKKquskieeeCL7779/3nnnnWrXevjhhzNw4MC8++672XLLLdOiRYs89NBDOfDAAzNy5MgcffTR+eyzz/L9738/H374Ya6//vqcdNJJNdZVEXTddtttX+s5AQAAAADqQkmDhivs69tqiWbeTJ48OUmy8cYbL5dilsSECROy2Wab5corr0zLli1TXl6eY445Jg888EDOPPPM9O/fP+edd14aNWqU2bNnp3///nn99dfz0EMP5cc//nGS5Ik
"text/plain": [
"<Figure size 2000x1200 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# heatmap of the missing values\n",
"warnings.filterwarnings('ignore')\n",
"msno.heatmap(penguin_manager.get_df())\n",
"warnings.filterwarnings('default')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The nullity correlation heatmap measures how strongly the presence or absence of one variable affects the presence of another. As we noticed before, the blood measures are missing together, and also the anatomical measures are missing together. This is why we see a strong correlation between those columns. For the other columns, we see that there is still a correlation of 0.4, this means that 40% of the values are missing in the same sample which something be aware of."
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot:>"
]
},
"execution_count": 118,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAACBEAAAPBCAYAAACieKANAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdZ1RUV/v38e/A0DsKRAUbImDvRrEm9hJ7b9i7xt6NUWNXRAEFARv2GjX23ht2jTUi2Aso0svM88JnThg1uZP8E0G5Pmvdy8iZOZ6z2Pcpe//2tVVarVaLEEIIIYQQQgghhBBCCCGEEEIIIbI9g8w+ACGEEEIIIYQQQgghhBBCCCGEEEJkDRIiEEIIIYQQQgghhBBCCCGEEEIIIQQgIQIhhBBCCCGEEEIIIYQQQgghhBBC/H8SIhBCCCGEEEIIIYQQQgghhBBCCCEEICECIYQQQgghhBBCCCGEEEIIIYQQQvx/EiIQQgghhBBCCCGEEEIIIYQQQgghBCAhAiGEEEIIIYQQQgghhBBCCCGEEEL8fxIiEEIIIYQQQgghhBBCCCGEEEIIIQQgIQIhhBBCCCGEEEIIIYQQQgghhBBC/H8SIhBCCCGEEEIIIYQQQgghhBBCCCEEICECIYQQQgghhBBCCCGEEEIIIYQQQvx/EiIQQgjx2dBoNJl9CEIIIYQQQgghhBBCCCGEEF80dWYfgBBCCPFHLl26xO3bt3F1dcXd3R1TU1MMDCT/JoQQQgghhBBCCCGEEEII8V+REIEQQogs6c2bN4SGhnL8+HHUajXGxsaUKFGCRo0aUbp0aXLlygWAVqtFpVJl8tEKIYQQQgghhBBCCCGEEEJ8GVRarVab2QchhBBCfExqairJyclcunSJnTt3cujQIWJiYvD09KRhw4b06NEjsw9RCCGEEEIIIYQQQgghhBDiiyIhAiGEEFmOrrpAeno6hoaGys+vXLnC0aNHWbx4MWlpadSsWZMhQ4bg6uqq9zkhhBBCCCGEEEIIIYQQQgjxz0iIQAghRJb3sTDBxIkTuXnzJiVLlqRfv35UqlQJY2PjTDxKIYQQQgghhBBCCCGEEEKIz5+ECIQQ/ycajQYDAwNZl158Mro2FxERga+vL7t27cLT05ORI0dSqVKlzD48kQXJdUoIIYQQQgghhBBCCCGE+OvUmX0AQojPz61bt7C0tMTCwgJbW1sAZWBOBunE/5VuwPeP6AaD8+fPz/DhwzE0NGTHjh0sWrQId3d37O3tpR0KLl68yLNnz8ifPz85c+YkZ86ccp0SQgghhBBCCCGEEEIIIf4CqUQghPhbXr58Sdu2bYmJiUGlUlGuXDnKli1L7dq1yZMnD0ZGRjJAJ/6xjMsW3Lhxg/T0dLRaLSVKlPjgs7p29uDBA8aPH8+5c+do164dP/zww6c+bJHFpKSk0L17dx4+fMiTJ09wcHDg66+/xsvLi2+//RYrK6vMPkQhhBBCCCGEEEIIIYQQIsuSEIEQ4m9JSUkhLi6OO3fucOnSJbZt28a9e/ewsbGhQoUKjBw5kty5c+utXy/EX5ExfBIUFMSKFSuIj48nMTGRoUOH0rJlS+zt7T/6nVu3btGhQwfi4uIICgqiWrVqmXEKIgtJS0sjMTGR8+fPs3//fg4cOMDr16/Jly8fAwYMoHTp0jg7O2f2YQohhBBCCCGEEEIIIYQQWY6ECIQQ/ydPnjzh9u3bzJ8/n19//RUHBwc6depEgwYNZIBO/COLFy9m/vz52Nra4unpyalTpwDo1KkTPXr0wMnJSe/zuiDBli1bGDt2LD179mTo0KGZcegii3i/Gkp6ejqvX79m/vz57N27l6SkJMqXL0/fvn0pW7ZsJh6pEEIIIYQQQgghhBBCCJH1SIhACPGPvD9Il5SUhJ+fH9u3b+f169c0bNiQLl264O7unolHKT43b968oVu3blhaWjJ69Gg8PT3Ztm0b8+bN4+nTp38YJAB4+PAhw4YN4/Hjx2zYsIGvvvoqE85AZEUajQYDAwNSU1M5cuQI69ev5+jRo+TJk4eJEydSvXr1zD5EIYQQQgghhBBCCCGEECLLMMjsAxBCfJ7en+VramrK4MGDGTVqFB4eHmzevBl/f3/u3r2biUcpsjqNRqP39/j4eO7cuUPTpk3x9PQE4LvvvmPMmDHkzp2blStXEhwczLNnz5Tv6LJwzs7O1KpVixcvXvDw4UO9bSJ7MzAwQKPRYGRkxLfffsukSZNo3bo1jx49YuTIkZw4cSKzD1EIIYQQQgghhBBCCCGEyDIkRCCE+Es+NhirGwA2NDRUBujq1avH4MGDKVOmDAcOHGDdunVER0d/6sMVn4H09HQMDN7dhg4dOsSGDRtYs2YNefPmpVChQgCkpKQAULduXUaNGvXRIIFKpSI9PR2Adu3akStXLvbs2ZMJZyQy2/uhFB2tVqu0NZVKRe7cuRk9ejTt27fnzZs3zJo1i19//fVTHqoQQgghhBBCCCGEEEIIkWWpM/sAhBBZX3p6OoaGhsC7cvPx8fHkzp1bGZTTDdDp/vz666+Ji4tj/vz5/Pzzz1SpUoXq1asrJcWF0Gq1SptasGABAQEBetu3b99O8eLFMTY2Ji0tDbVaTd26dQGYOXMmK1euxMDAgG7duuHk5KTsy9jYmEKFCnH//n1Av2KG+LJlvE5FRkby+vVrcubMib29Paamph9cf8zNzenTpw8xMTHs2rWLbdu2UbBgQYyNjaXdCCGEEEIIIYQQQgghhMjWZDRPCPGnNBqNMjC3dOlSvL29qV27Nj179mTXrl0kJiaiUqnQarXKnwYGBnzzzTe0bt2a2NhYfH19SUpKkgCBUOgGaZctW0ZAQAD58uWjffv2FCxYEBMTE/bs2cPWrVvRarWo1WrS0tKA3ysSuLi4sHz5chYuXEhCQgLwLphgbGxMlSpVePbsGQkJCbKcQTaRMZQSEBBA+/btad26NV26dGHWrFlER0crSxpk5OjoSJcuXShUqBA///wzT548Ua5jQgghhBBCCCGEEEIIIUR2JSN6Qog/pRv4X7BgATNnziQiIgJzc3OOHTvGvHnz2LRpEwkJCR8ECdRqNe3ataNy5crcuHGD7du3A39cblxkD7plB+DdUgXnzp2jSJEi+Pr6MnHiRObOnUuzZs2Ijo5m6dKl7Ny586NBgqFDh2JhYYGrqyvm5ubA78GE0qVLExAQgLm5ucwozyZ0v+egoCAWLFiASqWiaNGiJCQksHr1aqZMmfKHQYJixYpRu3ZtoqOjCQoKQqPRSLsRQgghhBBCCCGEEEIIka2ptDLdTgjxP0RERNC5c2fc3NwYPnw4lpaWhIWFsXnzZiwsLOjRowfNmzfH3NxcCRLoSovfv3+fdu3aUaVKFebMmZPZpyKyCN1yBD4+PvTs2ZPevXsr2+7fv8/y5cvZtGkTBQoUoHfv3jRo0ACVSqUsbQDw4MED8uXLB6C0O5E9aTQaXr9+jbe3NzY2NkyaNAlnZ2eOHz/O/PnzuXPnDvXq1WPixInY29srSxvo2k1CQgJt27YlPT2dLVu2YGxsnNmnJIQQQgghhBBCCCGEEEJkGqlEIIT4wPszdV+9esXz58/p168fnp6euLi40L17dzp37kx8fDzBwcFs3rxZryKBoaEhWq0WBwcHypcvz8GDB7lz504mnZHISi5dusRPP/3EkiVLMDU1JUeOHMC7ygQABQoUwNvbmxYtWnD//n0CAwP1KhKkpqYCKAECmTmePWW8ThkYGJCQkMDTp0/p2rUrrq6umJiYUK1aNcaPH4+bmxu7d+9m8uTJehUJdIEnc3Nz2rdvz7179zh48GAmnpUQQgghhBBCCCGEEEIIkfkkRCCE0JOenq4sYXDz5k1Onz7No0ePyJMnD/ny5SMtLY309HQcHR1p167dnwYJVCoVlpaWNGnShISEBB4/fpzJZyeygvz58zNhwgQMDAx4+fIlW7ZsISkpCWNjY2XJgvz58+sFCUJCQtixYwdarRYjIyO9/enaq8g+Ml6njh07xqZNm9izZw/p6emYmZkBKG2lXLlyfxokMDQ0BKBUqVKYm5sTERGRWaclhBB
"text/plain": [
"<Figure size 2500x1000 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# dendrogram of the missing values\n",
"msno.dendrogram(penguin_manager.get_df())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The last plot is the dendrogram, which groups together the columns that have missing values in the same samples. We can see that the blood measures are coupled first as expected, the second couple is specie and flipper (and in general all anatomical measures), may be worth to look further into that.\n",
"\n",
"At this point the first thing to do is to drop the rows that have too many missing values, and then decide what to do with the other missing values. I will drop this values because there is too little information to infer them."
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>studyName</th>\n",
" <th>Species</th>\n",
" <th>Island</th>\n",
" <th>Individual ID</th>\n",
" <th>Clutch Completion</th>\n",
" <th>Date Egg</th>\n",
" <th>Culmen Length (mm)</th>\n",
" <th>Culmen Depth (mm)</th>\n",
" <th>Flipper Length (mm)</th>\n",
" <th>Body Mass (g)</th>\n",
" <th>Sex</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N2A2</td>\n",
" <td>Yes</td>\n",
" <td>2007-11-16</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>339</th>\n",
" <td>PAL0910</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>N38A2</td>\n",
" <td>No</td>\n",
" <td>2009-12-01</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" studyName Species Island Individual ID \\\n",
"3 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N2A2 \n",
"339 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N38A2 \n",
"\n",
" Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) \\\n",
"3 Yes 2007-11-16 NaN NaN \n",
"339 No 2009-12-01 NaN NaN \n",
"\n",
" Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) \\\n",
"3 NaN NaN NaN NaN \n",
"339 NaN NaN NaN NaN \n",
"\n",
" Delta 13 C (o/oo) \n",
"3 NaN \n",
"339 NaN "
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# return only the cols with at least 5 missing values\n",
"penguin_manager.get_df()[penguin_manager.get_df().isna().sum(axis=1) > 5]"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>studyName</th>\n",
" <th>Species</th>\n",
" <th>Island</th>\n",
" <th>Individual ID</th>\n",
" <th>Clutch Completion</th>\n",
" <th>Date Egg</th>\n",
" <th>Culmen Length (mm)</th>\n",
" <th>Culmen Depth (mm)</th>\n",
" <th>Flipper Length (mm)</th>\n",
" <th>Body Mass (g)</th>\n",
" <th>Sex</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [studyName, Species, Island, Individual ID, Clutch Completion, Date Egg, Culmen Length (mm), Culmen Depth (mm), Flipper Length (mm), Body Mass (g), Sex, Delta 15 N (o/oo), Delta 13 C (o/oo)]\n",
"Index: []"
]
},
"execution_count": 120,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# drop the rows with at least 5 missing values\n",
"# Note: a faster solution would be to use dropna() with the threshold parameter, but\n",
"# after some try I found that it doesn't work and I don't know why\n",
"# so this is an alternative solution\n",
"missing_vals = penguin_manager.get_df().isna().sum(axis=1)\n",
"\n",
"drop_rows = missing_vals[missing_vals > 5].index\n",
"\n",
"penguin_manager.get_df().drop(drop_rows, inplace=True)\n",
"penguin_manager.get_df().reset_index(drop=True, inplace=True)\n",
"\n",
"penguin_manager.get_df()[penguin_manager.get_df().isna().sum(axis=1) > 5]"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(342, 13)"
]
},
"execution_count": 121,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check the shape of the dataset\n",
"penguin_manager.get_df().shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This dropped 2 rows, which is not a big deal, and now we have a dataset with 334 rows but with way less missing values."
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"studyName 0\n",
"Species 0\n",
"Island 0\n",
"Individual ID 0\n",
"Clutch Completion 0\n",
"Date Egg 0\n",
"Culmen Length (mm) 0\n",
"Culmen Depth (mm) 0\n",
"Flipper Length (mm) 0\n",
"Body Mass (g) 0\n",
"Sex 8\n",
"Delta 15 N (o/oo) 12\n",
"Delta 13 C (o/oo) 11\n",
"dtype: int64"
]
},
"execution_count": 122,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check the missing values\n",
"penguin_manager.get_df().isna().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There is still a problem with the columns \"sex\", \"Delta 15 N (o/oo)\", \"Delta 13 C (o/oo)\".\n",
"\n",
"First I want to see if the sex of some penguin is in another sample."
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Individual ID \n",
"N1A1 True 1\n",
" False 1\n",
"N24A1 False 2\n",
" True 1\n",
"N25A2 True 1\n",
" False 1\n",
"N26A2 True 1\n",
"N29A1 False 2\n",
" True 1\n",
"N29A2 False 2\n",
" True 1\n",
"N46A1 False 1\n",
" True 1\n",
"N50A1 False 1\n",
" True 1\n",
"N51A1 False 1\n",
" True 1\n",
"N5A1 True 1\n",
" False 1\n",
"N5A2 True 1\n",
" False 1\n",
"N6A1 False 2\n",
" True 1\n",
"N6A2 False 2\n",
" True 1\n",
"N7A1 True 1\n",
" False 1\n",
"N7A2 True 1\n",
" False 1\n",
"N8A2 False 2\n",
" True 1\n",
"N96A1 True 1\n",
"dtype: int64"
]
},
"execution_count": 123,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# get rows with empty values and extract its \"Individual ID\"\n",
"empty_ind_id = penguin_manager.get_df()[penguin_manager.get_df().isna().any(axis=1)][\"Individual ID\"].unique()\n",
"# then get the rows with the same \"Individual ID\"\n",
"row_ind_id = penguin_manager.get_df()[penguin_manager.get_df()[\"Individual ID\"].isin(empty_ind_id)]\n",
"# now groups by individual id and see if there are samples with NaNs and not NaNs\n",
"row_ind_id_groups = row_ind_id.groupby(\"Individual ID\").apply(lambda x: x.isna().any(axis=1).value_counts())\n",
"# True means that there is at least one NaN in the sample\n",
"row_ind_id_groups"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Individual ID \n",
"N1A1 False 2\n",
"N24A1 False 2\n",
" True 1\n",
"N25A2 False 2\n",
"N26A2 False 1\n",
"N29A1 False 3\n",
"N29A2 False 2\n",
" True 1\n",
"N46A1 False 1\n",
" True 1\n",
"N50A1 False 2\n",
"N51A1 False 1\n",
" True 1\n",
"N5A1 True 1\n",
" False 1\n",
"N5A2 True 1\n",
" False 1\n",
"N6A1 False 2\n",
" True 1\n",
"N6A2 False 2\n",
" True 1\n",
"N7A1 False 2\n",
"N7A2 False 2\n",
"N8A2 False 3\n",
"N96A1 False 1\n",
"Name: Sex, dtype: int64"
]
},
"execution_count": 124,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Now groups by individual id and see if there are samples with NaNs and not NaNs in the Sex column\n",
"row_ind_id_sex = row_ind_id.groupby(\"Individual ID\").apply(lambda x: x[\"Sex\"].isna().value_counts())\n",
"\n",
"# True means that there is at least one NaN in the Sex column\n",
"row_ind_id_sex"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Is interesting to see that some individuals have NaNs in some sample but not in others. This means we may know the truth for each sample of these individuals.\n",
"\n",
"In the case of \"Sex\" I can directly address that."
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 125,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# get the samples with nans in the column \"Sex\"\n",
"samples_with_nans = penguin_manager.get_df()[penguin_manager.get_df().isna().any(axis=1)]\n",
"samples_with_nans = samples_with_nans[samples_with_nans[\"Sex\"].isna()]\n",
"\n",
"# now for each individual id that have rows with nans get its rows\n",
"for individual_id in empty_ind_id:\n",
" # get rows with the same individual id\n",
" rows = penguin_manager.get_df()[penguin_manager.get_df()[\"Individual ID\"] == individual_id]\n",
"\n",
" # check if Sex column has a non-null value\n",
" sex_values = rows[\"Sex\"].dropna().unique()\n",
" if len(sex_values) == 1:\n",
" sex = sex_values[0]\n",
" # assign sex to all rows with the same individual id\n",
" penguin_manager.get_df().loc[penguin_manager.get_df()[\"Individual ID\"] == individual_id, \"Sex\"] = sex\n",
"\n",
"# check the empty rows\n",
"count_nans = penguin_manager.get_df()[\"Sex\"].isna().sum()\n",
"count_nans"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code is correct, but it didn't edit the added the sex to the rows. I don't know why, I need to investigate further.\n",
"\n",
"Let's see the rows that still have NaNs:"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Individual ID</th>\n",
" <th>Sex</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>N1A1</td>\n",
" <td>MALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>N5A1</td>\n",
" <td>FEMALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>N6A2</td>\n",
" <td>MALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>N7A1</td>\n",
" <td>FEMALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>N7A2</td>\n",
" <td>MALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>N8A2</td>\n",
" <td>FEMALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38</th>\n",
" <td>N25A2</td>\n",
" <td>MALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40</th>\n",
" <td>N26A2</td>\n",
" <td>MALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>45</th>\n",
" <td>N29A1</td>\n",
" <td>MALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46</th>\n",
" <td>N29A2</td>\n",
" <td>MALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>211</th>\n",
" <td>N96A1</td>\n",
" <td>MALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>249</th>\n",
" <td>N50A1</td>\n",
" <td>MALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>323</th>\n",
" <td>N24A1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Individual ID Sex\n",
"0 N1A1 MALE\n",
"7 N5A1 FEMALE\n",
"10 N6A2 MALE\n",
"11 N7A1 FEMALE\n",
"12 N7A2 MALE\n",
"14 N8A2 FEMALE\n",
"38 N25A2 MALE\n",
"40 N26A2 MALE\n",
"45 N29A1 MALE\n",
"46 N29A2 MALE\n",
"211 N96A1 MALE\n",
"249 N50A1 MALE\n",
"323 N24A1 NaN"
]
},
"execution_count": 126,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# print the columns with empty values\n",
"penguin_manager.get_df().loc[penguin_manager.get_df().isna().any(axis=1)][[\"Individual ID\", \"Sex\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Apparently the individual ID is not the id of a single penguin, OR the sample have wrong values. OR penguins have the ability to change sex...\n",
"\n",
"*Note: After some research it turned out that Adélie penguin (Pygoscelis Adelia) e il Chinstrap penguin (Pygoscelis antarctica) and Gentoo penguin (Pygoscelis papua) actually have the ability to change sex...* this is actually something important to keep in mind, and it explains why there are\n",
"Individuals with different sex in different samples. This means that would be useless to inference the sex based on the anatomical measures, because they cannot be related, at least not in these 3 species.\n",
"\n",
"I also mean that in out dataset the only way to know the sex is through blood isotopes, but they are missing for these rows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6 Detect and handle outliers\n",
"\n",
"Use techniques like box plots, Z-scores, or IQR to identify potential outliers in the data. Depending on the context, decide whether to remove, transform, or keep outliers.\n",
"\n",
"Using to describe() before, I had the idea there wasn't anything strange in the dataset, but I want to be sure. So I will look further into it."
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot:>"
]
},
"execution_count": 127,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjgAAAGbCAYAAADJFeorAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAABEsklEQVR4nO3deVxVdeL/8TeyXRAVDFMzrZlAXEYDcccsKWIaRBFQp9QyS02ddpcaFyxSYZrMzJExzchGpwbNHBxLJ3PKJcwFFW0waaZEHQVUCJD1cn9/+ON8vYHJptjx9Xw8eDzgfs7yOR/uOed9P5/PvdfBZrPZBAAAYCJNGrsCAAAADY2AAwAATIeAAwAATIeAAwAATIeAAwAATIeAAwAATIeAAwAATMepsSvQUCoqKlReXq4mTZrIwcGhsasDAABqwGazqaKiQk5OTmrSpOH6XUwTcMrLy5WWltbY1QAAAHXQrVs3ubi4NNj2TBNwKlNft27d5Ojo2Mi1AQAANWG1WpWWltagvTeSiQJO5bCUo6MjAQcAgJ+Zhp5ewiRjAABgOgQcAABgOgQcAABgOgQcAABgOgQcAABgOgQcAABgOgQcAABgOgQcAABgOgQcAABgOgQcAABgOgQcAABgOgQcAABgOqb5sk0AqA+bzabi4uJ6rS/V/wsDLRZLg3/pIHAjIuAAuOHZbDZNmTJFhw8fbuyqqFu3blqyZAkhB6gnhqgAQPXveQFwfaEHB8ANz8HBQUuWLKnzEFVxcbGGDh0qSdqwYYMsFkud68IQFdAwCDgAoIshx83Nrd7bsVgsDbIdAPXDEBUAADAdAg4AADAdAg4AADAdAg4AADAdAg4AADAdAg4AADAdAg4AADAdAg4AADAdAg4AADCdWgec3NxcTZ8+XX369FGvXr00efJkZWVlSZIOHjyo4cOHKyAgQMHBwUpKSrJbd/369QoJCZG/v78iIyOVmppqlFmtVsXHx6t///4KCAjQpEmTjO0CAADURq0DzpNPPqkLFy7on//8p7Zt2yZHR0fNnj1beXl5mjBhgiIiIrRnzx7NmzdPCxYs0KFDhyRJu3fvVmxsrOLi4rRnzx4NGTJEkyZNUlFRkSQpISFBO3fu1Lp167R9+3ZZLBbNmjWrYY8WAADcEGoVcA4fPqyDBw8qLi5OzZs3l4eHh2JjYzV16lRt2bJFnp6eGjVqlJycnNSvXz+Fh4dr9erVkqSkpCSFhYUpMDBQzs7OGjt2rLy8vLRp0yajfPz48Wrbtq08PDw0c+ZMffHFF8rMzGz4owYAAKZWqy/bPHTokHx8fPS3v/1Nf/3rX1VUVKS77rpLM2bM0LFjx9SxY0e75X18fLR27VpJUkZGhqKioqqUp6enKz8/X6dPn7Zb39vbWy1atNDRo0fVvn37GtfRarXW5pAAoN4uve5YrVauQ0AtXK3zpVYBJy8vT0ePHtWvfvUrrV+/XsXFxZo+fbpmzJghb2/vKt+ga7FYdOHCBUlSYWHhZcsLCwslSe7u7lXKK8tqKi0trVbLA0B9lZSUGL8fOnRIrq6ujVgbAFItA46Li4skaebMmXJ1dZWHh4eeeeYZjRgxQpGRkSouLrZbvri4WE2bNpUkubm5VVvu5eVlBJ/K+TjVrV9T3bp1k6OjY63WAYD6uPTa1b179yov5gBcntVqvSqdE7UKOD4+PqqoqFBZWZnxCqWiokKS1LlzZ61Zs8Zu+YyMDPn6+kqSfH19dezYsSrlAwcOVIsWLdS6dWtlZGQYw1TZ2dnKzc2tMux1JY6OjgQcANfUpdccrkHA9aFWk4z79++v9u3b6/e//70KCwt17tw5vf7667rvvvs0ePBg5eTkKDExUWVlZUpJSVFycrIx7yY6OlrJyclKSUlRWVmZEhMTdfbsWYWEhEiSIiMjlZCQoMzMTBUUFGj+/Pnq3bu3OnTo0PBHDQAATK1WPTjOzs567733FBcXp9DQUJWUlCg4OFgzZ85U8+bNtXLlSs2bN0+LFy9Wy5YtNWvWLPXt21eS1K9fP8XExGju3Lk6c+aMfHx8tHz5cnl6ekqSpkyZovLyco0aNUqFhYXq06ePFi1a1NDHCwAAbgAONpvN1tiVaAhWq1UHDhyQv78/3cMArqmioiKFhoZKkjZv3swcHKAWrtb9m69qAAAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAApkPAAQAAplPrgLNp0yZ16dJFAQEBxs+0adMkSQcPHtTw4cMVEBCg4OBgJSUl2a27fv16hYSEyN/fX5GRkUpNTTXKrFar4uPj1b9/fwUEBGjSpEnKysqq5+EBAIAbUa0DTlpamoYOHarU1FTj59VXX1VeXp4mTJigiIgI7dmzR/PmzdOCBQt06NAhSdLu3bsVGxuruLg47dmzR0OGDNGkSZNUVFQkSUpISNDOnTu1bt06bd++XRaLRbNmzWrYowUAADeEOgWcX/3qV1Ue37Jlizw9PTVq1Cg5OTmpX79+Cg8P1+rVqyVJSUlJCgsLU2BgoJydnTV27Fh5eXlp06ZNRvn48ePVtm1beXh4aObMmfriiy+UmZlZz0MEAAA3GqfaLFxRUaEjR47Izc1NK1askNVq1d13362pU6fq2LFj6tixo93yPj4+Wrt2rSQpIyNDUVFRVcrT09OVn5+v06dP263v7e2tFi1a6OjRo2rfvn2N62i1WmtzSABQb5ded6xWK9choBau1vlSq4Bz7tw5denSRaGhoVq8eLHOnz+vGTNmaNq0aWrVqpXc3NzslrdYLLpw4YIkqbCw8LLlhYWFkiR3d/cq5ZVlNZWWllar5QGgvkpKSozfDx06JFdX10asDQCplgHH29vbGHKSJDc3N02bNk0jRoxQZGSkiouL7ZYvLi5W06ZNjWWrK/fy8jKCT+V8nOrWr6lu3brJ0dGxVusAQH1ceu3q3r17lRdzAC7ParVelc6JWgWc9PR0bdy4Uc8//7wcHBwkSaWlpWrSpIm6d++ud9991275jIwM+fr6SpJ8fX117NixKuUDBw5UixYt1Lp1a2VkZBjDVNnZ2crNza0y7HUljo6OBBwA19Sl1xyuQcD1oVaTjD09PbV69WqtWLFC5eXlOnXqlF599VUNGzZMoaGhysnJUWJiosrKypSSkqLk5GRj3k10dLSSk5OVkpKisrIyJSYm6uzZswoJCZEkRUZGKiEhQZmZmSooKND8+fPVu3dvdejQoeGPGgAAmFqtenDatGmjZcuWaeHChUpISJCrq6vCwsI0bdo0ubq6auXKlZo3b54WL16sli1batasWerbt68kqV+/foqJidHcuXN15swZ+fj4aPny5fL09JQkTZkyReXl5Ro1apQKCwvVp08fLVq0qKGPFwAA3AAcbDabrbEr0RCsVqsOHDggf39/uocBXFNFRUUKDQ2VJG3evJk5OEAtXK37N1/VAAAATIeAAwAATIeAAwAATIeAAwAATIeAAwAATKdWbxMHgOuNzWar8inp19ql+2/sukgXv+am8sNYgRsVAQfAz1pxcbHxFu3rwdChQxu7CrxVHRBDVAAAwITowQFgGjMkuTTSvis/MbWxBoZKJcU30r6B6xEBB4BpuEhyabSI0dhM8aH0QINhiAoAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQcAAJgOAQc
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.boxplot(data=penguin_manager.get_df().select_dtypes(include=['float64', 'int64']))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The body mass values are in a different range than the other values."
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAgQAAAHBCAYAAAAWz6MMAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAhJ0lEQVR4nO3deVhVdeLH8Q8gCOZKOBplWSq4sSniKI4BatiAW6llits4aZpOZSRl6TNTihXujqmk9piU1rjUNFZmDekkomkhNplmmUuuKK6s935/fzjcnyQqy4UL+H49j88D5557z5cvR+77nnO4OBljjAAAwC3N2dEDAAAAjkcQAAAAggAAABAEAABABAEAABBBAAAARBAAAAARBAAAQAQBgArA+58BlR9BgGorPT1dsbGxCgsLk7+/v7p166YXX3xRhw8fLvFjrV27Vr6+vjpy5Eg5jNQ+KuMYjx8/rtGjR+vo0aO2ZREREYqLiyv1Y7788suaPXu2PYZXYl999ZX69u2rvLw8h2wfKE8EAaqlpKQkPfroo8rIyNDEiROVmJioMWPGaMeOHXr44Yf13XffOXqIt4StW7cqOTnZbo+3bds2bdy4UaNHj7bbY5ZEaGioGjdurDfeeMMh2wfKE0GAamfnzp2aNm2aHnvsMS1btky9evVSx44dNWDAAL377ruqVauWnn/+eUcPE6UQHx+voUOHqlatWg4bw9ixY5WYmKiTJ086bAxAeSAIUO0sXbpUderU0TPPPHPNbZ6enoqLi9MDDzygixcvSpJiYmIUExNTaL3U1FT5+voqNTW1yG3ExcXpT3/6k9577z11795d/v7+evTRR/Xzzz/r3//+t3r16qWAgAANGDBA33//faH7fv311xoyZIgCAgIUEhKiSZMm6cyZM7bb165dq9atWystLU2PPPKI/Pz8FBYWpsTExLJOjSQpMzNTU6ZMUefOneXn56eBAwcqJSWl0Dq+vr5KSkrS5MmTFRISoqCgIE2YMEGnT58utN7SpUvVrVs329f/xRdf2OZt7dq1tvDq1q1bodMEeXl5eu211xQaGqrAwECNHDlSv/zyyw3HnZycrB9++EHR0dG2ZfPnz1fPnj21adMmRUdHy8/PT3369NE333yjb7/9VgMGDJC/v7+io6MLfY2lvZ8k+fv7y9vbW2+99VaJ5h2o7AgCVCvGGP3nP/9Rp06d5OHhUeQ6PXv21JNPPqnatWuXaVvffvut3n77bcXFxWn69On68ccf9fjjjys+Pl6jR49WfHy8jh07pmeffdZ2nx07dmj48OFyd3fXnDlz9MILL2j79u0aOnSosrOzbetZrVY99dRT+uMf/6glS5aoffv2SkhI0JYtW8o05pycHA0bNkyff/65nn76aS1YsECNGzfWqFGjrnnimz17tqxWq2bNmqXnnntOycnJmj59uu32BQsWKCEhQQ8++KAWLlyogIAAPf3007bbw8LC9MQTT9jWHTt2rO22DRs2aP/+/ZoxY4amTJmi9PT0QvctyocffqjAwEDdcccdhZYfP35c8fHxGjNmjObMmaNz585pwoQJeuaZZzRw4EDNmjVLVqtVTz/9dKE5Lu39pCv70IcffljMWQeqhhqOHgBgT2fPnlVOTo7uuuuuct/WxYsXNWfOHDVr1kyStH37dq1evVpvvfWWOnXqJOnKk86rr76q8+fPq27dupo5c6buvfdeLV68WC4uLpKkgIAARUVFac2aNRo8eLCkK2EzduxYDRgwQJLUvn17ffbZZ0pOTtYf/vCHUo/5gw8+0N69e/Xee+8pICBAktS1a1fFxMQoISFBa9assa3r4+Oj+Ph42+e7d+/WJ598Ikm6fPmyEhMTNXjwYFvwdOnSRVlZWVq9erWkK0dj7r77bklSq1atCn1PGjVqpIULF8rV1VWS9Msvv2jRokW6ePHidUNt27ZtioqKumZ5VlaWpk6dqq5du0qSDhw4oJkzZ2ratGnq37+/JMlisWjChAn6+eef1apVqzLdT5L8/Py0aNEiHThwwPb9B6o6jhCgWnF2vrJLWyyWct9WvXr1Cj0ZNGzYUJIUGBhoW1a/fn1J0vnz55WVlaW0tDTdf//9MsYoPz9f+fn5atKkiZo1a6avvvqq0OMHBQXZPnZzc5Onp6cuX75cpjGnpKSoYcOGatOmjW37FotF4eHh2rNnj86dO2db9+qvQ5IaN26srKwsSVeOjmRnZ6tnz56F1rn6cP6N+Pv722JAkpo0aSLpyjwVJSsrSxkZGdcNvXbt2tk+9vLyumb8V38f7HG/gnFUpt/oAMqKIwSoVurXr6/bbrtNv/7663XXuXz5snJzc20/7Evreq9kr3eq4vz587JarUpMTCzyeoCaNWsW+tzd3b3Q587OzmX+ff7MzEydOnVKbdq0KfL2U6dOqV69epKu/Tqu3n7BNQ+enp6F1il4Ur2Z314UWBByVqu1yPULnpCvdzFhUd+L386fPe9XMDcXLly46bpAVUEQoNrp0qWLUlNTlZOTc82TrHTlor1p06bpnXfesb0K/+0RhbK+Ei/KbbfdJicnJw0fPrzIQ9/XCwl7qlOnjpo2baqEhIQiby/uqZbGjRtLuhIG9913n2351RdH2lODBg0kXf8IQkUrOJJSMC6gOuCUAaqdkSNHKjMzs8g3r8nIyNCbb76pe+65x3ZouHbt2jp+/Hih9Xbt2mX3cdWuXVutW7fWTz/9JD8/P9u/Fi1aaMGCBdf9jQZ7CgkJ0bFjx3T77bcXGkNKSorefPNN23UNN9OyZUvVqVNHGzduLLT8008/LfR5wSv/snJzc1PDhg117NgxuzxeWRXsL97e3g4eCWA/HCFAtRMYGKi//OUvmjNnjg4cOKB+/fqpQYMG2r9/v5YtW6ZLly5pyZIlcnJykiSFh4friy++0LRp09S9e3ft3LlT69evL5exPfPMM3r88cc1ceJE9e7dWxaLRcuWLVNaWprtivyyWrNmje2w/9WGDx+uhx56SCtXrtSIESM0ZswY3XHHHdq6dasSExM1ZMiQQuf1b6R27doaNWqU5s2bJw8PD4WEhGj79u169913Jf1/CNStW1eS9Nlnn6lr165lugAvNDS0XEKtNHbu3Km77rpL9957r6OHAtgNQYBq6YknnlDr1q2VlJSk+Ph4ZWZmqnHjxuratavGjBlT6JXdww8/rEOHDmndunVavXq1QkJCNHfuXA0aNMju4+rSpYuWLl2qBQsWaMKECXJ1dVWbNm20fPnyay7iK62FCxcWuXz48OGqVauWkpKSNHPmTL3++uu6cOGC7rzzTk2cOFEjR44s0XZGjx4tq9Wq1atXa+nSpQoICNCzzz6r+Ph427n+jh07qnPnzpo5c6ZSUlK0ZMmSUn9dkZGR+uc//6mTJ0/qd7/7Xakfxx62bNlyzQWVQFXnZPirIwBKKD8/Xx999JE6duxY6H0BkpKS9Morryg1NdV2dMBejDHq06ePIiMjNW7cOLs+dkls375do0aN0qZNmxweJoA9EQQASiUqKkpubm564okn1KBBA+3du1dz585Vjx49Cr1/gT1t3rxZL7zwgj755JMyv7FUaf35z39Wy5YtNXHiRIdsHygvBAGAUjl8+LBmzZql1NRUnT9/Xt7e3urdu7dGjx5d7GsRSmPq1KmqW7euQ56Qt2zZotdee01r1qyRm5tbhW8fKE8EAQAA4NcOAQAAQQAAAEQQAAAAFfN9CKxWq/Lz8+Xs7Gx7MxcAAFC5GWNktVpVo0aNm75zaLGCID8/X+np6XYZHAAAqFh+fn43/c2YYgVBQVX4+fkV+73ObwUWi0Xp6enMSxkxj/bBPNoH82gfzKN9lHUeC+5fnL8rUqwgKDhN4OLiwje2CMyLfTCP9sE82gfzaB/Mo32UdR6Lc7qfiwoBAABBAAAACAIAACCCAAAAiCAAAAAiCAAAgAgCAAAgggAAAIggAAAAIggAAIAIAgAAIIIAAACIIAAAACIIAACACAIAACCCAAAAiCAAAAAiCAAAgAgCAAAgggAAAIggAAAAIggAAIAIAgAAIIIAAACIIAAAACIIAACACAIAACC
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAgQAAAHBCAYAAAAWz6MMAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAgWUlEQVR4nO3deXhVhZ3H4W8gQKLUAdSKoNWWEZBNQARFbAFxRVxGEVBxnYq1rVbriGJdRkWKg48blkFQqAIWQbBatWoH6eiA4FbU1l07qLiCShHZwp0/GFIWsYDABXzf5+F5ktxzz/nlnJDzyT03uSWFQqEQAOAbrUqxBwAAik8QAACCAAAQBABABAEAEEEAAEQQAAARBABABAGwEfh7Z7DlEQR8I7zwwgv5t3/7t3Ts2DEtWrTIgQcemF/84hd5++2313ldEyZMSKNGjfLOO+9shEk3jOUzrvivefPm6dy5cy699NJ88MEHG23b48aNy8CBA1ebZX331zvvvJOOHTtmzpw5G2rEddKrV6889NBDRdk2bEqlxR4ANrbRo0fnmmuuSbt27fLzn/883/72tzNz5swMHz48jzzySEaMGJGmTZsWe8yNYvDgwdlxxx2TJF988UVee+21DB06NJMmTcpvfvOb7Lrrrht8m0OGDEnbtm03yLoKhUL69euXU045JXXq1Nkg61xXl1xySc4888y0bds222+/fVFmgE3BIwRs1Z555pn0798/J5xwQm6//fZ069Yt7dq1S/fu3XPXXXdlm222ycUXX1zsMTeaPffcMy1btkzLli2z33775eSTT85vfvObLFq0KJdddlmxx/uHHn300bz88ss54YQTijZDs2bN0rRp0wwZMqRoM8CmIAjYqt1222351re+lfPPP3+12+rUqZOLLrooBx98cObNm5ck6d27d3r37r3SctOmTUujRo0ybdq0L93GRRddlDPOOCN33313unTpkhYtWqRnz55566238thjj6Vbt27Za6+90r1797z00ksr3ffpp5/OSSedlL322itt27ZN3759V3pofMKECWnSpElmzJiRHj16pHnz5unYsWOGDRu23vtk1113zfHHH58pU6Zk5syZlR9/9dVX06dPn7Ru3TqtW7fOj3/845UuqSzfD0888UROPPHEtGjRIgcddFBGjRpVuUznzp3z7rvvZuLEiatdJpgxY0Z69uxZ+Tncdttt/3DWoUOH5uCDD06NGjUqP9aoUaPcddddueiii7L33nunbdu2ufrqq7NgwYIMHDgw++67b9q1a5dLLrkkCxcu/Nr3S5Ijjzwy48ePL9plC9gUBAFbrUKhkCeeeCL77bdfysvLv3SZQw89ND/5yU9Ss2bNr7WtP/3pT7nzzjtz0UUX5Zprrsnrr7+eM888MwMGDEifPn0yYMCAvPfee7ngggsq7/PUU0/l1FNPTVlZWW644Yb069cv06dPz8knn5wFCxZULrd06dL87Gc/y+GHH55bb701e++9dwYNGpTHH398veft0KFDkmWPoCTJW2+9lZ49e2b27Nn55S9/mf79++ftt99Or169Mnv27JXue95556VJkya55ZZbsv/+++eqq67KnXfemeTvlyh+8IMfZOzYsfn2t79deb8rrrgiRxxxRIYOHZoWLVrk2muvzWOPPbbGGd988828+OKLOfTQQ1e7bdCgQalevXoGDx6co446KnfeeWeOPvrovPfee/mP//iP9OzZM+PHj6+c6+ve78ADD0xFRUUeffTRddjLsGXxHAK2Wp988kkWLlyYXXbZZaNva968ebnhhhvSoEGDJMn06dMzduzYjBw5Mvvtt1+S5P3338/AgQMzd+7cbLfddrnuuuvy3e9+N0OHDk3VqlWTJHvttVe6du2ae+65JyeeeGKSZWFz9tlnp3v37kmSvffeO48++mgmT56cAw44YL3mXf68go8++ijJshN5WVlZRo4cWRlH++23X7p06ZLhw4enb9++lfft0qVLLrnkkiTJAQcckA8//DBDhgzJiSeemCZNmqR69eqpU6dOWrZsudI2zz///PTq1StJ0rJly0yaNClPPvlkOnXq9KUzPvnkk0mSFi1arHZbgwYNcuWVVyZJ9tlnn4wfPz6LFy/OoEGDUlpamgMOOCCTJk3Ks88+u0Hut80226RBgwaZOnVqevTo8Q/2LmyZPELAVqtKlWVf3hUVFRt9W//0T/9UGQPJ30+4K54Ua9WqlSSZO3duvvjii8yYMSM/+MEPUigUsmTJkixZsiS77rprGjRokP/5n/9Zaf2tWrWqfHv5CXf+/Plfe+6SkpIky06+7dq1S1lZWeUsNWvWTJs2bTJlypSV7nPUUUet9P7BBx+c2bNn56233vrKbbVp06by7W222SY77LBD5s6du8bl33777Wy33XbZbrvtVrttxf1RWlqa2rVrp1mzZikt/fvPOLVq1crf/va3DXK/JKlfv/5m/Zsl8HV5hICtVq1atbLttttm1qxZa1xm/vz5WbRoUeXJen2t6ZLDmi5VzJ07N0uXLs2wYcO+9PkAK14zT5KysrKV3q9SpcrX+l3/5b92WLdu3STJp59+mgcffDAPPvjgasuu+uz+FS8DJKl85v1XndyT1ffFP/oc5s2bt8b992X7e03Lboj7LV/uy0IBthaCgK1ahw4dMm3atCxcuHC1k2yy7El7/fv3z5gxYyp/elz1EYUN8ZP4qrbddtuUlJTk1FNPTdeuXVe7fW1PUutrypQpKSkpqfyp/Vvf+lbat2+f0047bbVlV/zpOVkWDyta/hyDDf0rebVr196sTsBz585N7dq1iz0GbDQuGbBVO/300/Ppp5/m+uuvX+222bNnZ/jw4dltt90qH9qvWbNm3n///ZWWW/V68oZQs2bNNGnSJG+++WaaN29e+W+PPfbI4MGD1/gbDRvC+++/n3HjxqVjx47ZeeedkyRt27bN66+/nj333LNylmbNmmXkyJGrPZFu0qRJK73/+9//PvXr1893vvOdJH+/VPN11atXL/Pnz89nn322Qdb3db333nupX79+sceAjcYjBGzVWrZsmXPPPTc33HBD3njjjRxzzDGpXbt2Xnvttdx+++35/PPPc+utt1ZeS+/UqVMmTZqU/v37p0uXLnnmmWdy7733bpTZzj///Jx55pn5+c9/niOPPDIVFRW5/fbbM2PGjPzoRz/aINt46aWX8vHHHydZ9oeJXnnllYwcOTI1atRY6e8QnH322enZs2f69OmTXr16pUaNGhk7dmz+8Ic/5KabblppnSNHjkxZWVlatmyZRx55JI899liuu+66ytu32267/OUvf8n06dO/9AmBa2v//fdPsizI1vTEw03lb3/7W15//fWcccYZRZ0DNiZBwFbvRz/6UZo0aZLRo0dnwIAB+fTTT1O3bt18//vfz1lnnZV69epVLnvsscdm5syZmThxYsaOHZu2bdvmxhtvrHx2/IbUoUOH3HbbbRk8eHDOOeecVKtWLU2bNs2IESNWe4b++vrJT35S+XbNmjWz884756ijjkrv3r2zww47VN7WuHHjjB49Otdff30uvPDCFAqFNGzYMLfccksOPPDAldbZr1+/TJw4MUOHDs33vve93HTTTTnkkEMqbz/99NNzzTXX5IwzzsiIESPWe/Zdd901TZs2zR//+MeiB8Hjjz+eatWqpWPHjkWdAzamkoJXIQHWwrRp03LyySfnjjvuSLt27TbJNh9++OH069cvjz/+eLbZZptNss0v07t37zRu3Ljy1y1ha+Q5BMBm6+CDD84ee+yRMWPGFG2GGTNm5JVXXsmZZ55ZtBlgUxAEwGarpKQk1157be64446i/dngAQMG5LLLLqv82xKwtXLJAADwCAEAIAgAgAgCACBr+XcIli5dmiVLlqRKlSqVf8AFANi8FQqFLF26NKWlpf/wr4iuVRAsWbIkL7zwwgYZDgDYtJo3b57q1at/5TJrFQTLq6J58+aVr9vOMhUVFXnhhRfsmy2AY7Vlcby2HI7V5mv5sVmb1xhZqyBYfpmgatWqDvYa2DdbDsdqy+J4bTkcq83X2lzu96RCAEA
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAggAAAHBCAYAAAAM80OCAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAkE0lEQVR4nO3de3SU9Z3H8c/kRoKoIAURtLWAyUIuZCIXS2CFQIgVgQpYQYwVaZUiFxG2IK5YSivWeoMAAkcUES8xVpdbXdSgKBJYoATQqgurVrkI1UAF5JaZ3/7ByZThS0ImJEzCvF/n5Jxk5nme+eVrzLwzz0Picc45AQAAnCQq3AsAAAC1D4EAAAAMAgEAABgEAgAAMAgEAABgEAgAAMAgEAAAgEEgAAAAg0AAUOP4fWxA3UMgIKJMnDhRSUlJ5b4tXrxYkpSbm6vc3NzAfklJScrLywvXsitlx44dSkpK0muvvRbupQQcO3ZM06ZN09KlSwO3TZw4UVlZWVU+5ooVKzR48ODqWF7ISkpKdO211+qrr74Ky+MD51JMuBcAnGtNmjTRzJkzT3vfD3/4w9Penp+fr2bNmtXkss5Le/fu1YIFCzRt2rRqOV5JSYmmTJmiefPmVcvxQnXJJZfo9ttv16RJk7Rw4UJ5PJ6wrAM4FwgERJy4uDilp6eHtE+o26NmzJ49W8nJyUpJSQnbGm655RbNmTNHb7/9trKzs8O2DqCmcYoBqISTTzGsW7dOSUlJWr16tYYMGaK0tDRlZ2dr0aJFZp9FixZpwoQJ8nq96ty5s37/+9/ryJEjQdu9/fbb6t+/v1JTU5WZmanf//73+v777wP35+XlKTs7WzNnzlSnTp3Us2dP7du3r8qfi9/v17x585Sdna2UlBTl5OTo+eefD9omNzdX999/v+bNm6du3bopNTVVgwYN0ubNm4O2e/fdd9W/f3+lpaUpJydHy5YtU3Z2tvLy8rRjxw716NFDknTfffeZ0wqvvfaacnJylJqaqr59++q9996rcN0lJSV69dVX1adPn8BtZf8tioqKlJubq7S0NHXr1k0FBQXau3evRo4cKa/Xq2uvvVYLFiw46/0kqV69eurVq5fmzp1b2ZEDdRKBgIhUWlpq3kK9kG7s2LFq27atZs2apczMTE2dOtU80U6fPl3ffvutnnzySf3yl7/UK6+8ov/4j/8I3L906VLdfffdatmypWbNmqWRI0dqyZIlGjFiRNB6du3apbfeekuPP/647rnnHjVq1KjKn/tvf/tbzZgxQ3379tWcOXN03XXX6aGHHtKsWbOCtluxYoUKCwv1n//5n3r88cf1zTffaPTo0fL5fJKktWvXasSIEbrsssuUl5enIUOG6MEHH9Tu3bslSU2bNg2cyvn1r38ddFpn9+7dmjdvnsaMGaMZM2bIOadRo0bp22+/LXfdb775pkpLSwPRcbJ7771XWVlZmjNnjq688ko9+OCDuu2225SYmKgZM2YoOTlZ06ZN05YtW6plv5/+9KfaunWrPv/88xAmD9QtnGJAxNm5c6eSk5PN7WPGjNGIESMqfZyePXvq/vvvlyR17dpVe/fu1VNPPaUhQ4YoKupEe19yySWaM2eOYmJidO211yoqKkrTpk3Ttm3b1Lp1az366KPq2rWrHn300cBxr7zySt1+++1atWqVunXrJulE0EyYMEGdO3c+i89c+vzzz/XKK6/o3nvv1Z133ilJ6tKlizwej+bOnatbbrklEB+lpaWaP3++GjRoIEk6dOiQJkyYoI8//lgpKSnKy8tT69atNXPmzMC5+MaNG+vee++VdOJUTps2bSSduLajbdu2gXX4/X7NmjVLrVq1knTip/KhQ4equLj4tAEgnQiSVq1a6YILLjD3DRgwQEOHDpUk1a9fXzfffLPS0tI0evRoSVJKSooKCwv117/+VWlpaWe9X2pqqiSpqKhIP/7xjys5faBu4RUERJwmTZro1VdfNW8DBw4M6Tj9+vUL+rhXr1769ttvg36q7N27t2Ji/tXhOTk5kqQNGzbos88+09dff62srKygVzI6dOigBg0a6IMPPgg6fmJiYqifqrF27Vo558xjZmVl6ejRo9q4cWNg29atWwfiQJIuvfRSSdLhw4d17Ngxbdq0STk5OUEX6uXk5AR9vuVp1KhRIA4k6YorrpAkHThwoNx9vvrqK11++eWnvc/r9Qbe/8EPfiBJateuXdDjne74Vd3vwgsv1EUXXaQdO3aUu16gruMVBEScuLi4wE+AZ6Np06ZBHzdu3FiS9N1331Vqm/3790uSpkyZoilTppjj7927N+jjsiews1H2mL179z7t/Xv27Am8n5CQEHRf2asifr9f+/fvl8/nC3w+ZWJiYip1+qN+/fpBH5dFht/vL3efgwcPmjWVOTlkylt/de5Xtt3BgwcrtS1QFxEIQBWVPdmWKTt/fvKT5qnbfPPNN5JOnHq46KKLJEm/+c1v1LFjR3P8iy++uBpXe0LZYz733HOnfam+efPmlTpO48aNFRsba64Z8Pv9Z3UBZUUaNWpU4SsM59p33313VteCALUdpxiAKlq5cmXQx//93/+tFi1aBP0uhVO3WbFihTwej6655hq1bNlSjRs31o4dO5Samhp4a9asmR577DH97W9/q/Y1d+jQQZK0b9++oMfcv3+/nnzySRM05YmOjlZGRobefvvtoNtXrlyp0tLSoO2qS/PmzQMXQIbb/v37dfjw4UoHFVAX8QoCUEULFixQfHy80tPT9eabb+qdd97RY489FrTNli1bNH78ePXr10+ffvqpZsyYoZ///OeBc+5jx47V5MmTFR0dre7du+u7777T7NmztWfPntNeSFkZH3zwQdBpjjLXXXedEhMT1bdvXz3wwAPauXOnUlJS9Pnnn+uJJ57Q5ZdfriuvvLLSjzN69Gjl5uZq9OjRGjhwoHbt2qXp06dL+tcpgwsvvFDSiYv5WrVqFXR+P1SZmZl64403dODAgcBxw6XsWo0uXbqEdR1ATSIQgCqaNGmSXn/9dc2dO1ctW7bUjBkzAhchlvnFL36hPXv2aOTIkWrUqJGGDx+uu+66K3D/TTfdpAsuuEBPP/208vPzVb9+fWVkZOjRRx8NRESoli1bpmXLlpnb27Rpo2bNmmnatGmaO3euXn75ZX399ddq3Lixrr/+et1zzz0h/cTfvn175eXlafr06RoxYoRatGihBx54QGPHjg2cvmjQoIGGDh2q/Px8vfvuu+bCy1B0795dMTExev/993X99ddX+TjV4b333lNaWppatGgR1nUANcnj+CsqQEjWrVun2267TQsXLlSnTp3K3S4pKUkjR47UqFGjzuHqzp3CwkI1a9Ys6JWObdu26YYbbtDs2bPL/eeKZ2Pq1Knavn27nnvuuWo/dmUdOnRIXbt21SOPPKKePXuGbR1ATeMaBABVsnr1at1xxx0qKCjQhg0btGzZMt1zzz1q2bJljb30Pnz4cH388cfmFxedSy+++KISExNrJICA2oRTDACqZMKECYqPj9dTTz2lvXv3qmHDhuratavGjRunevXq1chjNmnSRL/97W/10EMP6eWXX66Rx6hISUmJFi5cqEWLFvGHmnDe4xQDAAAwOMUAAAAMAgEAABgEAgAAMKp0kaLf71dpaamioqK4UAcAgDrCOSe/36+YmJjA31cpT5UCobS0VFu3bq3S4gAAQHilpqYqLi6uwm2qFAhl1ZGamnra37zm8/m0devWcu9HMOYVGuYVOmYWGuYVOmYWmnDNq+xxz/TqgVTFQCg7rRAdHV3hJ3am+xGMeYWGeYWOmYWGeYWOmYUmXPOqzOUBXKQIAAAMAgEAABgEAgAAMAgEAABgEAgAAMAgEAAAgEEgAAAAg0AAAAAGgQAAAAwCAQAAGAQCAAAwCAQAAGAQCAAAwCAQAACAQSAAAACDQAAAAAaBAAAADAIBAAAYBAIAADAIBAAAYBAIAADAIBAAAIBBIAAAAINAAAAABoEAAAAMAgEAABgEAgAAMAgEAABgEAgAAMAgEAAAgEEgAAAAg0AAAABGTLgXAITCOafDhw/r6NGjOnz
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAgMAAAHBCAYAAAD0E7h1AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAlcUlEQVR4nO3de3hU5b238W8SEhIIGhVEQKWXjQYEcsAQlAQ5acKxlYDFgqAtIiBCiTRIt5tqi3VT0bABlUOVAkIVOaMFZNMKaBMREORQOYgiRiAiYAiFMCTzvH/wZsqYENAaBuZ3f67LS1hrTeZ58kDmzlqLTIhzzgkAAJgVGugBAACAwCIGAAAwjhgAAMA4YgAAAOOIAQAAjCMGAAAwjhgAAMA4YgAAAOOIAQAAjCMGgO+ob9++iouL8/svOTlZ/fr10wcffPCDPU9cXJwmTZr0H3+MuLg45eTkVLjf6/WqdevWiouL08KFC/+j5/ohTZ8+Xb/+9a8v+HiPx6OMjAxt3ry56gYFBDFiAPgebr31Vs2dO1dz587VX/7yF40dO1bh4eHq37+/du/eHejh+QkNDdWKFSsq3Ld+/Xp99dVXF3lElduzZ4+mTJmi7OzsC35MRESERowYoVGjRunUqVNVODogOBEDwPcQHR2txMREJSYm6rbbbtNdd92lSZMmKTQ09JL6DluSmjdvrs8//1zbt28vt++vf/2rGjduHIBRndu4cePUuXNn1a1b9zs9Lj09XWFhYXrttdeqaGRA8CIGgB9IVFSUqlevrpCQEL/ty5YtU2ZmppKSkpSamqrf/va3Kiws9Dvmgw8+UK9evZSQkKCMjAzl5ub67e/Ro4fuu+++cs/Zv39/9e3bt9JxpaSkqHbt2lq+fLnf9pKSEq1cuVJdunQp95gdO3bo0Ucf1e23364mTZqodevWevrpp1VcXOw7Jjc3V7169VJSUpJatGihRx55RJ9++qlv/xdffKHBgwerZcuWSkhIUK9evbRmzZpKx7pr1y6tXr1a3bp189u+adMm9enTR4mJiWrbtq1mzpypBx98UKNGjfI7rlu3bpo+fbo8Hk+lzwPAHzEAfA/OOZWUlKikpESnT5/WoUOHlJOTI4/Hox49eviOe+mll5SVlaWEhARNnDhRQ4YM0dtvv62+ffv6Xli3b9+uX/7yl4qOjtaECRP0wAMP6LHHHvN7vp49e2rTpk36/PPPfdsKCgqUl5fn93wVCQ0NVUZGRrlLBXl5eTp16pTatWvnt/2rr75Snz59dPLkSY0dO1Z/+tOf1KlTJ7366quaMWOGpH+/0Ddp0kSTJ0/W008/rU8//VQPP/ywvF6vvF6vBg4cqBMnTujZZ5/VSy+9pJiYGD3yyCN+c/i2N998U3Xq1FHz5s192/bs2aMHH3xQkpSTk6OhQ4dq2rRp2rhxY7nHd+rUSQUFBT/ovRuABdUCPQDgcrR+/Xo1adKk3PbHHntMP/7xjyVJhYWFmjx5su699149+eSTvmNuueUW9enTRwsXLlTv3r01depUXX311Zo8ebIiIiIkSTExMcrKyvI9pmvXrho7dqyWLFmiYcOGSZKWLl2qyMhIpaenn3e8nTt31pw5c7Rt2zY1bdpU0pkzFh06dFBkZKTfsbt27VLjxo01YcIERUdHS5JatWqlvLw8rV+/XoMGDdKWLVtUXFysgQMH+k7n16tXT3/729904sQJnTx5Unv27NGgQYPUpk0bSVJ8fLxeeOGFSq/pv//++2rWrJnf2ZWpU6cqOjpaL7/8sqKioiRJN910U4VnSho2bKgrr7xSeXl5SktLO+/nBcAZxADwPTRp0kS/+93vJJ05S3Ds2DGtXbtW48eP14kTJ5SVlaXNmzfL4/GUO+WdnJysBg0aaN26derdu7c2btyotm3b+kJA+vf17zK1atVSenq6li5d6ouBxYsXq2PHjqpRo8Z5x3vbbbepbt26Wr58uZo2bSqPx6NVq1Zp3Lhx5Y5NS0tTWlqaTp8+rc8++0x79+7Vzp07deTIEcXExEiSEhISVL16dfXs2VOdO3dWmzZtlJycrPj4eElSzZo1FRsbq9GjRys3N1d33nmn0tLS9Jvf/KbScX7xxRdKSkry2/b++++rTZs2vhCQpKSkJDVo0KDCj1G/fn3l5+ef93MC4N+IAeB7qFmzppo1a+a3LS0tTSdOnNDLL7+sfv36+e4LqF27drnH165dW0VFRZLOnEG4+uqr/fZXq1ZNV111ld+2nj17aunSpdqwYYMiIiL0ySef+ILkfEJCQtSxY0etWLFC2dnZevfddxUaGqrU1FQVFBT4Hev1epWTk6M5c+boxIkTqlevnuLj41W9enXfMddff71mz56tadOm6Y033tCMGTN0xRVXqHfv3vrVr36l0NBQTZ8+XZMnT9b//d//adGiRQoPD9ddd92lp556yhcV33b8+HG/F31JOnLkiK655ppyx9apU6fCjxEVFaXjx49f0OcFwBncMwD8gBo3bqySkhLl5+fryiuvlCR9/fXX5Y47dOiQ78U+Jiam3DHOuXI3GaakpOjGG2/UihUrtHz5cjVs2FDJyckXPLbOnTsrPz9fW7du1bJly5Senq7w8PByx02bNk0zZszQE088oQ0bNmj16tWaOHFiuWApO+2/bt06zZgxQ6mpqZoyZYrv3oS6devqqaee0nvvvafFixerf//+WrlypcaPH3/OMcbExPgiqcx1112nw4cPlzu2om2SdOzYsXPGBoCKEQPAD2jTpk0KCwvTDTfcoISEBEVEROjNN9/0O2bDhg3av3+/7ya5O+64Q2vXrtXJkyd9x7z77rs6ffq03+NCQkKUmZmpVatWadWqVerevft3GltiYqIaNGigN998U3//+98r/FcEkrRx40bFxsaqZ8+eqlWrlqQzNyvu2rVLXq9XkjRjxgy1b99eHo9HERERuuOOOzRmzBhJ0oEDB7Rp0ya1atVKW7ZsUUhIiBo3bqysrCzdcsstOnjw4DnH2KBBAx04cMBvW4sWLbR27Vq/ew0+/vjjCi8FOOdUUFBwzksIACrGZQLgezh+/LjfT7s7ffq0/va3v+nNN99Ur169fN9FP/zww3rhhRcUHh6uDh06KD8/XxMmTFBsbKwyMzMlSUOGDNGqVavUv39/PfTQQzp69KjGjx9f4XftmZmZmjRpkpxzuueee77zuDt27KhZs2YpJiZGKSkpFR4THx+vl156SdOmTVNiYqI+//xzTZ06VR6Pxxcst99+u5577jkNGTJE999/v8LCwvT6668rIiJC7dq1U4MGDRQZGamRI0dq6NChql27tnJzc/Xxxx+rX79+5xxfamqq/vKXv8g557uJcNCgQVq2bJkeeugh/fKXv9SxY8c0YcIEhYSElPtnnDt37lRRUZFat279nT83gGXEAPA9/POf/1SvXr18v69evbpuvPFGZWVlqX///r7tZS+Es2fP1rx58xQTE6OOHTtq+PDhvmvjP/rRjzR79myNHTtWWVlZuuaaa/T4449r7Nix5Z63bt26atSoka666irVq1fvO4+7c+fOeuWVV9SpUyeFhlZ8YnDgwIE6evSoZs2apRdffFH16tXTT3/6U4WEhGjq1KkqLCxUo0aNNGXKFL344ot67LHHVFpaqqZNm2r69Om66aabJJ35kcLPP/+8/vCHP+jYsWP60Y9+pN///ve+CKpIenq6XnzxRW3dutV3M2LDhg31yiuv6Nlnn9WwYcN0zTXXaODAgZo8ebJq1qzp9/i1a9eW+6eJAM4vxDnnAj0IABemoKBA7du3V05OjjIyMgI9nCoxaNAgXX311XrmmWcknfl5COHh4X73RxQWFio1NVUjR470nWlwzik9PV19+vTx/VwCABeGewaAy8DHH3+sF154QQ899JCuv/563XXXXYEeUpXJysrS22+/rf3790v69w9lmjFjhtavX6+VK1dq4MCBqlWrlrp27ep73PLly+X1eiv8+QMAKseZAeAysHnzZvXv319169bV888/f8m9n8APbdq0adqxY4dycnLk9Xo1ZcoULVmyRAcOHFCNGjWUkpKiESNGqGH
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAgQAAAHBCAYAAAAWz6MMAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAimklEQVR4nO3deXSM9+LH8c9knaQaCXJPK3rQi2hIJLVfiSVS1NqiR0vT0lLtrd5ru8VRwi1FW1xrEYpQtKhrqahaSmtJWlcJrSptCUpiS4WsM8/vD8f8jAkyRCbk/TrHOTzzPM985+vJzDvPM5OYDMMwBAAASjU3Vw8AAAC4HkEAAAAIAgAAQBAAAAARBAAAQAQBAAAQQQAAAEQQAAAAEQSAS/DzwACUNAQBcJ3Y2FgFBwfb/tSsWVMRERHq3LmzFi1aJIvF4vQ+hw4dqujoaNu/N2/erCFDhhTlsCVJ48aNU2xsrMPy7du32z2ma39effXVm+7rxIkTCg4OVufOnZWfn+9we1JSkoKDg5WUlHTbce3fv1+tW7dWbm6ucw9I0htvvKFZs2Y5vd2tTJ48WaNHjy7SfQIPAg9XDwAoaUJCQhQXFydJslgsysjI0LZt2/Tee+9pz549mjx5skwm0x3vf8GCBUU00v83Z84cLViwQA0aNHC47dChQypbtqzmzJljt/zhhx++7X4PHjyo+Ph4vfHGG3c0rpycHA0ZMkSDBg2Sl5eXU9vm5uZq9+7deuutt+7ovm+mb9++at26tVq1aqXGjRsX6b6B+xlBANygTJkyCg8Pt1sWHR2tqlWraty4cYqOjlbHjh1dM7gbpKamavz48dq6detNX+B/+uknBQcHOzymwvDz89OMGTMUExOj6tWrO739kiVLZDKZ1KpVK6e3/e677/TQQw/piSeecHrbW/H19dVLL72k8ePHa/Xq1UW6b+B+xiUDoJBiY2P1l7/8RcuWLbNbvnz5crVr1061a9dW8+bNNW3atAJPs1/bR3JyspKTk+1OuR86dEj9+vVTo0aNVKtWLUVFRWnMmDHKzs6+5ZjGjRun48ePa+HChTd94fzpp5/u+EW1b9++KlOmjIYOHer05ZLc3FzNnz9fHTp0sFt+6dIljRs3TjExMQoNDVX79u21YsUKh+23b9+uqKgo29mYtLQ0DRs2TM2aNVNYWJi6du2qzZs3222Tk5OjGTNmqE2bNgoNDVWrVq00Z84cWa1Wu/U6dOign3/+Wdu2bXPqMQEPMoIAKCR3d3c1btxY+/fvt73gz549WyNGjFDjxo01a9Ys9ejRQ/Hx8Ro5cmSB+4iLi1NISIhCQkL06aefqlatWkpLS1OPHj2UlZWl8ePHKz4+Xk8//bQWLVp028sL/fv315o1a1S/fv0Cb8/KytKxY8d0/PhxdezYUbVr11aLFi00b968Qr2xsVy5cho5cqQOHDiguXPn3nb96yUlJenMmTNq06aNbVl2dra6d++uNWvW6JVXXtHMmTNVt25dDR8+3OG9Atu2bVPTpk0lSWfPnlXXrl2VnJysAQMGaNq0aQoKCtKbb76pNWvWSLr6Rs3XX39dc+fOVdeuXTVr1iy1adNG//nPf2yXgK555JFHFBERYdsWAJcMAKdUqFBBeXl5unjxory9vfXRRx+pW7dueueddyRJkZGR8vf31zvvvKNevXo5nGavVq2aypQpI0m2U/g//PCDnnjiCU2ZMsV229/+9jft2rVL3333nV5//fWbjqdGjRq3HO/PP/8sq9WqY8eO6Z///KfKli2rzZs364MPPtCff/6pAQMG3PYxt23bVhs2bND06dMVHR1d6EsHu3fvlp+fn6pWrWpb9vnnn+vw4cNasmSJ6tatK0mKiopSfn6+Zs6cqeeff17+/v5KTU1VamqqmjRpIkmaP3++zp8/r8TERD322GOSpGbNmqlnz556//331b59e33zzTfauXOnPvjgA9slnSZNmshsNmvKlCl6+eWXVa1aNdtYQkNDtW7dukI9FqA04AwBcAdMJpP27t2rrKwsRUdHKz8/3/bn2icKduzYUah9RUZGavHixfL29tZvv/2mrVu3atasWTp//vwdvTP/eo8//rji4+O1ZMkStWnTRo0bN9Y777yjrl27at68ebp06VKh9hMXFydfX18NGzas0JcOUlNTFRQUZLcsOTlZQUFBthi4pmPHjsrJydG+ffskXb1cEB4eLj8/P9t2ERERthi4frv09HT9+uuvSk5Olru7u9q2beuwjiSHT0QEBQXp3LlzysrKKtTjAR50nCEAnHDmzBmZzWb5+/vr4sWLkqTXXnutwHXT0tIKtU+r1apJkybpk08+0ZUrV/Too48qLCxM3t7edz1ePz8/22n36zVv3lzLly/X0aNHC/Vmw/Lly2vEiBEaNGiQ5s2bpzp16tx2m8zMTPn4+Ngty8jIUIUKFRzWvbbszz//lGR/ueDadpUqVbrldhkZGQoICJCHh/3TWmBgoCQ5xI+vr69t+Y3jBEojggAoJIvFouTkZD355JNyd3e3fff64YcfqkqVKg7rF/TCV5BrHxkcNWqUWrdubfu0QNeuXe96zAcOHFBKSoqef/55u49KXnuzYkBAQKH31b59e23YsEHTpk3T0KFDb7t+QECAQxSVLVtWx44dc1g3PT3dtk1OTo7tvQLXb3f27Nlbble2bFlduHBB+fn5dlFwbQw3PtaMjAyZTCb5+/vf9rEApQGXDIBCWrZsmdLS0vTCCy9IkurUqSNPT0+dOXNGoaGhtj+enp6aOHGiTpw4UeB+3Nzsv+z27NmjatWqqWvXrrYYOHPmjA4fPuzw7nhnHTp0SKNGjdLu3bvtlq9fv14VK1Ys8LvuWxk1apR8fX01efLk265bsWJFnT592u7Ni/Xr19fJkye1Z88eu3XXrFkjT09PhYWFKSkpSWXKlLH7ZET9+vW1d+9epaamOmwXGBioypUrq0GDBrJYLFq/fr3DOpIcLlOcPn1aFSpUcPrnIwAPKs4QADfIzMzUDz/8IOnq6fwLFy7o22+/1aeffqqOHTvaPlMfEBCg3r17a8qUKcrMzFTDhg115swZTZkyRSaTSTVr1ixw/35+ftq7d6927dqlkJAQhYWFaebMmZozZ47Cw8N17NgxzZ49W7m5uXd9fbtt27aaN2+e3n77bfXv31+BgYFau3attmzZosmTJ8vd3d2p/VWoUEHDhw/Xv/71r9uu26RJE82ZM0e//PKL7c2PnTt31pIlS9SvXz/94x//0GOPPaYtW7Zo5cqV6tevn/z8/LR9+3aHyxy9evXSmjVr1KtXL/Xr108BAQH673//q927d+u9996Tm5ubmjZtqoYNGyouLk5paWkKCQlRcnKy4uPj9eyzz9q9oVC6GmJRUVFOPX7gQUYQADf48ccf1a1bN0lXv5svX768qlatqvHjxzt8pv7ai+ySJUs0d+5clS1bVo0bN9bAgQNv+oOCevTooQMHDqhPnz4aN26c+vbtqwsXLighIUEzZszQo48+qk6dOslkMmn27NnKyMhQ2bJl7+ix+Pr6auHChZo8ebKmTJmiCxcuqHr16po+fbpiYmLuaJ8dO3bUhg0bHH4GwI3q1aun8uXLa9u2bbYg8PHx0aJFizRx4kRNnTpVmZmZevzxxzV27FjbJZLt27dr4MCBdvsKDAzU0qVLNXHiRI0dO1Z5eXmqWbOmZs6cqZYtW0qSbb6mTp2qhIQEnT9/XpUqVdKAAQPUq1cvu/2dOXNGhw4dUv/+/e9oDoAHkcngt6wAuEc+/vhjLVu2TF9++eVd/bjnojZ9+nRt2rRJq1atKlHjAlyJ9xAAuGe6d+8ui8WiDRs2uHooNpmZmVq6dKkGDhxIDADXIQgA3DNms1kffPCBJk+efNc/U6GozJ49Wy1btizw45hAacYlAwAAwBkCAABAEAAAABEEAABAhfw5BFarVfn5+XJzc+NduQAA3CcMw5DVapWHh4fDT0m9UaGCID8/XykpKUUyOAAAULxCQ0Nv+2O6CxUE16oiNDTU6R91CudZLBalpKQw38WMeXcN5t01mHfXKc65v3Zftzs7IBU
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAgQAAAHBCAYAAAAWz6MMAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAk70lEQVR4nO3deVxU9eL/8ffINpi5pbZg1xaXIkFwQxMtrSDXrootrlndLJVCrbyWhWYmLWqKVmK55Fai2dXE9JZlViqpuHXzYXpvipqIYgbKIjOf3x/+mK8jA6IOjMLr+Xj4eOCZM5/5zMcj8+KcASzGGCMAAFChVfL0BAAAgOcRBAAAgCAAAAAEAQAAEEEAAABEEAAAABEEAABABAEAABBBAHgEPw8MwJWGIADO0a9fPzVq1Mjx54477lBoaKh69OihefPmyWazXfSY//znP9WhQwfH37/55huNHDnSndOWJE2YMEH9+vUrtD0jI0OjR49WeHi4QkNDNWDAAO3YsaNEY9rtdiUmJqpPnz4KCwtT06ZN1b17d33yySfKy8sr0RgHDx7Uvffeq4yMjIt6PpL0xhtv6NVXX73o+xVn8eLFGjRokFvHBMoDggA4T2BgoD777DN99tlnWrBggSZOnKigoCC9+eabGjFixGV/dT9nzhz98ccfbprtWQkJCZozZ06h7Xa7XYMHD9a6dev0wgsvaOrUqfL29taAAQP0+++/Fztmdna2Bg4cqPHjxys4OFhxcXGaOnWqwsPD9e677+rZZ5+9YBQYY/Tyyy9rwIABqlmz5kU/r++//17t2rW76PsVJyoqSmlpaVq6dKlbxwWudt6engBwpalSpYpCQkKctnXo0EG33nqrJkyYoA4dOqhbt26emdx5UlNTFRcXp2+//VbXXnttods3bdqklJQUJSQk6J577pEkNW/eXK1atdLSpUs1YsSIIseeMGGCtm7dqnnz5jmtR3h4uAIDAxUTE6MFCxZo4MCBRY7x73//W7t379bMmTMv+rnt379fhw8fVuvWrS/6vsWpVKmSnn76aY0fP15dunSRn5+fW8cHrlacIQBKqF+/fqpTp44+/fRTp+2JiYnq3LmzGjdurHvvvVfx8fHKz88vcozk5GQlJyerUaNG2rRpkyRp9+7dGjp0qFq1aqW77rpLbdu21RtvvKGcnJxi5zRhwgQdOHBAc+fO1Z133lno9iZNmmjRokVq06aNY5uPj48sFkuxX91nZGRo6dKl6tmzZ6E4kqSOHTvqySef1A033FDs/GbMmKGIiAinF93c3FxNnz5dDz74oIKCghQREaGEhATZ7Xan+65bt05NmzZVlSpVJEmZmZmaMGGC7r//fgUFBalLly5asmSJ031sNpsWLFigrl27Kjg4WPfee6/effdd5ebmOu133333KScnp9D9gYqMMwRACXl5eal169ZKSkpSfn6+vL29NWPGDE2ePFl9+/bVqFGj9Ouvvyo+Pl5//PGH3nzzzUJjxMbG6sUXX3R8XL9+fR09elR9+vRRSEiI4uLi5Ovrq++++05z585VrVq19MwzzxQ5p5iYGDVo0EAWi8Xl7ZUrV1bTpk0lSfn5+UpNTdW0adNkjFGPHj2KHHfDhg3Kz89X+/bti9znpZdeKvI2Sfrvf/+rXbt2adiwYY5txhg988wz2rZtm4YMGaI777xTmzZt0nvvvafU1FSNGzfOse+6desclwtycnLUu3dvHTt2TNHR0br55pv19ddf65VXXtGxY8cca/Taa6/piy++0FNPPaWWLVvqP//5j6ZPn65ff/1VH330kWOd/Pz81L59e61YsUJ9+vQp9nkAFQVBAFyEWrVq6cyZM/rzzz/l5+enDz74QI888ohGjx4t6ezp9OrVq2v06NEaOHCgGjRo4HT/+vXrO77iLfjKe9u2bbrzzjs1ZcoUx2133323NmzYoJ9//rnYIGjYsGGJ5z5mzBglJiZKkoYMGVLsfY8cOSJJqlu3bonHP9/GjRslScHBwY5t33//vX766Se98847jssubdq0kdVq1ZQpUzRgwADVr19f2dnZ+vnnnx1vvvz888+1Z88eLVy4UM2aNZMktW3bVvn5+Xr//ff16KOP6tixY1qyZIliYmL07LPPOsauU6eOXnrpJX3//feOyyaSFBQUpKSkJGVlZTnWHajIuGQAXAKLxaKUlBRlZ2erQ4cOys/Pd/wp+I6CH3/8sURjhYeHa/78+fLz89P//vc/ffvtt/rwww+VkZFR4nfyl8TDDz+sefPmaciQIUpISNBrr71W5L6VKp391HD+afyLkZqaqqpVq6pq1aqObcnJyfLy8lKnTp2c9i2Ig4JLKJs2bVLNmjUd0ZKcnKyAgABHDJx7v9zcXG3fvl3JycmSpK5duzrt07lzZ3l5eTnGLhAQECCbzeaIH6Ci4wwBcBHS0tJktVpVvXp1/fnnn5Kkp59+2uW+R48eLdGYdrtdkyZN0oIFC3T69GndeOONCg4Odvub3Qq+Um/ZsqWMMfrggw80ZMgQl+8DCAgIkCQdPny40FmOAunp6apRo4a8vV1/GsnKypK/v7/TtpMnT7q8T+3atSWdfZ+A5Hy5oOB+tWrVKvQYBdv++usvnTx50mmsAt7e3qpRo4Zj7AKVK1d2ekygoiMIgBKy2WxKTk5W06ZN5eXl5fjK991339Utt9xSaH9XL2CuFHzL4JgxYxQZGen4boGoqKjLnvNvv/2mHTt2qGfPnk7bg4KCZIzRkSNHXAZBq1at5OPjo3Xr1jmdZj/XoEGDlJ2drVWrVrm83dWLcLVq1XTixAnHezAKFMRTjRo1JJ29tDBq1Cin++3fv7/QY6Snpxd6rPT0dKdLHWfOnNGJEyccYxcoCIjztwMVFZcMgBL69NNPdfToUT322GOSzr6D38fHR2lpaQoKCnL88fHx0cSJE3Xw4EGX4xScji+wZcsW1a9fX1FRUY4YSEtL0549ey7rlL0kbd++XS+//LK2bt3qtH39+vXy9fXVbbfd5vJ+VatWVVRUlBYvXuzyhxh9+eWX+uWXX/TQQw8V+dg33XSTTp8+7Xjhlc6enbDZbEpKSnLad/ny5ZKkZs2aad++fUpLS3P6dsMWLVro0KFD2rJlS6H7+fj4KDg4WC1btpQkrVixwmmflStXymazFbrccOTIEXl5een6668v8jkAFQlnCIDzZGVladu2bZLOns4/ceKEfvjhB3322Wfq1q2bIiIiJJ39yvKpp57SlClTlJWVpbCwMKWlpWnKlCmyWCy64447XI5ftWpVpaSkaMOGDQoMDFRwcLDef/99JSQkKCQkRPv379eMGTOUl5en7Ozsy3ounTp10qxZszRixAg999xzql27tr755hstWrRIw4YNc7q+f77hw4dr586dGjBggOMnFebn52v9+vVavHix2rVrp6eeeqrI+xd8q+PWrVsd363Qrl07hYWFKTY2VkePHlVgYKCSk5M1c+ZMde/eXfXr19fs2bPVrFkzXXPNNY6xevTooYULF2ro0KF67rnndPPNN2vt2rVaunSphg4d6nivQvfu3TVt2jTl5OQoLCxMv/76q6ZNm6awsDC1bdvWaX5btmxR8+bNC13WACoqi+GHqgMOBT8noEClSpV03XXX6dZbb1WvXr3UtWvXQt/it2DBAi1cuFD79+9XtWrV1Lp1aw0fPlw33XSTpLM/ujg5OVlr166VdPbd96NGjVJ6eromTJigyMhIxcXFac2aNcrMzNSNN96ozp07y2KxaMaMGfrhhx9UrVq1Es1dkubNm+e0PT09XZMnT9YPP/ygEydO6Pbbb9fjjz+uv//97xcc8/Tp05o/f76SkpJ08OBBGWNUr1499ezZU7169ZKvr2+x9+/Ro4eCg4M1ZswYx7bs7GxNnTpVK1euVEZGhurWrauoqCgNHDhQXl5eGjhwoMLDw/Xkk086jZWRkaGJEydq7dq1ysrK0m233aZ+/fo5XVqx2WxKSEjQ0qVLdeTIEdWpU0ddunTRkCFDCv0shLZt2yomJka9e/e+4DoAFQFBAKDUrF69Wi+//LLWr1/veBPflWDZsmWaOHGivv7
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"numerical_cols = penguin_manager.get_df().select_dtypes(include=['float64', 'int64'])\n",
"\n",
"# Creare un boxplot per ogni colonna numerica\n",
"for col in numerical_cols:\n",
" sns.boxplot(x=penguin_manager.get_df()[col])\n",
" plt.title(col)\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are evident no outliers in the dataset seen with the box plot.\n",
"\n",
"I will calculate using the standard deviation method."
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>studyName</th>\n",
" <th>Species</th>\n",
" <th>Island</th>\n",
" <th>Individual ID</th>\n",
" <th>Clutch Completion</th>\n",
" <th>Date Egg</th>\n",
" <th>Culmen Length (mm)</th>\n",
" <th>Culmen Depth (mm)</th>\n",
" <th>Flipper Length (mm)</th>\n",
" <th>Body Mass (g)</th>\n",
" <th>Sex</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" <th>feature</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [studyName, Species, Island, Individual ID, Clutch Completion, Date Egg, Culmen Length (mm), Culmen Depth (mm), Flipper Length (mm), Body Mass (g), Sex, Delta 15 N (o/oo), Delta 13 C (o/oo), feature]\n",
"Index: []"
]
},
"execution_count": 129,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numerical_cols = penguin_manager.get_df().select_dtypes(include=['float64', 'int64'])\n",
"\n",
"means = numerical_cols.mean()\n",
"stds = numerical_cols.std()\n",
"\n",
"# set the threshold to 3 standard deviations\n",
"threshold = 3\n",
"\n",
"outliers = pd.DataFrame()\n",
"\n",
"for col in numerical_cols.columns:\n",
" col_outliers = penguin_manager.get_df()[(penguin_manager.get_df()[col] < means[col] - threshold * stds[col]) |\n",
" (penguin_manager.get_df()[col] > means[col] + threshold * stds[col])]\n",
" col_outliers['feature'] = col\n",
" outliers = pd.concat([outliers, col_outliers])\n",
"\n",
"outliers.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Even this method doesn't find any outliers for 3 standard deviations.\n",
"\n",
"Calculating the z-score (how many st dev from the mean):"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: scipy in c:\\users\\stefa\\miniconda3\\envs\\mchinelearning\\lib\\site-packages (1.9.3)\n",
"Requirement already satisfied: numpy<1.26.0,>=1.18.5 in c:\\users\\stefa\\miniconda3\\envs\\mchinelearning\\lib\\site-packages (from scipy) (1.23.5)\n"
]
}
],
"source": [
"warnings.filterwarnings('ignore')\n",
"!pip install scipy\n",
"from scipy.stats import zscore\n",
"warnings.filterwarnings('default')"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>studyName</th>\n",
" <th>Species</th>\n",
" <th>Island</th>\n",
" <th>Individual ID</th>\n",
" <th>Clutch Completion</th>\n",
" <th>Date Egg</th>\n",
" <th>Culmen Length (mm)</th>\n",
" <th>Culmen Depth (mm)</th>\n",
" <th>Flipper Length (mm)</th>\n",
" <th>Body Mass (g)</th>\n",
" <th>Sex</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [studyName, Species, Island, Individual ID, Clutch Completion, Date Egg, Culmen Length (mm), Culmen Depth (mm), Flipper Length (mm), Body Mass (g), Sex, Delta 15 N (o/oo), Delta 13 C (o/oo)]\n",
"Index: []"
]
},
"execution_count": 131,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"z_scores = np.abs(zscore(penguin_manager.get_df().select_dtypes(include=['float64', 'int64'])))\n",
"\n",
"# print the rows with z-score > 3 or < -3\n",
"penguin_manager.get_df()[(z_scores > 3).any(axis=1)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Overall, no outliers have been found.\n",
"\n",
"At this point I can say that there are no important outliers in the dataset, so I can move on."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7 Analyze relationships between variables\n",
"\n",
"Using scatter plots, correlation matrices, and cross-tabulations to explore relationships between pairs of variables. This will help identify potential multicollinearity, interactions, or confounding variables that may need to be addressed in your analysis.\n",
"\n",
"I will create some scatter plot and a bubble plot (no need for both, it's for future reference, where I could add a third numerical dimension) to see if there are some correlations between the features:"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdIAAAm6CAYAAAA/8frkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3gU5fo38O/spvdCGiEk1EAIvRelCIoK9i6CWA+ix4Oe31HPsWE9+tobx4YIKCpKVYlKEekBkkAKnfTe+ya7O/P+sW5Mz5bZmu/nurwku7PPPFN39p5n7luQJEkCERERERERERERERF1SmHrDhARERERERERERER2TMG0omIiIiIiIiIiIiIusFAOhERERERERERERFRNxhIJyIiIiIiIiIiIiLqBgPpRERERERERERERETdYCCdiIiIiIiIiIiIiKgbDKQTEREREREREREREXWDgXQiIiIiIiIiIiIiom4wkE5ERERERERE5CAkSbJ1F4iIeiUG0omIiIiIiIjIau666y7Exsa2+W/ChAlYvHgxEhMTjW5v06ZNiI2NRV5eXpfT5OXlITY2Fps2beq2rTlz5uDJJ580ug+miI2Nxfvvv2/UZzZu3IjXXntNlvm///77HbbDmDFjcO211+Kbb76RZR6AbnvfddddZrcRGxuL2267rctpVqxYgdjYWKttPyLqfVxs3QEiIiIiIiIi6l3i4uLw3HPPAQC0Wi0qKyuxYcMG3Hvvvdi0aROGDBli4x7ap1WrVmHSpEmytvntt98CAERRRF1dHf744w8899xzUCqVuPnmm2WdlzkUCgVSUlJQWFiIiIiINu81Njbi999/t03HiKjXYCCdiIiIiIiIiKzKx8cHY8aMafPatGnTMHXqVGzatAlPPPGEbTrWC7XfDpdeeilOnz6Nb775xq4C6XFxcTh//jwSEhKwdOnSNu/t3r0b7u7u8PX1tVHviKg3YGoXIiIiIiIiIrI5T09PuLu7QxCEltc6Swty5MgRxMbG4siRI21eT0pKwnXXXYeRI0di4cKF+PnnnzvMo7i4GA8++CBGjRqFmTNn4r333oNWq+2yT01NTXj99dcxc+ZMxMfHd9luZ/3bv38/7rzzTowaNQrz5s3D+vXru/1cSUkJnnrqKcycOROjRo3CTTfdhF27drW8P2fOHOTn52Pz5s09prIxl7+/f5vtAACpqam49957MXnyZIwbNw5/+9vfcO7cuTbTFBQU4OGHH8b48eMxffp0fPHFF23ef+211zBq1CjU1ta2ef2TTz7B2LFj0dDQ0GWfvLy8MHPmTOzYsaPDez///DPmz58PF5e240UrKiqwcuVKzJ49G/Hx8Zg0aRKWL1/eZt3l5uZi2bJlmDx5MkaPHo1bb70Ve/fubXm/qakJK1euxKWXXor4+HjMnz8fq1ev7rKfROS8GEgnIiIiIiIiIquSJAkajQYajQZqtRqlpaV466230NzcjBtvvNGkNp955hnMnz8fH374IQYPHowVK1Zg//79baZ5//33ERQUhA8//BA33ngj/ve//+G9997rso/Lly/HN998g6VLl2LVqlUYO3YsVqxYgS1btvTYnxUrViAuLg4ffvghpk+fjhdffBHr1q3rdNqysjLcdNNNSExMxIoVK/D+++8jMjISy5cvx7Zt2wAAH3zwAUJCQjBz5kx8++23CA0NNW4FdUG/HTQaDWpqavDjjz/ijz/+wKJFi1qmOXz4MG6//XaIooiXX34ZL730EgoLC3HbbbfhwoULAICGhgYsWrQIp0+fxgsvvIBnn30WGzduRHJycks7N910E5qampCQkNCmD1u2bMH8+fPh5eXVbV+vuuoqnDhxAgUFBS2v6dPRLFiwoM20kiThwQcfxIEDB/D444/j888/x0MPPYSDBw/i2WefBaBLZ/Pggw+ioaEBr7/+Oj766CMEBATgoYceQnZ2NgDg5Zdfxt69e/HEE0/g888/x2WXXYbXXnutx3z7ROR8mNqFiIiIiIiIiKzq6NGjGDFiRIfXH3vsMQwaNMikNpcvX44HHngAgC49SVZWFj744APMmDGjZZqpU6fi1VdfBQBccsklqKurw9q1a3HPPffA39+/TXsHDx7Evn378Pbbb+Oqq65q+UxjYyPeeOMNLFiwoMMI6Nbmzp2L//znPy2fKykpwapVq3DnnXdCoWg7rvGLL75ARUUFduzYgaioKADAzJkzcffdd+P111/HggULEBcXBzc3NwQFBXVIx2KOzrbDnDlzWpYZAN58801ERUXhs88+g1KpBADMmDED8+bNw/vvv4933nkHmzdvRkFBAbZu3YrY2FgAaBmNrzdo0CCMHTsWW7dubUkbc/LkSVy4cAEvvPBCj32dNWsWvLy8kJCQgHvuuQcA8NtvvyEoKAjjx49vM21JSQk8PT3xxBNPYMKECQCAyZMnIy8vr6WYanl5OS5cuIC//e1vmDlzZkufP/jgAzQ1NQEAEhMTMW3aNFx99dUtbXh5eSEwMLDH/hKRc+GIdCIiIiIiIiKyqhEjRuD777/H999/j40bN+Lzzz/HkiVL8Pbbb+Ptt982qc0rr7yyzd9z585FSkoK6uvrW15rHRwGgMsvvxwNDQ1ISUnp0N6hQ4cgCAJmzpzZZtT2nDlzUFpa2iGtSXvXXntth3mVl5cjMzOzw7SJiYkYO3ZsSxBd75prrkFpaSkuXrzY7bz0Wo/01//XE/12+P7777Fu3Tr861//wrFjx3DvvfdCq9WioaEBqampuOqqq1qC6ADg5+eH2bNnt6TYOXbsGKKiolqC6AAQERHRIeh/44034tixYy3pVTZt2oT+/fu3BLu74+HhgTlz5rRJ7/LTTz/hqquu6pCKJiwsDGvXrsWECRNQUFCAQ4cOYf369UhKSoJarQYA9OnTB4MHD8YzzzyDJ598Ej///DMkScJTTz2FoUOHAtAFzjdu3Ij7778fX3/9NfLz87F8+XLMnj27x/4SkXPhiHQiIiIiIiIisipvb2+MHDmyzWszZsxAQ0MDPvvsMyxevBjBwcFGtRkSEtLm7+DgYEiShLq6upbX+vTp02aaoKAgAEB1dXWH9qqqqiBJEsaNG9fp/EpKSjB8+PAu+9M+9Yp+eWpqajpMW11djX79+nV4Xd/fzj7Tmc2bN+Opp55q89quXbs6bVuv/XaYNGkSQkJC8H//93/YtWsXRo8eDUmSOqw7ff/0+c6rq6tb1mdrISEhKCsra/n7qquuwiuvvIJt27bhvvvuw44dO7BkyRKDlg/Q3TDR5zn39vbGoUOH8I9//KPTabdt24a33noLhYWFCAgIwLBhw+Dh4dHyviAIWL16NVatWoXffvsNmzdvhqurK+bOnYvnn38eAQEB+M9//oPw8HBs27YNK1euBACMHTsWzz77LOLi4gzuNxE5PgbSiYiIiIiIiMguDB8+HBs3bkReXl5L4Ll9MdCuClJWV1e3CZKWlZVBqVTC39+/JZDbPiCtf72zoL2vry+8vLywdu3aTucXHR3d7bJUVVW1+bu8vLzLebXuY2ulpaUAYHAakdmzZ+P7779v85opudT1NwiysrIwY8YMCILQZf8CAgJa+qjPK95a+/Xg7e2N+fPnY8eOHRg+fDhqampw3XXXGdy3Sy+9FL6+vvjll1/g6+uLfv36IT4+vsN0x44dwxNPPIFFixbh3nvvRXh4OADg9ddfx/Hjx1umCwsLw/PPP4/nnnsOp0+fRkJCAj799FP4+/tj5cqVcHNzw7Jly7Bs2TIUFBRgz549+Oijj/D44493WviUiJwXU7sQERERERERkV1ITk6GUqlsSXHi4+ODoqKiNtMkJSV1+tl9+/a1/FsURSQkJGD06NFtguutpwF0aUE8PT0xevToDu1NmjQJDQ0NkCQJI0eObPnv3Llz+PDDD3tMm7J79+42fyckJCAyMhL9+/fvMO3EiRORnJyM3NzcNq9v27YNISEhLUH79rnV2wsMDGzT15EjR8LNza3bz3RGn+omJiYGXl5eiI+Px88//9zmpkZtbS1+//33ltzkU6ZMQV5eHlJTU1umqaio6DRtzk033YS
"text/plain": [
"<Figure size 1500x2500 with 10 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"numerical_cols = penguin_manager.decode().select_dtypes(include=['float64', 'int64'])\n",
"columns = numerical_cols.columns\n",
"\n",
"nrows = len(columns) - 1\n",
"ncols = 2\n",
"\n",
"fig, axs = plt.subplots(nrows=nrows, ncols=ncols, figsize=(15, 5 * nrows))\n",
"\n",
"for i, col1 in enumerate(columns[:-1]):\n",
" for j, col2 in enumerate(columns[i + 1:]):\n",
" if j * 2 < ncols:\n",
" axs[i, j * 2].scatter(penguin_manager.get_df()[col1], penguin_manager.get_df()[col2])\n",
" axs[i, j * 2].set_xlabel(col1)\n",
" axs[i, j * 2].set_ylabel(col2)\n",
" if j * 2 + 1 < ncols:\n",
" axs[i, j * 2 + 1].scatter(penguin_manager.get_df()[col1], penguin_manager.get_df()[col2], s=penguin_manager.get_df()['Body Mass (g)'] / 20)\n",
" axs[i, j * 2 + 1].set_xlabel(col1)\n",
" axs[i, j * 2 + 1].set_ylabel(col2)\n",
" axs[i, j * 2 + 1].set_title('Bubble plot - Body Mass')\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some couple also have a good correlation, some of them appear to be grouped in a cluster, it may be the \"Sex\", the \"Species\" or the \"Island\". But I will check that.\n",
"\n",
"I will do the same with sns.pairplot, but with all features and I will add the sex, island and species as hue."
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<seaborn.axisgrid.PairGrid at 0x1e733195730>"
]
},
"execution_count": 133,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABUMAAATPCAYAAAAroOSBAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3hUVfrA8e/0Pum9kZDQe+/YUbF3RVHsXXftu6697OrPuq69gaKIiAXErhSll0DoSYD03qb3+f0xEAiZCQHSOZ/n2Yd17sydc2/unXvve855X4nf7/cjCIIgCIIgCIIgCIIgCILQw0k7uwGCIAiCIAiCIAiCIAiCIAgdQQRDBUEQBEEQBEEQBEEQBEE4IYhgqCAIgiAIgiAIgiAIgiAIJwQRDBUEQRAEQRAEQRAEQRAE4YQggqGCIAiCIAiCIAiCIAiCIJwQRDBUEARBEARBEARBEARBEIQTggiGCoIgCIIgCIIgCIIgCIJwQhDBUEEQBEEQBEEQBEEQBEEQTggiGHoIv9+P1+vF7/d3dlMEQWiBOFcFoesT56kgdA/iXBWErk+cp4IgCG1LBEMP4fP5yM7Oxufztfv3bNmypd2/p72J7ehaesp2tEZHnavdzYl0DLQlsd/ax7Gcp93tbyHa2766U3u7U1sP19pztTtvY0cT+6r1xL5qnSOdpz11P4rt6l566nYJPZMIhnYCv9+P2+3u9j17Yju6lp6yHcKxE8fAsRH7revobn8L0d721Z3a253aeqxOhG1sK2JftZ7YV22jp+5HsV3dS0/dLqFnEsFQQRAEQRAEQRAEQRAEQRBOCN0yGFpfX8+DDz7I2LFjGT16NLfffjuVlZUAbN68mUsvvZThw4dzyimn8OWXX3ZyawVBEARBEARBEARBEARB6Aq6ZTD0rrvuwmaz8csvv/DHH38gk8n417/+RUNDAzfffDMXXHAB69at49lnn+X5559ny5Ytnd1kQRAEQRAEQRAEQRAEQRA6mbyzG3C0tm7dyubNm1m5ciV6vR6Ap59+mqqqKn7++WfCw8OZMWMGAOPHj+fcc89l7ty5DBkypDObLQiC0KP5fIHcQFKppJNbIgiC0DKPz4dc2i3HAwhCj6TVavH6/cg6uyFCuxO/v4IgdBXdLhi6ZcsWMjMzmT9/Pp9//jl2u53Jkyfz0EMPkZubS58+fZq8PzMzkwULFhzVd3i93rZscsj1t/f3tDexHV1LV90Omaz9bm272rZ2ts44BqotLnaWm/lyQzEKmZQrxqSQEa0jUqvosDYcr6567nS09jpXj2a/dre/hWhv+2rL9vp8fspMTn7eXsG6fbX0iTNw4fAk4o0qVPLjfzDvqH3bmdfU7nb8dCaxr1rH7/dT2uBkVbWCV9dnkx6t4+KRySQa1agV3Ttg1tHX1K58zLk8PspMTr7eVMLuCjOjekUybUAcCUbVETvRu/J2HQ+xXV1De15Tha5P4u9mpb7eeust3njjDS6++GIefPBBHA4HDz74IAqFgujoaNxuNy+88ELj+7/88kveffddfvnllyOu2+v1kp2d3Y6tF4QTz8iRI9t8neJc7RrC4lN55Ltc1hXUNXn9nMHx3D05EXNVaSe1TDgWbX2uivNU6CqUSiVeQzxXf7QRi9PT+LpcKuGtq4aSKDPjsFk7sYWtJ66pQk+hUCiQhidx1UfrMdkPnpdSCbx++RDS1XYcVnMntvD4iGtqgFqrpcwXxq1zs/H4DoYddEoZn8waicJSjsvl6sQWCiey9rimCt1HtxsZqlQqAfjnP/+JSqVCr9dz7733ctlll3HRRRfhcDiavN/hcKDT6Y7qOwYPHtzuPe85OTnt/j3tTWxH19JTtuNonEjb2hodfQzMX1/SLBAKsDinnMtHpzJh2LB2b0NbOBHPnY50NPu1u/0tRHvbV1u1t9bq5or31jQJhAJ4fH7+9uVWfrp3MvFGVZdoa2c6Utt7wjZ2FLGvjqze7ua6wwKhAD4/3LdgK7/+fSr9wo7vvOyJQh1TXfWYKzc5ueLVFU0CoQBWl5cHFm7j85vGEaULPZuoq27X8RLbJQidr9sFQzMzM/H5fLjdblSqwAXS5/MB0L9/fz777LMm78/LyyMrK+uovkMmk3XIydtR39PexHZ0LT1lO1rjRNrWo9ER+6XG4mT2qn0hl3+8ah+j0yNQK7rPZUYcT+3jWPZrd/tbiPa2r+Ntb53dRn6VJegyi9NDSb2dpAjtMa//UN1t3x6qtW3vztvY0cS+Cq3BbienpCHoMqfHR36VhZTItjkve5IjHVNd7Zgra3BgPqwj6oD8Kiv1djexRvUR19PVtqutiO0ShM7T7ZKxTJgwgZSUFP7xj39gtVqpra3llVde4bTTTuOcc86hurqajz/+GLfbzerVq1m0aBEXX3xxZzdbaCs+H2yeB5u/AF/3yEUiCD2Rzw92d+hz0ObyNhsFIAiC0Bk8Xl+Ly53ulpcLgtD2PN6W7xHsLnGf3xM4PC3/vh7p91kQBKG9dLtgqEKh4JNPPkEmkzFt2jSmTZtGfHw8zz33HBEREXz44Yf8+OOPjB07lkcffZRHH32UcePGdXazhbay6B74+hb4+mZY/PfObo0gnLDCtXLOGhQfcvmFw5PQq7pPESVBEHqucK2SSJ0y6DKZVEJqlBh9JggdLUyjICEs+IhAiQT6xRs6uEVCe0iJ0CALUSQpQqsgXBv8t1kQBKG9dZ/5i4eIi4vjlVdeCbps8ODBzJs3r4NbJHSI4g2waQ6MvRX8Plj7Hoy+ARKGdHbLBOGEo5DJmDE2jfnri6m1Nk18nxalZXJmdCe1TBAEoak4o5qnzhvInZ9varbsjpN6E60XeQkFoaPFhal59sLB3DB7HYeX8501oZc4L3uIaL2Ku07O5NXfcpste/L8QcS1Yoq8IAhCe+h2I0OFE9jK1yA8FfqcBX3PBkM8rH6rs1slCCeslEgt39wxkWvGpRKhVRBjUHHHSb35/KZxJIRrOrt5giAIQGD059S+Mcy/ZTyje0VgUMnpn2Dg7atHMGtiL3Sqbjk2QBC6vbHpkSy8bQLjMyIxqORkxep5/Yph3HFyJgaNmF3SE+hUcmZO6MU714xkQIIRg0rOqF4RzL9lHCf3jQk5alQQBKG9ibs/oXtwNMCuH2DY1SDdn4w5fQrs/B48LpCLKRaC0BlSI7U8es4A7jg5C4kEInVKFDLRzyYIQtdiUCsYkx7JezNHYXd7UcqkRImRZ4LQqXQqOUOSjDxxejKGiCiUMhnRBnFe9jSROiXTBsYzKi0Cl9eHRiET0+MFQeh04olV6B52LAavOxAAPaDXZHA2wJ6lndYsQRBAJZcRH6YmzqgWgVBBELq0cK2ShDCNCIQKQhdiqa0kzqASgdAeLkqvIiFMIwKhgiB0CeKpVege8n6BmL6gOyQPYXga6OMh//fjXn21zUthre241yMIgiAIgiAIgiAIgiB0XSIYKnR9Pl9g9GfC0KavSyQQNwAK/jqu1X+6upBbvq/ilJeW89mawuNalyAIreR1B9JfeN2d3RJBOC5yuRyJxwEOE82qgAiC0DN4PYFz3OPs7JYIgtDT+XyB3xu3vbNbIgg9msgZKnR9FTlgr4OEYc2XxQ6E1f8LXDDUxqNe9c5yE08t3sHoRBVGo5F/fbOViZlRpEXpjr/dgiA057ZDXQGsfTdwbscNhjE3Q0QaKETRJaF7kdhr6a+uRPLVc4Hgfv/zoP85gWJ/giB0fx43NBTCho+haA1EZsC42yEyHVSGzm6dIAg9TX0hbPsadi0BTRSMvx1i+oMuqrNbJgg9jgiGCl3fvj9BpoSYfs2XxQ0Evw+K10LmaUe96neW7SFCp+Cs3hrSM1LZUFDPm3/k859LhrRBwwVBaMLnDZzPn18e+P8ARWth48dw5RfQ+5SDBdIEoauz1SL541mUGz46+FrhKlj5Glz/E0T06rSmCYLQRso3w8fTweMI/HfRGtj8OVzwFgy8CBTqzm2fIAg9R+0e+OB0sFYffG3X9zD+Dpj8AGgjOq9tgtADiWnyQtdXvA6is0CmaL7MmARKPZRmH/Vqq8xOFm0u5cy
"text/plain": [
"<Figure size 1355.25x1250 with 30 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Plot the pairwise relationships between the columns\n",
"sns.pairplot(penguin_manager.get_df(), vars=numerical_cols.columns[:-1], hue=\"Sex\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"No important correlation with sex found in the scatter plot, which is expected since now I know they can change sex."
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<seaborn.axisgrid.PairGrid at 0x1e737990340>"
]
},
"execution_count": 134,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABUoAAATPCAYAAADXtc9LAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3xTVRvA8V9Wk3TvxSh7j7L3UhAEUXCggqgoTtyKAxEVVBQHvLhxgCjIEgeIOEA2skfZm7Z07zZpmvn+cWhLaFJm08H5fj583td7kpuTNDe5ee5znkfhcDgcSJIkSZIkSZIkSZIkSZIkXcOUlT0BSZIkSZIkSZIkSZIkSZKkyiYDpZIkSZIkSZIkSZIkSZIkXfNkoFSSJEmSJEmSJEmSJEmSpGueDJRKkiRJkiRJkiRJkiRJknTNk4FSSZIkSZIkSZIkSZIkSZKueTJQKkmSJEmSJEmSJEmSJEnSNU8GSiVJkiRJkiRJkiRJkiRJuubJQKkkSZIkSZIkSZIkSZIkSdc8GSg9h8PhwGaz4XA4KnsqkiSVQx6rklQ9yGNVkqoHeaxKUvUgj1VJkqSKJwOl57Db7ezevRu73V6pc9i7d2+lzuFKVPf5g3wO1UFVOFarqpr+t68o8nWrGFd6rFanv4uca8WQc/WMizlWq/Pz8zT5Wl08+VpdGnfHak1+HWvyc4Oa/fxq8nOTajYZKK1iHA4HFoul2l4lrO7zB/kcpOpN/u0vj3zdqqbq9HeRc60Ycq5VR01/fleTfK0unnytro6a/DrW5OcGNfv51eTnJtVsMlAqSZIkSZIkSZIkSZIkSdI1r1oGSnNycnjxxRfp0qULnTp14vHHHyctLQ2APXv2cMcdd9CuXTuuu+46Fi9eXMmzlSRJkiRJkiRJkiRJkiSpqquWgdInn3wSo9HI33//zb///otKpeK1114jNzeXhx9+mGHDhrFt2zbefvttpk6dyt69eyt7ypIkSZIkSZIkSZIkSZIkVWHqyp7Apdq3bx979uxh06ZN+Pr6AjBlyhTS09P566+/CAwMZNSoUQB069aNoUOHMm/ePNq0aVOZ05YkSar5igu1K6vlNThJkq4ldhsolKBQVPZMJOma5+Pjg8JhA1SVPRXJE+xWUFa7MIQkSdeQavcJtXfvXho1asSiRYv48ccfKSwspFevXrz00kscPXqUJk2aON2+UaNGLFmy5JIew2azXc0pX9ZjV+YcrkR1nz/I53A1qVQVe8Jb2c+vKqqMv73CmIUi6xjsmANWE7S7B0dEKxw+4R6bw5WqKsdMZajo4xQu/3WtTn8XOdeKcbXnqsxPgvjNKA7+hsM3AtrfiyOgLg6t/xXvu6Jf18o+VqvT+6ayydfqIjlsKHITaZS2AuXuHTgiWkHrO7D71wKVV2XP7rJVxrFaHd5zyrwzcGI1iqN/4wiMgfb34PCvg0PjXe79qsNzuxI1+flV9efmiWNVqp4UjmrWguzzzz/nk08+4bbbbuPFF1/EZDLx4osvotFoCA0NxWKxMG3atJLbL168mFmzZvH3339fcN82m43du3dX4Owl6drSoUOHCtmvPFarjnphvgRu+wBV3CKn7Y7anSgY/DlHkvMqaWbSxaqo4xTksSpVHQqFguZRPugX3g458U5j1n6TSK51I2m5hZU0u4sjj1WpJtFqtTTxNeA1fziYDaUDKg2WET9yzBaNsdBUeRO8AvJYdaZSqWgepkE7/2YwZJQOKBRYb/qEeN92ZBdUz7+1VL1V5LEqVW/VLqPUy0tcXXz11VfRarX4+vryzDPPMGLECG699VZMJucPWZPJhI+PzyU9RuvWrSvt6oLNZiMuLq5S53Alqvv8QT6H6qSmP7/L4em/vfL0BhTnBUkBFInb8E1YQ2yHMdViaeu1csxUlst9XavT30XOtWJcrbkqrCYUK8aXCZICqP+dTO1xNxFdP/YKZlq9Xld3ypt7TXh+niJfqwtTGNJQzr7HOUgKYLOg+eUhmj6yHodfs8qZXDVw/nurKr/nFEV5KBbf5xwkBXA4UP/+NPWf2E5MI/d/66r83K6Gmvz8avJzk2q2ahcobdSoEXa7HYvFglarBcB+ti5e8+bNmT9/vtPtjx07RuPGjS/pMVQqVaUfyFVhDleius8f5HOoDmr687sSHnltrEWwdZbbYcW2WahaDgPfsIqdx1Uk31MV40pf1+r0d5FzrRhXPNf8LNi32O2w4sgfqHo8c/n7P0d1el3PdzFzr87Pz9Pka1WOwizIPulmLBtlQSoE1vHsnKoRd++tKvmeM2XDybWux+xWFEk7UQXXv+BuquRzu4pq8vOryc9NqpmqXceN7t27U6dOHSZMmIDBYCArK4vp06fTv39/brrpJjIyMpgzZw4Wi4X//vuPZcuWcdttt1X2tCVPyomHde/Dmvcg83hlz0aSai67DSxG9+OWQnBUzZpEkiRdYxwOsJndj5sKPDcXSZJEQ5/yWIs8Mw+p4tkvcC54flaxJElSJat2gVKNRsP333+PSqVi4MCBDBw4kMjISN555x2CgoL49ttvWblyJV26dGHixIlMnDiRrl27Vva0JU85vRk+7wHrP4JNM+HTLrBnYWXPSpJqJi9vaHOn+/HmN4M+2HPzkSRJckfnBzE93Y83HeS5uUiSBN7BoAt0PabSQEAtj05HqkBafwgtZ4Vn7c6em4skSdJFqHZL7wEiIiKYPn26y7HWrVuzYMECD89IqhIK0mDRaAisC9e9Bko1bPkcfn4EfEKgUf/KnqEk1Tz1ekFII8g85rzdOxg6PwTq6tu1VpKkGkQfBIOmwtfXl80sbXg9BMZUzrwk6VrlGwWD3oVfHi071ncC+IR7fk5SxfCLgCEfwdxbwGF3Hou9R4xLkiRVIdUuo1SS3Fo9RSzT6fMSePmAWgvdn4Ja7WHpw2DMquwZSlLNE1AL7v0Ver8IfpEiGNHxARi7GoLqVfbsJEmSSoU3g0fWQfNbQBcgPqNunAbDPqtWtZQlqUZQqaHZEBz3LcNRu7PIOoxsA3cvgA73i1UrUs1RqyM89K+4MKULEBmmwz6H/m+Ic0dJkqQqpFpmlEpSGVknYNc86DDG+ctWoYQez8DPj8KqyTB0RmXNUJJqroDa4gJFxwcAh1hur9FV9qwkSZKcqbwgvLkIjJpyxcoT33BQKCp7ZpJ0bdL5Y6/bg4z+HxMW5ItSrQOf0MqelVQRvLwhOhbumC1qkhZ//kqSJFVBMqNUqhm2fyuySJveWHZMHwRt74ad38nmTpJUUVRq8I8C/2gZJJUkqWrT+opseL8IGSSVpCogMdOAwzdKBkmvBboAca4og6SSJFVhMlAqVX/WIpFN2vA6sdzelWaDRcB03TTPzk2SJEmSJEmSJEmSJEmqFmSgVKr+jq+GwixoNMD9bVRe0GI4xP0Eecmem5skSW7Z7DbyzfkUWYsqeyqSdNUUv6/N5zcMkiSp2pLfV5IkVQarzSrPKSSpEsgapVL1d+BX0ek+6AIdaxsPgD3zYPs3cN1Ez8xNkqQy7A47SQVJ/Hb8NzYnbSbCJ4J7W9xLPf96+Gv9K3t6knRZLHYLvnV8mb5rOvsy9lE/oD73NL+H2n618dbIpiSSVB3Z7DaSDEn8euxXtiRvIdInkntb3EuMf4z8vpIkqcKYbWbOFJxh4eGF7M/YL84pWpw9p1DLcwpJqmgyUCpVbzYLHFoBTQdd+LZePtCgH+z8Hvq8LGoqSpLkccdzjnPvH/dSYCkQG9Lhz1N/8lyH5xjRdAQ+Gp/KnaAkXYb9Wft56K+HsNgtAOxO380vx37h/T7v069OP7xUXpU8Q0mSLtXxnOOM/mM0RqtRbEiHladW8lKnl7i18a3yIogkSRVib/peHvr7Iax2K1B6TvFhnw/pW6cvGpWmkmcoSTWbXHovVW+J26EoF2p3vrjbN74BClLg+KqKnZckSS7lmnKZsnlKaZD0HNN3TCezMLMSZiVJVybNmMYr618pCZIWc+DgtY2vkVGYUUkzkyTpcmWbsnlz85ulQdJzvL/9ffl9JUlShUgzpvHKhldKgqTFHDiYuHGiPKeQJA+QgVKpeju
"text/plain": [
"<Figure size 1361.88x1250 with 30 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.pairplot(penguin_manager.get_df(), vars=numerical_cols.columns[:-1], hue=\"Island\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The correlation with the island may be a little more evident, but it's not very important."
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<seaborn.axisgrid.PairGrid at 0x1e72eca57f0>"
]
},
"execution_count": 135,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABgsAAATPCAYAAAA/L8XUAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3hT1RvA8W9W06Tp3oNd9t5LhgwRBURQVPgJiLhxskTAjThRcW8cIIILB7L33lA2ZZXuvZs26/fHoSM0Kaub83mePuI9yc25aW/uzXnPeV+FzWazIUmSJEmSJEmSJEmSJEmSJEnSDUtZ1R2QJEmSJEmSJEmSJEmSJEmSJKlqyWCBJEmSJEmSJEmSJEmSJEmSJN3gZLBAkiRJkiRJkiRJkiRJkiRJkm5wMlggSZIkSZIkSZIkSZIkSZIkSTc4GSyQJEmSJEmSJEmSJEmSJEmSpBucDBZIkiRJkiRJkiRJkiRJkiRJ0g1OBgskSZIkSZIkSZIkSZIkSZIk6QYngwWSJEmSJEmSJEmSJEmSJEmSdIOTwYISbDYbFosFm81W1V2RJMkJeZ5KUvUmz1FJqv7keSpJ1Z88TyVJkiRJqgoyWFCC1WrlwIEDWK3Wa3ruoUOHrum5NUFtPz6Qx1hTXM956mhfNen9qGn9hZrX55rWX6h+fS7Pc7QqVbf39XrVtuOB2ndMlXk8hedpbXr/qlJt+1usKvJ9tFdbrqeO3Ei/a3mskiRJUk0jgwXlxGazYTKZau3Mj9p+fCCP8UZU096PmtZfqHl9rmn9hZrZ55qgtr2vte14oPYdU1Ucj9lsrjXvX1WqbX+LVUW+jzeOG+l3LY9VkiRJqmlksECSJEmSJEmSJEmSJEmSJEmSbnA1MliQnp7OtGnT6Nq1K507d+bxxx8nMTERgIMHD3L33XfTvn17+vXrx9KlS6u4t5IkSZIkSZIkSZIkSZIkSZJUvdXIYMGTTz5Jbm4uq1evZv369ahUKmbPnk1GRgYPP/www4cPZ/fu3cyZM4e5c+dy6NChqu6yJEmSJEmSJEmSJEmSJEmSJFVb6qruwNU6fPgwBw8eZNu2bRgMBgBee+01kpKSWLVqFV5eXowZMwaA7t27M3ToUBYuXEibNm2qstuSJEm1jsVqQalQolAoqrorkiRJFcZqs4INlMoaOcdGkmoks9WMWlnjvqrekOTvSpIkSZJqlxp3VT906BDh4eEsWbKEn3/+mby8PHr16sX06dM5deoUTZo0sXt8eHg4v/7661W9hsViuep+FT7nWp5bE9T24wN5jBVFpVJVyH7L4xhq2u+8uvQ3KS+JA0kHWHFuBT5aH0Y2GUmoWyjuLu6lHltd+nylalp/4fr7XJ3P0apUE/8WylLbjgcq/pjS8tM4m3mW307+hhUrwxsNp7F3Y3y0PhXyemUdT0Wdp85eT7o6tfH8qgomswmPuh78cOwHDiQdoKl3U25rcBtBuiA0Kk1Vd69MN9o5WmAtICE3gX/O/MOp9FO0929P/3r9CdIFoVRcPrB6I50z8lirj4o8TyVJkmoTha2Glar/7LPP+Pjjjxk5ciTTpk3DaDQybdo0NBoNfn5+mEwm3n777aLHL126lC+//JLVq1dfdt8Wi4UDBw5UYO8l6cbTsWPHct2fPE+rjkKhwKuuF09uepKorCi7todbPcxAn4HkpOZUUe+kayXPUUkqzTPYk4+Pfcza6LV227sEdeGFti+QHpNeqf2R56lU22m1Wky+Jh5Z9wh55ryi7Wqlmvl95uOT44Mx11iFPSxbeZ+jUH3PU1e9K8n6ZJ7Z+Axmm7lou16t54t+X6BKVlFQUFCFPZQkxyriPJUkSaqNatzKAhcXFwBmzpyJVqvFYDDwzDPPMGrUKEaMGIHRaH8TaTQacXNzu6rXaN269VVHnS0WCxEREdf03Jqgth8fyGOsacrjGGra+1HV/TVZTby3771SgQKALw9/yaChg2hXt53d9qru89Wqaf2F6tvn6tafq1Vd39drVduOByr2mDbHbi4VKADYFb+Lo+FHua3dbeX6elB1v6Pa9DdRVWrj+VXZUowpjFs5zi5QACLFzfNbn+fXob8SoAuoot5Vrer2d5WQl8BDfz1kFygAyDXnMnvHbL4d9O1lV2DdSOeMPFZJkiSppqlxwYLw8HCsVismkwmtVguA1WoFoHnz5ixatMju8ZGRkTRu3PiqXkOlUl3zxe16nlsT1PbjA3mMNUV5HkNNez+qqr9JxiSWRS5z2r7y3Eqe7PCkwzb5Hle86tbn6tafa1VbjqNQbTseKP9jyjXlsujYIqftC48tpHdobzxdPcvtNUuq7N9RbfybqCryvbx26QXpRGdFO2zLLMgkKS+JYENwJfeqeqhuf1cJuQlkmbIctp3LPEdGfgb+ev8r2ld1O7aKJI9VkiRJqilqXKW2Hj16UKdOHV544QVycnJITU3l/fffZ8CAAQwZMoTk5GQWLFiAyWRix44d/P3334wcObKqu12tGU0Wluy+wKt/H+XLTaeJSc+7/JMkSbrh2LCRb8l32p5tyq7E3kiSJFUMi9VSanZzSUaLsdSMWkmSro/FWnaO8wKLTGtTXVzudyE/HyVJkiSpZqtxwQKNRsOPP/6ISqVi0KBBDBo0iKCgIN544w28vb359ttvWbFiBV27dmXWrFnMmjWLbt26VXW3q60LqbkM/nAzz/9+iJVH4nlv1Un6vL2e91adwGyxVnX3JEmqRgwaAz1DejptH1hvYCX2RpIkqWIYXAzc3vB2p+231LsFL61X5XVIkm4AXlovPFw8HLaplWqC3IIquUeSMyGGENQKxwkKvLRe8vNRkiRJkmq4GpeGCCAwMJD333/fYVvr1q1ZvHhxJfeoZsrON3P/tzsxFlh5e2RbQr11GE0W/jkUyyfrIzl4IZ3P7++I3qVG/plIklTO3F3cebbjs+yK31VqhUHHwI7U96hfNR2TJEkqRwqFgj51+rDgyAJismPs2vx0ftwRfgcqpUyvIEnlyV/vz/TO05m5dWaptsfbPo6vzrcKeiU54uvqyyNtH+GTA5+UapvRZQb+uitLQSRJkiRJUvVU41YWSOXnnRXHic8wMu3WpoR66wBw1ai4q2Mdnh/cnN3n0piwYA/55rKXBUuSdONo4NmAJUOWcGv9W/Fw8SDELYQpnabwTu938NP7VXX3JEmSykWwWzDfDvqWCS0n4Ovqi7fWm9HNRvPj4B8JNYRWdfckqdZRK9X0DevL5zd/Tlv/thg0Bpr5NOOjfh9xd9O70al1Vd1F6SK9Rs+9Te/lw5s/pKl3UwwaA+392/PdoO/oHdZbBlMlSZIkqYaTU8ZvUGeSsvlpRxT3dqlDsGfpm+/WoZ5MHdSUuf8dY8ZvEbw9slUV9FKSpOpGrVTT0Kshr/R4hayCLJQKJb46X5QKGXuWJKl2CTGEMKn9JMa0GIMNG95ab1xULlXdLUmqtdw0brinu/PRzR9RYC3AReWCt6t3VXdLcsDL1Yt+dfvRzr8dJqsJV7UrntqKKfouSZIkSVLlksGCG9TnG0/joVNzSwvn+T+bB3vwSO9GfLw+kg51vWguvx9LknSRXqNHr9FXdTckSZIqlEalIUAfUNXdkKQbhslkwsPFA5VKzk6vCXx0PlXdBUmSJEmSypmcCnoDSs0p4M/9sQxqGYSLuuw/gZ7hfvRvFsDry4+RkGOupB5KkiRJkiRJkiRJkiRJkiRJlUkGC25Af+6PwWqzcXPTK5spN6ZrPQxaNd/sz6zgnkmSdEPIzxY/klQTWM1gzARzQVX3RJKkymAxgTFD/FeSJKmimfPFfYZFTsyTJEmSqgeZhugG9Mf+GNrX9cJDp7mix+tcVNzXuQ4frT/NjjMp9Gwsl+NLknQNsuIgaifs+RZsVug4Hur1BI/gqu6ZJJVmMdEq2BXl2lcgZg/4NoZuj4J3A9Aaqrp3kiSVN1MepJ2DnV9C4mEIbANdHgLveqCRxXUlSSpnxkxIPQPbP4H081C3B3QcC551QSWHaSRJkqSqI69CN5iolFwiYjJ4ql/jq3pe1wbe/LZHxYdrI2WwQJKkq6bISYClD8C
"text/plain": [
"<Figure size 1556.62x1250 with 30 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.pairplot(penguin_manager.get_df(), vars=numerical_cols.columns[:-1], hue=\"Species\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The correlation with the species seems to be more important since the clusters and normal distributions are more evident."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Graphically we can see a good correlation between many couples, for instance the features body mass and flipper length is very correlated.\n",
"\n",
"But also compared to the 3 nominal category, we can see that some clusters are formed, and some of this cluster also have a linear correlation. Also, this separation clearly shows a that there is a tendency in the normal distributions of the features that doesn't appear with the sex separation. The most evident is the species separation.\n",
"\n",
"Overall, species and island seems to be more informative than sex. Again, these penguins can change sex and it seems that the only way to find the sex is though blood isotopes.\n",
"\n",
"Now I will check it with the linear correlation matrix."
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAwgAAAO2CAYAAABMxbASAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAADs20lEQVR4nOzdd1xT5x4G8OckIRA2AsoQRXGhKOCsqHXvUbSuSrVaV922dqi1jjqrVesuWhfuqnXvumvdWidOiqAoKENkZt4/uGJTQHNsQgCf7+eTz715z5vwHKhwfnnHEXQ6nQ5EREREREQAJOYOQEREREREBQcLBCIiIiIiysYCgYiIiIiIsrFAICIiIiKibCwQiIiIiIgoGwsEIiIiIiLKxgKBiIiIiIiysUAgIiIiIqJsLBCIiAqIonzfyqJ8bkRERQ0LBKJ3VM+ePdGzZ8/X9hk9ejSaNGmST4kMl5ycjIULF6J9+/YIDAxE3bp10atXLxw+fNjc0QC83fft3r17+Oijj/TaKlasiAULFhgzWp4qVqyIihUrYs6cObke12q1aNCgASpWrIjffvtN1Htv3rwZP/zwwxv7GfLfJBERmZ7M3AGIqOAaPHgwevXqZe4Yeu7fv4/+/ftDq9WiV69eqFSpEtLS0rBnzx4MHjwYQ4YMwfDhw80dU7R9+/bh8uXLem2bNm2Cm5tbvmWQSCTYv38/vvjiixzHzp8/j7i4uLd63yVLlqB27dpv7DdhwoS3en8iIjIuFghElKdSpUqZO4IelUqFkSNHwsLCAuvXr4ezs3P2sWbNmsHR0RGLFi1C06ZNUaVKFTMmNY6AgIB8/XrVq1fHhQsXcOPGjRzfvz179sDX1xfh4eEm+/rlypUz2XsTEZHhOMWIiPL076kyTZo0wfz58/HDDz8gKCgI1apVQ9++ffH333/rve7ChQv4+OOP4e/vj9q1a+Obb75BQkKCXp/z58+jb9++qFWrFvz8/NCkSRMsWLAAWq02zzzHjx/HnTt3MGLECL3i4KWhQ4ciJCQEGo0muy0yMhLDhw9HvXr1EBAQgJ49e+LixYvZxx8+fIiKFSti5cqVaN26NWrXro3ffvsNCxYsQPPmzbFw4ULUqVMHzZo1Q2JiIoCsKTNt27aFn58fGjVqhAULFkCtVueZOyMjA7Nnz0aLFi3g5+eH6tWro0+fPtkX2wsWLMDChQsB6E8r+vcUo7i4OIwZMwYNGzZEtWrV0Llz5xzTqipWrIh169bh22+/Re3atREYGIjhw4fj2bNneeZ7qXbt2nBxccG+ffv02tVqNQ4ePIi2bdvmeM2tW7cwdOhQvPfee6hSpQoaNGiAKVOmICMjA0DWfzOPHj3Ctm3bULFiRTx8+BC//fYbKleujM2bN6N+/fp4//33cffuXb0pRmFhYTmmM50/fx6+vr6YP3/+G8+FiIjeHgsEIhIlLCwMERERmD59OqZMmYLr169j9OjR2cfPnz+P3r17w8rKCj/99BPGjh2Lc+fOoVevXtkXjbdu3ULv3r3h6OiIuXPnYsmSJahevToWLlyIPXv25Pm1T5w4AalUioYNG+Z63NnZGePHj0e1atUAZM3r79SpE6KjozFu3Dj8+OOPEAQBn3zyCc6dO6f32rlz56Jv376YMmUK3nvvPQBATEwMDh06hDlz5mDkyJFwcnJCaGgovvvuO9StWxc///wzQkJCsGzZMowfPz7P3F9//TW2bNmCAQMGYMWKFRg9ejTu3LmDzz//HDqdDl26dEHnzp0BZE0r6tKlS473ePbsGTp37oxz587h888/x4IFC+Dp6YkhQ4Zg586dOc5Fq9Vizpw5+Prrr3Hs2DFMmzYtz3wvSSQStGzZEvv379drP336NDIzM9G4cWO99ri4OISEhCA9PR0zZszAsmXL0Lp1a6xZswarVq0CACxcuBCurq5o2LAhNm3ahOLFiwMANBoNfv75Z0yZMgUjR47MMXrQs2dP1K5dGz/88AMSEhKQmpqK0aNHw8/PD4MHD37juRAR0dvjFCMiEsXe3h6LFy+GVCoFAERFRWHBggVITEyEk5MTZs+ejTJlyiA0NDS7j7+/P9q2bYutW7ciJCQEt27dQlBQEGbNmgWJJOtzinr16uHYsWM4f/482rdvn+vXjo2NhZOTE2xsbAzKunDhQlhYWCAsLAx2dnYAgEaNGqFdu3aYNWsWNm/enN23RYsW2RfpL6nVanzzzTcICgoCALx48QJLlixBt27dMG7cOABA/fr14ejoiHHjxqFPnz4oX7683nsolUqkpqbiu+++Q5s2bQBkfVKfmpqKGTNm4OnTp3Bzc8tea5DXtKKVK1ciISEB+/btg5eXFwCgYcOG6N27N2bOnIl27dplfy8rVKiA6dOnZ7/26tWrOS7689KmTRusW7cO169fh5+fHwBg7969aNq0KaysrPT63rlzB76+vpg3bx5sbW0BAEFBQTh9+jTOnz+Pzz77DJUrV4ZcLkexYsVynNtnn32GRo0a5ZpDEARMmzYNHTp0wKxZsyCXy5GQkIAVK1ZAJuOfLiIiU+JvWSISpWrVqtkX/gCyL2zT09NhZWWFK1euoG/fvtDpdNnTbry8vODj44NTp04hJCQEwcHBCA4ORmZmJqKiovDgwQPcuHEDGo0GKpUqz68tCILe9KE3OXfuHBo3bpxdHACATCZD27ZtsWjRIqSmpma3V6hQIdf3+Gf75cuXkZ6ejiZNmuhNKXo5DevUqVM5CgS5XI7ly5cDyPrE/cGDB4iIiMDRo0cB4LXn++9zCQwMzC4OXurQoQPGjBmDiIiI7E/h/30h7ubmhvT0dIO+To0aNVCiRAns27cPfn5+UCqV+P333zFr1qwcfevXr4/69etDpVLh77//RmRkJG7fvo2EhAQ4Ojq+8Wvl9T1/ycvLC9988w0mTpwIAJg8eTJKly5t0HkQEdHbY4FARKIoFAq95y8/tdZqtUhOToZWq8WyZcuwbNmyHK+1tLQEkDUnf/LkydixYwfUajVKliyJwMBAyGSy1+6XX7JkSRw/fhypqal5jiI8fvwY7u7uAIDnz5/DxcUlRx8XFxfodDqkpKToteXmn+1JSUkAgAEDBuTaN69dfk6ePIlp06YhIiICNjY2qFixYnZ+Q+8P8Pz5c5QsWTLPfMnJydltuf2MDP06giCgVatW2L9/P7766iucPHkSEokE9erVQ2xsrF7fl9OY1q1bh7S0NLi7u6NatWrZP+c3yW0dyb+1bt0a06dPh0ajQf369Q16XyIi+m9YIBCR0djY2EAQBPTu3TvXBa0vL1ynTp2KAwcO4KeffkJQUBCsra0BAHXr1n3t+9evXx9r1qzByZMn0apVqxzHk5KS0Lx5c3Tq1Anff/89HBwccl2c+/TpUwCAk5OTqK077e3tAQA//vgjvL29cxzPrciIiorCkCFD0LRpU4SGhmbvDLVu3TqcPHnS4K9tyLkYS5s2bbB69Wpcu3YNe/fuRYsWLWBhYZGj39KlS7Fq1SpMnDgRLVu2zB6p+fdUrf9iypQpsLKygkKhwLhx47JHY4iIyHS4SJmIjMbW1haVK1dGREQEqlatmv0oX748Fi5ciLNnzwIALl68mL0z0Mvi4Pr160hISHjtLkb169dHhQoVMHfu3By7IgHAnDlzoFKpEBwcDACoVasWjh49ihcvXmT30Wg02LNnD6pWrQq5XC7q/Pz9/WFhYYHY2Fi987OwsMDs2bPx8OHDHK+5fv06MjMzMXDgQL1tY18WBy8/2X85EpOXWrVq4fLly4iOjtZr37lzJ1xdXY069SYgIACenp7YtWsXjhw5kmuxB2T9HMuVK4fOnTtnFwexsbG4c+eO3s/xTeeWl99//x07d+7E6NGjMWHCBPzxxx/YuHHjW70XEREZjiMIRO+wJ0+eZO8280/lypV76+kcX3zxBQYMGIBRo0ahQ4cO0Gg0WLFiBa5cuYJBgwYBAKpVq4Z9+/Zhw4YN8PHxwa1bt7BkyRIIgvDaufIymQwzZ87Ep59+ig8//BCffPIJKlasiMTERGzfvh3Hjx/HyJEjUb16dQBZ256eOHECvXr1woA
"text/plain": [
"<Figure size 1000x1000 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Get the numerical columns\n",
"numerical_cols = penguin_manager.get_df().select_dtypes(include=['float64', 'int64'])\n",
"columns = numerical_cols.columns\n",
"\n",
"# Create a new dataframe without NaN values\n",
"noNan = penguin_manager.get_df().dropna()\n",
"\n",
"corr_matrix = noNan.corr(numeric_only=True)\n",
"\n",
"fig, ax = plt.subplots(figsize=(10, 10))\n",
"\n",
"warnings.filterwarnings('ignore')\n",
"sns.heatmap(corr_matrix, annot=True, cmap=\"coolwarm\", ax=ax)\n",
"warnings.filterwarnings('default')\n",
"\n",
"ax.set_title(\"Linea Correlation Matrix\")\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This also shows a correlation between the features body mass and flipper length. As well as some others important correlations of couples.\n",
"\n",
"Now I will check the correlation with the Spearman method which is used to find non-linear correlations. There is also the Kendall method, but I will use only the Spearman method."
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot:>"
]
},
"execution_count": 137,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAn0AAAIWCAYAAAAiQDtUAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAADCkklEQVR4nOzdd1hT5xfA8W8IBMISFRT3wL0RFBUnYq171rq34h5117q3ddQtat1b696tq05wD9w4ASdOZiDk9wc/YyPYOsI+nz73eZp7z733fSNJTt7z3huFTqfTIYQQQgghUjWTpG6AEEIIIYRIeJL0CSGEEEKkAZL0CSGEEEKkAZL0CSGEEEKkAZL0CSGEEEKkAZL0CSGEEEKkAZL0CSGEEEKkAZL0CSGEEEKkAZL0CSGEEEKkAZL0CSGEEEIksJcvX1KjRg18fHw+GXP06FHq1atHqVKlqFWrFocPHzZqGyTpE0IIIYRIQOfOnePHH3/k4cOHn4y5f/8+vXv3pm/fvpw9e5bevXvTr18/nj59arR2SNInhBBCCJFAtm7dysCBA+nfv/9/xrm6uuLp6YmpqSm1a9emTJkybNiwwWhtkaRPCCGEEOIzaTQaQkJCDBaNRvPJ+IoVK/Lnn39Su3btfz3unTt3KFCggMG6fPnycePGDaO0G8DUaEcS4h+iXtxN6iYkCy1d/v2bXVqxYmjOpG5CsqF7/Dypm5AsjFqpSOomJAue4bqkbkKy8P3T9Ql+DmN9Lnmv283cuXMN1vXq1YvevXvHG+/g4PBZxw0NDUWtVhuss7CwICws7OsaGg9J+oQQQgghPpOXlxcdOnQwWKdSqb75uGq1moiICIN1ERERWFlZffOx35OkTwghhBCpX4zWKIdRqVRGSfI+VqBAAfz8/AzW3blzh2LFihntHDKnTwghhBCpny7GOEsCqV+/Pr6+vuzZs4fo6Gj27NmDr68vDRo0MNo5JOkTQgghhEgCzs7O7NixAwAnJyfmzZuHt7c3ZcqUYf78+cyZM4c8efIY7XxS3hVCCCFE6heTcKN0n+vmzZsGjy9cuGDwuFKlSlSqVCnBzi9JnxBCCCFSPV0ClmZTCkn6hBBCCJH6JYORvqQmc/qEEEIIIdIAGekTQgghROon5V1J+oQQQgiRBhjpPn0pmZR3hRBCCCHSABnpE0IIIUTqJ+VdSfqEEEIIkQbI1btS3hVCCCGESAtkpE8IIYQQqZ7cnFmSPiGEEEKkBVLelfKuEEIIIURaICN9QgghhEj9pLwrSZ8QQggh0gC5ObMkfUIIIYRIA2SkT+b0CSGEEEKkBTLSJ4QQQojUT67elaRPCCGEEGmAlHelvCuEEEIIkRbISF8SefbsGdbW1lhaWiZ1U1Ktl69e08rrJ8YM7UfZ0iWSujlG51zNhdZD25IppyMvgp6zasJyzh86G2+stZ0N7X7pQKkqpTE1N+PeVX9Wjl/G/Wv3DOJsM9gyYetUFgyZy7XTVxOjG9/kZVgk4/66xtmAV5iaKKhdKAv9KxfA1CTu99mzAS+ZdewW/i9DsTU35YcSOehUNi8AFeYdNIjV6XRERMcw8fvi1CqUJVH68k2sbDGv3wVl7iIQoyX68nE0+1fHW84ybzP0/3EftkVumIn2ziUwNUNVqy3KQmVQmJoS8/g+kXtXonv6MDF788UKVS1FnaEtyJgzE6+Cgtk1cQ3XD12IN1ZhoqDOkBa4NK6MmVrFnVN+/PHz77x7/hoAB6esNBzVlpyl8hHxLpzTaw9yaP52dDodAMW/L4tnn0ZkzJGJsDehnNl0lL9mb9FvTw5U9rYUndaFDBWKoIvWEvTHcW6OXo1O++mRrsx1ylJwVCv+Ltv3w0qFAk//ZaAA/tG9w8W80IZFJlwHEoqUd798pO/evXsMGTKEypUr4+zsjKenJ9OmTSM0NPSz9vfx8aFgwYJf3FBjCwgIoGDBggQEBCT6uV+8eEHNmjV5+fIlAHPmzKFNmzZfdIyoqChatGjBo0ePEqKJem/fvqVJkya8ffs2Qc9jbOcv+9HK6yceBT5O6qYkCMfcWRi4cAjrp6+lXbEWbJyxjp/mDyZD5gzxxnef2gubDLb0r9GbLi7tuHn2BsNXjMJcba6PKehaiAlbp+KYOwUkOf83ZM9lLFWmHOhShVUt3PB5FMya8w/ixN17GUrvbef5oWQOTvTwYHaD0qw+/4A/bz8B4GTP6gZL9fyZqZArIzUKZE7sLn0Vi2Z9QRNB2LTuhC/6BWXe4piVrx1vrDJrXiJWTSJsQnv9or1zCQCzaj9gkjEL4XMHEDbVi5gnD7BoMSAxu/LF7HM70m5hf/bN2MQvxTtxYOZm2szri23m9PHGe/ZuTIHKJfit/s+MK9eTqAgNzaZ0BUBlaU7XFUN5FRjMOLeezG82hpJ1y+PZpzEA2YrlocXMHuybtpERJTqzpN1kyjStTOVO8T/XSaXkor5oQyM4XLI7p2r9QsbKxcnlFX8bFaZK8vSsR0nvPigUhimBdcFsmJgqOVigE3/lba9fUmTCB+h0WqMsKdkXJX3nz5+nUaNGZMuWjW3btnHhwgUWL17MpUuX6NixI1ptyn4yEktERARhYWHfdIx58+ZRpkwZcuTIYaRWxc/W1pbmzZszfvz4BD2PMW3f8ydDRk+lT9d2Sd2UBFO1qQfXfa9x5oAPMdoYTu0+wTWfq3i2rBn/DjpYP30NIa/fER0VzY5FW7HLlJ4sebMCUKVJNfrOGsC6X1cnYi++zcPXYZwNeEXfivlRmynJns6SLmWdWH8p7hehDZceUs0pE/WLZEOhUFDAwYblP5bFOWvcxGCHXyA+D4KZ8H3xeEcMkxtFhswo8xRFc2AtRGnQvXqG5ugWTN3i/i0o7BxAbU1M0L14jgQmDllBYULs0I4CXUwMRCXvD3jXJpW563sDvwNnidHGcGn3ae76XKdcy+rxxrv9WI3DC3bw5vFLIkPC2T5mJQWrliRDjkzkKVMIa/t0bB25FE14JK8CX3Bw3lbKt/YEIEN2B06t+Yvrhy6g0+l45h/Elf1nyONWKDG7/K8sc2cmo3tRbo5dS0y4hvAHz/CfsYVcHeN/b3Dd+DMZ3Ityd86OONvSlXLi3bWH6KLksz21+KJ3tJEjR9KwYUP69OlDhgyxIwp58uRh5syZZMyYUT/qVLBgQXx8fPT7bdmyBQ8PjzjHez/atm3bNqpVq0apUqUYNmwYZ8+epX79+jg7O9OuXTv9iJhOp2PlypXUrFkTV1dXWrZsydWrH0pQHh4eeHt707BhQ5ydnWnYsCGnT5/+8mfFCOe6du0aLVq0wNnZmQYNGrBgwQI8PDzQarXUrVsXgLp167Jnzx4AQkND+eWXX6hYsSJubm7MnDnzk217+fIlK1eupFWrVkDs6KmHhwdLlizB3d0dFxcXZsyYwcGDB6lZsybOzs707t0bjUYDQJs2bZg9ezYtWrSgVKlS1K9fn8uXLzNgwABKly6Nh4cHR44c0Z+vQYMGHDlyhFu3bn3Vc5nY3N1c2LtxKbU8qyR1UxJMjvw5eXjTcEQr4PYjchXOHW/8r16TuO/34YO+XO0KRISGE+QfCMClvy/Qq7IXJ3cdT7A2G5t/cAjpLMzIZG2hX5c3oxVP3kXwLiLKINbvyRuy2qoZuucy1RYepvGKE5wNeIW9lblB3LvIKGYcu8XAKoWwU6sSpR/fyiRTdnRh79C9e6VfF/MsABM7B7AwnD5iks0JIsMxb9YXyyGLUPf8FVPnqvrtUSd3Y5IpO1bDlmD5ywpMS1YiYsNvidORr5S5QHae3DRM9J/eDiRr4ZxxYi1s1Nhlzcjjf8SHvHhD+JtQshTOiYmJCdGaaLT/SHJ0MTpsHexQ21pxZZ8vO8d/+GJkam5G4WrOBF6JP4lOCtaFsqN5+Y7Ipx/+HkJuBaDO4YCpbdzpRJd7zuNcy8mE338aZ1u6Uk6YqFWU3zcBD79FlN02CjvXAgna/gSlizHOkoJ9dtL38OFDbt++rU9Y/sne3p758+eTO3fur2rE0aNH2bNnDxs3bmT79u2MGzeOxYsXc/DgQR4/fszatWsBWLt2LcuWLWP
"text/plain": [
"<Figure size 640x480 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Create a new dataframe without NaN values\n",
"noNan = penguin_manager.get_df().dropna()\n",
"\n",
"# Create the correlation matrix with the spearman method\n",
"spearmanCorrMatrix = noNan.corr(method='spearman', numeric_only=True)\n",
"\n",
"sns.heatmap(spearmanCorrMatrix, annot=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This method shows a correlation based on the monotone relationship between the features. This means tha the also linear correlation is taken into account. The result are similar to the linear correlation matrix.\n",
"\n",
"However, to see a better correlation I should check after grouping by the nominal features which I won't do."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>studyName</th>\n",
" <th>Species</th>\n",
" <th>Island</th>\n",
" <th>Individual ID</th>\n",
" <th>Clutch Completion</th>\n",
" <th>Date Egg</th>\n",
" <th>Culmen Length (mm)</th>\n",
" <th>Culmen Depth (mm)</th>\n",
" <th>Flipper Length (mm)</th>\n",
" <th>Body Mass (g)</th>\n",
" <th>Sex</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N1A1</td>\n",
" <td>Yes</td>\n",
" <td>2007-11-11</td>\n",
" <td>39.1</td>\n",
" <td>18.7</td>\n",
" <td>181.0</td>\n",
" <td>3750.0</td>\n",
" <td>MALE</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N1A2</td>\n",
" <td>Yes</td>\n",
" <td>2007-11-11</td>\n",
" <td>39.5</td>\n",
" <td>17.4</td>\n",
" <td>186.0</td>\n",
" <td>3800.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.94956</td>\n",
" <td>-24.69454</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N2A1</td>\n",
" <td>Yes</td>\n",
" <td>2007-11-16</td>\n",
" <td>40.3</td>\n",
" <td>18.0</td>\n",
" <td>195.0</td>\n",
" <td>3250.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.36821</td>\n",
" <td>-25.33302</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N3A1</td>\n",
" <td>Yes</td>\n",
" <td>2007-11-16</td>\n",
" <td>36.7</td>\n",
" <td>19.3</td>\n",
" <td>193.0</td>\n",
" <td>3450.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.76651</td>\n",
" <td>-25.32426</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N3A2</td>\n",
" <td>Yes</td>\n",
" <td>2007-11-16</td>\n",
" <td>39.3</td>\n",
" <td>20.6</td>\n",
" <td>190.0</td>\n",
" <td>3650.0</td>\n",
" <td>MALE</td>\n",
" <td>8.66496</td>\n",
" <td>-25.29805</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>337</th>\n",
" <td>PAL0910</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>N38A1</td>\n",
" <td>No</td>\n",
" <td>2009-12-01</td>\n",
" <td>47.2</td>\n",
" <td>13.7</td>\n",
" <td>214.0</td>\n",
" <td>4925.0</td>\n",
" <td>FEMALE</td>\n",
" <td>7.99184</td>\n",
" <td>-26.20538</td>\n",
" </tr>\n",
" <tr>\n",
" <th>338</th>\n",
" <td>PAL0910</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>N39A1</td>\n",
" <td>Yes</td>\n",
" <td>2009-11-22</td>\n",
" <td>46.8</td>\n",
" <td>14.3</td>\n",
" <td>215.0</td>\n",
" <td>4850.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.41151</td>\n",
" <td>-26.13832</td>\n",
" </tr>\n",
" <tr>\n",
" <th>339</th>\n",
" <td>PAL0910</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>N39A2</td>\n",
" <td>Yes</td>\n",
" <td>2009-11-22</td>\n",
" <td>50.4</td>\n",
" <td>15.7</td>\n",
" <td>222.0</td>\n",
" <td>5750.0</td>\n",
" <td>MALE</td>\n",
" <td>8.30166</td>\n",
" <td>-26.04117</td>\n",
" </tr>\n",
" <tr>\n",
" <th>340</th>\n",
" <td>PAL0910</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>N43A1</td>\n",
" <td>Yes</td>\n",
" <td>2009-11-22</td>\n",
" <td>45.2</td>\n",
" <td>14.8</td>\n",
" <td>212.0</td>\n",
" <td>5200.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.24246</td>\n",
" <td>-26.11969</td>\n",
" </tr>\n",
" <tr>\n",
" <th>341</th>\n",
" <td>PAL0910</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>N43A2</td>\n",
" <td>Yes</td>\n",
" <td>2009-11-22</td>\n",
" <td>49.9</td>\n",
" <td>16.1</td>\n",
" <td>213.0</td>\n",
" <td>5400.0</td>\n",
" <td>MALE</td>\n",
" <td>8.36390</td>\n",
" <td>-26.15531</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>342 rows × 13 columns</p>\n",
"</div>"
],
"text/plain": [
" studyName Species Island Individual ID \\\n",
"0 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A1 \n",
"1 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A2 \n",
"2 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N2A1 \n",
"3 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N3A1 \n",
"4 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N3A2 \n",
".. ... ... ... ... \n",
"337 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N38A1 \n",
"338 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N39A1 \n",
"339 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N39A2 \n",
"340 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N43A1 \n",
"341 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N43A2 \n",
"\n",
" Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) \\\n",
"0 Yes 2007-11-11 39.1 18.7 \n",
"1 Yes 2007-11-11 39.5 17.4 \n",
"2 Yes 2007-11-16 40.3 18.0 \n",
"3 Yes 2007-11-16 36.7 19.3 \n",
"4 Yes 2007-11-16 39.3 20.6 \n",
".. ... ... ... ... \n",
"337 No 2009-12-01 47.2 13.7 \n",
"338 Yes 2009-11-22 46.8 14.3 \n",
"339 Yes 2009-11-22 50.4 15.7 \n",
"340 Yes 2009-11-22 45.2 14.8 \n",
"341 Yes 2009-11-22 49.9 16.1 \n",
"\n",
" Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) \\\n",
"0 181.0 3750.0 MALE NaN \n",
"1 186.0 3800.0 FEMALE 8.94956 \n",
"2 195.0 3250.0 FEMALE 8.36821 \n",
"3 193.0 3450.0 FEMALE 8.76651 \n",
"4 190.0 3650.0 MALE 8.66496 \n",
".. ... ... ... ... \n",
"337 214.0 4925.0 FEMALE 7.99184 \n",
"338 215.0 4850.0 FEMALE 8.41151 \n",
"339 222.0 5750.0 MALE 8.30166 \n",
"340 212.0 5200.0 FEMALE 8.24246 \n",
"341 213.0 5400.0 MALE 8.36390 \n",
"\n",
" Delta 13 C (o/oo) \n",
"0 NaN \n",
"1 -24.69454 \n",
"2 -25.33302 \n",
"3 -25.32426 \n",
"4 -25.29805 \n",
".. ... \n",
"337 -26.20538 \n",
"338 -26.13832 \n",
"339 -26.04117 \n",
"340 -26.11969 \n",
"341 -26.15531 \n",
"\n",
"[342 rows x 13 columns]"
]
},
"execution_count": 138,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguin_manager.get_df()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8 Explore categorical variables\n",
"\n",
"Investigate the distribution of categorical variables using frequency tables, bar plots, or pie charts. Look for patterns, imbalances, or rare categories that may require special handling."
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>studyName</th>\n",
" <th>Species</th>\n",
" <th>Island</th>\n",
" <th>Individual ID</th>\n",
" <th>Clutch Completion</th>\n",
" <th>Sex</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>342</td>\n",
" <td>342</td>\n",
" <td>342</td>\n",
" <td>342</td>\n",
" <td>342</td>\n",
" <td>341</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>190</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>PAL0910</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Biscoe</td>\n",
" <td>N69A1</td>\n",
" <td>Yes</td>\n",
" <td>MALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>119</td>\n",
" <td>151</td>\n",
" <td>167</td>\n",
" <td>3</td>\n",
" <td>307</td>\n",
" <td>171</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" studyName Species Island Individual ID \\\n",
"count 342 342 342 342 \n",
"unique 3 3 3 190 \n",
"top PAL0910 Adelie Penguin (Pygoscelis adeliae) Biscoe N69A1 \n",
"freq 119 151 167 3 \n",
"\n",
" Clutch Completion Sex \n",
"count 342 341 \n",
"unique 2 3 \n",
"top Yes MALE \n",
"freq 307 171 "
]
},
"execution_count": 139,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# get the categorical columns\n",
"categorical_cols = penguin_manager.get_df().select_dtypes(include=['object'])\n",
"df_categorical = penguin_manager.get_df()[categorical_cols.columns]\n",
"df_categorical.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Species"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 1.0, 'Species distribution')"
]
},
"execution_count": 140,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAlIAAAHBCAYAAACixVUDAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAABPV0lEQVR4nO3de3zP9f//8fuOtjbaMKcoMaZpZcwxxBglc5jDRPpIRQ6FDkOWQ8yhk/Mpp4XJeaKEElJYDguRZaUcpnZijM3svffvD7+9v94N2cvYG7fr5eJie52ej/f7/Xy/dn8/X8/3+21nNpvNAgAAQL7ZF3YBAAAAdyuCFAAAgEEEKQAAAIMIUgAAAAYRpAAAAAwiSAEAABhEkAIAADCIIAUAAGAQQQoAJNnCZxPbQg0A8ocgBdyFfvvtNw0aNEhPPfWUHn/8cTVs2FADBw7U4cOHC60mHx8fTZ06tVDaHjJkiAIDAy2/d+/eXd27d7/p/ffu3avevXv/53ZTp06Vj4+P4XZuZObMmZo3b9512wJgmxwLuwAA+XP06FGFhobqiSee0LBhw1SyZEn9/fffWrx4sUJDQ7Vo0SLVqFHjjte1bNkylSlT5o63ey0jRozI1/YrVqxQfHz8f27XqVMnNWrUyGhZNzRp0iT179//jrQFoOAQpIC7zIIFC+Th4aG5c+fKycnJsrx58+Z69tlnNWPGDH366ad3vK7CCG/X4+3tfVuOW6ZMmTsWFu9kWwCM49IecJdJTk6WlHc+zQMPPKChQ4fq2WeftSzr3r27hgwZotmzZ+upp55SzZo11adPH504ccJq399++029e/dWzZo1VbNmTfXr1y/PNikpKXr33XfVoEED+fv7q1u3btq7d69l/b8v7Z09e1bDhw9XgwYN5Ofnp86dO2vnzp1Wx9yxY4dCQ0Pl7++v2rVrq2/fvvrjjz9uePvT0tI0dOhQ1a1bV7Vr19aHH36onJwcq23+fcntRu0MGTJE0dHROnXqlHx8fLR69WqdPHlSPj4+WrBggZ599lnVqVNHq1evvu7ltunTp1vul759+1rdd9fb5+r7K3f9tGnTLD9fa7/169crJCRE/v7+euqppzR8+HClpaVZtRUUFKStW7cqODhYjz/+uFq2bKno6Ogb3qcAjCNIAXeZJk2aKCEhQV26dFFUVJR+//13S6h65pln1L59e6vtN2/erFWrVmnYsGF6//33deTIEb344ou6ePGiJOnYsWPq0qWLUlJSNH78eEVEROjEiRN6/vnnlZKSIkm6ePGiunTpoh07duitt97StGnT5ObmpldeeUW///57nhovXbqk//3vf9q8ebMGDRqkadOmqUyZMnrllVcsYerEiRPq06ePqlevrpkzZ2rMmDH6448/1KtXrzzBKFdOTo5eeeUVbd26VW+//bYmTJig2NhYrV+//rr313+107dvXz399NPy8vLSsmXL1KRJE8u+EydO1Msvv6wxY8aoXr161zz+3r17tW7dOg0fPlxjxozRkSNH1KNHD2VlZV23pn9btmyZJKljx46Wn/9txowZGjRokJ588klNmTJF/fr108aNG9W9e3dlZmZatktKStL777+vF198UZ9++qnKly+vIUOGXPNxAnDruLQH3GW6du2qpKQkzZs3T++//74kydPTUw0bNlT37t315JNPWm1/8eJFrVq1Sg8//LAkqVKlSmrfvr2io6PVrVs3TZs2TS4uLoqMjJS7u7skqX79+mrevLnmzp2rwYMHKzo6WidOnNCaNWtUrVo1SVJAQIDatWun3bt3q3LlylZtfvHFFzpy5IiWL19uqadx48bq3r27PvroI61atUoHDhxQZmamevfurdKlS0uSypYtq82bN+vixYuWWq72/fff68CBA5o9e7Yl8NSrV89qovm//Vc7Dz/8sIoXLy5nZ2fL5cnckNmiRQt17Njxho+Hvb295s2bp4ceekiSVLlyZbVr107R0dEKDQ294b65ctstU6bMNS+RpqWlaebMmerUqZPV/K+qVauqW7duWr16tbp27SpJysjIUEREhOrXry9Jqlixopo2bapt27bleZwA3DpGpIC70IABA7R9+3Z9/PHH6tixo9zd3bVu3TqFhobqs88+s9rW39/fEqIkydfXVxUqVNCePXskSbt27VLdunXl4uKi7OxsZWdny93dXQEBAdqxY4ckac+ePSpfvrwlRElSkSJF9PXXX6tLly556tu5c6e8vLxUvXp1yzFNJpOaNm2qX375RWlpaXryySdVpEgRdezYUePGjdOOHTtUrVo1DRo06JohKrcOJycnNW7c2LLsgQce0NNPP33d+8pIO7mqVq16w/XSlRCUG6IkqVq1aipfvrzlvisIP//8s7KyshQcHGy1PCAgQA899JBiYmLy1JQrd55VbjgEULAYkQLuUg8++KBat26t1q1bS5IOHz6ssLAwffTRR2rTpo08PT0lSaVKlcqzb4kSJXTu3DlJV+YyrV+//pqXx4oXL27ZpkSJEjdd29mzZ5WUlKTq1atfc31SUpK8vb21ePFiffrpp1q+fLkiIyNVrFgxde3aVQMGDJC9fd7XeWlpafLw8MizzsvL67q1lC9fPt/t5CpZsuR/3tZrbXP1/VsQcudBXautkiVL6vz581bLXF1dLT/n3j4+owq4PQhSwF3kn3/+UYcOHTRgwAB16tTJap2vr68GDhxomSieG6TOnj2b5zjJycmWUaqiRYuqQYMGeumll/Js5+joaNnm5MmTedbHxsbK3d1dVapUsVpetGhRVaxYUR999NE1b0f58uUlSU888YSmTZumrKws7d27V8uWLdOsWbPk4+OjVq1a5dnP09NTZ86ckclkkoODg2X5tW7j1fLbTn5cKzAlJSXJ399fkmRnZydJVjVfuHAhX208+OCDkq48bv++PJeUlKQKFSrku24ABYNLe8BdpGTJknJ0dNSSJUt06dKlPOv/+OMPFSlSRI888ohlWWxsrFJTUy2/Hzp0SCdPnrTMoalTp47i4+P12GOPyc/PT35+fnr88ccVGRmpb775RtKVS0gnTpxQXFyc5ThZWVl6/fXXtXz58jx11KlTR6dPn1aJEiUsx/Tz89POnTs1d+5cOTg4KDIyUoGBgcrKypKzs7Pq16+v0aNHS5JOnz59zdtfv359ZWdn69tvv7Wq48cff7zufXYz7dxoVOq/xMbGWo0IHThwQKdOnbJMTs+9fHj1bdq3b1+e49yohieffFLOzs5at26d1fI9e/YoISFBNWvWNFw/gFtDkALuIg4ODho5cqR+++03dejQQZ9//rl++uknbdu2TWPHjtXkyZPVv39/ywiGdGXy8auvvqpvv/1WX3zxhfr166eqVataLgn27dtXx48fV+/evfXtt99q+/btev311/XVV19Z5kSFhISoQoUK6tOnj7744gtt375db7zxhjIzM6/5yd4hISEqV66cXnrpJUVHR2vXrl365JNPNHHiRJUqVUpOTk6qV6+eEhMT1a9fP23btk0//PCDhg4dKmdnZzVt2vSat79+/fpq2LChwsPDtWTJEm3btk19+vSxCor/djPtFCtWTMnJydq2bZsSExPz9Zjk5OSoV69e2rZtm9asWWO5f9u0aSNJlvlb7733nnbs2KHVq1drxIgRcnNzszpOsWLFFBsbq927d+e5DOfh4aFevXppxYoVGjVqlH744QctXbpUr7/+ury9vRUSEpKvmgEUHC7tAXeZJk2aaPny5Zo3b55mzZql1NRUOTs7y9fXVxMnTlSLFi2stg8ICFC9evU0bNgwSVJgYKDCwsLk7Ows6crk6KioKE2cOFFhYWEym82qWrWqpk+frmbNmkm6MqqyePFiffDBB4qIiFB2draefPJJLVq0yGoie64HHnhAUVFR+vjjj/Xhhx/q/Pnzeuihh/TWW2+pZ8+elnZnzZql6dOn680335TJZNLjjz+u+fPnq1KlSte9/dOmTdNHH32kKVOm6NKlS2rVqpU6d+6szZs3X3P7m2knJCRE27ZtU79+/fTGG2/k63Jf06ZN9fDDD+udd95Rdna2mjZtqmHDhql
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# shows the species distribution\n",
"sns.countplot(x='Species', data=df_categorical)\n",
"plt.title(\"Species distribution\")"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjYAAAHBCAYAAAB6yfEJAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB8/ElEQVR4nO3deXgN5//G8Xf2hCBB7GpPWgQhxL6kSC1B7UVa+05rqa2WWEKooqL2rYpSSm1tae0tUtTeololBAmJEBJJTs7vj/ycb09tiS1xer+uy3XJzDMzn5kzOXNn5pkZK6PRaERERETEAlindwEiIiIiz4uCjYiIiFgMBRsRERGxGAo2IiIiYjEUbERERMRiKNiIiIiIxVCwEREREYuhYCMiIiIWQ8FG5BWk52qKiDycgo280g4fPky/fv2oVq0anp6evPnmm4wcOZI///zzgbYeHh6EhISkQ5XP15o1a5g8ebLp53Xr1uHh4cGlS5fSsar0c+nSJTw8PFi3bl2qpwkICCAgIOAFVmUuvfe9kJAQPDw8ntju8OHD9OzZEx8fH0qXLk3t2rUZPnw4Fy9efAlVPuhpPlsRBRt5Zc2fP5/27dtz9+5dhg8fzqJFi+jZsye//fYbb7/9Nlu2bEnvEl+IOXPmcPPmTdPPtWvXZvXq1eTKlSv9ipJX3v79+3n33Xext7dnwoQJLFq0iD59+nD06FFatWqVLuEmV65crF69mtq1a7/0Zcuryza9CxB5Gjt37uSTTz6hd+/evP/++6bhlSpVolmzZgwaNIhhw4bh7u5OiRIl0rHSFy979uxkz549vcuQV9zcuXPx9PRk5syZpmE+Pj7UqlWLevXqsWTJEsaMGfNSa7K3t6dcuXIvdZny6tMZG3klzZo1iyJFitC/f/8HxtnZ2TF27FhsbGxYsGCB2bjY2FgGDx6Ml5cXVapUYcKECcTFxZnGh4WF0atXL3x8fChbtixt2rRh9+7dZvM4e/YsPXr0oHz58pQvX54+ffoQFhZmGh8aGoqHhwerVq2iTp06VK1alW+++QYPDw9Onz5tNq/du3fj4eHB8ePHATh9+jR9+/alcuXKlCpViho1ajBhwgTi4+MB8PX15fLly6xfv950+elhl6J+/vln2rVrR4UKFfDx8WHQoEFcuXLFNH7dunWULFmSY8eO0aZNGzw9Paldu/YD2+vbb7+lSZMmlClThsqVKzN48GAiIiIe+bncX/effvqJ9u3bU6ZMGerVq8fy5cvN2iUnJzN//nzq1atH6dKl8fPz44svvjBrExAQwODBg+nfvz/ly5ene/fuj1zuv+3bt482bdrg5eVFxYoV6d27N3/99dcj20dFRTF27Fjq1KlD6dKlqVSpEn369DHbpgEBAXz00UfMnz+f2rVr4+npSdu2bTl27JjZvH755RfatGlD2bJl8fPzY9++famq+ccff6Rdu3Z4eXlRunRp3nrrLbPtdn/b7t+/n86dO1O2bFmqVq3K5MmTSUpKMrW7d+8ekyZNolq1anh5eTF8+HDu3bv3xOVfv379ocNz5crFyJEjqVatmmmYr68v06dPZ9KkSVSqVIlKlSrx4YcfEh0dbTbtoUOH6NChA2XLlqVSpUoMHTqUqKgoszYXL16kf//+VKpUiYoVK9KtWzf++OMP4OGXosLDwxk4cCCVKlWibNmyvPfee/z2229m80zrfiuWRcFGXjlRUVGcPHmSOnXqYGVl9dA2rq6uVK1ale3bt5sN/+KLL4iNjWXGjBn06NGDNWvWMHLkSCDlYNujRw/u3r3LlClTmD17Ni4uLvTu3ZsLFy4AcP78edq2bcuNGzcIDg4mKCiIsLAw3nnnHW7cuGG2rOnTpzN06FCGDh1K3bp1yZw58wOXxzZv3kyRIkUoU6YMERERtG/fnri4OIKDg1mwYAENGjTgiy++YOnSpUBKoHNzc6NWrVqPvPy0YcMGOnfuTO7cuZk2bRrDhw/nyJEjtGnTxqzG5ORkPvjgAxo2bMj8+fOpUKECU6dOZe/evUBKf4vBgwdTv359FixYwPDhwzlw4ACDBg164mc0YMAASpYsyWeffUa1atUYP368WXAJDAxk5syZNGnShLlz5/LWW28xceJEPvvsM7P5fPfdd9jZ2fHZZ5/x7rvvPnG58L9wWqpUKebMmcOECRP466+/6N69O8nJyQ+0NxqN9OjRg59//plBgwaxaNEievfuzb59+xg9erRZ261bt7J9+3ZGjhzJtGnTuH79Ov3798dgMABw6tQpOnfujLOzM59++invvfceAwcOfGLNu3btok+fPpQqVYrZs2cTEhJC/vz5GT9+PL/++qtZ28GDB1OhQgXmzp2Lv78/ixcvZu3atabxH374IatXr6Zbt27MmDGDmJgY0/7zOLVr1+bIkSMEBASwdu1as7DeqlUr6tata9Z+5cqVHD58mIkTJzJ48GD27NlD165dTdv44MGDdOzYEUdHR2bMmMGIESP45ZdfePfdd01BPSIiglatWvHXX38xZswYpk6dSkxMDB07dnwgAEHK737btm05deoUo0aN4pNPPiE5OZn27dub+tU9y34rFsIo8oo5fvy40d3d3bh8+fLHtgsODja6u7sbb968aTQajUZ3d3djw4YNjQaDwdRm6dKlRg8PD+O5c+eMERERRnd3d+OGDRtM42/dumWcOHGi8cyZM0aj0WgcOHCgsUqVKsbbt2+b2kRHRxsrVKhgDA4ONhqNRuOBAweM7u7uxmnTppnVM2zYMKOvr6/p57i4OKOXl5dx9uzZRqPRaNy7d6+xffv2ZvM2Go3Gxo0bGzt37mz6uU6dOsahQ4eafv7666+N7u7uxrCwMKPBYDBWq1bN2LFjR7N5XLhwwViqVCnjlClTzKb56quvTG3u3btn9PT0NI4bN85oNBqN8+bNM5YrV84YHx9varNr1y5jSEiIMTk5+aHb/P66Dxs2zGx4r169jFWqVDEaDAbjX3/9ZfTw8DDOmzfPrM306dONnp6exqioKKPRaDR26NDBWLp0aeOdO3ceuqz7wsLCjO7u7savv/7aaDQajZs3bza6u7sbr169ampz7Ngx47Rp00zbtkOHDsYOHToYjUaj8erVq8aAgADjwYMHzeY7fvx4Y6lSpUw/d+jQwVi2bFmzz2f9+vVGd3d344kTJ4xGo9HYr18/Y40aNYz37t0ztdmyZYvR3d3dOHPmzEeuw4IFC4xDhgwxGxYdHW10d3c3zp0712g0/m/bTp8+3aydr6+vsUePHkaj0Wg8e/bsA78bBoPB2LBhQ6O7u/sjl280pnz+o0aNMpYsWdLo7u5udHd3N9aoUcM4atQo47lz58za1qlTx1ixYkXjrVu3TMN++OEHo7u7u3Hnzp1Go9FobNOmjbFx48bGpKQkU5u//vrL+MYbb5jqCw4ONpYpU8YYERFhanPt2jVj7dq1jdu3b3/gs502bZrR09PTeOnSJbO633zzTWO/fv2MRuPT7bdiWXTGRl45xv+/1dnOzu6x7WxsbMzaA/j5+WFt/b/dvn79+hiNRg4cOEDOnDkpXrw4o0aNYtiwYXz77bcYjUaGDx+Ou7s7AAcOHMDHxwdHR0eSkpJISkrC2dkZb2/vBy45/PsulCZNmnDp0iXTpYsdO3Zw9+5d/P39AahevTrLly/HwcGB8+fPs3PnTubOnUtUVBQJCQmp2jbnz58nMjLSNM/7XnvtNby8vAgNDTUb7uXlZfq/vb092bNn5+7duwBUrFiR+Ph4/P39mT59OocPH6Z69er07dv3kWfK7mvatKnZz/Xr1+fGjRucP3+eAwcOYDQa8fX1NW3DpKQkfH19uXfvHocPHzZNV6BAATJlypSqdb+vbNmyODg40LJlSyZNmsS+fft4/fXXGTBgAM7Ozg+0z507N8uWLcPb25vw8HD279/P8uXL+fXXX0lMTDRrW7x4cbN55M6dG8B0OfPw4cPUqFEDe3t7s3W/vy8+SteuXZk8eTJ3797l9OnTfPfdd8yfPx/ggRr++ZkB5MmTx/SZHTp0CIA333zTNN7a2ho/P7/HLh9SPv9x48axa9cugoKC8Pf
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# shows on which island species live on a count plot\n",
"sns.set_style(\"whitegrid\")\n",
"\n",
"sns.countplot(x=\"Island\", hue=\"Species\", data=df_categorical)\n",
"\n",
"plt.title(\"Observations per Island and Species\")\n",
"plt.xlabel(\"Island\")\n",
"plt.ylabel(\"Observations\")\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([<AxesSubplot:ylabel='Adelie Penguin (Pygoscelis adeliae)'>,\n",
" <AxesSubplot:ylabel='Chinstrap penguin (Pygoscelis antarctica)'>,\n",
" <AxesSubplot:ylabel='Gentoo penguin (Pygoscelis papua)'>],\n",
" dtype=object)"
]
},
"execution_count": 142,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABjMAAAHbCAYAAACOdGgyAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3zT1f7H8VdGB6VsCmUPURwgG1SGiIoKuABx8vM6UNTr4IoTBygqKuIWUIYDwYkIIoKMskehzDJLmYUuOtOdfL+/PwJVFJRq22+avJ+PRy80SZN36iU5OZ9zPsdmmqaJiIiIiIiIiIiIiIiIj7JbHUBEREREREREREREROSvqJghIiIiIiIiIiIiIiI+TcUMERERERERERERERHxaSpmiIiIiIiIiIiIiIiIT1MxQ0REREREREREREREfJqKGSIiIiIiIiIiIiIi4tNUzBAREREREREREREREZ+mYoaIiIiIiIiIiIiIiPg0FTNERERERERERERERMSnqZghIiIiIiIiIiIiIiI+TcUMERERERERERERERHxaSpmiIiIiIiIiIiIiIiIT1MxQ0REREREREREREREfJqKGSIiIiIiIiIiIiIi4tOcVgcQERERERGRwObxeCgqKrI6ht8KCgrC4XBYHUNERMRnaSxStkprLKJihoiIiIiIiFjCNE0SExPJyMiwOorfq169OpGRkdhsNqujiIiI+AyNRcpPaYxFVMwQERERERERS5yYPKhTpw5hYWGaaC8DpmmSm5tLcnIyAPXq1bM4kYiIiO/QWKTsleZYRMUMERERERERKXcej6d48qBWrVpWx/FrlSpVAiA5OZk6deqo5ZSIiAgai5Sn0hqL6ABwERERERERKXcn+lKHhYVZnCQwnPg9qx+4iIiIl8Yi5as0xiIqZoiIiIiIiIhl1M6hfOj3LCIicmp6jywfpfF7VjFDRERERERERERERER8mooZIiIiIiIi4jsMj38/noiIiPg8j2H65WNVdDoAXERERERERHyH3QHf3wupu8v+sWqfAwMmlehHevXqRUpKCk6n9+O0aZo0adKEO+64g5tuuqksUoqIiEg5c9htPPrVRuKSXWX6OC3qhPPuLe1K9DMtW7YkJCQEh8OBaZoEBQXRsWNHXnjhBerVqwdA3759uf/++7nuuuvKIrZlVMwQERERERER35K6G45utjrFaY0aNYr+/fsDUFhYSFRUFM888wzp6encd999FqcTERGR0hCX7CL2SJbVMU7pk08+oUuXLgC4XC6GDx/OE088wbRp0wCYO3eulfHKjNpMiYiIiIiIiPxDwcHB9O7dm6eeeooPPvgAl8tFy5YtGT16NF26dGHo0KEArFq1ioEDB9KxY0f69u3L7Nmzi+/D5XLx3HPP0bt3b9q2bUv37t2ZMGFC8fW9evVi6tSpXHfddbRp04Zbb72V2NhYhgwZQrt27ejTpw9btmwp9+cuIiIi1gsPD2fQoEFs27at+LJevXoxc+ZMAKKjo+nfvz8dO3bkyiuv5JVXXsHtdgOQlpbG8OHD6dSpE126dGHYsGFkZmYCkJCQwGOPPcbFF19M165defzxx0lOTi5+jNjYWAYPHkynTp3o3bs3n376KaZZti2zVMwQERERERER+Zd69uxJQUEBMTExABw8eJCoqCjeeOMNdu7cyQMPPMB9993H2rVrefnll3n11VdZvnw5AGPHjuXw4cN89913bNy4keeee463336bAwcOFN//t99+y8cff8zKlStJS0tj8ODBPPjgg6xdu5ZzzjmHsWPHWvK8RURExFqZmZnMnTuX3r17n/L6J598ksGDB7N+/XqmTp3KL7/8wqJFiwB49NFHcblcLFiwgEWLFpGVlcWoUaMoKiri7rvvxuFwsGDBAubNmwfA0KFDcbvdJCUlceedd3L11VezatUqPvroI6ZPn87XX39dps9VbaZERERERERE/qUaNWoAkJGRAUC/fv2oVKkSlSpVYty4cVx++eXFkwzt27dn0KBBfPnll3Tv3p2HH34Yh8NBeHg4iYmJhISEAJCcnEyTJk0AGDBgAJGRkQBceOGFuFwu2rXz9tju1q0b48ePL8+nKyIiIhYaOnQoDocDwzDIycmhSpUqTJw48ZS3DQkJYd68eVSvXp1OnTqxdOlS7HY7CQkJrFu3jl9++aV4HDNmzBgyMjJYv349hw4d4vvvvyc8PBzwttns3Lkz27ZtIzo6mrPOOovbb78dgBYtWnDPPfcwbdo0brnlljJ73ipmiIiIiIiIiPxLaWlpANSqVQuAOnXqFF+XkJDAmjVr6NixY/FlHo+Hxo0bA3Ds2DFeeeUVtm/fTsOGDWnVqhUAhmEU37569erFf3c4HFSrVq34e7vdXuZtHURERMR3TJgwofjMjPz8fL788kvuvPNOvv76ay644IKTbvvZZ5/x/vvvM2rUKFJSUujevTsjR44kJSUFgAYNGhTfNiIigoiICHbt2kWNGjWKCxngbWdVvXp1EhISSEhIIDY29qSxjWEYOByOsnzaKmaIiIiIiIiI/FuLFy8mLCyMNm3aAGCz2Yqvi4yM5MYbb+Sll14qviw5Obm4APHoo4/Sq1cvJk+ejNPpJD09nW+++eak+//9/YmIiIicEBoayj333MPHH3/MqlWrTipmFBQUEBcXx8iRI3E6nezbt4/nnnuOV199lREjRgBw5MgRmjZtCkBcXBw//fQTl156Kenp6bhcruKCRnZ2Nunp6URERBAZGUmXLl2YPHly8WOlp6eTk5NTps9VxQwRERERERHxLbXPqTCPU1hYyMKFCxk3bhzDhg07aQXjCQMHDuSuu+6id+/eXHLJJRw8eJD77ruPyy67jGeeeYbs7GxCQ0NxOBykpaXxyiuvAFBUVPSv84mIiMg/06LOn9/TffEx3G43P/74I1lZWXTo0OGk62w2G//73/+45557uPvuu4mIiMDpdFKjRg3q1q1L165deeONNxgzZgx2u50333yT8PBwWrduTYsWLXjxxRcZOXIkACNHjqRx48a0b9+eBg0a8MknnzB79mz69OlDWloaDz/8MBEREXzwwQf/+jmdjooZIiIiIiIi4jsMDwyYVL6PZy9ZS4QXX3yRl19+GfD2oW7evDmjRo2iT58+p7x9mzZtGDduHOPGjePRRx+lUqVK9OvXj//9738AvPbaa7z66qtMmTKFatWq0adPH84//3x2795Nt27d/t3zExERkRLzGCbv3tKu3B7LYS/ZDswhQ4YUt3Sy2Ww0bdqUcePG0b59+5NuFxwczPjx43n99deZOHEiDoeDHj16MHz4cADGjh3LmDFjuOaaa3C73fTq1YsRI0bgdDqZOHEiY8aM4aqrrqKwsJBLLrmEqVOn4nQ6adCgAZMmTWLs2LGMHj0ah8NBz549i3d7lBWbqcaaIiIiIiIiUs7y8/PZt28fzZo1IzQ01Oo4fk+/bxERkZPpvbF8lcbv217KmUREREREREREREREREqVihkiIiIiIiIiIiIiIuLTVMwQERERERERERERERGfpmKGiIiIiIiIiIiIiIj4NBUzRERERERExDKmaVodISDo9ywiInJqeo8sH6Xxe1YxQ0RERERERMpdUFAQALm5uRYnCQwnfs8nfu8iIiKBTmOR8lUaYxFnaYUREREREREROVMOh4Pq1auTnJwMQFhYGDabzeJU/sc0TXJzc0lOTqZ69eo4HA6rI4mIiPgEjUXKR2mORWym9tGIiIiIiIiIBUzTJDExkYyMDKuj+L3q1asTGRmpSRoREZHf0Vik/JTGWETFDBEREREREbGUx+OhqKjI6hh+KygoSDsyRERE/oLGImWrtMYiKmaIiIiIiIiIiIiIiIhP0wHgIiIiIiIiIiIiIiLi01TMEBERERERERERERERn6ZihoiIiIiIiIiIiIiI+DQVM0RERERERERERERExKepmCEiIiIiIiIiIiIiIj5NxQwREREREREREREREfFpKmaIiIiIiIiIiIiIiIhPUzFDRERERERERERERER8mooZIiIiIiIiIiIiIiLi01TMEBERERERERERERE
"text/plain": [
"<Figure size 2000x1000 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# shows on which island species live on a pie plot\n",
"df_categorical.groupby(['Island', 'Species']).size().unstack().plot(kind='pie', subplots=True, figsize=(20, 10), autopct='%1.1f%%')"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([<AxesSubplot:ylabel='Adelie Penguin (Pygoscelis adeliae)'>,\n",
" <AxesSubplot:ylabel='Chinstrap penguin (Pygoscelis antarctica)'>,\n",
" <AxesSubplot:ylabel='Gentoo penguin (Pygoscelis papua)'>],\n",
" dtype=object)"
]
},
"execution_count": 143,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABj4AAAHbCAYAAAB7iuOCAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3xU1brG8d+emRQg9BZAqjQLVUCRHgUriqhYUI8F+7GLRw94FRWwi1jwSBEUBUWRYkFEeg+9JfQSSnoC6ZnZe98/AmgEBDTJnsw8389nLjJl7ydzD5k1+11rvYZt2zYiIiIiIiIiIiIiIiIBwOV0ABERERERERERERERkaKiwoeIiIiIiIiIiIiIiAQMFT5ERERERERERERERCRgqPAhIiIiIiIiIiIiIiIBQ4UPEREREREREREREREJGCp8iIiIiIiIiIiIiIhIwFDhQ0REREREREREREREAoYKHyIiIiIiIiIiIiIiEjBU+BARERERERERERERkYChwoeIiIiIiIiIiIiIiAQMFT5ERERERERERERERCRgqPAhIiIiIiIiIiIiIiIBQ4UPEREREREREREREREJGCp8iIiIiIiIiIiIiIhIwFDhQ0REREREREREREREAoYKHyIiIiIiIiIiIiIiEjBU+BARERERERERERERkYChwoeIiIiIiIiIiIiIiAQMFT5ERERERERERERERCRgqPAhIiIiIiIiIiIiIiIBQ4UPEREREREREREREREJGCp8iIiIiIiIiIiIiIhIwFDhQ0REREREREREREREAoYKHyIiIiIiIiIiIiIiEjBU+BARERERERERERERkYChwoeIiIiIiIiIiIiIiAQMFT5ERERERERERERERCRgqPAhIiIiIiIiIiIiIiIBQ4UPEREREREREREREREJGCp8iIiIiIiIiIiIiIhIwFDhQ0REREREREREREREAoYKHyIiIiIiIiIiIiIiEjBU+BARERERERERERERkYChwoeIiIiIiIiIiIiIiAQMFT5ERERERERERERERCRgqPAhIiIiIiIiIiIiIiIBQ4UPEREREREREREREREJGCp8iIiIiIiIiIiIiIhIwFDhQ0REREREREREREREAoYKHyIiIiIiIiIiIiIiEjBU+BARERERERERERERkYChwoeIiIiIiIiIiIiIiAQMFT5ERERERERERERERCRgqPAhIiIiIiIiIiIiIiIBQ4UPERERERGRABAVFUWLFi1o06YNbdq0oXXr1nTu3Jk33ngDy7IASEtLo1WrVlx//fUnvH7q1KlERUWd8vimafLGG29w6aWX0qZNGx5++GESExMBmDFjxvHzHrtdeOGFXHjhhcdfv379em6++WbatGlDVFQUU6ZMOf6YZVm89957dO3alYsuuoh+/fqxcuXKonprRERERCTIqPAhIiIiIiISIIYMGcLatWtZu3Yt69atY+zYsUybNo0PP/wQgClTptC1a1eSkpJYsmTJWR171KhRLFmyhO+++45FixYRHh7O4MGDAbjuuuuOn3ft2rXMmjWLSpUqMXToUAAOHz7MAw88QJ8+fYiOjmbo0KEMHz6cDRs2ADB58mTmzJnDlClTiI6O5uqrr+bBBx8kLy+vCN8dEREREQkWKnyIiIiIiIgEqGbNmtG+fXu2bNmCZVlMnjyZ3r17c/PNNzNu3LizOtaUKVO4//77qVWrFhEREQwaNIiFCxcSFxdX6Hm2bTNw4EC6d+9+fGXJ7NmzqVSpEv3798fj8dCxY0d69+7Nl19+CcCuXbuwLAvLsrBtG8MwCA8PL5o3QURERESCjgofIiIiIiIiAcjr9bJixQqWL19Op06dmDt3LqZpEhUVxW233caKFSvYunXrGR0rIyOD+Ph4mjZtevy+atWqUbFixROOMX36dHbs2MHzzz9//L7t27cXei1A48aNiY2NBeDWW28lNzeX7t2706JFC0aMGMHIkSMJCwv7uz++iIiIiAQxFT5EREREREQCxJAhQ2jXrh3t2rWjY8eOvPrqq9xzzz3ccccdTJw48fiKi8jISHr27Mn48ePP6LhZWVkAlC1bttD94eHhxx+Dgl4do0aN4qGHHiIiIqLQ68uUKXPCa7Ozs4GCIk2HDh34+eefWbNmDQMGDODxxx8nKSnp77wNIiIiIhLkPE4HEBERERERkaLx0ksv0bdv3xPu37lzJ8uWLWPTpk2MHTsWgPz8fLxeL0899RQ1atT4y+MeK1rk5OQUuj83N5dy5cod//uKFStITEzkpptuOuH1GRkZp3ztc889x0MPPUSjRo0AePTRR5k+fTqzZs3izjvvPJMfXURERETkOK34EBERERERCXATJ06kW7du/PDDD0yfPp3p06fz888/U7duXSZOnHja11esWJGaNWuyY8eO4/clJSWRnp5eaAurX375hZ49e56wMqRp06Zs37690H07duygSZMmABw8eJD8/PxCj3s8HkJCQs76ZxURERERUeFDREREREQkgGVmZjJt2jT69etHZGRkoVu/fv2YPHny8S2nTNMkPj6+0C01NRWAvn37MmrUKOLi4sjMzGTYsGF06NCBevXqHT/X6tWrad++/QkZevbsSXJyMuPHj8fr9bJ8+XJmzpzJjTfeCEBUVNTxY3u9XiZMmEBSUhI9evQogXdIRERERAKNtroSEREREREJYFOnTiU8PJxu3bqd8FifPn149913+fbbb4mIiCA+Pv6E57Vq1YpvvvmGRx99FJ/PR//+/cnKyuLiiy9mxIgRhZ67f//+k26bVblyZcaNG8fQoUMZOXIkVapUYfDgwVxyySUAvPzyy7z33nv079+fnJwcmjVrxtixY6lZs2bRvREiIiIiEjQM27Ztp0OIiDNMy8I6+hvgj78JjKP/xwAMw8AwwGUYAFiWjYWNbf/+GuPoc90uA+Po80REREROx7ZtTMs+Ph6Bo+OQQv9R2J/vtuyCsYjbMHC5NA4RERGRs1MwFim4znHs+sZxxp//8/c7bPv3Mcyx6yZujUVE/IZWfIgEENu28Vk22AVFiD9/+Tctm8M5XlIy80jMyCMlM49cn4XXZ5FvWnhNG59p4TUt8k/y3zYQ6nYRFuIizOMmzOMizOOibKiHcmFuIsI8VCgTQoVwD+XDQ6hSLpRyYYV/zfhMC/sU+URERKR0Oz4W4dRf/rPzfaRne0nNyicxI4/UrDxSs7x4Tev4RQfbtrEpmGRh/eG/7aOPH7vP4zKoEB5CxbIhVCobQuWyoVQpF0qlMiFUKBNCeIj7pDl9pgWAx62df0VERAKNaRVMrHC7ThyL5HpNsvJ8ZOb5OJLjJT3Hy5EcHxl5XjJyfWTkesnzWr+PPY5O/Dzmj2OUUI+LiDAPEWEeyoUVXBcpf/S6SET40ftDPUSEe45PJj1dPhEpOip8iJRCf/6ybtk2CYdz2Z6Yyf60bJIy80nJzCMlM5+UrDySMgr+PJzjpaTXeJUJcRNZMZzICuHUrBBOZMWwo3+Gc06lMkRWLEOVcqHHP+xNy8a2bV2IEBER8WPHVo2G/OHzOv5wLtsSMtiXmk1KVj5pWfmk/uGWkpVHeraXPJ9VYjlD3AYVy4RQsUwolY4WRyqVCaF2pTI0qFaOxjUiaFi1HBXK/N5A22taGAZ4XBqLiIiI+CuvaRVa7WnZNqlZ+RxKz2Ffag7xR3I4mJ7LocM5xB/O5WB6LsmZeccnaJQkj8ugRvkwIiuGH78eEnn0z9qVylC7Yhmqlw8j1FMw9ji2AiVE10VE/hFtdSXix7ymhecP20dl5HrZlZTF9sQMdiVlsTMpi93JmexNyS7RiwhFzWVAzQrhNKkRQdPI8jSrWZ7za1fg3OoRx2dq+kwLdBFCRESkRHlNq9DKjXyfxZ6ULLbGZ7AzKZOdiZnsTMpiV3Imud7SOxapUMZDw6rlaFCtHA2P3hrXiKBB1XLHV68eW82iixAiIiIlyPQd3Ufq91Wc2xMyWL4rlZ1JmexIzGR3chYJR3IdKWoUFcOA2hXL0KRGBOfWiKBxjQiaRZanSY0IyocXTNCwjm4RqrGIyJlR4UPET5hHP6DdLgP
"text/plain": [
"<Figure size 2000x1000 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# shows the distribution of species in the study\n",
"df_categorical.groupby(['studyName', 'Species']).size().unstack().plot(kind='pie', subplots=True, figsize=(20, 10), autopct='%1.1f%%')"
]
},
{
"cell_type": "code",
"execution_count": 144,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAloAAAGtCAYAAADOJNrWAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAADyEUlEQVR4nOzdd3iTVfvA8W+S7j3pLquljBZa9gbZiIAUBARxvK4XcQtOfEVRAf2JW0RQUEQZAoKIDAUEAdmFMlpaCm2B7tK90/z+eCAlJMEW6QDuz3X1usg5J0/O04bm7hn3Uel0Oh1CCCGEEOKGU9d3B4QQQgghblUSaAkhhBBC1BIJtIQQQgghaokEWkIIIYQQtUQCLSGEEEKIWiKBlhBCCCFELZFASwghhBCilkigJYQQQghRSyzquwMNSWVlJRUVFajValQqVX13RwghhBDVoNPpqKysxMLCArW6YY0hSaB1hYqKCqKjo+u7G0IIIYS4DmFhYVhZWdV3NwxIoHWFy1FwWFgYGo2mnnsjhBBCiOrQarVER0c3uNEskEDLwOXpQo1GI4GWEEIIcZNpiMt+Gl7oJ4QQQghxi5BASwghhBCilkigJYQQQghRS2SNVg3pdDoqKirQarX13RVxFUtLS1lbJ4QQokGRQKsGysrKSElJoaioqL67IkxQqVT4+/vj4OBQ310RQgghAAm0qq2yspIzZ86g0Wjw9fXFysqqQe5uuF3pdDoyMjI4d+4cwcHBMrIlhBCiQZBAq5rKysqorKwkICAAOzu7+u6OMMHT05OzZ89SXl4ugZYQQogGQRbD11BDTIYmFDLCKIQQoqGRqEEIIYQQopZIoCWEEEIIUUtkjdYtKDc3lw8//JBt27aRm5uLg4MDPXr04LnnnsPb27u+uyeEELe9/Wez+fj3OI6cyyHA1Y5HejUlsr1/fXdL1AIZ0boFPffcc1y8eJGffvqJqKgofv75Z8rKynjooYeoqKio7+4JIcRtLSo5h4kL9vJXfCb5JRWcSMnj+RVH+P7vxPrumqgFEmjdgg4ePMjAgQPx9PQEwMPDg1dffZV27dqRl5dHQUEBb731Fn369KFbt24899xzZGZmAvDrr78SGhpKTEwMACdOnKBt27bs2LGj3u5HCCFuJfP/PE2ZttKo/Itt8VRW6uqhR6I2SaB1Cxo2bBhvvPEGM2bMYMOGDZw/fx5PT09mz56Nm5sbr776KomJiaxevZrff/8dBwcHnnzySXQ6HcOGDWP48OG8+OKL5Obm8txzz/Hggw/Su3fv+r4tIYS4JcSm5Zssv5BbQn6pzDrcaiTQugW9/fbb/O9//yMlJYX//e9/9OvXj4EDB7Ju3TqysrLYtGkTr732Gu7u7tjb2/Pqq68SHR3N8ePHAXj99dcpKytj1KhReHp68swzz9TzHQkhxK0jyNP06RVeTtY4WsvS6VuN/ERvQWq1mpEjRzJy5Eh0Oh2nT59m7dq1vPjiizz//PMAjB071uA5Go2Gc+fOERoaip2dHaNHj+b//u//mDJliiT/FEKIG+jxPs3YGpNOxVXThP/t0xy1WvIB3mpkROsWs3PnTiIiIsjJyQGUJJ5BQUG88MILtG7dmrKyMgB+++03Dhw4oP9avXo1d9xxBwBJSUnMmzePe+65h/fee4/U1NT6uh0hhLjldGjsxrf/6UynJq5YWahp7mnPrMgwHurRtL67JmqBBFq3mE6dOuHu7s4rr7xCbGws5eXlFBQUsG7dOs6ePcvQoUPp27cv77zzDhcvXqS8vJx58+YxZswY8vLyKC8v5/nnn2fYsGG8/fbbdOrUiWnTplFZabxwUwghxPXpEeTByv9259TbQ/njhb7c2zmwvrskaokEWrcYGxsbfvjhBzw9PZk8eTIdO3akb9++rFu3jkWLFtG8eXPee+89nJycuPvuu+natSt//vknCxcuxNPTk48//piLFy/y8ssvA/DWW28RHx/P/Pnz6/nOhBBCiJuPSqfTyV7SS7RaLVFRUYSHhxutSyopKeHMmTM0bdoUGxubeuqhuBb5GQkhxO3pWp/f9U1GtIQQQgghaokEWkIIIYQQtUQCLSGEEEKIWiKBlhBCCCFELZFASwghhBCilkigJYQQQghRSyTQEkIIIYSoJRJoCSGEEELUEgm0hBBCCCFqiQRa9UBbqWPP6SzWRp1nz+kstJW1m5w/JCSEkJAQEhISjOoWLVpESEgIn376qUF5RUUFvXv3pnv37pSWlhrU7d27l5CQELOv9+mnn9KqVSsiIiKMvv73v//dmJsSQgghbgIW9d2B283GYym8+csJUnJL9GU+zja8Mbw1Q0J9au11XV1dWbNmDS+88IJB+erVq3FwcDBqv3nzZry8vNBqtaxdu5axY8fW6PU6duzIkiVL/lWfhRBCiJudjGjVoY3HUpj8/SGDIAsgNbeEyd8fYuOxlFp77eHDh7N27VoqKyv1ZUePHqWsrIzWrVsbtf/++++58847GT9+PIsWLUKOxBRCCCFqTgKtOqKt1PHmLycwFa5cLnvzlxO1No3Yt29fysvL2b17t77sp59+YsyYMUZtY2JiOHHiBJGRkQwfPpzs7Gy2b99eK/0SQgghbmUSaNWRfWeyjUayrqQDUnJL2Hcmu1Ze38LCguHDh7NmzRoASkpK2LRpE3fffbdR2yVLljBy5EicnZ2xtbXlnnvu4ZtvvqnR6x08eJCOHTsafR04cOBG3I4QQtzUisoq+GFvEi+vOsrn2+JJzzP/+SBubrJGq46k51fvP1F1212PyMhIxo0bR0FBAb///jvt27fH09PToE1OTg7r169HrVazceNGQFkYX1BQwLFjxwgNDa3Wa3Xo0EHWaAkhhAnZhWWMnb+H+PQCfdn8P0+z9JGuhPk712PPRG2QEa060sjR5oa2ux4tW7akWbNm/Pbbb6xevdrktOFPP/1EYGAgv/32G2vXrmXt2rX8+uuv9OjRg0WLFtVa34QQ4nbx5Z+nDYIsgLySCt5af7yeeiRqkwRadaRzUzd8nG1QmalXoew+7NzUrVb7ERkZyeLFizlz5gx9+vQxqKusrOSHH35g9OjReHt7G3yNHz+ejRs3cuHCBX371NRUg6/09PRa7bsQQtwKtsea/l25/+xF8kvK67g3orbVS6CVk5PDiy++SJcuXejUqRNPPPGE/kP6yJEj3HPPPURERNCvXz9Wrlxp8Nw1a9YwcOBAwsPDiYyM5PDhw/o6rVbLnDlz6N69OxEREUyePLnBfPhr1CreGK7s7rs62Lr8+I3hrdGozYViN8Zdd91FYmIiI0aMwMLCcOZ427ZtpKenM2LECKPn9evXD1dXV7799lt9WZ8+fQy+Ro0apa87cOCAyTxapq4thBC3Ezsr06t2rCzUWFnI+MetRqWrh337kyZNwtnZmXfffRe1Ws0rr7xCWVkZ7733HoMGDeLpp59m3Lhx7N+/nylTprB48WLatm3L3r17mTx5MgsWLKBt27YsXbqUL7/8km3btmFra8tnn33G5s2bmT9/Po6Ojrz++usUFhby1VdfVatfWq2WqKgowsPD0Wg0BnUlJSWcOXOGpk2bYmNz/dN79ZVH63Zwo35GQghRE+n5JXyw6RRbTqZhpVEzMsKXZ/u3wNZKY7L9938nMv3nY0blI9r54mZvxa/RKeh0MCzMm+cHhuBsZ1nbt3DTu9bnd32r88Xwx44d48iRI+zevVufKHPmzJlkZGSwefNmXFxcmDhxIgDdunVj+PDhLF26lLZt27Jy5UqGDRtGhw4dAHjwwQdZvnw5GzZsYPTo0axcuZKpU6fi46MELK+99ho9e/YkOTmZgICAur5Vk4aE+jCwtTf7zmSTnl9CI0dlurC2R7KEEELceCXlWsZ/9TcJGYX6svl/JhCTks+3/+ls8jkTuwQSl5bP93uT9Cl9ejR350xmIeuOVC3P+HZPIoeSclg7pQdq+Yy4adV5oHX06FGCgoJYsWIFP/74I8XFxfTq1YuXXnqJuLg4WrRoYdA+KCiIn376CYD4+HhGjx5tVB8TE0N+fj6pqakGz/fw8MDZ2ZnY2NgGE2iBMo3Yrbl
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# shows the correlation between \"Body Mass (g)\" and \"Species\" in a swarm plot\n",
"warnings.filterwarnings('ignore')\n",
"sns.swarmplot(x=\"Species\", y=\"Body Mass (g)\", data=penguin_manager.get_df(), hue=\"Sex\")\n",
"warnings.filterwarnings('default')"
]
},
{
"cell_type": "code",
"execution_count": 145,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAksAAAGtCAYAAAAGSDAAAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAADDc0lEQVR4nOzdd3hU1brH8e9MJj0hvUBIgFRaIPTeO4ggIIgoCIiIFbDrsR69okexYAFFUIqCggUFFaV36T0VQmgppJJG2tw/NiCTvSdmMCHt/TzPPPey1p6ZNckc95u91/otndFoNCKEEEIIITTpq3oAQgghhBDVmRRLQgghhBBlkGJJCCGEEKIMUiwJIYQQQpRBiiUhhBBCiDJIsSSEEEIIUQYploQQQgghyiDFkhBCCCFEGQxVPYCKVlJSQlFREXq9Hp1OV9XDEUIIIUQ5GI1GSkpKMBgM6PXV61pOrSuWioqKOHr0aFUPQwghhBA3ITw8HBsbm6oeholaVyxdq0bDw8OxsrKq4tEIIYQQojyKi4s5evRotbuqBLWwWLp2683KykqKJSGEEKKGqY5TaKpf+SaEEEIIUY1IsSSEEEIIUQYploQQQgghylDr5iwJIYSo24xGI0VFRRQXF1f1UMQNrKysMBgM1XJO0j+RYkkIIUStUVBQwMWLF8nNza3qoQgNDg4O1K9fv9pFA/wTKZaEEELUCiUlJZw+fRorKysaNGiAjY1NjbyKURsZjUYKCgpISUnh9OnThISEVMuIAHOkWBJCCFErFBQUUFJSgr+/Pw4ODlU9HFGKvb091tbWnDlzhoKCAuzs7Kp6SOVWc8o6IYQQohxq0hWLuqam/m5q5qiFEEIIIW4RKZaEEEIIIcogc5aEyrn0XN5dH82mqGQcbQyMadeQh/sEY2OQ2loIISpCZmYm7733Hps2bSIzMxMnJye6devGrFmz8PX1rerhiVLk7CdMZOYVMnb+Ln44eJ6M3ELOZ+TxwYYYnlp1uKqHJoQQtcasWbNIT09n1apVHDp0iB9//JGCggImT55MUVFRVQ9PlCLFkjCxev85LmTmq9rXHL5A/KWcKhiREELUPvv372fAgAF4eXkB4OnpyfPPP0/r1q3JysoiOzub1157jV69etGlSxdmzZrFpUuXAFi7di0tW7YkMjISgBMnTtCqVSu2bt1aZZ+ntpNiSZiISb6s2W40Qkxy9i0ejRBC1E7Dhg3j5Zdf5pVXXmHdunWcP38eLy8v5syZg7u7O88//zxnzpzh+++/588//8TJyYlHHnkEo9HIsGHDGD58OE8//TSZmZnMmjWL++67j549e1b1x6q1pFgSJoK8nMroc7yFIxFCiNrr9ddf56WXXuLixYu89NJL9O3blwEDBrBmzRpSU1P5/fffeeGFF/Dw8MDR0ZHnn3+eo0ePcvz4cQBefPFFCgoKuOOOO/Dy8uLxxx+v4k9Uu8kEb2FiTLuGfLb1FMmXr5i0Dw33JbCMQkoIIUT56fV6RowYwYgRIzAajcTFxfHTTz/x9NNPM3v2bADGjh1r8hwrKyvOnTtHy5YtcXBwYPTo0bzzzjs8/PDDWFlZVcXHqDPkypIw4epgw7fTuzAsvD721lZ4OtnyYK8g5o6NqOqhCSFErbBt2zbatGlDRkYGADqdjuDgYJ544gmaN29OQUEBAL/++iv79u27/vj+++/p06cPAAkJCXz66afceeedvP322yQmJlbVx6kTpFgSKo09Hfl4QltO/ncw+/7Tn2eHNMXOWv5qEUKIitChQwc8PDx47rnniIqKorCwkOzsbNasWUN8fDxDhgyhd+/evPHGG6Snp1NYWMinn37KmDFjyMrKorCwkNmzZzNs2DBef/11OnTowFNPPUVJSUlVf7RaS4olIYQQ4hays7Pj66+/xsvLixkzZtC+fXt69+7NmjVrWLx4MUFBQbz99tvUq1ePkSNH0rlzZ7Zs2cLChQvx8vLigw8+ID09nWeffRaA1157jdjYWBYsWFDFn6z20hmNRmNVD6IiFRcXc+jQISIiIuQerhBC1CH5+fmcPn2aJk2a1KhNWuuSsn5H1fn8LVeWhBBCCCHKIMWSEEIIIUQZJDqglttzKpUPNsRw9HwmAe4OTOsRyMg2fhX+PvmFxXy4IYYfDp7nSlEJ/Zp68+SgMHzqyaVwIYQQNZsUS7XYgYR07vliD4XFyrS04xeymLnyEHmFxYzvGFCh7/XI1wf582TS9X9/t/8cf8Wn8dvjPbG3qV73noUQQghLyG24Wmz+5rjrhdKNPtoYS0XO6z95McukULrmTGouaw6fr7D3EUIIIaqCFEu1WHSS9j5v5zPyyCkorvT3UfpkPzkhhBA1m9yGq8WCvZ2IT81VtTdwseNQQjrzNsZy4mIWTTwdmd4ziGGt6gNwLj2Xd9dHsykqGUcbA2PaNeThPsHYGPQUFJXw0aZYVu8/R05BEX3DvLmtdX2zYyhrrzkhhBCiJpBiqRab3iuIzVEpFJWY3nIb1NKXSYv3Uny1/ci5TB7++gCFxRH0aerN2Pm7uJCZD0BGbiEfbIghPjWHD+5qwxPfHebnwxeuv9b3B8+z53Qa3YI82BGXavI+fq72jIhoUMmfUgghhKhcchuuFuvQ2J3FkzvQvpEbNgY9wd5OvDU6nNjk7OuF0o3mbYxh9f5z1wulG605fIHtMSn8cuSCqu98Rh69wryY2r0JHo42ONpYMSKiAd8+2AVHW6nHhRBC1GxyJqvleoR40SPEy6Rt7h/RmsfGpeQQlZSl2Wc0ws64VMzNC49PzeX/7gjnxdua/6vxCiFEVSsuMfLX6TSSL+fj7WxHxybuWOl1lfZ+YWFhgLJxbmBgoEnf4sWLmTNnDo888giPPvro9faioiL69u1LUVERmzZtwtbW9nrfnj17mDhxIlFRUZrvN2/ePD755BPNlPPhw4fz2muvVcTHqlWkWKqDgr2dSMq6ompv4ulIiLez2ed1bOzOJ8Rp9sncJCFEbfDbsYu8+vMJLt5whb2+ix0vD2/O4Jbm52f+W25ubvzwww888cQTJu3ff/89Tk7q/76uX78eHx8fiouL+emnnxg7dqxF79e+fXuWLl36r8Zcl8htuDrowV5BaP2RNKN3EGPaNcTb2VbVNzTcl95NvRncwlfV51PPljFtG1bGUIUQ4pb57dhFZiw7YFIoASRm5jNj2QF+O3ax0t57+PDh/PTTT5SUlFxvO3LkCAUFBTRvrr5iv2zZMoYOHcpdd93F4sWLKzQORqhJsVQH9QjxYuGk9kT4u2Jj0BPm48zcsa0Z294fVwcbvp3ehWHh9bG3tsLTyZYHewUxd2wEAO/fFcH0XoF4OtniYGPFsFb1+XZ6F1wcrKv2QwkhxL9QXGLk1Z9PoFVyXGt79ecTmvM9K0Lv3r0pLCxk586d19tWrVrFmDFjVMdGRkZy4sQJRo0axfDhw0lLS2Pz5s2VMi6hkNtwdVTfpj70beqj2dfY05GPJ7TV7LOztuK5Ic14bkizyhyeEELcUn+dTlNdUbqREbiYmc9fp9PoEuRR4e9vMBgYPnw4P/zwA927dyc/P5/ff/+dX375ha1bt5ocu3TpUkaMGIGLiwsAd955J4sWLaJPnz7lfr/9+/fTvn17Vfv8+fM12+s6KZaEEELUecmXzRdKN3PczRg1ahTjxo0jOzubP//8k7Zt2+LlZbpAJyMjg19++QW9Xs9vv/0GKJO9s7OzOXbsGC1btizXe7Vr107mLFlAiiUhhBB1nrdz+Tb9Lu9xN6Np06YEBgby66+/8vPPPzNp0iTVMatWrSIgIIDPP//cpP35559n8eLFvPvuu5U2vrpMiiUhhBB1Xscm7tR3sSMxM19z3pIO8HVRYgQq06hRo/jyyy/JysqiV69eJn0lJSV8/fXXTJw4EV9f08U2d911F7NmzTJZTZeYmGhyjF6vx9vbu/IGX4tJsSSEEKLOs9LreHl4c2YsO4AOTAqma4uHXx7evFLzlgBuu+023nrrLSZNmoTBYHqK3rRpE8nJydx+++2q5/Xt2xc3Nze++uor+vbtC6Aqtjw9PdmxYwcA+/bto02
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# shows the correlation between \"Culmen Depth (mm)\" and \"Species\" in a swarm plot\n",
"warnings.filterwarnings('ignore')\n",
"sns.swarmplot(x=\"Species\", y=\"Culmen Depth (mm)\", data=penguin_manager.get_df(), hue=\"Sex\")\n",
"warnings.filterwarnings('default')"
]
},
{
"cell_type": "code",
"execution_count": 146,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAlYAAAGtCAYAAADUGDpYAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAADu3UlEQVR4nOzdd1iTV/vA8W8Swt57ihtFUFDc27rqrlq1trb6arW17a+1w277dk+7305HHbW1rjrr3ltRFEFEEZC99wok+f3xyIgJVhRknc91cWnOeZ4nJ4jJzXnOuW+ZVqvVIgiCIAiCINwzeX0PQBAEQRAEoakQgZUgCIIgCEItEYGVIAiCIAhCLRGBlSAIgiAIQi0RgZUgCIIgCEItEYGVIAiCIAhCLRGBlSAIgiAIQi0RgZUgCIIgCEItMarvAdQ3jUZDWVkZcrkcmUxW38MRBEEQBOEOaLVaNBoNRkZGyOUNZ56o2QdWZWVlhIaG1vcwBEEQBEG4C/7+/hgbG9f3MCo0+8CqPMr19/dHoVDU82gEQRAEQbgTarWa0NDQBjVbBSKwqrj9p1AoRGAlCIIgCI1MQ1vG07DCPEEQBEEQhEZMBFaCIAiCIAi1RARWgiAIgiAItaTZr7ESBEEQmje1Wk1paWl9D0O4hUKhwMjIqMGtofo3IrASBEEQmq38/Hzi4+PRarX1PRTBAHNzc9zc3BpUOoV/IwIrQRAEoVlSq9XEx8djbm6Ok5NTo5sZacq0Wi0qlYq0tDSio6Np165dg0urUB0RWAmCIAjNUmlpKVqtFicnJ8zMzOp7OMItzMzMUCqVxMbGolKpMDU1re8h3ZHGEf4JgiAIQh0RM1UNV2OZpaqq8Y1YEARBEAShgRKBlSAIgiAIQi0Ra6wEQWhUcopK+WpPJNtDkwAY7e/GgmHtsTFTVhyTkF1EVoGK9i5WGBvp/v54KDKNTefiKSpV80AHFx7q6oFSIX7HFBq2nJwcvvrqKw4cOEBOTg6Wlpb07duXBQsW4OrqWt/DE6oQgZUgCI2GRqPl8aWnuBCfU9H22/EYzsdl8/f8PmQVlvLiXyEcikxDqwVHS2Nef7Ajk7p5AvDtvqt8uSey4txdYSlsD01i2czuKORinY3QcC1YsAArKyvWr1+Pk5MT6enpfPjhh8yaNYutW7diZCQ+zhsK8WuaIAiNxqGraTpBVbkLcdkcikzjxb9COHhFCqoA0vNVvLL+AudvZJGaV8x3+6/qXzMyjX2XU+p66IJwT4KDgxk2bBhOTk4AODo68sYbb9ClSxdyc3PJz8/nvffeY+DAgfTu3ZsFCxaQnp4OwPbt2/Hz8yMiIgKA8PBwOnfuzOHDh+vt9TRl9R5YZWZmMmzYME6dOlXRduHCBR5++GECAwMZMmQI69atu+01fv31VwYMGEBAQAAzZszg+vXrdT1sQRDqwdWUvGr7TkdncvBKml67Rgt/nL7B2ZgsStWGk0Aej8qotTEKQl0YPXo077zzDv/973/ZsWMHCQkJODk58cknn2Bvb88bb7xBbGwsGzduZO/evVhaWvLss8+i1WoZPXo0Y8eOZeHCheTk5LBgwQJmzpzJgAED6vtlNUn1GlgFBwczdepUbty4UdGWk5PD3LlzmTBhAmfOnOHDDz/k448/5uLFiwavsWnTJlatWsXSpUs5deoUnTp14v/+7/9EFl1BaILaOltW2+doaVJtX0a+Cjvz6jM3O1g0nqzOQvP0wQcfsGjRIpKSkli0aBFDhgxh2LBhbNmyhYyMDHbt2sWbb76Jg4MDFhYWvPHGG4SGhhIWFgbA22+/jUql4qGHHsLJyYnnn3++nl9R01VvgdWmTZt4+eWXWbBggU777t27sbW15dFHH8XIyIjevXszduxYfv/9d4PX+euvv5g+fTrt2rXDxMSEl156icTERJ0ZMEEQmoZB7Z3xdbPWa+/kbs0jPbywryZA6t3GgV6t7WnjZKHXZ2IkZ+LNNViC0FDJ5XLGjx/Pzz//zJkzZ9i+fTsjR45k4cKFbNiwAYApU6YQFBREUFAQ/fv3R6FQEB8fD0ilYSZNmkRCQgIPPfQQCoWiPl9Ok1Zvq9369evH2LFjMTIy0gmurl69Svv27XWObdu2LevXrzd4nWvXrvHkk09WPFYqlbRs2ZKIiAh69ep1x+NRq9U1fAWCINSHFbOC+GJ3JP9cSgbgQT9XXhnRHmOFjFdHtue1jZeoOmHd3tmSMrWGl9ddYEA7R4wVci4nS7cU3W1N+WB8J1ytjMV7QDOkVqvRarUVXw3VkSNHeP7559m/fz+2trYAtGnThhdffJGjR4+iUqkA2LFjR8UaLJA+H728vNBqtdy4cYMff/yRhx9+mE8//ZQ+ffo0it2E5f82arVa7/9oQ/0/W2+BVdV//KoKCgr0SguYmppSWFhYK8dXJzQ0tEbHC4JQf6a0gimtHG8+KiM2MpxYoK0cPhxsz97rReSUaGhjZ8Sh2CI+2Xml4lwLpYxX+tjibK7A29YIRWECISEJ9fI6hPpnZGREUVERGo2mvodSrU6dOmFvb8+rr77K/PnzadmyJSqVikOHDhEbG8vAgQM5f/487733Hq+//joWFhasWLGCZcuWsWXLFmxsbFiwYAHDhw/n9ddfJz09nZdffpmffvqpwWc2LykpobS0tGLhfWPQ4PZnmpmZkZenu0C1uLgYCwv9Kfzy44uLi+/4+Or4+/uLqVFBaAICgKkPSH9/d1s4KQU3dPoLSrUcTpKzZk73+z42oWEpLi4mNjYWMzOzBl2HztzcnDVr1vD999+zYMECMjMzUSqVBAQEsGzZMvz8/Fi8eDGLFy9m+vTp5Ofn07ZtW5YuXUqLFi1YvHgxOTk5vPXWW5ibm/Phhx8yZswYVq9ezVNPPVXfL++25HI5SqWStm3b6v0bqdXqBjkp0uACq/bt23Ps2DGdtmvXrtGuXTuDx7dr146rV68yePBgQCqqGRMTo3c78d8oFAoRWAlCE3PkquHdfqeiMynVgKlS/J9vzhQKBTKZrOKrIXNxceH999+vtt/W1rba/pdffpmXX3654rGDgwMnTpyo9THWhfJ/m8b0Gd3g5gCHDRtGeno6v/32G6WlpZw8eZKtW7cyadIkg8dPmjSJ1atXExERQUlJCYsXL8bR0ZGgoKD7PHJBEBoaSxPDvzuaKRUYiYSggiDUgQYXWNnZ2bFs2TJ27txJz549eeutt3jrrbcqFqKfPXuWwMBAEhMTAZg8eTIzZ87kmWeeoVevXoSHh/Pzzz+jVCpv9zSCIDQDk6vZ7fdQVw+MRBkbQRDqgEzbkLdC3AdqtZqQkBACAgIazTSjIAi3dzkpl20XEylTa4nLKmRXWApqjfRW90AHZ759JBCLamazhOajuLiY6OhoWrVq1aDXWDVnt/s3aqif3+KdRRCEJuWXw1F8tEN3B9Gsvi0Z0N6JFvbmtHGqPsmoIAjCvRJz4YIgNBkJ2UV8WiW1Qrnlx2LwsjMTQZUgCHVOBFaCIDQZByJSK2753WpPeOp9Ho0gCM2RCKwEQWgybpc+wVQp3u4EQah74p1GEIQmY3gnFyyM9YMrYyM5ozu71cOIBEFobkRgJQhCk2FtquR/j3bFxqwy3YqViRHfTA3A2Urs+hIEoe6JwEoQhCZlkI8zRxYOZmaflnRvacfozm542Jn9+4mCcJfUGi0nojLYHJLAiaiMatf51RYfHx98fHy4fv26Xt/y5cvx8fHhu+++02kvKytjwIAB9OnTh5KSEp2+U6dO4ePjU+3zfffdd3Ts2JHAwEC9r0WLFtXOi2pCRLoFQRCalFK1hqdWB3M8SipncyYmi7Vn4/hkoj9Tu7eo59EJTc3OS0m8uzWcpJzKmrVuNqa8M9aXkX51d/vZzs6OTZs28dJLL+m0b9y4EUtL/d2vu3fvxsXFBbVazebNm5kyZUqNni8oKIhVq1bd05ibCzFjJQhCk7L1QmJFUFVOq4UPtl+mSKWup1EJTdHOS0k8vfqcTlAFkJxTzNOrz7HzUlKdPffYsWPZvHkzGo2mou3ixYuoVCp8fX31jl+9ejWjRo1i2rRpLF++nGaeG7x
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# shows the correlation between \"Delta 15N (o/oo)\" and \"Species\" in a swarm plot\n",
"warnings.filterwarnings('ignore')\n",
"sns.swarmplot(x=\"Species\", y=\"Delta 15 N (o/oo)\", data=penguin_manager.get_df(), hue=\"Sex\")\n",
"warnings.filterwarnings('default')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the species are imbalanced, there are 2 times more Adelie than Chinstrap.\n",
"\n",
"Moreover, we can see that The gentoo lives only on the Biscoe island, the Chinstrap only in the Chinstrap and the Adelie lives everywhere. So if we want to predict the island, the species feature will be very important for the first 2 species and almost useless for the latter.\n",
"\n",
"The studies seem to have a reasonable balance of species.\n",
"\n",
"As comparison with the numerical features, I have chosen the body mass, the culmen depth and the delta 15N. The gentoo seems to sit in specific ranges that is different that the other 2 species, also, using the \"Sex\" as hue, it seems that this feature may be correlated with the body mass (higher for males) and culmen depth (higher for males). So, even if these penguins could change sex, it may be a rare event that also have influence on the body mass and culmen depth. Therefore, it could be used for the model.\n",
"\n",
"## Island"
]
},
{
"cell_type": "code",
"execution_count": 147,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot:xlabel='Island', ylabel='count'>"
]
},
"execution_count": 147,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjYAAAGsCAYAAADOo+2NAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAtxElEQVR4nO3dfUBUdd7//5fMgEzSBhabWqabIK1FiqiIkm60aK0irmi2qxTVannTpmlahncp6tXVpqnbjRaypXtZmFSUqbua3ZmIranZilBtmpaoKIkwCgPfP/w1vyZtcxA5x0/Px19yzhnOe9jPTk/ODDONamtrawUAAGCAAKsHAAAAqC+EDQAAMAZhAwAAjEHYAAAAYxA2AADAGIQNAAAwBmEDAACM4bR6gIZWU1Oj6upqBQQEqFGjRlaPAwAAzkJtba1qamrkdDoVEPDj12V+dmFTXV2tHTt2WD0GAACog+joaAUFBf3o/p9d2HxXedHR0XI4HBZPAwAAzobH49GOHTv+69Ua6WcYNt89/eRwOAgbAAAuMD/1MhJePAwAAIxB2AAAAGMQNgAAwBiEDQAAMAZhAwAAjEHYAAAAYxA2AADAGIQNAAAwBmEDAACMQdgAAABjEDYAAMAYhA0AADAGYQMAAIxB2AAAAGMQNgAAwBiEDWCA2hqP1SPAZlgT+LlyWj0AgHPXKMChQysfUtWhz60eBTYQeNnVumzAHKvHACxB2ACGqDr0uaq++bfVYwCApXgqCgAAGIOwAQAAxiBsAACAMQgbAABgDMIGAAAYg7ABAADGIGwAAIAxCBsAAGAMS8OmtLRUSUlJys/P927btWuX7rjjDsXExKhbt26aPXu2qqurvftzc3OVlJSkDh06aMCAAdq6dasVowMAABuyLGw++ugjDR48WHv27PFuKy0tVXp6urp166bNmzfr5Zdf1oYNG/S3v/1NkpSfn68ZM2Zozpw5KigoUL9+/TRixAhVVlZadTcAAICNWPKRCrm5uZo/f74efPBBjR071rv91VdfVevWrXXPPfdIkq688kplZWWpUaNGkqScnBz16dNHsbGxkqT09HS99NJLWrVqlVJTU/2awePhA+JgDofDYfUIsCEe52CSs13PloRNQkKCkpOT5XQ6fcJm+/btatu2raZMmaJ169bJ5XIpNTXVGzrFxcWnBUxERIR27drl9ww7duw4tzsB2ITL5VK7du2sHgM2VFhYyBVt/OxYEjbh4eFn3F5WVqZ//vOfmjZtmiZPnqzPPvtM9957r4KCgnT33Xfr+PHjcrlcPrcJDg5WRUWF3zNER0fzWy4Ao0VFRVk9AlBvPB7PWV2UsNWnewcFBSk6OloDBw6UJF1zzTUaOnSo3nrrLd19991yuVxyu90+t3G73QoLC/P7XA6Hg7ABYDQe4/BzZKs/927Tpo1Onjzps62mpka1tbWSpMjISBUVFfnsLy4uVmRkZIPNCAAA7MtWYZOamqrdu3dr8eLF8ng8Kiws1NKlS5WSkiJJGjhwoPLy8rRp0yZVVVUpOztbhw8fVlJSksWTAwAAO7DVU1Ft2rTR0qVL9dhjj2nRokUKDg7WH/7wB6WlpUmS4uPjNXXqVE2bNk0HDhxQRESEFi9erNDQUGsHBwAAtmB52BQWFvp83b59ey1btuxHj09JSfFewQEAAPg+Wz0VBQAAcC4IGwAAYAzCBgAAGIOwAQAAxiBsAACAMQgbAABgDMIGAAAYg7ABAADGIGwAAIAxCBsAAGAMwgYAABiDsAEAAMYgbAAAgDEIGwAAYAzCBgAAGIOwAQAAxiBsAACAMQgbAABgDMIGAAAYg7ABAADGIGwAAIAxCBsAAGAMwgYAABiDsAEAAMYgbAAAgDEIGwAAYAzCBgAAGIOwAQAAxiBsAACAMQgbAABgDMIGAAAYg7ABAADGsDRsSktLlZSUpPz8/NP2lZSUqFu3blq5cqXP9tzcXCUlJalDhw4aMGCAtm7d2lDjAgAAm7MsbD766CMNHjxYe/bsOW1fTU2Nxo8fryNHjvhsz8/P14wZMzRnzhwVFBSoX79+GjFihCorKxtqbAAAYGOWhE1ubq7Gjx+vsWPHnnH/X//6VzVr1kzNmzf32Z6Tk6M+ffooNjZWgYGBSk9PV1hYmFatWtUQYwMAAJtzWnHShIQEJScny+l0nhY3mzZt0ptvvqlXXnlFycnJPvuKi4uVmprqsy0iIkK7du3yewaPx+P/4IBNORwOq0eADfE4B5Oc7Xq2JGzCw8PPuP3w4cOaNGmS5s+fryZNmpy2//jx43K5XD7bgoODVVFR4fcMO3bs8Ps2gB25XC61a9fO6jFgQ4WFhTxVj58dS8LmTGprazVhwgSlpaXpuuuuO+MxLpdLbrfbZ5vb7VZYWJjf54uOjua3XABGi4qKsnoEoN54PJ6zuihhm7D5+uuvtXnzZm3btk1//etfJUnl5eWaPn261qxZo2effVaRkZEqKiryuV1xcbF69Ojh9/kcDgdhA8BoPMbh58g2YdOiRYvTSiwxMVGjR4/WgAEDJEkDBw7UqFGjdMsttyg2NlbLli3T4cOHlZSUZMXIAADAZmwTNmcjPj5eU6dO1bRp03TgwAFFRERo8eLFCg0NtXo0AABgA5aHTWFh4Y/uW79+/WnbUlJSlJKScj5HAgAAFyg+UgEAABiDsAEAAMYgbAAAgDEIGwAAYAzCBgAAGIOwAQAAxiBsAACAMQgbAABgDMIGAAAYg7ABAADGIGwAAIAxCBsAAGAMwgYAABiDsAEAAMYgbAAAgDEIGwAAYAzCBgAAGIOwAQAAxiBsAACAMQgbAABgDMIGAAAYg7ABAADGIGwAAIAxCBsAAGAMwgYAABiDsAEAAMYgbAAAgDEIGwDAeeGp8Vg9AmykodaDs0HOAgD42XEEODR9zXT958h/rB4FFmsd1lpTe09tkHMRNgCA8+Y/R/6j3Qd3Wz0GfkZ4KgoAABiDsAEAAMawNGxKS0uVlJSk/Px877Y1a9YoJSVFHTt2VGJiohYuXKiamhrv/tzcXCUlJalDhw4aMGCAtm7dasXoAADAhiwLm48++kiDBw/Wnj17vNs++eQTTZgwQWPGjNGWLVu0ePFirVy5UtnZ2ZKk/Px8zZgxQ3PmzFFBQYH69eunESNGqLKy0qJ7AQAA7MSSsMnNzdX48eM1duxYn+379u3TbbfdphtvvFEBAQFq06aNkpKSVFBQIEnKyclRnz59FBsbq8DAQKWnpyssLEyrVq2y4m4AAACbseSvohISEpScnCyn0+kTN71791bv3r29X7vdbm3YsEHJycmSpOLiYqWmpvp8r4iICO3atcvvGTwe3l8B5nA4HFaPABuy+nGOdYkfOpc1eba3tSRswsPDf/KY8vJy3X///QoODlZ6erok6fjx43K5XD7HBQcHq6Kiwu8ZduzY4fdtADtyuVxq166d1WPAhgoLCy17qp51iTNpiDVpy/ex+fzzz/XnP/9Zl156qV544QWFhIRIOvV/FLfb7XOs2+1WWFiY3+eIjo7mtwkARouKirJ6BMDHuaxJj8dzVhclbBc277zzjh544AHdeuutGjdunJzO/3/EyMhIFRUV+RxfXFysHj16+H0eh8NB2AAwGo9xsJuGWJO2eh+bjz/+WKNGjdLDDz+siRMn+kSNJA0cOFB5eXnatGmTqqqqlJ2drcOHDyspKcmiiQEAgJ3Y6orNM888o+rqamVmZiozM9O7PTY2Vs8995zi4+M1depUTZs2TQcOHFBERIQWL16s0NBQ64YGAAC2YXnYFBYWev/9zDPP/OTxKSkpSklJOZ8jAQCAC5StnooCAAA4F4QNAAAwBmEDAACMQdgAAABjEDYAAMAYhA0AADAGYQMAAIxB2AAAAGMQNgAAwBiEDQAAMAZhAwAAjEHYAAAAYxA2AADAGIQNAAAwBmEDAACMQdgAAABjEDYAAMAYhA0AADAGYQMAAIxB2AAAAGMQNgAAwBiEDQAAMAZhAwAAjEHYAAAAYxA2AADAGIQNAAAwBmEDAACMQdgAAABjEDYAAMAYhA0AADAGYQMAAIxB2AAAAGNYGjalpaV
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.countplot(x='Island', data=df_categorical)"
]
},
{
"cell_type": "code",
"execution_count": 148,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot:xlabel='Island', ylabel='count'>"
]
},
"execution_count": 148,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAi4AAAGsCAYAAAD62iyRAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA/aElEQVR4nO3deVyVdf7//yfnAIKiAm5oWS4IjYUriqal0SiVUYaiU0ip5V6Wk9skLoyiTVqaOlouyJSUiZqoqTkfrTRT0lJTG1FM09wAFxIEgXPO7w+/nl8nN3DhcOHjfrt1u3mu632936/r4urw5FpdbDabTQAAAAZgcnYBAAAARUVwAQAAhkFwAQAAhkFwAQAAhkFwAQAAhkFwAQAAhkFwAQAAhuHq7AJuN6vVqsLCQplMJrm4uDi7HAAAUAQ2m01Wq1Wurq4yma59XKXMBZfCwkLt3r3b2WUAAICbEBQUJHd392vOL3PB5XJKCwoKktlsdnI1AACgKCwWi3bv3n3doy1SGQwul08Pmc1mggsAAAZzo8s8uDgXAAAYBsEFAAAYBsEFAAAYRpm7xgUAcPeyWq3Kz893dhm4Cjc3t9ty7SnBBQBQJuTn5+vQoUOyWq3OLgXX4O3tLT8/v1t6zhrBBQBgeDabTSdOnJDZbFbt2rVveEstSpbNZtOFCxeUnp4uSapZs+ZN90VwAQAYXmFhoS5cuKBatWqpfPnyzi4HV+Hp6SlJSk9PV/Xq1W/6tBGRFABgeBaLRZKu+8RVON/lUFlQUHDTfRBcAABlBu+oK91ux8+H4AIAAAyD4AIAQBGdP39eZ86ccXYZdzWCCwAARdShQwcdOHDgppcPDAxUSkrKDdtFR0erXbt2Onv2rMP03377TYGBgfrtt99uugajI7gAAFBEfw4Sd9LJkyc1YsQI2Wy2EhvTCAguAAD8yYwZM9SuXTu1bNlSXbp00fr16xUWFiZJ6tOnj+bOnatly5YpNDTUYbno6GjNmDFD0qU7ZyZNmqSQkBC1atVK8+bNs7f74osv1Lx5c128eNE+be3atXrsscfsQaVz58768ccfHZb7s4MHD6pfv35q3769GjVqpKeeekpfffWVpP//6Mzy5cv12GOPqUmTJvrHP/6h7du365lnnlHTpk310ksv2U992Ww2ffTRRwoLC1NwcLBeeOEF7dmz5zZszduL4AIAwB9s3bpVn332mZKSkpSSkqLIyEiNGjVKq1atkiTNnTtXffr0uWE/s2bN0tdff60lS5Zow4YN2r9/v31ehw4dZDabtX79evu05cuX67nnnrPfeXPvvfcqLi5O06ZN048//njVMV577TUFBATov//9r7Zv3662bdtq3LhxDm2++eYbrV69WosXL1ZycrLGjx+vuXPnav369Tpx4oQ++eQTSdInn3yiBQsW6P3339eWLVsUERGhXr16KTMzs1jb704juAC4I2xWi7NLKBaj1Ys7p1y5csrKytLixYv1888/KzIyUlu2bJGbm1ux+klOTtbLL7+s2rVrq3z58oqJibGHEnd3dz399NNKTk6WJJ0+fVrffvutnnvuOYc+wsLC1L17d/3973/XuXPnrhjjww8/1GuvvSabzaZjx46pUqVKOnXqlEOb3r17y9PTUwEBAapWrZqee+451ahRQ76+vmrSpImOHTsmSUpMTFS/fv30wAMPyM3NTV27dlX9+vW1YsWKYq33ncaTcwHcES4mszKXjVRB5i/OLuWG3KrWU9WIt51dBkqJpk2basaMGfr44481b948eXh4KDo6WgMGDChWP+np6Q6Ptq9UqZIqV65s/xwREaHu3bvr9OnTWrFihZo1a6batWtf0c/IkSO1c+dOjRw5UqNGjXKYt2/fPg0cOFAZGRmqX7++fH19r7gmxtvb2/5vs9msSpUq2T+bTCZ7+2PHjulf//qXpkyZYp9fWFiohx56qFjrfacRXADcMQWZv6jg5P+cXQZQLMePH1eVKlU0f/585efna8uWLXr11Vf14IMPOrQzmUxXvIn6jxfv+vn56ejRo/bPFy5c0Pnz5+2fH3roIfn7++vLL7/UF198oejo6KvW4+7urqlTpyoiIkILFiywTz916pRef/11zZw5036tzZdffql169Y5LF/Uh775+flp8ODB6tSpk33akSNHHIJPacCpIgAA/mD37t165ZVXtG/fPrm7u6tKlSqSJB8fH7m7u9vDR/369ZWZmamtW7fKZrMpOTlZBw8etPcTGRmpefPm6eDBg7p48aLefvtt+6sJLouIiNDixYt1+PBhdezY8Zo13X///Ro/frwSExPt03JycmSxWOzvAEpLS9O///1vSboiUBVFt27dNHv2bPs6bNq0SZ06ddK2bduK3dedxBEXAAD+ICwsTIcPH9aAAQN09uxZValSRW+99ZYaN26s7t27680331TPnj01ZMgQDRgwQCNHjlROTo7++te/2u88ki7dfZSbm6sePXqosLBQ3bp1u+LoRXh4uCZPnqyIiAh7ALmWp556yn7hsCTVq1dPw4cP17Bhw5Sbmys/Pz9169ZNkydP1v79+4t9pKRnz56y2WwaOHCg0tPTVaNGDY0ZM0aPP/54sfq501xsZewGcYvFop07d6pJkyY3/eZJALfHiTndDHGqyM3vL6rZd7Gzy8AtyMvL06FDh1S3bl15eHg4u5wis1gsatu2rT744AM1btzY2eXccdf7ORX19zdHXAAAcIIDBw5ozZo18vPzuytCy+1CcAEAwAn69esnSZo+fbqTKzEWggsAAE6wYcMGZ5dgSNxVBAAADIPgAgAADIPgAgAADIPgAgAADIPgAgAADIPgAgAosyxWa5ke727E7dAAgDLLbDIp5pNNOpSedcfHqlu9sia88MgdH+duR3ABAJRph9KztO/YGWeXcVWhoaHKyMiQq+ulX8c2m01eXl4KDw/XsGHDZDKZdPbsWbVv31516tRRcnKyw/LLli3TzJkzr/lMGIvFoilTpig5OVm5ublq1aqVYmNjVb16da1YsUJjx451aF9QUCBJ2rNnjyRp165dmjBhgtLS0uTj46MBAwYoMjJSkmS1WvX+++/r888/V05OjurXr6+hQ4eqZcuWt3Ub/RmnigAAcKLY2Fjt2LFDO3bs0M6dOzV//nwtX75cM2fOlCQlJSXp0UcfVUZGhjZv3lysvmfPnq3Nmzdr6dKl2rRpkzw8PBQTEyNJeuaZZ+zj7tixQ2vXrpW3t7fi4uIkSVlZWerbt686d+6sbdu2KS4uTpMmTdJPP/0kSVq0aJH+7//+T0lJSdq2bZueeuop9evXTxcvXryNW+dKTgku586d0/DhwxUSEqIWLVrY30QpXUp3kZGRatq0qUJDQ5WUlOSMEgEAcIrAwEC1aNFCP//8s6xWqxYtWqTw8HBFRkYqPj6+WH0lJSWpT58+qlmzpry8vDRq1Cht3LhRR48edWhns9k0bNgwtW/fXs8++6wkad26dfL29lZUVJRcXV3VunVrhYeHKzExUZL0yy+/yGq1ymq1ymazycXFpURecOmU4PLaa6/pwoUL+u9//6uvvvpKZrNZo0ePvmG6AwCgLCsoKFBKSoq2bt2qNm3aaMOGDbJYLAoNDdXzzz+vlJQUpaamFqmv8+fP6+TJkwoICLBPq1q1qipXrnxFH8nJyUpLS9PIkSPt0w4cOOCwrCT5+/tr3759kqS//e1vysvLU/v27RUUFKRp06Zp+vTpKleu3M2ufpGU+DUue/bs0a5du/Tdd9/Jy8tLkjR+/HhlZGQ4pDtJDumuUaNGJV0qAAB3XGxsrCZOnGj/7Ofnp169eqlHjx7q1auX/YiHn5+fOnTooISEBE2aNOmG/ebk5EiSypcv7zDdw8PDPk+6dK3K7Nmz1b9/f/vv5cvLe3p6XrHshQsXJF0KWS1btlS/fv1Uq1YtzZ8/X4MHD9aKFStUrVq14m+IIirx4PLTTz/J399fixcv1qeffqrc3Fw98sgjGjFixDXT3ZIlS4o9jsViuV0lA7gJZrPZ2SUUG98bxmWxWGSz2ez/Xebi4lLitfxx/KK0HTNmjCIiIq6Yd/DgQW3
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.countplot(x='Island', hue='studyName', data=df_categorical)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there is also a high imbalance between the sample of the islands the gentoo island is the most represented, which is also the only island the gentoo lives. The Torgersen island is underrepresented.\n",
"\n",
"Studies are slightly imbalanced if grouped by island.\n",
"\n",
"### Clutch Completion\n"
]
},
{
"cell_type": "code",
"execution_count": 149,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot:xlabel='Clutch Completion', ylabel='count'>"
]
},
"execution_count": 149,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjYAAAGtCAYAAAAF/z4oAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAoa0lEQVR4nO3de1RU97338Q8MIBM1AoHES03SyMWj4gHxeCk+GmmorQYviJfGeETrJWhqYuolpt5OiLdTk6BJTatWiZEsU4zY4DHGJuZqlBCDSq0Q8HhqlFMVUCIqBYZ5/sjjPJmiCaMOAz/fr7VYXbP3nr2/m66Bd/ZsHC+73W4XAACAAbw9PQAAAMCtQtgAAABjEDYAAMAYhA0AADAGYQMAAIxB2AAAAGMQNgAAwBiEDQAAMIaPpwdobHV1daqtrZW3t7e8vLw8PQ4AAGgAu92uuro6+fj4yNv7+tdlbruwqa2tVX5+vqfHAAAANyAyMlJ+fn7XXX/bhc3VyouMjJTFYvHwNAAAoCFsNpvy8/O/82qNdBuGzdW3nywWC2EDAEAz8323kXDzMAAAMAZhAwAAjEHYAAAAYxA2AADAGIQNAAAwBmEDAACMQdgAAABjEDYAAMAYhA0AADAGYQMAAIxB2AAAAGMQNgAAwBiEDQAAMAZhAwAAjEHYuIGtrs7TIwBNEq8NAO7m4+kBTGTx9taC1z/WibMVnh4FaDJ+eHcbPffI//H0GAAMR9i4yYmzFSo4Xe7pMQAAuK3wVhQAADCGR8Jm//79GjVqlHr06KHY2FilpqaqqqpKknT48GGNGjVK0dHRiouLU2ZmptNzs7KyFB8fr6ioKCUmJiovL88TpwAAAJqgRg+b8vJyTZs2TT//+c/1+eefKysrS5999pnWrVuniooKTZ06VcOHD1dubq6WLl2q5cuX68iRI5KknJwcpaamasWKFcrNzdXQoUOVkpKiK1euNPZpAACAJqjRwyYoKEiffvqpEhMT5eXlpQsXLugf//iHgoKCtGfPHgUEBGjcuHHy8fFR3759lZCQoIyMDElSZmamhgwZopiYGPn6+io5OVmBgYHatWtXY58GAABogjxy83CrVq0kSQMGDNCZM2fUs2dPJSYmKi0tTeHh4U7bhoaGatu2bZKk4uJijRw5st76goICl2ew2Ww3OP33s1gsbts30Ny587UHwFwN/dnh0b+K2rNnjyoqKjR79mzNnDlT99xzj6xWq9M2/v7+unz5siTp0qVL37neFfn5+Tc++HewWq3q0qWLW/YNmKCwsJC3jwG4jUfDxt/fX/7+/pozZ45GjRql8ePH6+LFi07bVFVVqWXLlpK+iYarNxl/e31gYKDLx46MjOTKCuABERERnh4BQDNks9kadFGi0cPmiy++0DPPPKO33npLfn5+kqTq6mr5+voqNDRU+/btc9q+uLhYYWFhkqSwsDAVFRXVW9+/f3+X57BYLIQN4AG87gC4U6PfPBwREaGqqio9//zzqq6u1unTp7Vy5UolJSVp0KBBKi0tVXp6umpqanTgwAFlZ2c77qtJSkpSdna2Dhw4oJqaGqWnp6usrEzx8fGNfRoAAKAJavQrNi1bttSGDRu0bNkyxcbGqnXr1kpISNCMGTPk5+enjRs3aunSpVqzZo2CgoK0YMEC9enTR5LUt29fLV68WEuWLNGZM2cUGhqq9evXKyAgoLFPAwAANEEeuccmNDRUGzduvOa6yMhIbd269brPHTZsmIYNG+au0QAAQDPGRyoAAABjEDYAAMAYhA0AADAGYQMAAIxB2AAAAGMQNgAAwBiEDQAAMAZhAwAAjEHYAAAAYxA2AADAGIQNAAAwBmEDAACMQdgAAABjEDYAAMAYhA0AADAGYQMAAIxB2AAAAGMQNgAAwBiEDQAAMAZhAwAAjEHYAAAAYxA2AADAGIQNAAAwBmEDAACMQdgAAABjEDYAAMAYhA0AADAGYQMAAIxB2AAAAGMQNgAAwBiEDQAAMAZhAwAAjEHYAAAAYxA2AADAGIQNAAAwBmEDAACMQdgAAABjEDYAAMAYhA0AADAGYQMAAIxB2AAAAGMQNgAAwBiEDQAAMAZhAwAAjEHYAAAAYxA2AADAGIQNAAAwhkfCpqCgQBMnTlSvXr0UGxuruXPnqry8XJK0ePFidevWTdHR0Y6vN954w/HcrKwsxcfHKyoqSomJicrLy/PEKQAAgCao0cOmqqpKkydPVnR0tD755BPt3LlTFy5c0DPPPCNJys/PV2pqqvLy8hxfY8aMkSTl5OQoNTVVK1asUG5uroYOHaqUlBRduXKlsU8DAAA0QY0eNiUlJercubNmzJghPz8/BQYGasyYMcrNzVV1dbW+/PJLdevW7ZrPzczM1JAhQxQTEyNfX18lJycrMDBQu3btauSzAAAATVGjh80DDzygDRs2yGKxOJa988476tq1qwoKClRbW6s1a9boRz/6kQYNGqR169aprq5OklRcXKzw8HCn/YWGhqqgoKBRzwEAADRNPp48uN1uV1pamt5//31t2bJFpaWl6tWrl8aPH68XXnhBx44d04wZM+Tt7a3Jkyfr0qVLslqtTvvw9/fX5cuXXT62zWa7VadRz7ejDYAzd772AJiroT87PBY2lZWVmj9/vo4ePaotW7YoIiJCERERio2NdWzTvXt3TZgwQbt27dLkyZNltVpVVVXltJ+qqioFBga6fPz8/PybPodrsVqt6tKli1v2DZigsLCQ++IAuI1HwubkyZOaMmWK2rdvr23btikoKEiS9O6776q0tFRjx451bFtdXS1/f39JUlhYmIqKipz2VVxcrP79+7s8Q2RkJFdWAA+IiIjw9AgAmiGbzdagixKNHjYVFRWaMGGC+vTpo6VLl8rb+//f5mO327V8+XLdd9996tOnjw4dOqTNmzdr/vz5kqSkpCTNmDFDP/vZzxQTE6OMjAyVlZUpPj7e5TksFgthA3gArzsA7tToYbN9+3aVlJTo7bff1u7du53W5eXlaf78+VqyZInOnDmj4OBg/fKXv9SwYcMkSX379tXixYsd60NDQ7V+/XoFBAQ09mkAAIAmyMtut9s9PURjstlsOnTokKKiotz6X47j0naq4HS52/YPNDedOwQp48mHPT0GgGaqob+/+UgFAABgDMIGAAAYg7ABAADGIGwAAIAxCBsAAGAMwgYAABiDsAEAAMYgbAAAgDEIGwAAYAzCBgAAGIOwAQAAxiBsAACAMQgbAABgDMIGAAAYg7ABAADGIGwAAIAxCBsAAGAMwgYAABiDsAEAAMYgbAAAgDEIGwAAYAzCBgAAGIOwAQAAxiBsAACAMQgbAABgDMIGAAAYg7ABAADGIGwAAIAxCBsAAGAMwgYAABiDsAEAAMYgbAAAgDEIGwAAYAzCBgAAGIOwAQAAxiBsAACAMQgbAABgDMIGAAAYg7ABAADGIGwAAIAxCBsAAGAMwgYAABiDsAEAAMYgbAAAgDEIGwAAYAzCBgAAGIOwAQAAxiBsAACAMTwSNgUFBZo4caJ69eql2NhYzZ07V+Xl5ZKkw4cPa9SoUYqOjlZcXJwyMzOdnpuVlaX4+HhFRUUpMTFReXl5njgFAADQBDV62FRVVWny5MmKjo7WJ598op07d+rChQt65plnVFFRoalTp2r48OHKzc3V0qVLtXz5ch05ckSSlJOTo9TUVK1YsUK5ubkaOnSoUlJSdOXKlcY+DQAA0AQ1etiUlJSoc+fOmjFjhvz8/BQYGKgxY8YoNzdXe/bsUUBAgMaNGycfHx/17dtXCQkJysjIkCRlZmZqyJAhiomJka+vr5KTkxUYGKhdu3Y19mkAAIAmyKexD/jAAw9ow4YNTsveeecdde3aVUVFRQoPD3daFxoaqm3btkmSiouLNXLkyHrrCwoKXJ7DZrO5/JyGslgsbts30Ny587UHwFwN/dnR6GHzbXa7XWlpaXr//fe1ZcsWbd68WVar1Wkbf39/Xb58WZJ06dKl71zvivz8/Bsf/DtYrVZ16dLFLfsGTFBYWMjbxwDcxmNhU1lZqfnz5+vo0aPasmWLIiIiZLVadfHiRaftqqq
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.countplot(x='Clutch Completion', data=df_categorical)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"### Sex"
]
},
{
"cell_type": "code",
"execution_count": 150,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot:xlabel='Sex', ylabel='count'>"
]
},
"execution_count": 150,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjYAAAGsCAYAAADOo+2NAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAm80lEQVR4nO3df1iUdb7/8ZcMGpO2QulmnmPfvgmMi5Iitoqx1rI769m1EQKV9msW9sOOaW20amW4Wkq6XbtXHmvbWluklNaCRKMUrc32pBuEHkOqhQO2m/3UgCT5MQvO8P2jE6dZdGNw9L79+HxcV9cV9w/u99Tn0if3zDB9Ojs7OwUAAGCAMKsHAAAACBXCBgAAGIOwAQAAxiBsAACAMQgbAABgDMIGAAAYg7ABAADGCLd6gNPN7/fr2LFjCgsLU58+faweBwAA9EBnZ6f8fr/Cw8MVFnbi+zJnXdgcO3ZMVVVVVo8BAAB6IT4+Xv369Tvh/rMubL6qvPj4eDkcDounAQAAPeHz+VRVVfVP79ZIZ2HYfPX0k8PhIGwAADjDfNPLSHjxMAAAMAZhAwAAjEHYAAAAYxA2AADAGIQNAAAwBmEDAACMQdgAAABjEDYAAMAYhA0AADAGYQMAAIxB2AAAAGMQNgAAwBiEDQAAMAZhAwAAjEHYAAAAYxA2veDz+60eATbCegAA+wi3eoAzkSMsTDnPvK6/Hm6yehRY7P9+e6BW/L/vWT0GAOB/EDa99NfDTar+qNHqMQAAwNfwVBQAADAGYQMYoNPvs3oE2AxrAmcrnooCDNAnzKH6Tfeoo/49q0eBDfQddKkGpa+yegzAEoQNYIiO+vfU8elfrB4DACzFU1EAAMAYhA0AADAGYQMAAIxB2AAAAGMQNgAAwBiEDQAAMAZhAwAAjEHYAAAAYxA2AADAGIQNAAAwBmEDAACMQdgAAABjWBo2jY2NcrvdKi8v79pWXV2tG264QQkJCZo4caJWrlypY8eOde0vLi6W2+3WmDFjlJ6ern379lkxOgAAsCHLwmbv3r3KzMzUwYMHu7Y1NjYqKytLEydO1JtvvqnnnntOr732mp566ilJUnl5uZYvX65Vq1apoqJCU6dO1dy5c9XW1mbVwwAAADYSbsVFi4uLtWbNGi1cuFDZ2dld2zdv3qxLLrlEt956qyTpX//1X5WXl6c+ffpIkgoLCzVlyhQlJiZKkrKysvTss89q69atysjICGoGn8/X6/kdDkevz4WZTmY9hQJrEsdj9boEQqmn69mSsElOTpbH41F4eHhA2Ozfv1+xsbH6xS9+oT/+8Y9yOp3KyMjoCp26urpuARMdHa3q6uqgZ6iqqurV7E6nU3Fxcb06F+aqqamx7M4haxInYuW6BKxiSdgMHjz4uNubmpr0yiuvaNmyZVqyZIkOHDigf//3f1e/fv100003qaWlRU6nM+CciIgItba2Bj1DfHw8P+UiZFwul9UjAN2wLmESn8/Xo5sSloTNifTr10/x8fGaNm2aJGnEiBG67rrrtG3bNt10001yOp3yer0B53i9XkVFRQV9LYfDQdggZFhLsCPWJc5Gtnq79/Dhw9Xe3h6wze/3q7OzU5IUExOj2tragP11dXWKiYk5bTMCAAD7slXYZGRk6L//+7+1du1a+Xw+1dTUaMOGDUpNTZUkTZs2TSUlJSorK1NHR4fy8/PV0NAgt9tt8eQAAMAObPVU1PDhw7VhwwY99NBD+t3vfqeIiAj99Kc/1axZsyRJSUlJWrp0qZYtW6ZDhw4pOjpaa9euVWRkpLWDAwAAW7A8bGpqagK+Hj16tAoKCk54fGpqatcdHAAAgK+z1VNRAAAAJ4OwAQAAxiBsAACAMQgbAABgDMIGAAAYg7ABAADGIGwAAIAxCBsAAGAMwgYAABiDsAEAAMYgbAAAgDEIGwAAYAzCBgAAGIOwAQAAxiBsAACAMQgbAABgDMIGAAAYg7ABAADGIGwAAIAxCBsAAGAMwgYAABiDsAEAAMYgbAAAgDEIGwAAYAzCBgAAGIOwAQAAxiBsAACAMQgbAABgDMIGAAAYg7ABAADGIGwAAIAxLA2bxsZGud1ulZeXd9t3+PBhTZw4UZs2bQrYXlxcLLfbrTFjxig9PV379u07XeMCAACbsyxs9u7dq8zMTB08eLDbPr/frwULFujzzz8P2F5eXq7ly5dr1apVqqio0NSpUzV37ly1tbWdrrEBAICNWRI2xcXFWrBggbKzs4+7/ze/+Y2GDBmiiy66KGB7YWGhpkyZosTERPXt21dZWVmKiorS1q1bT8fYAADA5sKtuGhycrI8Ho/Cw8O7xU1ZWZleeuklPf/88/J4PAH76urqlJGREbAtOjpa1dXVQc/g8/mCH/x/OByOXp8LM53MegoF1iSOx+p1CYRST9ezJWEzePDg425vaGjQ4sWLtWbNGvXv37/b/paWFjmdzoBtERERam1tDXqGqqqqoM+RJKfTqbi4uF6dC3PV1NRY9pQoaxInYuW6BKxiSdgcT2dnpxYtWqRZs2Zp1KhRxz3G6XTK6/UGbPN6vYqKigr6evHx8fyUi5BxuVxWjwB0w7qESXw+X49uStgmbD755BO9+eabqqys1G9+8xtJUnNzs+6//35t375dTzzxhGJiYlRbWxtwXl1dnSZNmhT09RwOB2GDkGEtwY5Ylzgb2SZshg4d2q3EUlJSNH/+fKWnp0uSpk2bpnnz5unHP/6xEhMTVVBQoIaGBrndbitGBgAANmObsOmJpKQkLV26VMuWLdOhQ4cUHR2ttWvXKjIy0urRAACADVgeNjU1NSfc9+qrr3bblpqaqtTU1FM5EgAAOEPxkQoAAMAYhA0AADAGYQMAAIxB2AAAAGMQNgAAwBiEDQAAMAZhAwAAjEHYAAAAYxA2AADAGIQNAAAwBmEDAACMQdgAAABjEDYAAMAYhA0AADAGYQMAAIxB2AAAAGMQNgAAwBiEDQAAMAZhAwAAjEHYAAAAYxA2AADAGIQNAAAwBmEDAACMQdgAAABjEDYAAMAYhA0AADAGYQMAAIxB2AAAAGMQNgAAwBiEDQAAMAZhAwAAjGFp2DQ2Nsrtdqu8vLxr2/bt25WamqqxY8cqJSVFjz76qPx+f9f+4uJiud1ujRkzRunp6dq3b58VowMAABuyLGz27t2rzMxMHTx4sGvb22+/rUWLFunOO+/Unj17tHbtWm3atEn5+fmSpPLyci1fvlyrVq1SRUWFpk6dqrlz56qtrc2iRwEAAOzEkrApLi7WggULlJ2dHbD9o48+0rXXXqvvf//7CgsL0/Dhw+V2u1VRUSFJKiws1JQpU5SYmKi+ffsqKytLUVFR2rp1qxUPAwAA2Ey4FRdNTk6Wx+NReHh4QNxMnjxZkydP7vra6/Xqtddek8fjkSTV1dUpIyMj4HtFR0eruro66Bl8Pl8vp5ccDkevz4WZTmY9hQJrEsdj9boEQqmn69mSsBk8ePA3HtPc3Kyf/exnioiIUFZWliSppaVFTqcz4LiIiAi1trYGPUNVVVXQ50iS0+lUXFxcr86FuWpqaix7SpQ1iROxcl0CVrEkbL7Je++9pzvuuEMXXHCBnn76aQ0YMEDSl3+Ae73egGO9Xq+ioqKCvkZ8fDw/5SJkXC6X1SMA3bAuYRKfz9ejmxK2C5s//elPuuuuuzRjxgz9/Oc/V3j4/44YExOj2tragOPr6uo0adKkoK/jcDgIG4QMawl2xLrE2chWv8fmrbfe0rx583Tvvffq7rvvDogaSZo2bZpKSkpUVlamjo4O5efnq6GhQW6326KJAQCAndjqjs3jjz+uY8eOKTc3V7m5uV3bExMT9eSTTyopKUlLly7VsmXLdOjQIUVHR2vt2rWKjIy0bmgAAGAblodNTU1N178//vjj33h8amqqUlNTT+VIAADgDGWrp6IAAABOBmEDAACMQdgAAABjEDYAAMAYhA0AADAGYQMAAIxB2AAAAGMQNgAAwBiEDQAAMAZhAwAAjEHYAAAAYxA2AADAGIQNAAAwBmEDAACMQdgAAABjEDYAAMAYhA0AADAGYQMAAIxB2AAAAGMQNgA
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.countplot(x='Sex', data=penguin_manager.get_df())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset is overall imbalanced in the most important categorical features which is something to keep in mind while training the model.\n",
"\n",
"The sex is balanced, but we should be aware that is a less important category and the other features are not. In particular, I would not use the \"Clutch Completion\" feature because it is strongly imbalanced.\n",
"Also, the \"Toergsen\" island is underrepresented."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9 Impute missing values\n",
"\n",
"At this point I know there is a correlation between species, anatomical values and sex / isotope, so I will impute them.\n",
"This is not meant to be the perfect imputation, but just a way to have a complete dataset to work with. And a future reference.\n",
"\n",
"There are various approaches:\n",
"\n",
"- Impute with the mean\n",
"- Impute with the median\n",
"- Impute with the mode\n",
"- Impute with a constant value\n",
"- Impute using forward filling or back-filling\n",
"- Impute with a value estimated by another predictive model\n",
"\n",
"I will use the average for the 2 deltas."
]
},
{
"cell_type": "code",
"execution_count": 151,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>studyName</th>\n",
" <th>Species</th>\n",
" <th>Island</th>\n",
" <th>Individual ID</th>\n",
" <th>Clutch Completion</th>\n",
" <th>Date Egg</th>\n",
" <th>Culmen Length (mm)</th>\n",
" <th>Culmen Depth (mm)</th>\n",
" <th>Flipper Length (mm)</th>\n",
" <th>Body Mass (g)</th>\n",
" <th>Sex</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N1A1</td>\n",
" <td>Yes</td>\n",
" <td>2007-11-11</td>\n",
" <td>39.1</td>\n",
" <td>18.7</td>\n",
" <td>181.0</td>\n",
" <td>3750.0</td>\n",
" <td>MALE</td>\n",
" <td>8.733382</td>\n",
" <td>-25.686292</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N1A2</td>\n",
" <td>Yes</td>\n",
" <td>2007-11-11</td>\n",
" <td>39.5</td>\n",
" <td>17.4</td>\n",
" <td>186.0</td>\n",
" <td>3800.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.949560</td>\n",
" <td>-24.694540</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N2A1</td>\n",
" <td>Yes</td>\n",
" <td>2007-11-16</td>\n",
" <td>40.3</td>\n",
" <td>18.0</td>\n",
" <td>195.0</td>\n",
" <td>3250.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.368210</td>\n",
" <td>-25.333020</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N3A1</td>\n",
" <td>Yes</td>\n",
" <td>2007-11-16</td>\n",
" <td>36.7</td>\n",
" <td>19.3</td>\n",
" <td>193.0</td>\n",
" <td>3450.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.766510</td>\n",
" <td>-25.324260</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N3A2</td>\n",
" <td>Yes</td>\n",
" <td>2007-11-16</td>\n",
" <td>39.3</td>\n",
" <td>20.6</td>\n",
" <td>190.0</td>\n",
" <td>3650.0</td>\n",
" <td>MALE</td>\n",
" <td>8.664960</td>\n",
" <td>-25.298050</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>337</th>\n",
" <td>PAL0910</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>N38A1</td>\n",
" <td>No</td>\n",
" <td>2009-12-01</td>\n",
" <td>47.2</td>\n",
" <td>13.7</td>\n",
" <td>214.0</td>\n",
" <td>4925.0</td>\n",
" <td>FEMALE</td>\n",
" <td>7.991840</td>\n",
" <td>-26.205380</td>\n",
" </tr>\n",
" <tr>\n",
" <th>338</th>\n",
" <td>PAL0910</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>N39A1</td>\n",
" <td>Yes</td>\n",
" <td>2009-11-22</td>\n",
" <td>46.8</td>\n",
" <td>14.3</td>\n",
" <td>215.0</td>\n",
" <td>4850.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.411510</td>\n",
" <td>-26.138320</td>\n",
" </tr>\n",
" <tr>\n",
" <th>339</th>\n",
" <td>PAL0910</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>N39A2</td>\n",
" <td>Yes</td>\n",
" <td>2009-11-22</td>\n",
" <td>50.4</td>\n",
" <td>15.7</td>\n",
" <td>222.0</td>\n",
" <td>5750.0</td>\n",
" <td>MALE</td>\n",
" <td>8.301660</td>\n",
" <td>-26.041170</td>\n",
" </tr>\n",
" <tr>\n",
" <th>340</th>\n",
" <td>PAL0910</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>N43A1</td>\n",
" <td>Yes</td>\n",
" <td>2009-11-22</td>\n",
" <td>45.2</td>\n",
" <td>14.8</td>\n",
" <td>212.0</td>\n",
" <td>5200.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.242460</td>\n",
" <td>-26.119690</td>\n",
" </tr>\n",
" <tr>\n",
" <th>341</th>\n",
" <td>PAL0910</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>N43A2</td>\n",
" <td>Yes</td>\n",
" <td>2009-11-22</td>\n",
" <td>49.9</td>\n",
" <td>16.1</td>\n",
" <td>213.0</td>\n",
" <td>5400.0</td>\n",
" <td>MALE</td>\n",
" <td>8.363900</td>\n",
" <td>-26.155310</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>342 rows × 13 columns</p>\n",
"</div>"
],
"text/plain": [
" studyName Species Island Individual ID \\\n",
"0 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A1 \n",
"1 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A2 \n",
"2 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N2A1 \n",
"3 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N3A1 \n",
"4 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N3A2 \n",
".. ... ... ... ... \n",
"337 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N38A1 \n",
"338 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N39A1 \n",
"339 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N39A2 \n",
"340 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N43A1 \n",
"341 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N43A2 \n",
"\n",
" Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) \\\n",
"0 Yes 2007-11-11 39.1 18.7 \n",
"1 Yes 2007-11-11 39.5 17.4 \n",
"2 Yes 2007-11-16 40.3 18.0 \n",
"3 Yes 2007-11-16 36.7 19.3 \n",
"4 Yes 2007-11-16 39.3 20.6 \n",
".. ... ... ... ... \n",
"337 No 2009-12-01 47.2 13.7 \n",
"338 Yes 2009-11-22 46.8 14.3 \n",
"339 Yes 2009-11-22 50.4 15.7 \n",
"340 Yes 2009-11-22 45.2 14.8 \n",
"341 Yes 2009-11-22 49.9 16.1 \n",
"\n",
" Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) \\\n",
"0 181.0 3750.0 MALE 8.733382 \n",
"1 186.0 3800.0 FEMALE 8.949560 \n",
"2 195.0 3250.0 FEMALE 8.368210 \n",
"3 193.0 3450.0 FEMALE 8.766510 \n",
"4 190.0 3650.0 MALE 8.664960 \n",
".. ... ... ... ... \n",
"337 214.0 4925.0 FEMALE 7.991840 \n",
"338 215.0 4850.0 FEMALE 8.411510 \n",
"339 222.0 5750.0 MALE 8.301660 \n",
"340 212.0 5200.0 FEMALE 8.242460 \n",
"341 213.0 5400.0 MALE 8.363900 \n",
"\n",
" Delta 13 C (o/oo) \n",
"0 -25.686292 \n",
"1 -24.694540 \n",
"2 -25.333020 \n",
"3 -25.324260 \n",
"4 -25.298050 \n",
".. ... \n",
"337 -26.205380 \n",
"338 -26.138320 \n",
"339 -26.041170 \n",
"340 -26.119690 \n",
"341 -26.155310 \n",
"\n",
"[342 rows x 13 columns]"
]
},
"execution_count": 151,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# fill deltas with the average\n",
"df_filling = penguin_manager.get_df(as_copy=True)\n",
"\n",
"delta_15n_mean = df_filling['Delta 15 N (o/oo)'].mean()\n",
"delta_13c_mean = df_filling['Delta 13 C (o/oo)'].mean()\n",
"\n",
"# Fill missing values with mean\n",
"df_filling['Delta 15 N (o/oo)'].fillna(value=delta_15n_mean, inplace=True)\n",
"df_filling['Delta 13 C (o/oo)'].fillna(value=delta_13c_mean, inplace=True)\n",
"\n",
"# Print the modified dataframe\n",
"df_filling"
]
},
{
"cell_type": "code",
"execution_count": 152,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAB/MAAAPQCAYAAADKHVi5AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd1RU1/738ffA0KsoEBVsiIC9G8Wa2EvU2HvBbtTYuzFq7IoooCBgw16jJvbeC3aNNSLYCyjSGWaeP3zmXEZNbnJ/KoLf11p3Gaece/aa7Tn77M8uKp1Op0MIIYQQQgghhBBCCCGEEEIIIYQQnw2jzD4BIYQQQgghhBBCCCGEEEIIIYQQQhiSMF8IIYQQQgghhBBCCCGEEEIIIYT4zEiYL4QQQgghhBBCCCGEEEIIIYQQQnxmJMwXQgghhBBCCCGEEEIIIYQQQgghPjMS5gshhBBCCCGEEEIIIYQQQgghhBCfGQnzhRBCCCGEEEIIIYQQQgghhBBCiM+MhPlCCCGEEEIIIYQQQgghhBBCCCHEZ0bCfCGEEEIIIYQQQgghhBBCCCGEEOIzI2G+EEIIIYQQQgghhBBCCCGEEEII8ZmRMF8IIYQQQgghhBBCCCGEEEIIIYT4zEiYL4QQQgghhBBCCCGEEEIIIYQQQnxmJMwXQgghhBBCCCFEtqbVajP7FIQQQgghhBBCiH9NndknIIQQQgghhBBCCPEhXbhwgZs3b+Lm5oaHhwfm5uYYGcl8BiGEEEIIIYQQWYuE+UIIIYQQQgghhMg2Xr16RVhYGEePHkWtVmNqakrJkiVp3LgxZcqUIXfu3ADodDpUKlUmn60QQgghhBBCCPHXVDqdTpfZJyGEEEIIIYQQQgjxoaSlpZGSksKFCxf4/fffOXDgALGxsXh5edGoUSN69OiR2acohBBCCCGEEEL8VxLmCyGEEEIIIYQQIlvQz7ZPT0/H2NhYef3SpUscPnyYRYsWodFoqFWrFoMHD8bNzc3gc0IIIYQQQgghxOdEwnwhhBBCCCGEEEJkS+8L9SdMmMD169cpVaoU/fr1o3LlypiammbiWQohhBBCCCGEEO8nYb4QQgghhBBCCCGyPa1Wi5GREZGRkfj5+bFjxw68vLwYMWIElStXzuzTEx+J/nfXr9ogBEi9EEIIIYQQWYc6s09ACCGEEEIIIYQQ4v9KH879FX1wV6BAAYYNG4axsTHbt29n4cKFeHh44ODgIMFeNnHjxg2sra2xsrLC3t4eQPld5Tf+cp0/f54nT55QoEABcuXKRa5cuaReCCGEEEKIz57MzBdCCCGEEEIIIUSWlnE5/WvXrpGeno5Op6NkyZLvfFYf2t27d49x48Zx5swZ2rVrx08//fSpT1t8BM+fP6dt27bExsaiUqkoX7485cqVo06dOuTNmxcTExMJbr9Aqamp+Pj4cP/+fR49eoSjoyNff/013t7efPvtt9jY2GT2KQohhBBCCPFeEuYLIYQQQgghhBAiy8oYzAYHB7N8+XISEhJISkpiyJAhtGzZEgcHh/d+58aNG3To0IH4+HiCg4OpXr16ZhRBfECpqanEx8dz69YtLly4wNatW7lz5w52dnZUrFiRESNGkCdPHmXwh/hyaDQakpKSOHv2LHv37mXfvn28fPmS/Pnz88MPP1CmTBlcXFwy+zSFEEIIIYQwIGG+EEIIIYQQQgghsrxFixYxb9487O3t8fLy4sSJEwB06tSJHj164OzsbPB5faC/efNmxowZQ8+ePRkyZEhmnLr4iB49esTNmzeZN28ef/zxB46OjnTq1ImGDRtKcPsFeXs1hvT0dF6+fMm8efPYvXs3ycnJVKhQgb59+1KuXLlMPFMhhBBCCCEMSZgvhBBCCCGEEEKILO3Vq1d0794da2trRo0ahZeXF1u3bmXu3Lk8fvz4LwN9gPv37zN06FAePnzI+vXr+eqrrzKhBOJDezu8TU5Oxt/fn23btvHy5UsaNWpEly5d8PDwyMSzFJlFq9ViZGREWloahw4dYt26dRw+fJi8efMyYcIEatSokdmnKIQQQgghBABGmX0CQgghhBBCCCGEEP+GVqs1+HtCQgK3bt2iWbNmeHl5AfDdd98xevRo8uTJw4oVKwgJCeHJkyfKd/RzG1xcXKhduzbPnj3j/v37Bu+JrOvtWdjm5uYMGjSIkSNH4unpyaZNmwgICOD27duZeJYisxgZGaHVajExMeHbb79l4sSJtG7dmgcPHjBixAiOHTuW2acohBBCCCEEIGG+EEIIIYQQQgghspD09HSMjN50Zxw4cID169ezevVq8uXLR+HChYE3+6YD1KtXj5EjR7430FepVKSnpwPQrl07cufOza5duzKhROJDeN8ADP2gD2NjYyW4rV+/PoMGDaJs2bLs27ePtWvXEhMT86lPV3wibw/80dPpdMp1RKVSkSdPHkaNGkX79u159eoVM2fO5I8//viUpyqEEEIIIcR7qTP7BIQQQgghhBBCCCH+CZ1Oh7GxMQDz588nMDDQ4P1t27ZRokQJTE1N0Wg0qNVq6tWrB8CMGTNYsWIFRkZGdO/eHWdnZ+VYpqamFC5cmLt37wKGs7rF5y89PV35LV+9ekVCQgJ58uRRwlp9cKv/8+uvvyY+Pp558+bx66+/UrVqVWrUqKEsvS6yh4z1IioqipcvX5IrVy4cHBwwNzd/5/e2tLSkT58+xMbGsmPHDrZu3UqhQoUwNTWVa4IQQgghhMg08oQihBBCCCGEEEKILEEfqC1dupTAwEDy589P+/btKVSoEGZmZuzatYstW7ag0+lQq9VoNBrgPzP0XV1dWbZsGQsWLCAxMRF4E/SamppStWpVnjx5QmJioiyzn4VotVolsF2yZAldu3alTp069OzZkx07dpCUlIRKpUKn0yl/GhkZ8c0339C6dWvi4uLw8/MjOTlZgvxsJOPAn8DAQNq3b0/r1q3p0qULM2fOJCYmRllqPyMnJye6dOlC4cKF+fXXX3n06JFSb4QQQgghhMgM8pQihBBCCCGEEEKIz5p+OXx4s4T+mTNnKFq0KH5+fkyYMIE5c+bQvHlzYmJiWLJkCb///vt7A/0hQ4ZgZWWFm5sblpaWwH8GCJQpU4bAwEAsLS1lFm4Wog/g58+fz4wZM4iMjMTS0pIjR44wd+5cNm7cSGJi4juBvlqtpl27dlSpUoVr166xbds24K+XZRdZi/7fcHBwMPPnz0elUlGsWDESExNZtWoVkydP/stAv3jx4tSpU4eYmBiCg4PRarVyTRBCCCGEEJlGpZOhpUIIIYQQQgghhMgC9Mvk+/r60rNnT3r37q28d/fuXZYtW8bGjRspWLAgvXv3pmHDhqhUKmXJfYB79+6RP39+ACXcFVlbZGQknTt3xt3dnWHDhmFtbU14eDibNm3CysqKHj168P3332Npaan85vol2O/evUu7du2oWrUqs2fPzuyiiA9Eq9Xy8uVLunbtip2dHRMnTsTFxYWjR48yb948bt26Rf369ZkwYQIODg7Kkvv6+pGYmEjbtm1JT09n8+bNmJqaZnaRhBBCCCHEF0pm5gshhBBCCCGEEOKzd+HCBX755RcWL16Mubk5OXPmBN7M1AcoWLAgXbt2pUWLFty9e5egoCCDGfppaWkASpAvs22zrrdnUr948YKnT5/Sr18/vLy8cHV1xcfHh86dO5OQkEBISAibNm0ymKFvbGyMTqfD0dGRChUqsH//fm7dupVJJRIfQsZ6YWRkRGJiIo8fP6Zbt264ublhZmZG9erVGTduHO7u7uzcuZNJkyYZzNDXD/SwtLSkffv23Llzh/3792diqYQQQgghxJdOwnwhhBBCCCGEEEJ89goUKMD48eMxMjLi+fPnbN68meTkZExNTZWl9AsUKGAQ6IeGhrJ9+3Z0Oh0mJiYGx5P90bOm9PR05be7fv06J0+e5MGDB+TNm5f8+fOj0WhIT0/HycmJdu3a/W2gr1KpsLa2pmnTpiQmJvLw4cNMLp34X2WsF0eOHGHjxo3s2rWL9PR0LCwsAJTrQPny5f820Dc2NgagdOnSWFpaEhkZmVnFEkIIIYQ
"text/plain": [
"<Figure size 2500x1000 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# check for empties with missingno\n",
"warnings.filterwarnings('ignore')\n",
"msno.matrix(df_filling)\n",
"warnings.filterwarnings('default')"
]
},
{
"cell_type": "code",
"execution_count": 153,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"studyName 0\n",
"Species 0\n",
"Island 0\n",
"Individual ID 0\n",
"Clutch Completion 0\n",
"Date Egg 0\n",
"Culmen Length (mm) 0\n",
"Culmen Depth (mm) 0\n",
"Flipper Length (mm) 0\n",
"Body Mass (g) 0\n",
"Sex 1\n",
"Delta 15 N (o/oo) 0\n",
"Delta 13 C (o/oo) 0\n",
"dtype: int64"
]
},
"execution_count": 153,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# count the missing values\n",
"df_filling.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset is almost full, now only one row with the value sex is missing. I will use a classifier to predict it.\n",
"\n",
"Since it's not worth to impute with a classifier for one row, I will drop it."
]
},
{
"cell_type": "code",
"execution_count": 154,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"studyName 0\n",
"Species 0\n",
"Island 0\n",
"Individual ID 0\n",
"Clutch Completion 0\n",
"Date Egg 0\n",
"Culmen Length (mm) 0\n",
"Culmen Depth (mm) 0\n",
"Flipper Length (mm) 0\n",
"Body Mass (g) 0\n",
"Sex 0\n",
"Delta 15 N (o/oo) 0\n",
"Delta 13 C (o/oo) 0\n",
"dtype: int64"
]
},
"execution_count": 154,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_filling = df_filling.dropna(subset=['Sex'])\n",
"df_filling.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 155,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAB/MAAAPQCAYAAADKHVi5AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd1RU1/738ffA0KsoEBVsiIC9G8Wa2EvU2HvBbtTYuzFq7IoooCBgw16jJvbeC3aNNSLYCyjSGWaeP3zmXEZNbnJ/KoLf11p3Gaece/aa7Tn77M8uKp1Op0MIIYQQQgghhBBCCCGEEEIIIYQQnw2jzD4BIYQQQgghhBBCCCGEEEIIIYQQQhiSMF8IIYQQQgghhBBCCCGEEEIIIYT4zEiYL4QQQgghhBBCCCGEEEIIIYQQQnxmJMwXQgghhBBCCCGEEEIIIYQQQgghPjMS5gshhBBCCCGEEEIIIYQQQgghhBCfGQnzhRBCCCGEEEIIIYQQQgghhBBCiM+MhPlCCCGEEEIIIYQQQgghhBBCCCHEZ0bCfCGEEEIIIYQQQgghhBBCCCGEEOIzI2G+EEIIIYQQQgghhBBCCCGEEEII8ZmRMF8IIYQQQgghhBBCCCGEEEIIIYT4zEiYL4QQQgghhBBCCCGEEEIIIYQQQnxmJMwXQgghhBBCCCFEtqbVajP7FIQQQgghhBBCiH9NndknIIQQQgghhBBCCPEhXbhwgZs3b+Lm5oaHhwfm5uYYGcl8BiGEEEIIIYQQWYuE+UIIIYQQQgghhMg2Xr16RVhYGEePHkWtVmNqakrJkiVp3LgxZcqUIXfu3ADodDpUKlUmn60QQgghhBBCCPHXVDqdTpfZJyGEEEIIIYQQQgjxoaSlpZGSksKFCxf4/fffOXDgALGxsXh5edGoUSN69OiR2acohBBCCCGEEEL8VxLmCyGEEEIIIYQQIlvQz7ZPT0/H2NhYef3SpUscPnyYRYsWodFoqFWrFoMHD8bNzc3gc0IIIYQQQgghxOdEwnwhhBBCCCGEEEJkS+8L9SdMmMD169cpVaoU/fr1o3LlypiammbiWQohhBBCCCGEEO8nYb4QQgghhBBCCCGyPa1Wi5GREZGRkfj5+bFjxw68vLwYMWIElStXzuzTEx+J/nfXr9ogBEi9EEIIIYQQWYc6s09ACCGEEEIIIYQQ4v9KH879FX1wV6BAAYYNG4axsTHbt29n4cKFeHh44ODgIMFeNnHjxg2sra2xsrLC3t4eQPld5Tf+cp0/f54nT55QoEABcuXKRa5cuaReCCGEEEKIz57MzBdCCCGEEEIIIUSWlnE5/WvXrpGeno5Op6NkyZLvfFYf2t27d49x48Zx5swZ2rVrx08//fSpT1t8BM+fP6dt27bExsaiUqkoX7485cqVo06dOuTNmxcTExMJbr9Aqamp+Pj4cP/+fR49eoSjoyNff/013t7efPvtt9jY2GT2KQohhBBCCPFeEuYLIYQQQgghhBAiy8oYzAYHB7N8+XISEhJISkpiyJAhtGzZEgcHh/d+58aNG3To0IH4+HiCg4OpXr16ZhRBfECpqanEx8dz69YtLly4wNatW7lz5w52dnZUrFiRESNGkCdPHmXwh/hyaDQakpKSOHv2LHv37mXfvn28fPmS/Pnz88MPP1CmTBlcXFwy+zSFEEIIIYQwIGG+EEIIIYQQQgghsrxFixYxb9487O3t8fLy4sSJEwB06tSJHj164OzsbPB5faC/efNmxowZQ8+ePRkyZEhmnLr4iB49esTNmzeZN28ef/zxB46OjnTq1ImGDRtKcPsFeXs1hvT0dF6+fMm8efPYvXs3ycnJVKhQgb59+1KuXLlMPFMhhBBCCCEMSZgvhBBCCCGEEEKILO3Vq1d0794da2trRo0ahZeXF1u3bmXu3Lk8fvz4LwN9gPv37zN06FAePnzI+vXr+eqrrzKhBOJDezu8TU5Oxt/fn23btvHy5UsaNWpEly5d8PDwyMSzFJlFq9ViZGREWloahw4dYt26dRw+fJi8efMyYcIEatSokdmnKIQQQgghBABGmX0CQgghhBBCCCGEEP+GVqs1+HtCQgK3bt2iWbNmeHl5AfDdd98xevRo8uTJw4oVKwgJCeHJkyfKd/RzG1xcXKhduzbPnj3j/v37Bu+JrOvtWdjm5uYMGjSIkSNH4unpyaZNmwgICOD27duZeJYisxgZGaHVajExMeHbb79l4sSJtG7dmgcPHjBixAiOHTuW2acohBBCCCEEIGG+EEIIIYQQQgghspD09HSMjN50Zxw4cID169ezevVq8uXLR+HChYE3+6YD1KtXj5EjR7430FepVKSnpwPQrl07cufOza5duzKhROJDeN8ADP2gD2NjYyW4rV+/PoMGDaJs2bLs27ePtWvXEhMT86lPV3wibw/80dPpdMp1RKVSkSdPHkaNGkX79u159eoVM2fO5I8//viUpyqEEEIIIcR7qTP7BIQQQgghhBBCCCH+CZ1Oh7GxMQDz588nMDDQ4P1t27ZRokQJTE1N0Wg0qNVq6tWrB8CMGTNYsWIFRkZGdO/eHWdnZ+VYpqamFC5cmLt37wKGs7rF5y89PV35LV+9ekVCQgJ58uRRwlp9cKv/8+uvvyY+Pp558+bx66+/UrVqVWrUqKEsvS6yh4z1IioqipcvX5IrVy4cHBwwNzd/5/e2tLSkT58+xMbGsmPHDrZu3UqhQoUwNTWVa4IQQgghhMg08oQihBBCCCGEEEKILEEfqC1dupTAwEDy589P+/btKVSoEGZmZuzatYstW7ag0+lQq9VoNBrgPzP0XV1dWbZsGQsWLCAxMRF4E/SamppStWpVnjx5QmJioiyzn4VotVolsF2yZAldu3alTp069OzZkx07dpCUlIRKpUKn0yl/GhkZ8c0339C6dWvi4uLw8/MjOTlZgvxsJOPAn8DAQNq3b0/r1q3p0qULM2fOJCYmRllqPyMnJye6dOlC4cKF+fXXX3n06JFSb4QQQgghhMgM8pQihBBCCCGEEEKIz5p+OXx4s4T+mTNnKFq0KH5+fkyYMIE5c+bQvHlzYmJiWLJkCb///vt7A/0hQ4ZgZWWFm5sblpaWwH8GCJQpU4bAwEAsLS1lFm4Wog/g58+fz4wZM4iMjMTS0pIjR44wd+5cNm7cSGJi4juBvlqtpl27dlSpUoVr166xbds24K+XZRdZi/7fcHBwMPPnz0elUlGsWDESExNZtWoVkydP/stAv3jx4tSpU4eYmBiCg4PRarVyTRBCCCGEEJlGpZOhpUIIIYQQQgghhMgC9Mvk+/r60rNnT3r37q28d/fuXZYtW8bGjRspWLAgvXv3pmHDhqhUKmXJfYB79+6RP39+ACXcFVlbZGQknTt3xt3dnWHDhmFtbU14eDibNm3CysqKHj168P3332Npaan85vol2O/evUu7du2oWrUqs2fPzuyiiA9Eq9Xy8uVLunbtip2dHRMnTsTFxYWjR48yb948bt26Rf369ZkwYQIODg7Kkvv6+pGYmEjbtm1JT09n8+bNmJqaZnaRhBBCCCHEF0pm5gshhBBCCCGEEOKzd+HCBX755RcWL16Mubk5OXPmBN7M1AcoWLAgXbt2pUWLFty9e5egoCCDGfppaWkASpAvs22zrrdnUr948YKnT5/Sr18/vLy8cHV1xcfHh86dO5OQkEBISAibNm0ymKFvbGyMTqfD0dGRChUqsH//fm7dupVJJRIfQsZ6YWRkRGJiIo8fP6Zbt264ublhZmZG9erVGTduHO7u7uzcuZNJkyYZzNDXD/SwtLSkffv23Llzh/3792diqYQQQgghxJdOwnwhhBBCCCGEEEJ89goUKMD48eMxMjLi+fPnbN68meTkZExNTZWl9AsUKGAQ6IeGhrJ9+3Z0Oh0mJiYGx5P90bOm9PR05be7fv06J0+e5MGDB+TNm5f8+fOj0WhIT0/HycmJdu3a/W2gr1KpsLa2pmnTpiQmJvLw4cNMLp34X2WsF0eOHGHjxo3s2rWL9PR0LCwsAJTrQPny5f820Dc2NgagdOnSWFpaEhkZmVnFEkIIIYQ
"text/plain": [
"<Figure size 2500x1000 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"warnings.filterwarnings('ignore')\n",
"msno.matrix(df_filling)\n",
"warnings.filterwarnings('default')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In total, I dropped 3 rows and imputed the others with the mean. Now I have a dataset withouth missing values."
]
},
{
"cell_type": "code",
"execution_count": 156,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>studyName</th>\n",
" <th>Species</th>\n",
" <th>Island</th>\n",
" <th>Individual ID</th>\n",
" <th>Clutch Completion</th>\n",
" <th>Date Egg</th>\n",
" <th>Culmen Length (mm)</th>\n",
" <th>Culmen Depth (mm)</th>\n",
" <th>Flipper Length (mm)</th>\n",
" <th>Body Mass (g)</th>\n",
" <th>Sex</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N1A1</td>\n",
" <td>Yes</td>\n",
" <td>2007-11-11</td>\n",
" <td>39.1</td>\n",
" <td>18.7</td>\n",
" <td>181.0</td>\n",
" <td>3750.0</td>\n",
" <td>MALE</td>\n",
" <td>8.733382</td>\n",
" <td>-25.686292</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N1A2</td>\n",
" <td>Yes</td>\n",
" <td>2007-11-11</td>\n",
" <td>39.5</td>\n",
" <td>17.4</td>\n",
" <td>186.0</td>\n",
" <td>3800.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.949560</td>\n",
" <td>-24.694540</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N2A1</td>\n",
" <td>Yes</td>\n",
" <td>2007-11-16</td>\n",
" <td>40.3</td>\n",
" <td>18.0</td>\n",
" <td>195.0</td>\n",
" <td>3250.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.368210</td>\n",
" <td>-25.333020</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N3A1</td>\n",
" <td>Yes</td>\n",
" <td>2007-11-16</td>\n",
" <td>36.7</td>\n",
" <td>19.3</td>\n",
" <td>193.0</td>\n",
" <td>3450.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.766510</td>\n",
" <td>-25.324260</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>PAL0708</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>N3A2</td>\n",
" <td>Yes</td>\n",
" <td>2007-11-16</td>\n",
" <td>39.3</td>\n",
" <td>20.6</td>\n",
" <td>190.0</td>\n",
" <td>3650.0</td>\n",
" <td>MALE</td>\n",
" <td>8.664960</td>\n",
" <td>-25.298050</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>337</th>\n",
" <td>PAL0910</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>N38A1</td>\n",
" <td>No</td>\n",
" <td>2009-12-01</td>\n",
" <td>47.2</td>\n",
" <td>13.7</td>\n",
" <td>214.0</td>\n",
" <td>4925.0</td>\n",
" <td>FEMALE</td>\n",
" <td>7.991840</td>\n",
" <td>-26.205380</td>\n",
" </tr>\n",
" <tr>\n",
" <th>338</th>\n",
" <td>PAL0910</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>N39A1</td>\n",
" <td>Yes</td>\n",
" <td>2009-11-22</td>\n",
" <td>46.8</td>\n",
" <td>14.3</td>\n",
" <td>215.0</td>\n",
" <td>4850.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.411510</td>\n",
" <td>-26.138320</td>\n",
" </tr>\n",
" <tr>\n",
" <th>339</th>\n",
" <td>PAL0910</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>N39A2</td>\n",
" <td>Yes</td>\n",
" <td>2009-11-22</td>\n",
" <td>50.4</td>\n",
" <td>15.7</td>\n",
" <td>222.0</td>\n",
" <td>5750.0</td>\n",
" <td>MALE</td>\n",
" <td>8.301660</td>\n",
" <td>-26.041170</td>\n",
" </tr>\n",
" <tr>\n",
" <th>340</th>\n",
" <td>PAL0910</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>N43A1</td>\n",
" <td>Yes</td>\n",
" <td>2009-11-22</td>\n",
" <td>45.2</td>\n",
" <td>14.8</td>\n",
" <td>212.0</td>\n",
" <td>5200.0</td>\n",
" <td>FEMALE</td>\n",
" <td>8.242460</td>\n",
" <td>-26.119690</td>\n",
" </tr>\n",
" <tr>\n",
" <th>341</th>\n",
" <td>PAL0910</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>N43A2</td>\n",
" <td>Yes</td>\n",
" <td>2009-11-22</td>\n",
" <td>49.9</td>\n",
" <td>16.1</td>\n",
" <td>213.0</td>\n",
" <td>5400.0</td>\n",
" <td>MALE</td>\n",
" <td>8.363900</td>\n",
" <td>-26.155310</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>341 rows × 13 columns</p>\n",
"</div>"
],
"text/plain": [
" studyName Species Island Individual ID \\\n",
"0 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A1 \n",
"1 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A2 \n",
"2 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N2A1 \n",
"3 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N3A1 \n",
"4 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N3A2 \n",
".. ... ... ... ... \n",
"337 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N38A1 \n",
"338 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N39A1 \n",
"339 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N39A2 \n",
"340 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N43A1 \n",
"341 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N43A2 \n",
"\n",
" Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) \\\n",
"0 Yes 2007-11-11 39.1 18.7 \n",
"1 Yes 2007-11-11 39.5 17.4 \n",
"2 Yes 2007-11-16 40.3 18.0 \n",
"3 Yes 2007-11-16 36.7 19.3 \n",
"4 Yes 2007-11-16 39.3 20.6 \n",
".. ... ... ... ... \n",
"337 No 2009-12-01 47.2 13.7 \n",
"338 Yes 2009-11-22 46.8 14.3 \n",
"339 Yes 2009-11-22 50.4 15.7 \n",
"340 Yes 2009-11-22 45.2 14.8 \n",
"341 Yes 2009-11-22 49.9 16.1 \n",
"\n",
" Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) \\\n",
"0 181.0 3750.0 MALE 8.733382 \n",
"1 186.0 3800.0 FEMALE 8.949560 \n",
"2 195.0 3250.0 FEMALE 8.368210 \n",
"3 193.0 3450.0 FEMALE 8.766510 \n",
"4 190.0 3650.0 MALE 8.664960 \n",
".. ... ... ... ... \n",
"337 214.0 4925.0 FEMALE 7.991840 \n",
"338 215.0 4850.0 FEMALE 8.411510 \n",
"339 222.0 5750.0 MALE 8.301660 \n",
"340 212.0 5200.0 FEMALE 8.242460 \n",
"341 213.0 5400.0 MALE 8.363900 \n",
"\n",
" Delta 13 C (o/oo) \n",
"0 -25.686292 \n",
"1 -24.694540 \n",
"2 -25.333020 \n",
"3 -25.324260 \n",
"4 -25.298050 \n",
".. ... \n",
"337 -26.205380 \n",
"338 -26.138320 \n",
"339 -26.041170 \n",
"340 -26.119690 \n",
"341 -26.155310 \n",
"\n",
"[341 rows x 13 columns]"
]
},
"execution_count": 156,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# recreate the penguin manager\n",
"penguin_manager = EncoderManager(df_filling, columns_to_label_encode, columns_to_ordinal_encode, columns_to_onehot_encode)\n",
"penguin_manager.get_df()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 10 Normalize numeric features\n",
"\n",
"Normalization is a common preprocessing step in machine learning that aims to scale features to have similar ranges and avoid that one feature dominates the others during training. By normalizing the data, we can ensure that each feature contributes equally to the model's training, leading to more accurate and robust models. Normalization can also improve the convergence of some machine learning algorithms, such as gradient descent, and reduce the computational complexity.\n",
"\n",
"References: TP02 - Data Preprocessing\n",
"\n",
"The most commons are:\n",
"\n",
"#### Min-max normalization\n",
"\n",
"Used to scale all values in a range between 0 and 1. Function to use: sklearn.preprocessing.MinMaxScaler.\n",
"\n",
"In general, the minmax normalization:\n",
"\n",
"- It's sensitive to outliers. Outliers could outscale the other values.\n",
"- It may not preserve the original shape of the data.\n",
"- It's good for distance based algorithms (Ex. k-means clustering, hierarchical clustering, and principal component analysis (PCA)).\n",
"- It's not good for non-linear data.\n",
"Could produce too small values.\n",
"\n",
"#### Z-score normalization\n",
"\n",
"Used to standardize values by scaling them to have a mean of 0 and a standard deviation of 1. Function to use: sklearn.preprocessing.StandardScaler.\n",
"\n",
"In general, the zscore normalization:\n",
"\n",
"- Helps remove outliers influence.\n",
"- Works well with normally distributed data.\n",
"- Sensitive to EXTREME outliers.\n",
"- It may not preserve the original shape of the data.\n",
"\n",
"#### Clipping\n",
"\n",
"This normalization method is used to clip the data between a min and a max value. It's used to remove outliers and to reduce the influence of extreme values. The data is not deleted, but it's replaced with the chosen min or max value if the data is outside the range.\n",
"\n",
"- Helps remove outliers influence.\n",
"- It can influence the model.\n",
"- Could remove important information that could contain useful information to modeling data\n",
"\n",
"#### Binning\n",
"\n",
"This normalization method is used to bin the data into a fixed number of bins. It's used to remove outliers and to reduce the influence of extreme values.\n",
"\n",
"Binning is a method of transforming continuous variables into discrete variables. It reduces the effects of minor observation errors.\n",
"\n",
"More info on binning: https://en.wikipedia.org/wiki/Data_binning\n",
"\n",
"And: TP02 - Data Preprocessing\n",
"\n",
"#### L1 normalization\n",
"\n",
"Used to normalize the data based on the sum of the absolute values of each observation. Function to use: sklearn.preprocessing.normalize with norm='l1'.\n",
"\n",
"#### L2 normalization\n",
"\n",
"Used to normalize the data based on the sum of the squared values of each observation. Function to use: sklearn.preprocessing.normalize with norm='l2'.\n",
"\n",
"#### Log transformation\n",
"\n",
"Used to scale the data and improve the linearity of relationships. Function to use: numpy.log or numpy.log1p.\n",
"\n",
"### Normalization\n",
"\n",
"For this step I will normalize with Min-Max since I checked before there are no important outliers and these normalizations, and it works well for the PCA which I will use later. The distributions tend to be normal, but the Min-Max normalization should be fine."
]
},
{
"cell_type": "code",
"execution_count": 157,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[<AxesSubplot:title={'center':'Culmen Length (mm)'}>,\n",
" <AxesSubplot:title={'center':'Culmen Depth (mm)'}>],\n",
" [<AxesSubplot:title={'center':'Flipper Length (mm)'}>,\n",
" <AxesSubplot:title={'center':'Body Mass (g)'}>],\n",
" [<AxesSubplot:title={'center':'Delta 15 N (o/oo)'}>,\n",
" <AxesSubplot:title={'center':'Delta 13 C (o/oo)'}>]], dtype=object)"
]
},
"execution_count": 157,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAh0AAAGvCAYAAAD/imcEAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAABwgUlEQVR4nO3deVyN6f8/8FedVioVDcOYMaQoJ2WpyJoljH0f2TLGljG27DPMkBh7iGyTJbsxg7GNYTBDwjSJjyJjMBNSKdJy6pz790ffc36O1lOns9Tr+Xj0oHt93/e5z9X7vq7rvm4DQRAEEBEREZUzQ20HQERERJUDkw4iIiLSCCYdREREpBFMOoiIiEgjmHQQERGRRjDpICIiIo1g0kFEREQawaSDiIiINIJJB+ksjltHROrC8kQ3MOl4R0xMDAICAtChQwe4uLigU6dOWLBgAZ48eaLytn744Qc4Ojri33//LYdI1UMXY3z27BnGjx+P//77TzHN29sbc+bMKfU2Fy9ejDVr1qgjPJX98ccf6Nu3L3JycrSyf9JNlbWseftHLBbD29sbX331FZ4/f15u+z506BCWL1+eL5bSnq9///0XHTp0QEpKirpCVMmnn36KU6dOaWXfZcWk4y3h4eEYOnQokpOTMWPGDGzduhUTJkzA9evXMWDAANy5c0fbIVYKV65cwW+//aa27UVERODs2bMYP3682rapCi8vL9SqVQubNm3Syv5J91TmsmbDhg04cOAADhw4gC1btmD06NE4f/48+vfvX6qEqyQ2bdqE1NRUtWxLEATMmzcPo0aNgq2trVq2qar58+dj8eLFSE5O1sr+y4JJx/+5efMmAgMDMWzYMOzYsQO9evWCh4cHBg0ahH379qFKlSqYO3eutsOkUggKCsLIkSNRpUoVrcUwadIkbN26FYmJiVqLgXRDZS9rGjduDFdXV7i6uqJVq1YYOXIk9u/fD4lEgq+//lrb4RXrl19+QWxsLIYNG6a1GJo0aQJnZ2e9vJFh0vF/tm/fDktLS0yfPj3fPFtbW8yZMwddu3ZFeno6AGDEiBEYMWKE0nLXrl2Do6Mjrl27VuA+5syZg88++wwHDx5E586d4eLigqFDh+Lhw4e4cOECevXqhaZNm2LQoEG4e/eu0ro3btzA8OHD0bRpU7i7u2P27NlKVXs//PADnJycEB0djSFDhkAsFqNDhw7YunVrWU8NACA1NRVff/01WrduDbFYjMGDB+Pq1atKyzg6OiI8PBzz58+Hu7s73NzcMGXKFCQlJSktt337dnTq1Elx/OfPn1ectx9++EFR4Hbq1EmpSSUnJwffffcdvLy84OrqijFjxuDRo0dFxv3bb78hLi4OPXv2VExbv349unXrhnPnzqFnz54Qi8Xo06cPoqKi8Ndff2HQoEFwcXFBz549lY6xtOsBgIuLC2rXro2wsDCVzjtVPCxr8qtbty4GDx6MK1eu4PHjx4rp9+7dw/jx49GsWTM0a9YM/v7+SrUh8vPw+++/w9fXFy4uLujSpQv27NmjWMbb2xv//fcfjh49mq9JJTo6GkOHDlUcw/bt24uNNTQ0FF27doWpqalimqOjI/bt24c5c+agefPmcHd3x5IlS5CVlYXly5fD09MTHh4emD9/PrKzs8u8HgD07t0bhw8f1loTT2kx6UBeddnvv/+OVq1awdzcvMBlunXrhsmTJ8PCwqJM+/rrr7+we/duzJkzB0uXLkV8fDzGjRuHoKAgjB8/HkFBQXj69ClmzpypWOf69esYPXo0zMzMsHbtWsybNw+RkZEYOXIksrKyFMvJZDJMnToVPXr0wJYtW9C8eXOsXLkSly9fLlPM2dnZGDVqFH799VdMmzYNGzZsQK1atTB27Nh8f1zXrFkDmUyG1atXY9asWfjtt9+wdOlSxfwNGzZg5cqV6N69O0JCQtC0aVNMmzZNMb9Dhw6YOHGiYtlJkyYp5p08eRL379/HsmXL8PXXXyMmJkZp3YIcO3YMrq6ueP/995WmP3v2DEFBQZgwYQLWrl2LtLQ0TJkyBdOnT8fgwYOxevVqyGQyTJs2Tekcl3Y9IO8aOnbsWAnPOlVELGsK16ZNGwB5NUEA8PDhQ0UT1LJlyxAYGIgnT57g008/zdesMG3aNDg5OWHjxo3w8vLC4sWLsXv3bgB55YidnR3at2+PAwcO4L333lOst2jRIvTs2ROhoaFwcXHBd999hwsXLhQa499//43bt2+jW7du+eatXLkSJiYm2LBhA/r06YPdu3ejb9++ePr0KVasWIGhQ4fi8OHDirjKul6nTp0glUrxyy+/qHCWtc9I2wHogpcvXyI7OxsffPBBue8rPT0da9euRYMGDQAAkZGROHDgAMLCwtCqVSsAeX/Yli9fjlevXsHKygqrVq3Cxx9/jNDQUIhEIgBA06ZN8cknn+DIkSPw9fUFkFegTZo0CYMGDQIANG/eHL/88gt+++03tG3bttQx//TTT4iNjcXBgwfRtGlTAEC7du0wYsQIrFy5EkeOHFEs6+DggKCgIMXvt27dwunTpwEAGRkZ2Lp1K3x9fRUFXZs2bZCZmYkDBw4AyLvT+/DDDwHkVcO+/ZnUrFkTISEhMDY2BgA8evQImzdvRnp6eqEFdEREBD755JN80zMzM7Fw4UK0a9cOAPDgwQOsWrUKgYGBGDhwIABAKpViypQpePjwIRo3blym9QBALBZj8+bNePDggeLzp8qFZU3h7OzsAAAvXrwAkJcsmJmZISwsTPH9btWqFTp37oxt27Zh9uzZinU7d+6M+fPnAwDatm2LxMREbNq0Cb6+vnBycoKJiQlsbW3h6uqqtM/p06fj008/BQC4urri/PnziIiIQMeOHQuMMSIiAkBezeW7GjRogG+//RYA0LJlSxw+fBg5OTlYuXIljIyM0LZtW5w/fx5//vmnWtarUqUKGjRogKtXr2LIkCHFnF3dwZoOAIaGeadBKpWW+76qVaum9AdH/kV7+8tgbW0NAHj16hUyMzMRHR2N9u3bQxAE5ObmIjc3F3Xr1kWDBg3wxx9/KG3fzc1N8X/5Fy0jI6NMMV+9ehV2dnZwdnZW7F8qlaJjx464ffs20tLSFMu++6WuVasWMjMzAeTdeWVlZeW7S3i76aMoLi4uioQDyKuSBfLOU0EyMzORnJxcaAHfrFkzxf9r1KiRL/63Pwd1rCePQ5efMKDyxbKmeAYGBgDy/sB7eHjAzMxMEYuFhQVatGiBK1euKK3Tp08fpd+7du2K5ORkPHz4sMh9tWjRQvH/KlWqoEaNGoWWJwDw5MkTWFlZwcrKKt+8t8+HkZERbGxs0KRJExgZ/f97e2tra7x+/Vot6wFAnTp19K48YU0H8j7QqlWrIiEhodBlMjIyIJFIFF/S0irsjrywqtZXr15BJpNh69atBbaZvt2uCABmZmZKvxsaGpb5+fTU1FS8ePECzs7OBc5/8eIFqlWrBiD/cby9f3nb47s9vuV/uIvzbkdQeQEuk8kKXF5eeBTWgbSgz+Ld86fO9eTnpqDCgyoHljWFkz8yW6tWLQB55c7Jkydx8uTJfMu+W4a83WQCANWrVwdQ+A2JXFHlVUHS09MLPX8Fne/CllXHevLl9K08YdLxf9q0aYNr164hOzs735cLyOs8FRgYiL179yoy03fvVtSR5b+ratWqMDAwwOjRowtsJijpxVkWlpaWqFevHlauXFng/JJWFcsLk5SUFNSvX18xvbw6QtnY2AAovuDRFHmNkDwuqpxY1hTsypUrMDAwUNQ+WFpaonXr1vDz88u37Nu1AADyPQ4r7/MhTz7UxcbGRqf+yL969UrvyhM2r/yfMWPGIDU1tcABpJKTk7Ft2zZ89NFHiqpJCwsLPHv2TGm5d9vc1MHCwgJOTk74+++/IRaLFT8NGzbEhg0bCu29rk7u7u54+vQpqlevrhTD1atXsW3bNkXbb3EaNWoES0tLnD17Vmn6mTNnlH6X12CUlYmJCezs7PD06VO1bK+s5NdL7dq1tRwJaRPLmvyePXuGQ4cOoUOHDopO3+7u7oiPj0fjxo0VsTRp0gRhYWH5Ok+
"text/plain": [
"<Figure size 640x480 with 6 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_numeric_ = penguin_manager.get_df().select_dtypes(include=[np.number])\n",
"\n",
"df_numeric_.hist()"
]
},
{
"cell_type": "code",
"execution_count": 158,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[<AxesSubplot:title={'center':'Culmen Length (mm)'}>,\n",
" <AxesSubplot:title={'center':'Culmen Depth (mm)'}>],\n",
" [<AxesSubplot:title={'center':'Flipper Length (mm)'}>,\n",
" <AxesSubplot:title={'center':'Body Mass (g)'}>],\n",
" [<AxesSubplot:title={'center':'Delta 15 N (o/oo)'}>,\n",
" <AxesSubplot:title={'center':'Delta 13 C (o/oo)'}>]], dtype=object)"
]
},
"execution_count": 158,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAiAAAAGvCAYAAABih26MAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAABtAklEQVR4nO3deVhU1f8H8DcMq1tuqGl+M5dBgUFIBBT3DffdtNAU5asp7isuZWWIZm6oGG6RYWmSlfu+lrijTiYqZmqZGu6IbDPn9we/uV9Gthm8zAzwfj0PzwP3nrlzzty5Hz73nHPvtRJCCBARERGZkLW5K0BEREQlDxMQIiIiMjkmIERERGRyTECIiIjI5JiAEBERkckxASEiIiKTYwJCREREJscEhIiIiEyOCQhZLN4jj4jkwnhieZiAvEStVmPKlClo1aoV3N3d0bZtW8yaNQu3b982eltbtmyBs7Mz/vrrr0KoqTwssY53797FiBEj8Pfff0vL2rRpg5CQkAJvc86cOVi8eLEc1TPar7/+ip49eyI9Pd0s70+WqaTGmqw/KpUKbdq0wYcffoh79+4V2ntv3rwZ8+fPz1aXgn5ef/31F1q1aoWHDx/KVUWjvPvuu9i1a5dZ3ltOTECy2LBhAwYMGIAHDx5g0qRJWL16NT744AOcPn0affr0waVLl8xdxRLh+PHjOHz4sGzbO3HiBPbu3YsRI0bItk1j+Pn5oVq1ali5cqVZ3p8sT0mONcuXL8emTZuwadMmrFq1CkOGDMHBgwfRu3fvAiVfhli5ciUeP34sy7aEEJgxYwYGDx6MihUryrJNY82cORNz5szBgwcPzPL+cmEC8v/Onj2L0NBQvPfee1i3bh26desGHx8f9OvXD9999x1KlSqF6dOnm7uaVABhYWF4//33UapUKbPVYdSoUVi9ejXu379vtjqQZSjpsaZBgwbw8PCAh4cHmjRpgvfffx8bN25EWloaPvroI3NXL1/79u1DfHw83nvvPbPVwc3NDa6urkX+pIYJyP9bu3YtypYti4kTJ2ZbV7FiRYSEhKBDhw5ISkoCAAwaNAiDBg3SK3fy5Ek4Ozvj5MmTOb5HSEgIhg0bhu+//x7t2rWDu7s7BgwYgBs3buDQoUPo1q0bGjZsiH79+uHy5ct6rz1z5gwGDhyIhg0bwtvbG9OmTdPr/tuyZQtcXFxw4cIF9O/fHyqVCq1atcLq1atf9aMBADx+/BgfffQRmjZtCpVKhXfeeQexsbF6ZZydnbFhwwbMnDkT3t7e8PT0xNixY5GYmKhXbu3atWjbtq3U/oMHD0qf25YtW6Tg27ZtW71hl/T0dHz++efw8/ODh4cHhg4dips3b+ZZ78OHD+PKlSvo2rWrtGzZsmXo2LEj9u/fj65du0KlUqFHjx6Ii4vD+fPn0a9fP7i7u6Nr1656bSzo6wDA3d0d1atXR1RUlFGfOxU/jDXZ1axZE++88w6OHz+OW7duScuvXr2KESNG4O2338bbb7+N4OBgvV4S3efwyy+/ICAgAO7u7mjfvj2io6OlMm3atMHff/+NH3/8Mduwy4ULFzBgwACpDWvXrs23rpGRkejQoQPs7e2lZc7Ozvjuu+8QEhKCRo0awdvbG5999hlSUlIwf/58+Pr6wsfHBzNnzkRqauorvw4AunfvjpiYGLMNA8mBCQgyu9R++eUXNGnSBI6OjjmW6dixI0aPHo0yZcq80nudP38e33zzDUJCQjB37lwkJCRg+PDhCAsLw4gRIxAWFoZ//vkHkydPll5z+vRpDBkyBA4ODliyZAlmzJiBU6dO4f3330dKSopUTqvVYvz48ejcuTNWrVqFRo0a4YsvvsCxY8deqc6pqakYPHgwDhw4gAkTJmD58uWoVq0agoKCsv2jXbx4MbRaLRYtWoSpU6fi8OHDmDt3rrR++fLl+OKLL9CpUydERESgYcOGmDBhgrS+VatWGDlypFR21KhR0rqdO3fi2rVrmDdvHj766COo1Wq91+Zk69at8PDwwOuvv663/O7duwgLC8MHH3yAJUuW4MmTJxg7diwmTpyId955B4sWLYJWq8WECRP0PuOCvg7I/A5t3brVwE+diiPGmtw1a9YMQGYPEQDcuHFDGqaaN28eQkNDcfv2bbz77rvZhh4mTJgAFxcXrFixAn5+fpgzZw6++eYbAJlxxMnJCS1btsSmTZtQpUoV6XUff/wxunbtisjISLi7u+Pzzz/HoUOHcq3jH3/8gd9++w0dO3bMtu6LL76AnZ0dli9fjh49euCbb75Bz5498c8//2DBggUYMGAAYmJipHq96uvatm0LjUaDffv2GfEpWxYbc1fAEjx69Aipqal44403Cv29kpKSsGTJEtSpUwcAcOrUKWzatAlRUVFo0qQJgMx/cvPnz8fTp09Rrlw5LFy4EG+99RYiIyOhUCgAAA0bNkSXLl3www8/ICAgAEBmcBs1ahT69esHAGjUqBH27duHw4cPo3nz5gWu888//4z4+Hh8//33aNiwIQCgRYsWGDRoEL744gv88MMPUlmlUomwsDDp74sXL2L37t0AgOTkZKxevRoBAQFS0GvWrBlevHiBTZs2Acg8A/zPf/4DILOrNus+qVq1KiIiImBrawsAuHnzJr788kskJSXlGqxPnDiBLl26ZFv+4sULzJ49Gy1atAAAXL9+HQsXLkRoaCj69u0LANBoNBg7dixu3LiBBg0avNLrAEClUuHLL7/E9evXpf1PJQtjTe6cnJwAAP/++y+AzMTBwcEBUVFR0vHdpEkTtGvXDmvWrMG0adOk17Zr1w4zZ84EADRv3hz379/HypUrERAQABcXF9jZ2aFixYrw8PDQe8+JEyfi3XffBQB4eHjg4MGDOHHiBFq3bp1jHU+cOAEgs0fzZXXq1MGnn34KAGjcuDFiYmKQnp6OL774AjY2NmjevDkOHjyIc+fOyfK6UqVKoU6dOoiNjUX//v3z+XQtE3tAAFhbZ34MGo2m0N/rtdde0/vnozvosh4Y5cuXBwA8ffoUL168wIULF9CyZUsIIZCRkYGMjAzUrFkTderUwa+//qq3fU9PT+l33UGXnJz8SnWOjY2Fk5MTXF1dpffXaDRo3bo1fvvtNzx58kQq+/IBXq1aNbx48QJA5hlZSkpKtrOHrMMjeXF3d5eSDyCz2xbI/Jxy8uLFCzx48CDXYP/2229Lv1euXDlb/bPuBzlep6uHJV+pQIWLsSZ/VlZWADL/2fv4+MDBwUGqS5kyZeDl5YXjx4/rvaZHjx56f3fo0AEPHjzAjRs38nwvLy8v6fdSpUqhcuXKucYTALh9+zbKlSuHcuXKZVuX9fOwsbFBhQoV4ObmBhub/53nly9fHs+ePZPldQBQo0aNIh1P2AOCzJ1bunRp3LlzJ9cyycnJSEtLkw7YgsrtTD237tinT59Cq9Vi9erVOY6xZh2HBAAHBwe9v62trV/5+vfHjx/j33//haura47r//33X7z22msAsrcj6/vrxipfnjmu+yeen5cnkeqCuVarzbG8LpDkNvk0p33x8ucn5+t0n01OgYRKBsaa3Okuw61WrRqAzLizc+dO7Ny5M1vZl2NI1mEVAKhUqRKA3E9OdPKKVzlJSkrK9fPL6fPOrawcr9OVK8rxhAnI/2vWrBlOnjyJ1NTUbAcakDnxKjQ0FN9++62Usb58FiNH9v+y0qVLw8rKCkOGDMlxKMHQL+qrKFu2LGrVqoUvvvgix/WGdifrAsvDhw9Ru3ZtaXlhTaKqUKECgPyDkKnoeop09aKSibEmZ8ePH4eVlZXUK1G2bFk0bdoUgYGB2cpm7R0AkO0SW90cEV0iIpcKFSpY1D/8p0+fFul4wiGY/zd06FA8fvw4x5tVPXjwAGvWrMGbb74pdV+WKVMGd+/e1Sv38hidHMqUKQMXFxf88ccfUKlU0k+9evWwfPnyXGfBy8nb2xv//PMPKlWqpFeH2NhYrFmzRhorzk/9+vVRtmxZ7N27V2/5nj179P7W9Wy8Kjs7Ozg5OeGff/6RZXuvSvd9qV69uplrQubEWJPd3bt3sXnzZrRq1UqaMO7t7Y2
"text/plain": [
"<Figure size 640x480 with 6 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# normalize with Min-Max\n",
"df_normalized = penguin_manager.normalize(\"min-max\", inplace=True).get_df()\n",
"df_normalized.select_dtypes(include=[\"float\", \"int\"]).hist()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result seems pretty good, the distributions are similar to the original ones, but the values are scaled between 0 and 1."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11 Pandas Profiling\n",
"\n",
"Before going further, I will use the pandas profiling to get a quick overview of the data at this point."
]
},
{
"cell_type": "code",
"execution_count": 159,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "7d15661e0ad04ca99ed51d50ac562dad",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "63610006d2c14ad9b6c1bcf633a6aa4a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "d9656c5bbed848df8b563be649eb7fbf",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Render HTML: 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "8bf81e12745d487284dfe47a83c72fa3",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Export report to file: 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# pandas profiling\n",
"profile = ProfileReport(penguin_manager.get_df(), title=\"Pandas Profiling Report\", explorative=True)\n",
"warnings.filterwarnings('ignore')\n",
"profile.to_file(\"profiling/profiling.html\")\n",
"warnings.filterwarnings('default')"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"This function is really interesting, but it doesn't add any information now."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12 Clustering, dimension reduction with PCA and t-SNE\n",
"\n",
"Using clustering algorithms, principal component analysis (PCA), and t-distributed stochastic neighbor embedding (t-SNE) to further analyze and visualize the data. Clustering helps identify patterns and similarities within the data, while PCA and t-SNE help reduce the dimensionality of the data, making it easier to visualize and interpret. The results of these techniques can provide insights into the underlying structure of the data and help identify any outliers or anomalies.\n",
"\n",
"### Reducing and visualizing using PCA\n"
]
},
{
"cell_type": "code",
"execution_count": 160,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Body Mass (g)</th>\n",
" <th>Flipper Length (mm)</th>\n",
" <th>Culmen Length (mm)</th>\n",
" <th>Culmen Depth (mm)</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" <th>Species</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.291667</td>\n",
" <td>0.152542</td>\n",
" <td>0.254545</td>\n",
" <td>0.666667</td>\n",
" <td>0.460122</td>\n",
" <td>0.412350</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.305556</td>\n",
" <td>0.237288</td>\n",
" <td>0.269091</td>\n",
" <td>0.511905</td>\n",
" <td>0.550450</td>\n",
" <td>0.719311</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.152778</td>\n",
" <td>0.389831</td>\n",
" <td>0.298182</td>\n",
" <td>0.583333</td>\n",
" <td>0.307537</td>\n",
" <td>0.521692</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.208333</td>\n",
" <td>0.355932</td>\n",
" <td>0.167273</td>\n",
" <td>0.738095</td>\n",
" <td>0.473964</td>\n",
" <td>0.524404</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.263889</td>\n",
" <td>0.305085</td>\n",
" <td>0.261818</td>\n",
" <td>0.892857</td>\n",
" <td>0.431532</td>\n",
" <td>0.532516</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Body Mass (g) Flipper Length (mm) Culmen Length (mm) Culmen Depth (mm) \\\n",
"0 0.291667 0.152542 0.254545 0.666667 \n",
"1 0.305556 0.237288 0.269091 0.511905 \n",
"2 0.152778 0.389831 0.298182 0.583333 \n",
"3 0.208333 0.355932 0.167273 0.738095 \n",
"4 0.263889 0.305085 0.261818 0.892857 \n",
"\n",
" Delta 15 N (o/oo) Delta 13 C (o/oo) Species \n",
"0 0.460122 0.412350 0 \n",
"1 0.550450 0.719311 0 \n",
"2 0.307537 0.521692 0 \n",
"3 0.473964 0.524404 0 \n",
"4 0.431532 0.532516 0 "
]
},
"execution_count": 160,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import LabelEncoder\n",
"\n",
"# for this step I will use the encoded and normalized dataset with only the most interesting features\n",
"columns_to_use = ['Body Mass (g)', 'Flipper Length (mm)', 'Culmen Length (mm)', 'Culmen Depth (mm)', 'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)', 'Species']\n",
"\n",
"df_decoded_pca = penguin_manager.get_df(as_copy=True)\n",
"\n",
"cols_to_label_encode_knn = ['Clutch Completion', 'Sex', 'Individual ID']\n",
"cols_to_ordinal_encode_knn = [\"Date Egg\"]\n",
"cols_to_onehot_encode_knn = ['studyName', 'Island']\n",
"\n",
"# for this i will label encode the species\n",
"lb = LabelEncoder()\n",
"df_decoded_pca['Species'] = lb.fit_transform(df_decoded_pca['Species'])\n",
"\n",
"penguin_manager_knn = EncoderManager(df_decoded_pca, cols_to_label_encode_knn, cols_to_ordinal_encode_knn, cols_to_onehot_encode_knn)\n",
"\n",
"df_encoded_pca = penguin_manager_knn.encode(inplace=False)[columns_to_use]\n",
"df_encoded_pca.head()"
]
},
{
"cell_type": "code",
"execution_count": 161,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(341, 7)"
]
},
"execution_count": 161,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_encoded_pca.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now I have 341 rows and 7 columns. I will use the elbow method to find the best number of clusters."
]
},
{
"cell_type": "code",
"execution_count": 162,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjMAAAGtCAYAAADj1vVsAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAABKwElEQVR4nO3dd3xUZb4G8Gdaek8goQRS6D0kMBGpiQu7AqISwVVZdUXQuCgWpF0UVBAbaBRWWXWxsCyKoICNFRVQSEJTQ5GSRiAQkpnUSZl27h/JDETanGROzkzyfD8fPnfmzJlzfvPKvTz3nN95X4UgCAKIiIiI3JRS7gKIiIiImoNhhoiIiNwawwwRERG5NYYZIiIicmsMM0REROTWGGaIiIjIrTHMEBERkVtjmCEiIiK3ppa7AKlZrVaYzWYolUooFAq5yyEiIiIHCIIAq9UKtVoNpfLa115afZgxm83IysqSuwwiIiJqgv79+8PDw+Oa+7T6MGNLc/3794dKpXLqsS0WC7KysiQ5dmvDsXIcx8pxHCvHcawcx7ESR6rxsh33eldlgDYQZmy3llQqlWR/KaU8dmvDsXIcx8pxHCvHcawcx7ESR6rxcqRFhA3ARERE5NYYZoiIiMitMcwQERGRW2OYISIiIrfGMENERERujWGGiIiI3BrDDBEREbk1hhkiIiJyawwzRERE5NYYZoiIiMitMcwQERGRW2OYISIiIrfW6healEqN0YL/Zuajk2CRuxQiIqI2jVdmmujbI+exZNsxrP21Uu5SiIiI2jSGmSbqHOwNADh8oQ6CIMhcDRERUdvFMNNEAzoHwUujRIVRwMkLVXKXQ0RE1GYxzDSRh1qJwV2CAQCZuaUyV0NERNR2Mcw0w9Co+jCTkaeXuRIiIqK2i2GmGbTRIQCAzFw9+2aIiIhkwjDTDAM7B0KjBEqqjMguNshdDhERUZvEMNMMnhoVeoRqAAAZuTqZqyEiImqbGGaaqU87DwBARg77ZoiIiOTAMNNMfW1hJlfHvhkiIiIZMMw0U49QD3ioFCiqqEO+rlrucoiIiNochplm8lQpMKBzEAD2zRAREcmBYcYJhkY3zDfDvhkiIqIWxzDjBLb5ZjI43wwREVGLky3M6HQ6pKamIiEhAVqtFkuXLoXZbL7ivh988AGSkpIwePBgTJw4Ed9++20LV3ttg7sEQa1U4GxZDc6U1shdDhERUZsiW5iZPXs2fHx8sHv3bmzcuBF79+7F2rVrL9tv586deOedd/Duu+/i4MGD+Mc//oHZs2fjzJkzLV/0Vfh4qNG/cyAAID2HfTNEREQtSZYwk5+fj8zMTMyZMwfe3t6IjIxEamoq1q1bd9m+OTk5EATB/kelUkGj0UCtVstQ+dVpo0MB1N9qIiIiopYjSyI4efIkgoKCEB4ebt8WGxuLwsJCVFRUICAgwL59/Pjx2LRpE26++WaoVCooFAq88soriIiIEHVOi8XitPr/eEyLxYKhUUF4eyeQkaOT5Fzu7tKxomvjWDmOY+U4jpXjOFbiSDVeYo4nS5gxGAzw9vZutM32vrq6ulGYMZlM6NWrF5YuXYpevXph69atWLhwIWJjY9GzZ0+Hz5mVleWc4q9ybA+TFUoFUFBag+/2HECYj0qy87kzKf87tDYcK8dxrBzHsXIcx0ocOcdLljDj4+ODmprGjbK2976+vo22P//88xg8eDAGDBgAAJg8eTK2bduGzZs3Y968eQ6fs3///lCpnBswLBYLsrKy7Mfut28vfjtbjiqfDrhpUEennsvd/XGs6Oo4Vo7jWDmOY+U4jpU4Uo2X7biOkCXMdO/eHWVlZSgpKUFYWBgAIDs7GxEREfD392+0b2FhIfr169dom1qthkajEXVOlUol2V9K27ETY0Px29ly7MsrxeT4SEnO5e6k/O/Q2nCsHMexchzHynEcK3HkHC9ZGoCjoqIQHx+PZcuWoaqqCgUFBVi9ejVSUlIu2zcpKQkff/wxjhw5AqvVim+++QYZGRm4+eabZaj82mzzzfCJJiIiopYj26PZaWlpMJvNSE5OxpQpUzBixAikpqYCAOLi4rBlyxYAwD/+8Q/cfffdmDVrFoYMGYI1a9Zg1apV6N27t1ylX1VCVAgUCiBPV42iilq5yyEiImoTZHu+OSwsDGlpaVf87NChQ/bXarUas2bNwqxZs1qqtCYL9NagT4cAHCmsQHqODpMGdZK7JCIiolaPyxk4WWIM55shIiJqSQwzTmZfp4l9M0RERC2CYcbJhkbX981kFxtQXFkndzlEREStHsOMkwX5eKBneP3j5Zm81URERCQ5hhkJXOyb4a0mIiIiqTHMSIDzzRAREbUchhkJDG0IMyeKqqA3GGWuhoiIqHVjmJFAqJ8nurf3AwBk8lYTERGRpBhmJGLrm0nPYRMwERGRlBhmJKKNaZhvhk80ERERSYphRiK2vpnfz1egvNokczVEREStF8OMRNr7eyGmnS8EAcjM49UZIiIiqTDMSEgb3TDfDB/RJiIikgzDjIQSG/pm0vlEExERkWQYZiRkuzJztLACFbXsmyEiIpICw4yEIgK9EBXqA6sA7GffDBERkSQYZiR2sW+GYYaIiEgKDDMS09r7ZhhmiIiIpMAwIzFtw0zAh8+Wo6rOLHM1RERErQ/DjMQ6BXmjc7A3LFYBB/JL5S6HiIio1WGYaQG2vpl0zjdDRETkdAwzLcC+ThPDDBERkdMxzLSAGxr6Zn47U45qI/tmiIiInIlhpgV0DvZGx0AvmK0CDuaXyV0OERFRq8Iw0wIUCoX9qaYMLm1ARETkVAwzLUQbbeub4XwzREREzsQw00JsV2Z+KShDrckiczVEREStB8NMC4kK9UF7f08YLVYcOl0mdzlEREStBsNMC7m0b4bzzRARETkPw0wLsvfNsAmYiIjIaRhmWlBiw5WZQ6fLUGdm3wwREZEzMMy0oNh2vgjz80Sd2YpfC8rlLoeIiKhVYJhpQQqF4pJHtHmriYiIyBkYZlqYfZ2mXM43Q0RE5AwMMy3MtoL2gfxSGM1WmashIiJyfwwzLax7ez8E+2hQY7Ig62yZ3OUQERG5PYaZFqZUKjC0oW8mnUsbEBERNRvDjAwS7YtOMswQERE1F8OMDOx9M3l6mC3smyEiImoOhhkZ9IrwR6C3BgajBYcLK+Quh4iIyK0xzMhAqVRgSBTnmyEiInIGhhmZJHK+GSIiIqdgmJGJrW9mX64eFqsgczVERETui2FGJn06BsDfU43KOjOOsm+GiIioyRhmZKJSKpAQFQwAyMhl3wwREVFTMczIyDbfDCfPIyIiajqGGRlpG8LMvjw9rOybISIiahKGGRn16xgAXw8VymtM+P18pdzlEBERuSWGGRmpVUrE2+abYd8MERFRkzDMyEwbbZs8j30zRERETcEwIzPb5HmZ7JshIiJqEoYZmfXvFAQvjRJ6gxEnL1TJXQ4REZHbYZiRmYdaiYSu7JshIiJqKoYZF8C+GSIioqZjmHEBtvlmMnJ1EAT2zRAREYnBMOMCBkYGwlOtREmVEdnFBrnLISIicisMMy7AU61CXJcgAOybISIiEothxkVooxtuNbFvhoiISBSGGRehbZhvJj2HfTNERERiMMy4iMFdguGhUuJCZR3ydNVyl0NEROQ2GGZchJdGhUGRQQCAjBz2zRARETmKYcaF2G41ZeSyb4aIiMhRDDMu5GITMPtmiIiIHMUw40IGdw2CWqlAYXktzpTWyF0OERGRW2CYcSE+HmoM6BwIoP6pJiIiIro+hhkXc3FpA/bNEBEROYJhxsXYFp3klRkiIiLHMMy4mISoEKiUCpwprcHZMvbNEBERXY/oMPPdd9/hwQcfxM0334x7770XW7dulaKuNsvPU41+ner7ZjjfDBER0fWJCjNbt27FvHnz0KNHD0ybNg19+vTB4sWL8emnn0pVX5uU2HCries0ERERXZ9azM7/+te/8NZbbyExMdG+bdSoUXjuuedwxx13iDqxTqfDokWLkJmZCZVKhVtuuQVz586FWn15SZmZmXjllVdw6tQpBAQE4K677sLMmTNFnc+daGNC8M6
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjoAAAHBCAYAAABg9RGHAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAC5ZUlEQVR4nOzdd3QU1dvA8e9sb0k2vRFaqFJDFQFFUMEG/LC9il1sqIgVsTewYQPFhr13xYIdERUBAaV3CCWk1+1t3j9CFtYkYCDN8HzOyTnsnTszd26W7LO3KqqqqgghhBBCtECapi6AEEIIIURDkUBHCCGEEC2WBDpCCCGEaLEk0BFCCCFEiyWBjhBCCCFaLAl0hBBCCNFiSaAjhBBCiBZLAh0hhBBCtFgS6AjRCGRdTiGEaBoS6IgWb9WqVdxyyy0MGzaMnj17MmLECO6880527twZke+CCy7gggsuqNd7l5eXM2XKFP788896ud7w4cO57bbb6uVah2LWrFl07ty5ye6/v8WLF9O5c2cWL17c1EWp1SeffELnzp3ZtWtXUxelSd1222107tw54qdbt24MGTKEW265hT179lQ7Jy8vj0cffZRRo0bRq1cvhgwZwpVXXsnSpUtrvc+iRYvo3Lkzp5xySkM+jviP0TV1AYRoSG+//TbTp09n4MCB3HTTTSQlJbFjxw7mzJnDd999x6uvvkq3bt0a7P7r1q3js88+Y9y4cfVyvWeeeQabzVYv1/qv69atG++//z4dOnRo6qLUatiwYbz//vskJSU1dVGaXGJiIs8880z4dSAQYNu2bcyYMYMVK1bw5ZdfYjKZAFi2bBnXXHMNsbGxXHjhhbRr146ysjI++OADLrjgAh588EHOPPPMavf4+OOP6dSpExs3bmTJkiUMGDCg0Z5PNF8S6IgWa9myZUybNo3x48dzxx13hNMHDhzIiBEjGDduHFOnTmXu3LlNWMq6Oeqoo5q6CM2GzWajd+/eTV2MA4qLiyMuLq6pi9EsGAyGar+vfv36odfrmTJlCj/++COnnnoqpaWlTJ48mbZt2/Lqq69iNpvD+U866SQmTpzIfffdx7HHHhsRQFZUVPD9999z55138tprr/Hee+9JoCMA6boSLdjLL79MVFQUN954Y7VjcXFx3HbbbZx00kk4HI4az+/cuTOzZs2KSPtn101xcTE333wzgwcPpkePHowZM4bPPvsMqOxaufDCCwG48MILI7rFfvjhB8aNG0ePHj0YPHgwDz74IC6XK+I+J554Is888wwDBw7khBNOoKSkJKLrateuXXTu3Jl58+YxadIksrKy6N+/P3fccQdOpzN8Lb/fz4wZMzj22GPp2bMnl112GZ999tlBu1S8Xi8PPfQQgwcPJisri6lTp+L1eqvl+/PPPzn//PPp1asXAwYMYMqUKRQXF4ePh0Ihnn76aYYPH0737t0ZPnw4TzzxBH6/v9Z7z5o1i+HDhzN//vxw18VZZ53FokWLwnn+2XVVVWc///wzp59+Ot27d2fkyJF8+umnEdfesmULl19+OX369OGYY47hySefZOrUqRG/H6/Xy7PPPsuoUaPo0aMHJ510Ei+++CKhUCicp6auzn+W6Z9dV7fddhsXX3wxH3/8MSNHjqR79+6MHj2aBQsWHFZ9QeV76rzzziMrK4vu3bszatQo3nrrrfDz9OvXj+nTp0ecEwqFGDJkCPfdd1847cMPP+TUU0+le/fuDBs2jFmzZhEIBCLO++233xg/fjxZWVkMGTKEu+++m7KysgOWrzY9evQAYPfu3QB89tln5Ofnc/vtt0cEOQAajYabbrqJ8ePHV/t/++WXX+Lz+Tj22GMZPXo03333XcT7UBy5JNARLZKqqvz6668MGjSo2h/LKqNGjeLaa689rK6gW265hc2bN3Pffffx4osvctRRRzFlyhQWL15Mt27duPvuuwG4++67ueeeewD44osvuOaaa2jfvj3PPvss1157LXPnzmXixIkRg5ZzcnL4/vvveeKJJ5g8eTKxsbE1luGee+4hPT2d2bNnM2HCBD7++GOef/758PG7776b119/nfPPP59nn32WhIQE7rrrrn/1bO+//z6XX345Tz31FGVlZbz22msReZYuXcrFF1+MyWTiqaee4vbbb2fJkiVceOGFeDweAF566SXefvttrrnmGl555RXOPfdc5syZE1HGmhQXFzNlyhTOO+88nn76acxmM5dffjmrV6+u9ZyCggLuv/9+LrzwQl588UVatWrFbbfdxpYtW8LXPP/889mzZw8PPfQQd955J9988w1ffvll+BqqqnLVVVcxZ84czjzzTJ5//nlGjRrFU089Ff4dHo7Vq1fz8ssvM2nSJJ599ll0Oh2TJk0KBwqHUl8///wz11xzDd26dWP27NnMmjWL9PR0HnjgAZYvX47RaGTkyJHMmzcvIlhbvHgxBQUFjBkzBoAXXniBu+66i0GDBvH8888zfvx4XnrppfD7GGDBggVMmDABu93Ok08+yS233MJPP/3EpEmTDqk+tm3bBkDr1q0BWLhwIfHx8fTs2bPG/B07duS2226jffv2Eekff/wxxxxzDMnJyYwdO5ZQKMRHH310SGUSLYt0XYkWqaSkBK/XS6tWrRr0PkuWLGHixImccMIJQGW3mN1uR6vVYrPZwuNHOnToQIcOHVBVlRkzZjB06FBmzJgRvk7btm25+OKLWbBgAcOGDQMqxzBMmTKFY4455oBlOO6445gyZQoAgwYN4rfffuPnn3/mpptuYseOHXz66adMmTKFSy65BIChQ4dSWFjIr7/+Wus1N23axLfffsvdd9/N+PHjw+edfvrpbN68OZzv8ccfp127drzwwgtotVoAevXqxamnnsrHH3/M+PHjWbJkCd26deOMM84AYMCAAZjN5oMGmG63m3vvvZexY8cCcPTRR3PCCSfw4osvMnPmzFrPmTZtGoMGDQIq6/X4449nwYIFZGZm8uabb+J0Ovnss89ITk4Ol3fkyJHha/zyyy/8/vvvPPbYY4wePRqAwYMHYzKZePrpp7nooosOa1xQRUUFn3zySfiD3WKxcP755/PHH38wcuTIQ6qvzZs3M3bs2Igu2qysLAYOHMjSpUvp06cPY8aM4aOPPuLPP/8Md+l88cUXtGnTht69e1NRUcFzzz3HOeecw5133gnAkCFDsNvt3HnnnVxyySV07NiRmTNn0qVLF5599tnwvUwmE0888QR5eXnheq3J/i1DDoeDVatW8dBDD5Gens5xxx0HVA5Cruv/202bNrFq1SqefPJJAJKTkxk8eDAffPABl19+OYqi1Ol6omWRFh3RImk0lW/tYDDYoPcZOHAgs2bN4vrrr+eTTz4Jt0L069evxvxbt24lNzeX4cOHEwgEwj/9+/fHZrPx22+/ReTv1KnTQcvwz3EPKSkp4W6wxYsXo6oqo0aNishz2mmnHfCaVbPERowYEU7TaDQRAYHb7ebvv//muOOOQ1XV8LNkZGSQmZkZfpaBAwfy+++/c9555/Hqq6+yZcsWzj///HAAUxutVsupp54afm0ymTj22GNZtmzZAc/bvz5SUlIAwvXxxx9/kJWVFfFhnJ6eTlZWVvj1kiVL0Gq11WbuVAU9hzvLKy4uLhzk7F9Gt9sNHFp9TZgwgUceeQSXy8X69euZN28eL774IkC4y6t///6kp6fz1VdfAeDz+fj+++/Dz7VixQrcbne19+bw4cOByu4qj8fDmjVrwoF9lZEjR/Ltt98eMMjZvXs33bp1C/8MHDiQCRMmEB8fz+zZs8Mtr4qi1Pn/7UcffYTVamXAgAGUl5dTXl7OqFGj2Llz5wEDenFkkBYd0SLZ7XasVis5OTm15nG5XPh8Pux2+yHf58knn+T5559n3rx5fPPNN2g0Go455hjuvfdeMjIyquUvLS0F4L777osYF1ElPz8/4nVCQsJBy1DTOIaqLrCqMQrx8fF1um5VN8o/B9ImJiaG/11eXk4oFOKll17ipZdeqnYNo9EIVH4IW61WPv74Yx555BEefvhhOnXqxO233x5uealJXFwcer0+Ii0+Pv6gY0H
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pca = PCA()\n",
"data_pca = pca.fit_transform(df_encoded_pca)\n",
"df_encoded_pca['pca1'] = data_pca[:, 0]\n",
"df_encoded_pca['pca2'] = data_pca[:, 1]\n",
"\n",
"# Plot explained variance ratio\n",
"plt.plot(pca.explained_variance_ratio_)\n",
"plt.xlabel('Principal Component')\n",
"plt.ylabel('Explained Variance Ratio')\n",
"plt.show()\n",
"\n",
"# Plot first two principal components\n",
"sns.scatterplot(x='pca1', y='pca2', hue='Species', data=df_encoded_pca)\n",
"plt.title('Clustering des pingouins avec PCA')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I am not very confident with the PCA, but I think the best number components is 2 because the explained variance ratio is low and the number of component minimal. As we see with the PCA after all the scaling and encoding, it turned out that the result cluster are very clear."
]
},
{
"cell_type": "code",
"execution_count": 163,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjMAAAGtCAYAAADj1vVsAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAABf40lEQVR4nO3deXhM9/4H8Pcs2WURIbGEECGWkElCLEmRqCWxi6VKaWvrILa6tmrRUtqiYqulraX22Aluq1UJ2RDEHglJCCG7bJJMzu+Pe8XNj5IhMyeTvF/P43k6J2fOefs8U97O98yMRBAEAUREREQ6Sip2ACIiIqJ3wTJDREREOo1lhoiIiHQaywwRERHpNJYZIiIi0mksM0RERKTTWGaIiIhIp7HMEBERkU6Tix1A04qLi1FUVASpVAqJRCJ2HCIiIioDQRBQXFwMuVwOqfT1114qfZkpKipCdHS02DGIiIjoLTg5OUFfX/+1+1T6MvO8zTk5OUEmk5XrsVUqFaKjozVybHqBc9YOzlk7OGft4Jy1R1Ozfn7cN12VAUQsM6mpqZg3bx4iIiIgk8nQp08fzJw5E3L5y5H279+PDRs2IDk5GU2aNMHnn3+ONm3alOk8z5eWZDKZxl7Qmjw2vcA5awfnrB2cs3ZwztqjqVmX5RYR0W4AnjJlCoyNjREcHIzAwECEhoZi8+bNL+136tQpfPXVV5g5cybOnz+PTz/9FGPGjEFcXJz2QxMREVGFI0qZiY+PR0REBGbMmAEjIyPY2tpCqVRi+/btL+179OhR9OrVC126dIFMJkO3bt3g5uaGffv2iZCciIiIKhpRlpliYmJgYWEBa2vrkm329vZISkpCVlYWzMzMSrarVCoYGxuXer5UKlX7yoxKpXq30K85piaOTS9wztrBOWsH56wdnLP2aGrW6hxPlDKTk5MDIyOjUtueP87NzS1VZrp3744vv/wS3bt3h4uLC06fPo3Q0NAy3zPznCbf0cR3S2kH56wdnLN2cM7awTlrj5izFqXMGBsbIy8vr9S2549NTExKbff19UVaWhrmzZuHzMxMdOrUCb169Xrp+W/CdzPpLs5ZOzhn7eCctYNz1h5Nv5upLEQpMw4ODsjIyEBKSgqsrKwAALGxsbCxsYGpqWmpfZ88eQJPT0+MGDGiZNvgwYPRrVs3tc7JdzPpPs5ZOzhn7eCctYNz1h4xZy3KDcB2dnZwdXXF4sWLkZ2djcTERKxduxZ+fn4v7RsZGYkRI0bgwYMHePbsGTZv3oy7d++if//+IiQnIiKiika0t2YHBASgqKgI3t7eGDx4MDw9PaFUKgEACoUChw8fBgD4+PhgyJAhGDJkCNq3b49Tp05hy5YtqFGjhljRiYiIqAIR7UPzrKysEBAQ8MqfRUVFlXo8ceJETJw4URuxiIiISMfwW7OJiIhIp7HMEBERkU5jmSEiIiKdxjLzlnILirAtLB5JT4vEjkJERFSliXYDsK47deMx5h+5AQOZBLkm9zGkTQOxIxEREVVJvDLzlrwca6GjfQ08UwmYue8qpu25hJxnvEpDRESkbSwzb8nEQI5fR7nhgxbVIJUA+y8+QJ/VIbj5KEvsaERERFUKy8w7kEkl8GteDb992hbWZgaIfZKDvqvPYldEAgRBEDseERFRlcAyUw7cG1oiyN8TnZrUxLOiYszaH43Juy4hm8tOREREGscyU05qVDPAr6PaYGYPR8ikEhy+nITeq0JwLSlT7GhERESVGstMOZJKJfissz12j22H2uaGuJuSg/5rz2FbWDyXnYiIiDSEZUYD3Oz+s+zk7VgLBUXFmHfwKibuiEJWfqHY0YiIiCodlhkNqW6ij00j3fCFbzPIpRIci36IXgEhuHI/Q+xoRERElQrLjAZJJBKM9myEvePbo66FERLScjFw3Tn8evYul52IiIjKCcuMFijqV0eQvye6NbdGoUrAgiPXMW7bBWTmctmJiIjoXbHMaIm5sR7Wj3DF/N7NoS+T4t/Xk+ETEIyohHSxoxEREek0lhktkkgkGNWxIfZ91gH1LY3xICMPg34KxcYzcSgu5rITERHR22CZEYFTPXMc9feAr1NtFBULWBR0A6O3nkd6ToHY0YiIiHQOy4xIzAz1sHqYAt/0awl9uRR/3nwMn4BgnL+XJnY0IiIincIyIyKJRILh7RrggLIDGlqZ4GFmPoZsCMPa03e47ERERFRGLDMVQIs65jgyyQN9netAVSzguxO3MGpzJFKyn4kdjYiIqMJjmakgqhnI8eMQZywd6AQDuRRnbj+Bz8pghMWlih2NiIioQmOZqUAkEgmGtKmPwxM90LhWNTx++gzDNoYh4FQMVFx2IiIieiWWmQqoqY0pDk/siIEu9VAsAMt/v42PfgnH46f5YkcjIiKqcFhmKihjfTmWDW6NHwa1hpGeDGfvpMJnZQjO3kkROxoREVGFwjJTwfm51sORSR3R1NoUKdnPMPzncCz/9y0UqYrFjkZERFQhsMzogMa1THFwQkcMbWMLQQAC/ryDYZvCkZzFZSciIiKWGR1hpC/DkoGtsHKoM0z0ZYi4m4aeK4Nx+tZjsaMRERGJimVGx/R1rosjkzzQrLYZ0nIKMOrXSCw9cZPLTkREVGWxzOigRjWr4YCyA4a3qw8AWHc6FkM3hCEpI0/kZERERNrHMqOjDPVk+KafE1YPU8DUQI7z8enwCQjGqRvJYkcjIiLSKpYZHderVR0c9feAU11zZOQW4tMt57Ho2HUUFHHZiYiIqgaWmUqgQQ0TBH7WHqM62AEANgbfxeD1oUhMyxU3GBERkRaIVmZSU1OhVCrh5uYGd3d3LFq0CEVFRa/cd8uWLfDy8oKLiwt69+6NkydPajltxWcgl2F+nxZYP8IVZoZyXErMgG9AME5eeyR2NCIiIo0SrcxMmTIFxsbGCA4ORmBgIEJDQ7F58+aX9vv777+xfv16bNq0CRcvXsTEiRMxZcoU3L9/X/uhdUD3FjY45u8JZ1sLZOUXYdy2C5h/+BqeFanEjkZERKQRopSZ+Ph4REREYMaMGTAyMoKtrS2USiW2b9/+0r5xcXEQBKHkl0wmg56eHuRyuQjJdYOtpTH2jGuPMZ4NAQCbz92D37pQxKfmiJyMiIio/InSCGJiYmBhYQFra+uSbfb29khKSkJWVhbMzMxKtvv6+mL//v3w8fGBTCaDRCLB999/DxsbG7XOqVKV/5WJ58fUxLHflUwCzOrRFG3tqmNGYDSiH2TCNyAE3/ZvCR8n9WYntoo858qEc9YOzlk7OGft0dSs1TmeKGUmJycHRkZGpbY9f5ybm1uqzBQWFsLR0RGLFi2Co6Mjjhw5grlz58Le3h5NmzYt8zmjo6PLJ7yWj/2uLAEs9bLAirAM3EwtxKRdl3A00gijWptBXyYRO55aKvKcKxPOWTs4Z+3gnLVHzFmLUmaMjY2Rl1f6A96ePzYxMSm1/euvv4aLiwtatWoFABg4cCCOHj2KAwcOYNasWWU+p5OTE2Qy2TsmL02lUiE6Olojxy5vndyL8eMfMfjpzF2cjM1DQo4cqz5wRkMrkzc/WWS6NGddxjlrB+esHZyz9mhq1s+PWxailBkHBwdkZGQgJSUFVlZWAIDY2FjY2NjA1NS01L5JSUlo2bJlqW1yuRx6enpqnVMmk2nsBa3JY5cXmUyGWT7N0c7eCtP2XMaNR0/Rd805LB7ghL7OdcWOVya6MOfKgHPWDs5ZOzhn7RFz1qLcAGxnZwdXV1csXrwY2dnZSExMxNq1a+Hn5/fSvl5eXvjtt99w7do1FBcX48SJEwgPD4ePj48IyXVf56a1cHyyJ9wbWiKnQIXJuy5h1r4ryCvgujIREekm0d6aHRAQgKKiInh7e2Pw4MHw9PSEUqkEACgUChw+fBgAMHHiRHz44YeYNGkS2rRpgw0bNmDNmjVo1qyZWNF1nrWZIbaPdoe/V2NIJMCuyET0W3MWdx4/FTsaERGR2kR7f7OVlRUCAgJe+bOoqKiS/5bL5Zg0aRImTZqkrWhVglwmxbRuTdG2YQ1M2X0Jt5Kfoveqs/i6X0v4udYTOx4REVGZ8es
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkQAAAHBCAYAAACIdaSsAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAADCBElEQVR4nOzdd3xUVdrA8d+d3tI7IaGE0EGqgBQRRbAhYlvF3gVFlEWwiwV1F0VpYsVXxbXrYl07oCJIUXqHEJKQXqeX+/4xZmBMQk1CIM/388muc8+59557MmSeOVVRVVVFCCGEEKIZ0xzvAgghhBBCHG8SEAkhhBCi2ZOASAghhBDNngREQgghhGj2JCASQgghRLMnAZEQQgghmj0JiIQQQgjR7ElAJIQQQohmTwIiIZoAWR9VCCGOLwmIRLO3bt06Jk+ezNChQ+nevTtnnnkmDz74INnZ2WH5rr76aq6++up6vXdFRQVTpkxh5cqV9XK9YcOGMXXq1Hq51tGYPXs2HTp0OG73P9Dy5cvp0KEDy5cvP95FqdPHH39Mhw4d2Lt37/EuynE1depUOnToEPbTpUsXBg0axOTJk8nLy6txTn5+Pv/6178YOXIkp5xyCoMGDeLWW2/l999/r/M+y5Yto0OHDpx77rkN+TjiBKU73gUQ4nhauHAh06dPp1+/fkyaNInExET27NnDq6++yjfffMOCBQvo0qVLg91/06ZNfPrpp4wZM6ZerjdnzhxsNlu9XOtE16VLF9577z3atWt3vItSp6FDh/Lee++RmJh4vIty3CUkJDBnzpzQa5/Px65du5gxYwZr1qzh888/x2QyAbBq1SrGjx9PTEwM11xzDW3atKG8vJz333+fq6++mieeeIJLLrmkxj0++ugj2rdvz9atW1mxYgWnnnpqoz2faPokIBLN1qpVq3jyyScZO3YsDzzwQOh4v379OPPMMxkzZgz33XcfixYtOo6lPDKdO3c+3kVoMmw2Gz169DjexTio2NhYYmNjj3cxmgSDwVDj99WnTx/0ej1Tpkzh+++/57zzzqOsrIyJEyfSunVrFixYgNlsDuU/++yzGTduHNOmTWPIkCFhgWZlZSXffvstDz74IG+88QbvvvuuBEQijHSZiWbrtddeIyIignvuuadGWmxsLFOnTuXss8+mqqqq1vM7dOjA7Nmzw479vcuopKSEf/7znwwcOJBu3bpx4YUX8umnnwLBLp1rrrkGgGuuuSasO+67775jzJgxdOvWjYEDB/LEE0/gcDjC7jN8+HDmzJlDv379OOussygtLQ3rMtu7dy8dOnTgq6++YsKECfTs2ZO+ffvywAMPYLfbQ9fyer3MmDGDIUOG0L17d2688UY+/fTTQ3bluN1unnrqKQYOHEjPnj257777cLvdNfKtXLmSq666ilNOOYVTTz2VKVOmUFJSEkoPBAK88MILDBs2jK5duzJs2DCee+45vF5vnfeePXs2w4YN48cffwx1mVx66aUsW7YslOfvXWbVdfbTTz9xwQUX0LVrV0aMGMEnn3wSdu0dO3Zw880306tXL0477TRmzpzJfffdF/b7cbvdzJ07l5EjR9KtWzfOPvtsXn75ZQKBQChPbV2sfy/T37vMpk6dynXXXcdHH33EiBEj6Nq1K6NGjWLx4sXHVF8QfE9deeWV9OzZk65duzJy5Ejefvvt0PP06dOH6dOnh50TCAQYNGgQ06ZNCx374IMPOO+88+jatStDhw5l9uzZ+Hy+sPN++eUXxo4dS8+ePRk0aBAPP/ww5eXlBy1fXbp16wZATk4OAJ9++ikFBQXcf//9YcEQgEajYdKkSYwdO7bGv9vPP/8cj8fDkCFDGDVqFN98803Y+1AICYhEs6SqKj///DMDBgyo8Ue12siRI7njjjuOqQtq8uTJbN++nWnTpvHyyy/TuXNnpkyZwvLly+nSpQsPP/wwAA8//DCPPPIIAJ999hnjx4+nbdu2zJ07lzvuuINFixYxbty4sMHXubm5fPvttzz33HNMnDiRmJiYWsvwyCOPkJqayrx587jpppv46KOPmD9/fij94Ycf5v/+7/+46qqrmDt3LvHx8Tz00EOH9WzvvfceN998M88//zzl5eW88cYbYXl+//13rrvuOkwmE88//zz3338/K1as4JprrsHlcgHwyiuvsHDhQsaPH8/rr7/OFVdcwauvvhpWxtqUlJQwZcoUrrzySl544QXMZjM333wz69evr/OcwsJCHnvsMa655hpefvllWrZsydSpU9mxY0fomldddRV5eXk89dRTPPjgg3z99dd8/vnnoWuoqsptt93Gq6++yiWXXML8+fMZOXIkzz//fOh3eCzWr1/Pa6+9xoQJE5g7dy46nY4JEyaEAoqjqa+ffvqJ8ePH06VLF+bNm8fs2bNJTU3l8ccfZ/Xq1RiNRkaMGMFXX30VFtQtX76cwsJCLrzwQgBeeuklHnroIQYMGMD8+fMZO3Ysr7zySuh9DLB48WJuuukmoqOjmTlzJpMnT+aHH35gwoQJR1Ufu3btAiA9PR2ApUuXEhcXR/fu3WvNn5mZydSpU2nbtm3Y8Y8++ojTTjuNpKQkRo8eTSAQ4MMPPzyqMomTk3SZiWaptLQUt9tNy5YtG/Q+K1asYNy4cZx11llAsDsuOjoarVaLzWYLjW9p164d7dq1Q1VVZsyYweDBg5kxY0boOq1bt+a6665j8eLFDB06FAiOsZgyZQqnnXbaQctw+umnM2XKFAAGDBjAL7/8wk8//cSkSZPYs2cPn3zyCVOmTOH6668HYPDgwRQVFfHzzz/Xec1t27bxv//9j4cffpixY8eGzrvgggvYvn17KN+zzz5LmzZteOmll9BqtQCccsopnHfeeXz00UeMHTuWFStW0KVLFy6++GIATj31VMxm8yEDUafTyaOPPsro0aMB6N+/P2eddRYvv/wys2bNqvOcJ598kgEDBgDBej3jjDNYvHgxGRkZvPXWW9jtdj799FOSkpJC5R0xYkToGkuWLOHXX3/l3//+N6NGjQJg4MCBmEwmXnjhBa699tpjGrdUWVnJxx9/HAoALBYLV111Fb/99hsjRow4qvravn07o0ePDusa7tmzJ/369eP333+nV69eXHjhhXz44YesXLky1JX02Wef0apVK3r06EFlZSUvvvgil19+OQ8++CAAgwYNIjo6mgcffJDrr7+ezMxMZs2aRceOHZk7d27oXiaTieeee478/PxQvdbmwJamqqoq1q1bx1NPPUVqaiqnn346EBxMfaT/brdt28a6deuYOXMmAElJSQwcOJD333+fm2++GUVRjuh64uQkLUSiWdJogm99v9/foPfp168fs2fP5q677uLjjz8OtWr06dOn1vw7d+5k3759DBs2DJ/PF/rp27cvNpuNX375JSx/+/btD1mGv4/LSE5ODnW/LV++HFVVGTlyZFie888//6DXrJ4Vd+aZZ4aOaTSasMDB6XTy559/cvrpp6OqauhZ0tLSyMjICD1Lv379+PXXX7nyyitZsGABO3bs4KqrrgoFOnXRarWcd955odcmk4khQ4awatWqg553YH0kJycDhOrjt99+o2fPnmEf2qmpqfTs2TP0esWKFWi12hozlaqDo2Od1RYbGxsKhg4so9PpBI6uvm666SaeeeYZHA4Hmzdv5quvvuLll18GCHW19e3bl9TUVL744gsAPB4P3377bei51qxZg9PprPHeHDZsGBDsJnO5XGzYsCH0BaDaiBEj+N///nfQYCgnJ4cuXbqEfvr168dNN91EXFwc8+bNC7XkKopyxP9uP/zwQ6xWK6eeeioVFRVUVFQwcuRIsrOzDxr4i+ZFWohEsxQdHY3VaiU3N7fOPA6HA4/HQ3R09FHfZ+bMmcyfP5+vvvqKr7/+Go1Gw2mnncajjz5KWlpajfxlZWUATJs2LWzcRrWCgoKw1/Hx8YcsQ23jLKq73qrHUMTFxR3Rdau7b/4+IDghISH03xUVFQQCAV555RVeeeWVGtcwGo1A8MPaarXy0Ucf8cwzz/D000/Tvn177r///lBLTm1iY2PR6/Vhx+Li4g4
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pca = PCA(n_components=2)\n",
"data_pca = pca.fit_transform(df_encoded_pca)\n",
"df_encoded_pca['pca1'] = data_pca[:, 0]\n",
"df_encoded_pca['pca2'] = data_pca[:, 1]\n",
"\n",
"# Plot explained variance ratio\n",
"plt.plot(pca.explained_variance_ratio_)\n",
"plt.xlabel('Principal Component')\n",
"plt.ylabel('Explained Variance Ratio')\n",
"plt.show()\n",
"\n",
"# Plot first two principal components\n",
"sns.scatterplot(x='pca1', y='pca2', hue='Species', data=df_encoded_pca)\n",
"plt.title('Clustering des pingouins avec PCA')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 164,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 1.00\n"
]
}
],
"source": [
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Definire le feature e il target\n",
"X = df_encoded_pca[['pca1', 'pca2']]\n",
"y = df_encoded_pca['Species']\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"knn = KNeighborsClassifier(n_neighbors=5)\n",
"\n",
"knn.fit(X_train, y_train)\n",
"\n",
"accuracy = knn.score(X_test, y_test)\n",
"print(\"Accuracy: {:.2f}\".format(accuracy))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the PCA, the prediction is very good."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reducing and visualizing using t-SNE"
]
},
{
"cell_type": "code",
"execution_count": 165,
"metadata": {},
"outputs": [],
"source": [
"df_encoded_tsne = df_encoded_pca.copy()"
]
},
{
"cell_type": "code",
"execution_count": 166,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjcAAAHBCAYAAACVC5o3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAABq+klEQVR4nO3deVxU5f4H8M+wDJvIIiq4pKaCIiqiuQRXk+RqbplLmmtaloY/y6XUzDJJ7V4r90oL9RaalWW5lLlUt5uhuKEglkqgpqCyK7sz5/cHzTgDs5wzG8Pweb9eva6cOXPm4bkzh+88z/f5PjJBEAQQEREROQin2m4AERERkSUxuCEiIiKHwuCGiIiIHAqDGyIiInIoDG6IiIjIoTC4ISIiIofC4IaIiIgcCoMbIiIicigMbogcCGtymqZ6vzlCPzrC70BkKgY3RA7iyJEjWLBggahzr1+/jsWLF6Nfv34ICwtD79698fzzz+PYsWNa561fvx4hISHYtm2bzussXLgQ0dHR6p+//vprhISEGPzv4sWLBtt28eJFzJkzB5GRkQgLC0NUVBReeuklpKWlifrdpKioqMDKlSuxd+9e9bHLly/jqaeesvhrqejqo44dO+Khhx7CtGnTcOrUqRrn/vXXX6KvX1RUhAULFuDkyZPWaD5RneBS2w0gIsvQF4BUd/v2bYwdOxZNmjTBnDlz0KxZM+Tl5eHLL7/E008/jXXr1uGf//yn1nNWr16N/v37o1WrVqJeY8OGDWjcuLHOxx544AG9z7t06RLGjh2LLl26YPHixQgICEB2djYSEhIwduxYfPrppwgPDxfVBjFu3bqFbdu2YeXKlepj33//Pc6cOWOx19BHs4+USiVycnKwceNGTJkyBbt27UKHDh1Muu6FCxfwzTffYOTIkZZsLlGdwuCGqJ754osvUFRUhO+//x7e3t7q4zExMRgzZgzWrl1bI7iRy+VYtGgREhIS4ORkfMC3Y8eOaNGiheS2bd26Fb6+vvj444/h6uqqPj5gwAA89thjeP/997F582bJ17VHuvooNDQUMTEx2LFjB5YtW1ZLLSOq+zgtReQAJk2ahKSkJCQlJSEkJATHjx/Xe25OTg5kMhmUSqXWcWdnZ8ybNw9PPvlkjecsXLgQp06dwqeffmrxtldvG1AzX8TT0xOLFi3CY489pnV8//79GDlyJLp27YpHHnkEq1atQkVFhfrxw4cPY/z48ejWrRvCwsIwaNAgJCQkAAD++usvPProowCARYsWITo6GuvXr8eGDRsAACEhIVi/fj2AqpGVzZs3IyYmBmFhYRg4cGCNvpg0aRLmz5+P2bNnIyIiAs8995zk379Fixbw8/PDjRs39J5z9OhRjB8/Ht27d0evXr0wb948ZGVlAQCOHz+OyZMnAwAmT56MSZMmSW4DkSNgcEPkAN544w2EhoYiNDQUn3/+OTp16qT33EceeQRlZWV48sknER8fj7S0NCgUCgBAZGQkpkyZUuM5o0aNQt++fbF69WpcuXLFaHuUSiXu3btX47/qAZWutt24cQPjxo3D9u3bkZ6erg50Bg0ahCeeeEJ97s6dOzF37lx07NgRGzZswPPPP48dO3Zg6dKlAICff/4ZsbGx6NSpE95//32sX78ezZs3R1xcHE6fPo0mTZqoA5mZM2diw4YNGDNmDEaPHg0A+PzzzzFmzBgAwNKlS7Fu3ToMHz4cH374IQYNGoQVK1Zg48aNWu3//vvv4erqio0bN6qDDCny8/ORn5+vd+ru22+/xbRp09C0aVO89957WLRoEc6cOYOxY8ciNzcXnTp1wuuvvw4AeP311/HGG29IbgORI+C0FJEDaNeuHRo0aAAARnNS+vXrh9dffx3vvfce/v3vfwMAGjRogD59+mDcuHGIiorS+by4uDgMHToUr776KhISEiCTyfS+RkxMjM7jffr0MZgbNH78eNy+fRvx8fHqaRk/Pz9ERUVh0qRJ6Nq1K4Cq4Gn9+vWIiYnB8uXL1c8vLy/H7t27UVFRgcuXL2PEiBFYvHix+vFu3bqhV69eOHHiBCIiItCxY0cAVXlAoaGhAIDAwEAA9/sxIyMDX3zxBebOnasejYmKioJMJsOmTZswfvx4+Pn5AQCcnJwQFxcHT09Pvb+jiioAVLX7ypUrWLVqFZycnDB27Fid569atQoPP/wwVq9erT4eERGBwYMHY8uWLXj55ZfRrl07AFXvCdW/ieobBjdEDkqhUGhN7zg5OanzZSZMmICRI0fi119/RWJiIpKSknDo0CEcOnQIU6dOxcKFC2tcLzAwEAsWLMBrr72GTz/91ODIxAcffKAzoVgVgBny4osv4umnn8b//vc/JCYm4vjx49i7dy/27duHRYsWYcqUKcjIyEBOTg4GDBig9dynn34aTz/9NADg2WefBQCUlJTg6tWryMjIQEpKCgCgsrLSaDtUjh07BkEQEB0drQ5GACA6OhoffPABTp06pW5HixYtRAU2gO4AsHnz5li1ahVCQkJqPJaRkYHbt29j7ty5WscfeOABdOvWzeBUJFF9w+CGyEHFxMTg+vXr6p+feOIJvP322+qfPTw8EBMTo/4je+XKFSxevBhbt27FyJEjERwcXOOaY8aMwYEDB/Dee+/hkUce0fvawcHBJiUUq/j4+GDo0KEYOnQoACAtLQ2vvPIK3nnnHQwfPhwFBQUAgEaNGum9Rl5eHt544w0cPnwYMpkMrVq1Qvfu3QFIqwGjeq0hQ4bofPzmzZvqfwcEBIi+rmYA6OrqCj8/PzRt2tRoO3S9RkBAgFWWyhPVVQxuiBzUBx98oJVc6+fnB4VCgZiYGIwYMQKzZ8/WOr9Vq1ZYvHgxRowYgcuXL+sMbgDgrbfewtChQ7F48WI0a9bMYu29efMmRo0ahRdffFGd66ISGhqKl156CbGxsbh27RoaNmwIoCqA0VRQUIDz588jPDwc8+fPR3p6OrZu3YqIiAjI5XKUlpbiyy+/lNQu1Wv95z//gZeXV43HTe0DqQGgr68vgPtJ15pu376tnhojIiYUEzmM6ku0Q0JC0LlzZ/V/LVq0gLOzM5o0aYKvvvoK+fn5Na6RkZEBAHoDGwAICgrCggULkJSUhCNHjlis/QEBAXBxccGOHTtQXl5e4/E///wTbm5uaNWqFR588EH4+fnVeP29e/di+vTpKC8vx6lTpzBw4ED07t0bcrkcAPDLL78AgDqx2dnZucbrVO/Hhx56CEBVsq9mfxYUFGDNmjXqERVra9OmDRo3bqxVcBAArl27huTkZERERADQ/TsR1TccuSFyEA0bNsSZM2eQmJiI0NBQ+Pj46Dzvtddew6RJkzBy5EhMnjwZHTt2hFKpxIkTJ7Bt2zaMGzfOaCLqk08+iQMHDuDo0aPqkQ1NFy5c0DnCAFSNdDRp0qTGcWdnZyxduhSxsbEYNWoUJkyYgLZt26K0tBRHjx7F9u3b8eKLL6p/r//7v//DsmXLsHTpUsTExCAzMxNr1qzBU089BX9/f3Tp0gV79+5Fp06dEBgYiDNnzmDTpk2QyWQoLS0FAHWdn8TERLRt2xZdu3ZV/z779u1D165dERwcjOHDh2PJkiW4fv06wsLCkJGRgdWrV6NFixZo3bq1wb6yFCcnJ8ydOxeLFi3CnDlzMGLECOTn52PDhg3w8fHB1KlTtX6nn3/+GT4+PiYXAySqyxjcEDmICRMmIDU1FdOnT8fKlSsxbNgwneeFhYXhm2++waZNm5CQkIDbt2/D2dkZ7dq1w6uvvqpeCm2ManpKl1mzZul93iuvvIJnnnlG52OPPPIIvvjiC8THx+PDDz9EXl4e5HI5QkNDsXr1aq3ighMmTICnpyfi4+Oxa9cuNG3aFNOmTVOvaHr77bcRFxeHuLg4AEDr1q3x5ptvYs+ePeqtCRo0aICpU6fi888/x88//4yjR4/in//8J7799lssXLgQo0ePxtKlS7Fy5Ups2rQJO3fuRHZ2Nho1aoTBgwfjpZdesulIyciRI+Hl5YVNmzYhNjYWDRo0wD/+8Q/MnTtXnb/Tvn17DB06FNu3b8f//vc/7Nu3z2btI7I
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tsne = TSNE(n_components=2)\n",
"\n",
"# fit and transform the data to two dimensions\n",
"tsne_data = tsne.fit_transform(df_encoded_tsne)\n",
"\n",
"# create a scatter plot of the t-SNE representation of the data\n",
"plt.scatter(tsne_data[:, 0], tsne_data[:, 1])\n",
"plt.xlabel('t-SNE Component 1')\n",
"plt.ylabel('t-SNE Component 2')\n",
"plt.title('t-SNE Scatter Plot')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 167,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 1.00\n"
]
}
],
"source": [
"X = tsne_data[:, :2]\n",
"y = df_encoded_tsne['Species']\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"knn = KNeighborsClassifier(n_neighbors=5)\n",
"\n",
"knn.fit(X_train, y_train)\n",
"\n",
"accuracy = knn.score(X_test, y_test)\n",
"print(\"Accuracy: {:.2f}\".format(accuracy))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Also with the t-SNE the prediction is very good."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the default values the result is strange, even with this method, the clusters pop out very well.\n",
"\n",
"## 12 Model selection and evaluation\n",
"\n",
"Based on the insights gained from EDA, choose the appropriate machine learning model(s) for your problem. Evaluate the performance of the model(s) using appropriate evaluation metrics and cross-validation techniques. Iterate on the model, features, or data preprocessing steps to improve performance, if necessary.\n",
"\n",
"For this step I will use the encoded and normalized dataset. I will use the KNN classifier. I will use the cross-validation to evaluate the model.\n",
"\n",
"At this point I use the KNN because I want a distance-based classifier, and I think it's the best choice for this dataset. I will use the cross-validation to evaluate the model. Moreover, This TP is not intended to train the perfect model, but to show the process of data analysis. However, I will with other models to see how they perform.\n",
"\n",
"I will try to predict the species."
]
},
{
"cell_type": "code",
"execution_count": 168,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Individual ID</th>\n",
" <th>Clutch Completion</th>\n",
" <th>Date Egg</th>\n",
" <th>Culmen Length (mm)</th>\n",
" <th>Culmen Depth (mm)</th>\n",
" <th>Flipper Length (mm)</th>\n",
" <th>Body Mass (g)</th>\n",
" <th>Sex</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" <th>studyName_PAL0708</th>\n",
" <th>studyName_PAL0809</th>\n",
" <th>studyName_PAL0910</th>\n",
" <th>Island_Biscoe</th>\n",
" <th>Island_Dream</th>\n",
" <th>Island_Torgersen</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>22</td>\n",
" <td>1</td>\n",
" <td>2.0</td>\n",
" <td>0.254545</td>\n",
" <td>0.666667</td>\n",
" <td>0.152542</td>\n",
" <td>0.291667</td>\n",
" <td>2</td>\n",
" <td>0.460122</td>\n",
" <td>0.412350</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>23</td>\n",
" <td>1</td>\n",
" <td>2.0</td>\n",
" <td>0.269091</td>\n",
" <td>0.511905</td>\n",
" <td>0.237288</td>\n",
" <td>0.305556</td>\n",
" <td>1</td>\n",
" <td>0.550450</td>\n",
" <td>0.719311</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>44</td>\n",
" <td>1</td>\n",
" <td>6.0</td>\n",
" <td>0.298182</td>\n",
" <td>0.583333</td>\n",
" <td>0.389831</td>\n",
" <td>0.152778</td>\n",
" <td>1</td>\n",
" <td>0.307537</td>\n",
" <td>0.521692</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>66</td>\n",
" <td>1</td>\n",
" <td>6.0</td>\n",
" <td>0.167273</td>\n",
" <td>0.738095</td>\n",
" <td>0.355932</td>\n",
" <td>0.208333</td>\n",
" <td>1</td>\n",
" <td>0.473964</td>\n",
" <td>0.524404</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>67</td>\n",
" <td>1</td>\n",
" <td>6.0</td>\n",
" <td>0.261818</td>\n",
" <td>0.892857</td>\n",
" <td>0.305085</td>\n",
" <td>0.263889</td>\n",
" <td>2</td>\n",
" <td>0.431532</td>\n",
" <td>0.532516</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Individual ID Clutch Completion Date Egg Culmen Length (mm) \\\n",
"0 22 1 2.0 0.254545 \n",
"1 23 1 2.0 0.269091 \n",
"2 44 1 6.0 0.298182 \n",
"3 66 1 6.0 0.167273 \n",
"4 67 1 6.0 0.261818 \n",
"\n",
" Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex \\\n",
"0 0.666667 0.152542 0.291667 2 \n",
"1 0.511905 0.237288 0.305556 1 \n",
"2 0.583333 0.389831 0.152778 1 \n",
"3 0.738095 0.355932 0.208333 1 \n",
"4 0.892857 0.305085 0.263889 2 \n",
"\n",
" Delta 15 N (o/oo) Delta 13 C (o/oo) studyName_PAL0708 studyName_PAL0809 \\\n",
"0 0.460122 0.412350 1.0 0.0 \n",
"1 0.550450 0.719311 1.0 0.0 \n",
"2 0.307537 0.521692 1.0 0.0 \n",
"3 0.473964 0.524404 1.0 0.0 \n",
"4 0.431532 0.532516 1.0 0.0 \n",
"\n",
" studyName_PAL0910 Island_Biscoe Island_Dream Island_Torgersen \n",
"0 0.0 0.0 0.0 1.0 \n",
"1 0.0 0.0 0.0 1.0 \n",
"2 0.0 0.0 0.0 1.0 \n",
"3 0.0 0.0 0.0 1.0 \n",
"4 0.0 0.0 0.0 1.0 "
]
},
"execution_count": 168,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_knn = penguin_manager.get_df(as_copy=True)[\"Species\"]\n",
"X_knn = penguin_manager.get_df(as_copy=True).drop(columns=[\"Species\"])\n",
"\n",
"cols_to_label_encode_knn = ['Clutch Completion', 'Sex', 'Individual ID']\n",
"cols_to_ordinal_encode_knn = [\"Date Egg\"]\n",
"cols_to_onehot_encode_knn = ['studyName', 'Island']\n",
"\n",
"penguin_manager_knn = EncoderManager(X_knn, cols_to_label_encode_knn, cols_to_ordinal_encode_knn, cols_to_onehot_encode_knn)\n",
"\n",
"X_knn_encoded = penguin_manager_knn.encode(inplace=False)\n",
"X_knn_encoded.head()"
]
},
{
"cell_type": "code",
"execution_count": 169,
"metadata": {},
"outputs": [],
"source": [
"# split the data\n",
"X_train_model, X_test_model, y_train_model, y_test_model = train_test_split(X_knn_encoded, y_knn, test_size=0.2, random_state=42)"
]
},
{
"cell_type": "code",
"execution_count": 170,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Knn score: 0.9130434782608695\n",
"KMeans score: 0.09537487102572052\n",
"Gauss score: 0.8840579710144928\n",
"RFC score: 0.9710144927536232\n",
"Logistic Regression: 0.9710144927536232\n",
"C-Support Vector: 1.0\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\stefa\\miniconda3\\envs\\MchineLearning\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
}
],
"source": [
"from sklearn.metrics import rand_score, adjusted_rand_score\n",
"from sklearn import svm\n",
"from sklearn.naive_bayes import GaussianNB\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.cluster import KMeans\n",
"\n",
"knn = KNeighborsClassifier(n_neighbors=5).fit(X_train_model, y_train_model)\n",
"y_pred_knn = knn.predict(X_test_model)\n",
"print(f\"Knn score: {accuracy_score(y_pred_knn, y_test_model)}\")\n",
"\n",
"warnings.filterwarnings(\"ignore\")\n",
"kmeans = KMeans(n_clusters=3, random_state=42, n_init=\"auto\").fit(X_train_model, y_train_model)\n",
"print(f\"KMeans score: {adjusted_rand_score(y_train_model, kmeans.labels_)}\")\n",
"warnings.filterwarnings(\"default\")\n",
"\n",
"gnb = GaussianNB().fit(X_train_model, y_train_model)\n",
"y_pred_gnb = gnb.predict(X_test_model)\n",
"print(f\"Gauss score: {accuracy_score(y_pred_gnb, y_test_model)}\")\n",
"\n",
"rfc = RandomForestClassifier(n_estimators=50, criterion=\"gini\").fit(X_train_model, y_train_model)\n",
"y_pred_rfc = rfc.predict(X_test_model)\n",
"print(f\"RFC score: {accuracy_score(y_pred_rfc, y_test_model)}\")\n",
"\n",
"reg = LogisticRegression(max_iter=50).fit(X_train_model, y_train_model)\n",
"y_pred_reg = reg.predict(X_test_model)\n",
"print(f\"Logistic Regression: {accuracy_score(y_pred_reg, y_test_model)}\")\n",
"\n",
"svc = svm.SVC(kernel=\"linear\", C=1.0).fit(X_train_model, y_train_model)\n",
"y_pred_svc = svc.predict(X_test_model)\n",
"print(f\"C-Support Vector: {accuracy_score(y_pred_svc, y_test_model)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I think that this is already a good results for this TP. I will stick with the KNN even if it's not the best accuracy for this dataset, but it has a higher improvement margin. I will try to improve the model by tuning the hyperparameters (which i won't do).\n",
"\n",
"I will use the decision tree to find the most informative features."
]
},
{
"cell_type": "code",
"execution_count": 171,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.03380558, 0. , 0. , 0.35479631, 0.01317101,\n",
" 0.5319982 , 0. , 0. , 0. , 0.0176711 ,\n",
" 0. , 0. , 0. , 0.0485578 , 0. ,\n",
" 0. ])"
]
},
"execution_count": 171,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tree_clf = DecisionTreeClassifier()\n",
"\n",
"tree_clf.fit(X_knn_encoded, y_knn)\n",
"\n",
"feature_importances = tree_clf.feature_importances_\n",
"feature_importances"
]
},
{
"cell_type": "code",
"execution_count": 172,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1. feature 5 (0.531998)\n",
"2. feature 3 (0.354796)\n",
"3. feature 13 (0.048558)\n",
"4. feature 0 (0.033806)\n",
"5. feature 9 (0.017671)\n",
"6. feature 4 (0.013171)\n",
"7. feature 15 (0.000000)\n",
"8. feature 14 (0.000000)\n",
"9. feature 12 (0.000000)\n",
"10. feature 11 (0.000000)\n",
"11. feature 10 (0.000000)\n",
"12. feature 8 (0.000000)\n",
"13. feature 7 (0.000000)\n",
"14. feature 6 (0.000000)\n",
"15. feature 2 (0.000000)\n",
"16. feature 1 (0.000000)\n"
]
}
],
"source": [
"indices = feature_importances.argsort()[::-1]\n",
"\n",
"for f in range(X_knn_encoded.shape[1]):\n",
" print(\"%d. feature %d (%f)\" % (f + 1, indices[f], feature_importances[indices[f]]))"
]
},
{
"cell_type": "code",
"execution_count": 173,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0oAAAKXCAYAAAC8DbqoAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAACZn0lEQVR4nOzdd1iV9eP/8RfgAtwTB2Wa2cdy4CJX7lGamuJI0zTLgeVILU3TzJlmKq7cK1dSbtPMrSnOFK2cOVEcOAIH8/eHP87XcwBBIu77xufjurquzn0OnBcHPOd+3ff7/b6dYmJiYgQAAAAAsHE2OgAAAAAAmA1FCQAAAAAcUJQAAAAAwAFFCQAAAAAcUJQAAAAAwAFFCQAAAAAcUJQAAAAAwAFFCQAAAAAcUJQAAE+Na5U/HV4vALAeihIAWFj//v1VvHjxBP9btWpVij5feHi4Ro0apTVr1qTo931aP/30k4oXL65Lly4ZmiMppk2bptmzZxsdAwDwlNIZHQAA8O/kyZNHkydPjve+5557LkWf69q1a5o3b55GjRqVot83LZswYYI++ugjo2MAAJ4SRQkALC5DhgwqU6aM0TEAAEhTGHoHAM+IX3/9Vc2aNVPJkiVVpUoVDR8+XPfu3YvzmDZt2sjLy0uvvvqqGjRooO+//16SdOnSJdWuXVuSNGDAANWqVUvSo+F/sf8f69KlSypevLh++uknSVJAQICKFy+upUuXqmbNmqpcubJ27dolSTpw4IDeffddlS5dWhUrVtRnn32mkJCQp/rZYr//nj171K5dO5UqVUo1atTQ8uXLde3aNX300Ufy8vJS9erVNW/evDhft2vXLrVt21alSpVS3bp1bT9zrIcPH2rKlClq0KCBSpYsqXr16mnGjBmKjo62PaZdu3bq27evevToobJly6pz584qXry4JGny5Mm2/0/sdXb8ed5//32VLl1alStX1tdff63IyEjb4yIiIjRlyhTVqVNHpUqVUsOGDfXjjz/aZU/s9/7w4UMNHTpUr7/+ui3LnDlznur1B4C0iKIEAGlAZGRknP8eX0BgzZo16t69u4oUKaIpU6boo48+0urVq+Xr62t73LZt29S9e3e98sormjp1qiZNmqSCBQtq2LBhOnTokPLmzWsb4tetW7cEh/s9yfjx4/XZZ5/ps88+U5kyZbR//3516NBBmTJl0oQJE/T5559r3759at++vR48ePDU3/+TTz5RrVq19N1336lw4cIaMmSI2rdvr5deekl+fn565ZVXNGrUKB09etTu63r37q0SJUpoypQpqlKlioYNG6aFCxdKerQQQ9euXTVr1iz5+Pjou+++U4MGDTRhwgQNGTLE7vv8/PPPSp8+vaZMmaL27dtr2bJlkiQfHx/b/yf2Oj+ub9++KleunL777ju99dZbmjNnjvz9/W33f/bZZ5oxY4Z8fHw0ffp0Va9eXZ9//rlWrlwpKWm/9xEjRmj79u367LPPNHv2bNWuXVtff/21reQCwLOKoXcAYHGXL1/WK6+8Emd7z549bTvE33zzjapVq6ZvvvnGdn/hwoXVoUMHbd++XTVq1NDp06fVtGlTDRw40PYYLy8veXt7a//+/Spbtqz+97//SXo096lEiRJPnbV169Zq0KCB7fa4ceP0wgsvaPr06XJxcZEklS5d2nZmpG3btk/1/Zs3b66OHTtKktzc3NSqVSuVKlVKPXr0kCS9+uqr2rx5sw4dOqRSpUrZvq5OnTq2n7tatWq6du2apk2bprZt22rnzp367bffNHbsWDVu3FiSVKVKFWXKlEkTJ07Ue++9pxdffFGS5OzsrGHDhsnNzc0ul4eHh214ZFJe51gtWrRQ9+7dJUmVKlXSr7/+qm3btql169Y6deqU1q1bp4EDB6p9+/a2xwQFBSkgIEBNmjRJ0u993759qly5sho2bChJ8vb2lpubm3LkyPFUrz0ApDUUJQCwuDx58mjatGlxtufLl0+SdPbsWV29elVdunSxG7ZVoUIFZc6cWbt371aNGjX0wQcfSJLu3bunCxcu6O+//1ZgYKCkR0O8UsLjw8/u37+vI0eOqFOnToqJibFl8/T0VNGiRbV79+6nLkpeXl62/8+dO7ekR8UrVuzO/z///GP3dU2aNLG7Xa9ePW3evFl///239u3bJxcXF7355pt2j2ncuLEmTpyogIAAW1EqVKhQnJLk6Gle58d/HulR4YodNnfgwAFJUt26de0eM2HCBEnSmTNnkvR79/b21tKlSxUcHKyaNWuqevXqtnIGAM8yihIAWFyGDBlUsmTJBO+/ffu2JGno0KEaOnRonPuvXbsmSQoJCdGQIUP066+/ysnJSc8//7zKlSsnKeWuA5QrVy7b/9+9e1fR0dGaOXOmZs6cGeexGTNmfOrvnzlz5jjbXF1dE/26vHnzxpvz7t27unPnjnLkyKF06ew/MvPkySPJvnTFlrMneZrXOVOmTHa3nZ2dbY+J/b0+/po+Lqm/94EDB8rDw0OrV6+2Pc7Ly0uDBw9O1llDAEgrKEoAkMZlzZpVkvTpp5+qYsWKce7Pli2bpEfzYc6cOaO5c+eqbNmyypAhg+7fv6/ly5c/8fs7OTkpKirKbpvjIhHxcXd3l5OTkzp06GAb9vW4pBSclBJbKmLdvHlT0qMSki1bNt26dUuRkZF2ZSm2aDztELXkvs6OYn+vISEh8vDwsG0/e/asQkJCbL/XxH7vGTJkULdu3dStWzcFBQVp69atmjp1qvr06aOff/75qTIBQFrCYg4AkMYVKVJEuXLl0qVLl1SyZEnbfx4eHho3bpz++OMPSdLBgwdVv359vfbaa8qQIYMkaceOHZJkW90tdh7R49zd3XXr1i09fPjQts1xUYL4ZM6cWSVKlNDZs2ftchUrVkyTJ09WQEDAv/7Zk2rLli12tzds2KCCBQvqueeeU8WKFRUVFaX169fbPWb16tWSZDsblBBnZ/uP2qS8zkkR+7y//vqr3fbx48dr2LBhSfq9P3jwQPXr17etclegQAG1bdtWDRs21NWrV5OcBQDSIs4oAUAa5+Liot69e2vw4MFycXFRzZo1dffuXU2dOlXBwcG2hSBKlSqlNWvW6JVXXpGHh4cOHz6s6dOny8nJSffv35ckZcmSRZK0Z88eFS1aVKVLl1bNmjW1cOFCff7552rRooVOnTqlOXPmxFuqHH3yySfq3Lmz+vTpo8aNGysqKkpz5szRkSNH1K1bt//uRXEwb948ZcqUSWXKlNEvv/yirVu3aty4cZKk119/Xd7e3hoyZIiuXbumEiVKaN++fZo5c6befvtt2/ykhGTNmlWHDx/W/v37Vb58+SS9zknx8ssvq0GDBvrmm2/04MEDvfLKK9q1a5c2bdqkCRMmJOn3nilTJr3yyiuaPHmy0qdPr+LFi+vvv//WihUrVL9+/X/1mgKA1VGUAOAZ0KJFC7m7u2vWrFlatmyZ3NzcVLZsWX3zzTfy9PSUJI0ePVrDhg3TsGHDJD1aHW3o0KFavXq1beGAzJkzq2PHjlq2bJm2bdum3bt3q0qVKvrss8+0cOFC/fLLL7Yd79atWyeaq2rVqpo9e7YmT56sHj16KH369HrllVc0d+7cVL2I7ueff64VK1Zo+vTpKlKkiPz8/GxFwcnJSdOnT5efn58WLFigkJAQFSpUSL1797atsPckXbt21dSpU/Xhhx9q/fr1SXqdk2rs2LGaPHmyFi5cqFu3bumFF17QhAkTbCsLJuX3/tVXX2nChAmaM2eOrl+/rly5csnHx0c9e/Z8qiwAkNY4xaTUDF0AACwmICBA7du314IFC+Tt7W10HACAiTBHCQAAAAAcUJQAAAAAwAFD7wAAAADAAWeUAAAAAMABRQkAAAAAHKT55cGjo6MVGRkpZ2dnOTk5GR0HAAAAgEFiYmIUHR2tdOnSxbkguKM0X5QiIyMVGBhodAwAAAAAJlGyZEllyJDhiY9J80UptimWLFkySVeJt4KoqCgFBgZa6meyWmar5ZWsl9lqeSUypwar5ZWsl9lqeSUypwar5ZWsl9lqeSVrZk5M7M+U2Nkk6RkoSrH
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fi = pd.DataFrame({\n",
" 'feature': X_knn_encoded.columns,\n",
" 'importance': tree_clf.feature_importances_\n",
"})\n",
"\n",
"# ordina le feature importances in ordine decrescente\n",
"fi = fi.sort_values(by='importance', ascending=False)\n",
"\n",
"# crea un grafico delle feature importances\n",
"plt.figure(figsize=(10, 6))\n",
"plt.bar(fi['feature'], fi['importance'])\n",
"plt.xticks(rotation=90)\n",
"plt.xlabel('Feature')\n",
"plt.ylabel('Importance')\n",
"plt.title('Feature Importances')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"it seems that only the 6 features are important, i will try to retrain the model only with them."
]
},
{
"cell_type": "code",
"execution_count": 174,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Flipper Length (mm)</th>\n",
" <th>Culmen Length (mm)</th>\n",
" <th>Island_Biscoe</th>\n",
" <th>Individual ID</th>\n",
" <th>Delta 13 C (o/oo)</th>\n",
" <th>Delta 15 N (o/oo)</th>\n",
" <th>Sex</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>322</th>\n",
" <td>0.949153</td>\n",
" <td>0.618182</td>\n",
" <td>1.0</td>\n",
" <td>31</td>\n",
" <td>0.379622</td>\n",
" <td>0.429100</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>116</th>\n",
" <td>0.457627</td>\n",
" <td>0.189091</td>\n",
" <td>0.0</td>\n",
" <td>113</td>\n",
" <td>0.201729</td>\n",
" <td>0.778965</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>113</th>\n",
" <td>0.322034</td>\n",
" <td>0.272727</td>\n",
" <td>1.0</td>\n",
" <td>108</td>\n",
" <td>0.070866</td>\n",
" <td>0.491998</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42</th>\n",
" <td>0.406780</td>\n",
" <td>0.436364</td>\n",
" <td>0.0</td>\n",
" <td>39</td>\n",
" <td>0.771173</td>\n",
" <td>0.670639</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>126</th>\n",
" <td>0.389831</td>\n",
" <td>0.341818</td>\n",
" <td>0.0</td>\n",
" <td>131</td>\n",
" <td>0.307669</td>\n",
" <td>0.373327</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Flipper Length (mm) Culmen Length (mm) Island_Biscoe Individual ID \\\n",
"322 0.949153 0.618182 1.0 31 \n",
"116 0.457627 0.189091 0.0 113 \n",
"113 0.322034 0.272727 1.0 108 \n",
"42 0.406780 0.436364 0.0 39 \n",
"126 0.389831 0.341818 0.0 131 \n",
"\n",
" Delta 13 C (o/oo) Delta 15 N (o/oo) Sex \n",
"322 0.379622 0.429100 2 \n",
"116 0.201729 0.778965 2 \n",
"113 0.070866 0.491998 1 \n",
"42 0.771173 0.670639 2 \n",
"126 0.307669 0.373327 2 "
]
},
"execution_count": 174,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train_knn_sim = X_train_model[[\"Flipper Length (mm)\", \"Culmen Length (mm)\", \"Island_Biscoe\", \"Individual ID\", \"Delta 13 C (o/oo)\", \"Delta 15 N (o/oo)\", \"Sex\"]]\n",
"\n",
"X_test_knn_sim = X_test_model[[\"Flipper Length (mm)\", \"Culmen Length (mm)\", \"Island_Biscoe\", \"Individual ID\", \"Delta 13 C (o/oo)\", \"Delta 15 N (o/oo)\", \"Sex\"]]\n",
"\n",
"X_test_knn_sim.head()"
]
},
{
"cell_type": "code",
"execution_count": 175,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 0.5797101449275363\n"
]
}
],
"source": [
"knn_sim = KNeighborsClassifier(n_neighbors=5)\n",
"\n",
"# fit the model\n",
"knn.fit(X_train_knn_sim, y_train_model)\n",
"\n",
"# predict\n",
"y_pred_knn = knn.predict(X_test_knn_sim)\n",
"\n",
"# evaluate\n",
"print(\"Accuracy:\", accuracy_score(y_test_model, y_pred_knn))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I had to try, but this want a good idea. I will use the original dataset.\n",
"\n",
"## 14 Report and communicate findings\n",
"\n",
"Summarize your findings from the EDA and document any insights, patterns, or relationships discovered during the process. Use clear visualizations and concise language to communicate your results to stakeholders.\n",
"\n",
"### Summary\n",
"\n",
"This dataset contains data about penguins. Some features are useless and can be ignored, but the other are useful to some information. FOr instance the species of the penguin. But other columns could be predicted. Some correlation is present with the sex of the penguin, as well as the others like the island, the flipper length and the culmen length. Also, the BMI seems to have a good predictability index.\n",
"\n",
"I manged to predict the species with a good accuracy, but I think that the model could be improved. I think that the dataset may be too small to train a good model, and the client should sample more data.\n",
"\n",
"It would be nice to apply a classificator (class = day) on the date without the year, on the date egg date, to predict futures egg dates in the year, but it would require a lot more samples since the final classes would be 365. An interesting information to add could be the date the egg opened, this would allow the prediction the egg birthdate.\n",
"\n",
"No outliers were not found, but some missing data were present.\n",
"\n",
"Overall the missing data wasn't too much, and it was possible to manage it. Only 3 rows were dropped.\n",
"\n",
"A problem could be the imbalance to some feature, for instance, it would not be possible to predict the clutch completion because the model would be too biased, trying balance it would not be possible otherwise it would result in too few data.\n",
"\n",
"Moreover, the client should sample more data about the Chinstrap penguin.\n",
"\n",
"It could be interesting to have the data about the behaviour to see how they could help to predict others features, or how the other features could predict the behaviour."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.16"
}
},
"nbformat": 4,
"nbformat_minor": 1
}