# Exploratory Data Analysis The goal of this notebook is to explore the data and get a better understanding of the data with the idea of to create the best penguin classifier. ## Steps - 1 Understanding the dataset - 2 Load the dataset - 3 Inspect the dataset - 4 Data type conversion - 5 Identify missing values - 6 Detect and handle outliers - 7 Analyze relationships between variables - 8 Explore categorical variables - 9 Impute missing values - 10 Normalize and scale data - 11 Pandas Profiling - 12 Clustering, dimension reduction with PCA - 13 Model selection and evaluation - 14 Report and communicate findings ## 1) Understanding the dataset Start familiarizing with the dataset. Understand the context, the purpose of the data, and the problem trying to solve. Look for documentation or metadata that provides information about the dataset, its variables, and its source. ### Columns description - studyName: The name of the study that collected the data. (Categorical) - Sample Number: A unique identifier for each sample in the dataset. (Categorical) - Species: The species of the penguin (e.g., Adelie Penguin). (Categorical) - Region: The region where the penguin was observed. (Categorical) - Island: The island where the penguin was observed. (Categorical) - Stage: The stage of the penguin's life cycle at the time of observation (e.g., Adult, 1 Egg Stage). (Categorical) - Individual ID: A unique identifier for each individual penguin. (Categorical) - Clutch Completion: Whether the penguin has completed its clutch (a set of eggs laid by a bird). (Categorical - Binary) - Date Egg: The date the egg was observed. (Datetime) - Culmen Length (mm): The length of the penguin's culmen (the upper ridge of a bird's beak) in millimeters. (Numerical - Discrete) - Culmen Depth (mm): The depth of the penguin's culmen in millimeters. (Numerical - Discrete) - Flipper Length (mm): The length of the penguin's flipper (wing) in millimeters. (Numerical - Discrete) - Body Mass (g): The body mass of the penguin in grams. (Numerical - Discrete) - Sex: The sex of the penguin (Male, Female). (Categorical - Binary) - Delta 15 N (o/oo): The ratio of stable isotopes of nitrogen (15N/14N) in the penguin's blood, indicating the penguin's trophic level. (Numerical - Continuous) - Delta 13 C (o/oo): The ratio of stable isotopes of carbon (13C/12C) in the penguin's blood, indicating the penguin's foraging habitat. (Numerical - Continuous) - Comments: Additional comments related to the penguin or the sample. (Text) ### First impressions The dataset contains several variables, such as: - Species: The species of the penguin (e.g., Adelie, Gentoo, Chinstrap). - Island: The island where the penguin was observed. - Culmen Length and Culmen Depth: Measurements of the penguin's beak, which can be useful for differentiating between species and understanding feeding habits. - Flipper Length and Body Mass: Measurements of the penguin's body size, which can be indicators of their overall health, age, and ability to swim or dive. - Sex: The sex of the penguin. - Delta 15 N and Delta 13 C: Isotopic ratios that can provide information about the penguin's diet, foraging ecology, and trophic position within the food web. I can already see that some data are missing, and I will need to deal with them later. The "Species" column contains the common name and the (scientific name). Keeping both' doesn't add any value to our analysis. The "Sample Number" is a unique identifier for each sample, so it will not be useful for our analysis. The "IndividualID" may be useful to identify the penguins, but I will need to check if it is unique. The "Stage" column contains only a value, so it will not be useful for our analysis. The "Sex" feature contains a binary value, some values are missing and some are wrong. The "Date Egg" column is a trying and needs to be converted to a datetime format. Some sample are missing many values, and we will need to decide if we want to keep them or not. Overall, the dataset is a study on penguins. It seems they were studying the morphological characteristics and the isotopic signatures of different species and geographic areas. Also, they keep track of they egg and the clutch completion. ## 2) Load the dataset Importing the data into the notebook using padnas. And check the correct interpretation. ```python # Import libraries import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import warnings from sklearn.decomposition import PCA from sklearn.manifold import TSNE from ydata_profiling import ProfileReport from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score from sklearn.tree import DecisionTreeClassifier ``` ```python # Load the dataset df = pd.read_csv('data/penguins.csv') df.head() ```
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN
The dataset seems correctly loaded. ## 3) Inspect the dataset Use functions like head(), tail(), info() or describe() to get an overview of the dataset, its shape, and basic summary statistics to identify any immediate issues or patterns in the data. ```python df.shape ``` (344, 17) ```python df.head() ```
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN
```python df.info() ``` RangeIndex: 344 entries, 0 to 343 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 studyName 344 non-null object 1 Sample Number 344 non-null int64 2 Species 344 non-null object 3 Region 344 non-null object 4 Island 344 non-null object 5 Stage 344 non-null object 6 Individual ID 344 non-null object 7 Clutch Completion 344 non-null object 8 Date Egg 344 non-null object 9 Culmen Length (mm) 342 non-null float64 10 Culmen Depth (mm) 342 non-null float64 11 Flipper Length (mm) 342 non-null float64 12 Body Mass (g) 342 non-null float64 13 Sex 334 non-null object 14 Delta 15 N (o/oo) 330 non-null float64 15 Delta 13 C (o/oo) 331 non-null float64 16 Comments 26 non-null object dtypes: float64(6), int64(1), object(10) memory usage: 45.8+ KB ```python df.describe() ```
Sample Number Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Delta 15 N (o/oo) Delta 13 C (o/oo)
count 344.000000 342.000000 342.000000 342.000000 342.000000 330.000000 331.000000
mean 63.151163 43.921930 17.151170 200.915205 4201.754386 8.733382 -25.686292
std 40.430199 5.459584 1.974793 14.061714 801.954536 0.551770 0.793961
min 1.000000 32.100000 13.100000 172.000000 2700.000000 7.632200 -27.018540
25% 29.000000 39.225000 15.600000 190.000000 3550.000000 8.299890 -26.320305
50% 58.000000 44.450000 17.300000 197.000000 4050.000000 8.652405 -25.833520
75% 95.250000 48.500000 18.700000 213.000000 4750.000000 9.172123 -25.062050
max 152.000000 59.600000 21.500000 231.000000 6300.000000 10.025440 -23.787670
```python # get the number of unique values for each column df.nunique() ``` studyName 3 Sample Number 152 Species 3 Region 1 Island 3 Stage 1 Individual ID 190 Clutch Completion 2 Date Egg 50 Culmen Length (mm) 164 Culmen Depth (mm) 80 Flipper Length (mm) 55 Body Mass (g) 94 Sex 3 Delta 15 N (o/oo) 330 Delta 13 C (o/oo) 331 Comments 7 dtype: int64 ```python # remove the columns with only one unique value because are useless for the analysis but # keep it in a separate dataframe df_unique = df.loc[:, df.nunique() == 1] df_unique.head() ```
Region Stage
0 Anvers Adult, 1 Egg Stage
1 Anvers Adult, 1 Egg Stage
2 Anvers Adult, 1 Egg Stage
3 Anvers Adult, 1 Egg Stage
4 Anvers Adult, 1 Egg Stage
```python # drop the columns with only one unique value df.drop(df_unique.columns, axis=1, inplace=True) ``` ```python # list all comments df['Comments'].unique() ``` array(['Not enough blood for isotopes.', nan, 'Adult not sampled.', 'Nest never observed with full clutch.', 'No blood sample obtained.', 'No blood sample obtained for sexing.', 'Nest never observed with full clutch. Not enough blood for isotopes.', 'Sexing primers did not amplify. Not enough blood for isotopes.'], dtype=object) ```python # shows the blood info column that has a comment and its comment df[df['Comments'].notnull()][['Sex', 'Clutch Completion', 'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)', 'Comments']] ```
Sex Clutch Completion Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 MALE Yes NaN NaN Not enough blood for isotopes.
3 NaN Yes NaN NaN Adult not sampled.
6 FEMALE No 9.18718 -25.21799 Nest never observed with full clutch.
7 MALE No 9.46060 -24.89958 Nest never observed with full clutch.
8 NaN Yes NaN NaN No blood sample obtained.
9 NaN Yes 9.13362 -25.09368 No blood sample obtained for sexing.
10 NaN Yes 8.63243 -25.21315 No blood sample obtained for sexing.
11 NaN Yes NaN NaN No blood sample obtained.
12 FEMALE Yes NaN NaN Not enough blood for isotopes.
13 MALE Yes NaN NaN Not enough blood for isotopes.
15 FEMALE Yes NaN NaN Not enough blood for isotopes.
28 FEMALE No 8.38404 -25.19837 Nest never observed with full clutch.
29 MALE No 8.90027 -25.11609 Nest never observed with full clutch.
38 FEMALE No 9.41131 -25.04169 Nest never observed with full clutch.
39 MALE No NaN NaN Nest never observed with full clutch. Not enou...
41 MALE Yes NaN NaN Not enough blood for isotopes.
46 MALE Yes NaN NaN Not enough blood for isotopes.
47 NaN Yes NaN NaN Sexing primers did not amplify. Not enough blo...
68 FEMALE No 8.47781 -26.07821 Nest never observed with full clutch.
69 MALE No 8.86853 -26.06209 Nest never observed with full clutch.
120 FEMALE No 9.04296 -26.19444 Nest never observed with full clutch.
121 MALE No 9.11066 -26.42563 Nest never observed with full clutch.
130 FEMALE No 8.98460 -25.57956 Nest never observed with full clutch.
131 MALE No 8.86495 -26.13960 Nest never observed with full clutch.
138 FEMALE No 8.61651 -26.07021 Nest never observed with full clutch.
139 MALE No 9.25769 -25.88798 Nest never observed with full clutch.
```python # drop the comments useless column df.drop('Comments', axis=1, inplace=True) df.head() ```
studyName Sample Number Species Island Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Torgersen N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Torgersen N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Torgersen N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426
```python # count the number of "sample number" unique repetitions df['Sample Number'].value_counts() ``` 1 3 45 3 52 3 51 3 50 3 .. 129 1 128 1 127 1 126 1 152 1 Name: Sample Number, Length: 152, dtype: int64 ```python # drop the sample number column df.drop('Sample Number', axis=1, inplace=True) df.head() ```
studyName Species Island Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
0 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN
1 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454
2 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302
3 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN
4 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426
The dataset consist in 17 features and 344 samples. The dataset contains 7 pure numerical features. Many values are missing and have been replaced by NaN. The dataset has 1 date feature. Some columns are clearly useless. The numerical features have different scales, and could be useful to apply some scaling later. Also, these values are not too big, therefore should not be a problem to manipulate them. Also, the columns "Region" and "Stage" have been dropped because they contain only one value. The "comments" columns contains 26 non-null values where only 6 are comments, this comments basically says that they couldn't get the sample since the blood wasn't enough or that "Nest never observed with full clutch". We can say that because the blood information are missing in the first case, or because the column "Clutch completion" is "No", the comments are redundant and can be dropped since they add no information. The comment "No blood sample obtained for sexing" indicate that these researchers used the blood sample to determinate the sex, so if they failed to amplify the DNA or didn't take enough blood, the sex of the penguin will be empty. The "Sample Number" column contains 152 unique values, but the column "Individual ID" contains 191 unique values. This means that some samples are from the same penguin. This is may not a problem, but I need to keep it in mind. The total number of samples is 344, but the total number of unique "Sample Number" is 152, this because some penguins have been sampled more than once with the same "Individual ID". And a sample get more individuals in one go. This means I can drop the sample ID because is an arbitrary number that add no information other than a chronological order, which is already in the date feature, but I will keep the Individual ID. The "Study number" may also be dropped since there is logic correlation and seems less important at this point, but I will keep it for now. ## 4 Data type conversion Ensure that each variable is of the appropriate data type (e.g., numeric, categorical, datetime). Convert any variables with incorrect data types. Convert with LabelEncoder or OrdinalEncoder. In the step 3 I found out many columns that need to be converted to the correct data type. For this step I looked at the values of the non-numerical columns, if they are non-ordinal and only 2 values are possible (eg: sex) I will use the LabelEncoder, if more than 2 values are possible I will use the OneHotEncoder (eg: species). If the values are ordinal I will use the OrdinalEncoder. Since the "Individual ID" is non-ordinal but contains a lot of unique values, I will encode is with the label encoder because the one-hot encoder could generate too many columns. I could then reduce the number of columns with a PCA, but this may start to be too complicated for a TP. Also, I will convert the "Date Egg" column to a datetime format. To simplify the coding and encoding I wrote a convenience class that will do all the work for me. ```python # Import the custom EncoderManager class %load_ext autoreload %autoreload 2 from lib.EncoderManager import EncoderManager ``` The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload ```python # list of non-numerical columns and the count of their unique values df.select_dtypes(exclude='number').nunique() ``` studyName 3 Species 3 Island 3 Individual ID 190 Clutch Completion 2 Date Egg 50 Sex 3 dtype: int64 ```python # Convert to date with pandas.to_datetime df['Date Egg'] = pd.to_datetime(df['Date Egg']) # Define the columns to encode with LabelEncoder columns_to_label_encode = ['Clutch Completion', 'Sex', 'Individual ID'] # Define the columns to encode with OrdinalEncoder # since the date type gives me many problem il convert it to ordinal # however, this column is not very interesting for the prediction columns_to_ordinal_encode = ["Date Egg"] # Define the columns to encode with OneHotEncoder columns_to_onehot_encode = ['studyName', 'Species', 'Island'] ``` ```python penguin_manager = EncoderManager(df, columns_to_label_encode, columns_to_ordinal_encode, columns_to_onehot_encode) penguin_manager.encode(inplace=True).get_df() ```
Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) studyName_PAL0708 studyName_PAL0809 studyName_PAL0910 Species_Adelie Penguin (Pygoscelis adeliae) Species_Chinstrap penguin (Pygoscelis antarctica) Species_Gentoo penguin (Pygoscelis papua) Island_Biscoe Island_Dream Island_Torgersen
0 22 1 2.0 39.1 18.7 181.0 3750.0 2 NaN NaN 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
1 23 1 2.0 39.5 17.4 186.0 3800.0 1 8.94956 -24.69454 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
2 44 1 6.0 40.3 18.0 195.0 3250.0 1 8.36821 -25.33302 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
3 45 1 6.0 NaN NaN NaN NaN 3 NaN NaN 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
4 66 1 6.0 36.7 19.3 193.0 3450.0 1 8.76651 -25.32426 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
339 63 0 49.0 NaN NaN NaN NaN 3 NaN NaN 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0
340 64 1 45.0 46.8 14.3 215.0 4850.0 1 8.41151 -26.13832 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0
341 65 1 45.0 50.4 15.7 222.0 5750.0 2 8.30166 -26.04117 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0
342 74 1 45.0 45.2 14.8 212.0 5200.0 1 8.24246 -26.11969 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0
343 75 1 45.0 49.9 16.1 213.0 5400.0 2 8.36390 -26.15531 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0

344 rows × 19 columns

```python penguin_manager = EncoderManager(df, columns_to_label_encode, columns_to_ordinal_encode, columns_to_onehot_encode) df = penguin_manager.get_df() df["Species"].unique() ``` array(['Adelie Penguin (Pygoscelis adeliae)', 'Chinstrap penguin (Pygoscelis antarctica)', 'Gentoo penguin (Pygoscelis papua)'], dtype=object) ```python df = penguin_manager.get_df() df["Island"].unique() ``` array(['Torgersen', 'Biscoe', 'Dream'], dtype=object) ```python df = penguin_manager.get_df() df["studyName"].unique() ``` array(['PAL0708', 'PAL0809', 'PAL0910'], dtype=object) So now I have a correctly encoded dataset, and I can easily decode it back to the original format with the penguin_manager.decode() method. ## 5 Identify missing values Determine the extent of missing values in the dataset by calculating the percentage of missing values for each variable. Decide whether to impute, drop, or interpolate missing values based on the context and the importance of the variable. For this step I will use the library missingno to visualize the missing values. Missingno is a library specialized in visualize the *missing values* in a dataset in a very intuitive way. ```python # if uncommented, install missingno if not already installed # !pip install missingno # note: to works it need matplotlib=3.5.0 import missingno as msno ``` ```python # visualize a matrix of the missing values warnings.filterwarnings('ignore') msno.matrix(penguin_manager.get_df()) warnings.filterwarnings('default') ``` ![png](output_28_0.png) With this matrix plot in particular we can see which column has missing value (axis x) in function of the sample index (axis y). Also, on the right side there is a line that shows the number of non-missing in that "zone" of the matrix. With this we can also see if some zone contains the missing values are missing in a specific zone, in more columns together, or if they are missing randomly. In this dataset we see that the 2 blood measures are missing together. Which il logic if we think that those aren't taken if they could get enough blood from the penguin. The measures of the anatomical parts also tend to miss together. Also, we see some missing values in the "sex" column. Later in this step i want to see if this value could be inferred by another sample of the same penguin "Individual ID". Or maybe, since usually sex differences are quite evident, I could try to predict it with a machine learning model. We can also see that in the firsts and lasts samples some row are missing many columns, in this case could be simpler to drop those rows because they are missing too many values. ```python # get an idea of what to expect in terms of missing values penguin_manager.get_df().isna().sum() ``` studyName 0 Species 0 Island 0 Individual ID 0 Clutch Completion 0 Date Egg 0 Culmen Length (mm) 2 Culmen Depth (mm) 2 Flipper Length (mm) 2 Body Mass (g) 2 Sex 10 Delta 15 N (o/oo) 14 Delta 13 C (o/oo) 13 dtype: int64 ```python # visualize a bar chart of the missing values msno.bar(penguin_manager.get_df()) ``` ![png](output_31_1.png) With this bar chart we can see that even if some values are missing and at which percentage, overall the dataset is quite complete and quite balanced. Some data is missing in the columns as we seen in the matrix plot. With this bar chart we can see that even if some values are missing, overall the dataset is quite complete. ```python # heatmap of the missing values warnings.filterwarnings('ignore') msno.heatmap(penguin_manager.get_df()) warnings.filterwarnings('default') ``` ![png](output_34_0.png) The nullity correlation heatmap measures how strongly the presence or absence of one variable affects the presence of another. As we noticed before, the blood measures are missing together, and also the anatomical measures are missing together. This is why we see a strong correlation between those columns. For the other columns, we see that there is still a correlation of 0.4, this means that 40% of the values are missing in the same sample which something be aware of. ```python # dendrogram of the missing values msno.dendrogram(penguin_manager.get_df()) ``` ![png](output_36_1.png) The last plot is the dendrogram, which groups together the columns that have missing values in the same samples. We can see that the blood measures are coupled first as expected, the second couple is specie and flipper (and in general all anatomical measures), may be worth to look further into that. At this point the first thing to do is to drop the rows that have too many missing values, and then decide what to do with the other missing values. I will drop this values because there is too little information to infer them. ```python # return only the cols with at least 5 missing values penguin_manager.get_df()[penguin_manager.get_df().isna().sum(axis=1) > 5] ```
studyName Species Island Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
3 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N2A2 Yes 2007-11-16 NaN NaN NaN NaN NaN NaN NaN
339 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N38A2 No 2009-12-01 NaN NaN NaN NaN NaN NaN NaN
```python # drop the rows with at least 5 missing values # Note: a faster solution would be to use dropna() with the threshold parameter, but # after some try I found that it doesn't work and I don't know why # so this is an alternative solution missing_vals = penguin_manager.get_df().isna().sum(axis=1) drop_rows = missing_vals[missing_vals > 5].index penguin_manager.get_df().drop(drop_rows, inplace=True) penguin_manager.get_df().reset_index(drop=True, inplace=True) penguin_manager.get_df()[penguin_manager.get_df().isna().sum(axis=1) > 5] ```
studyName Species Island Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
```python # check the shape of the dataset penguin_manager.get_df().shape ``` (342, 13) This dropped 2 rows, which is not a big deal, and now we have a dataset with 334 rows but with way less missing values. ```python # check the missing values penguin_manager.get_df().isna().sum() ``` studyName 0 Species 0 Island 0 Individual ID 0 Clutch Completion 0 Date Egg 0 Culmen Length (mm) 0 Culmen Depth (mm) 0 Flipper Length (mm) 0 Body Mass (g) 0 Sex 8 Delta 15 N (o/oo) 12 Delta 13 C (o/oo) 11 dtype: int64 There is still a problem with the columns "sex", "Delta 15 N (o/oo)", "Delta 13 C (o/oo)". First I want to see if the sex of some penguin is in another sample. ```python # get rows with empty values and extract its "Individual ID" empty_ind_id = penguin_manager.get_df()[penguin_manager.get_df().isna().any(axis=1)]["Individual ID"].unique() # then get the rows with the same "Individual ID" row_ind_id = penguin_manager.get_df()[penguin_manager.get_df()["Individual ID"].isin(empty_ind_id)] # now groups by individual id and see if there are samples with NaNs and not NaNs row_ind_id_groups = row_ind_id.groupby("Individual ID").apply(lambda x: x.isna().any(axis=1).value_counts()) # True means that there is at least one NaN in the sample row_ind_id_groups ``` Individual ID N1A1 True 1 False 1 N24A1 False 2 True 1 N25A2 True 1 False 1 N26A2 True 1 N29A1 False 2 True 1 N29A2 False 2 True 1 N46A1 False 1 True 1 N50A1 False 1 True 1 N51A1 False 1 True 1 N5A1 True 1 False 1 N5A2 True 1 False 1 N6A1 False 2 True 1 N6A2 False 2 True 1 N7A1 True 1 False 1 N7A2 True 1 False 1 N8A2 False 2 True 1 N96A1 True 1 dtype: int64 ```python # Now groups by individual id and see if there are samples with NaNs and not NaNs in the Sex column row_ind_id_sex = row_ind_id.groupby("Individual ID").apply(lambda x: x["Sex"].isna().value_counts()) # True means that there is at least one NaN in the Sex column row_ind_id_sex ``` Individual ID N1A1 False 2 N24A1 False 2 True 1 N25A2 False 2 N26A2 False 1 N29A1 False 3 N29A2 False 2 True 1 N46A1 False 1 True 1 N50A1 False 2 N51A1 False 1 True 1 N5A1 True 1 False 1 N5A2 True 1 False 1 N6A1 False 2 True 1 N6A2 False 2 True 1 N7A1 False 2 N7A2 False 2 N8A2 False 3 N96A1 False 1 Name: Sex, dtype: int64 Is interesting to see that some individuals have NaNs in some sample but not in others. This means we may know the truth for each sample of these individuals. In the case of "Sex" I can directly address that. ```python # get the samples with nans in the column "Sex" samples_with_nans = penguin_manager.get_df()[penguin_manager.get_df().isna().any(axis=1)] samples_with_nans = samples_with_nans[samples_with_nans["Sex"].isna()] # now for each individual id that have rows with nans get its rows for individual_id in empty_ind_id: # get rows with the same individual id rows = penguin_manager.get_df()[penguin_manager.get_df()["Individual ID"] == individual_id] # check if Sex column has a non-null value sex_values = rows["Sex"].dropna().unique() if len(sex_values) == 1: sex = sex_values[0] # assign sex to all rows with the same individual id penguin_manager.get_df().loc[penguin_manager.get_df()["Individual ID"] == individual_id, "Sex"] = sex # check the empty rows count_nans = penguin_manager.get_df()["Sex"].isna().sum() count_nans ``` 1 The code is correct, but it didn't edit the added the sex to the rows. I don't know why, I need to investigate further. Let's see the rows that still have NaNs: ```python # print the columns with empty values penguin_manager.get_df().loc[penguin_manager.get_df().isna().any(axis=1)][["Individual ID", "Sex"]] ```
Individual ID Sex
0 N1A1 MALE
7 N5A1 FEMALE
10 N6A2 MALE
11 N7A1 FEMALE
12 N7A2 MALE
14 N8A2 FEMALE
38 N25A2 MALE
40 N26A2 MALE
45 N29A1 MALE
46 N29A2 MALE
211 N96A1 MALE
249 N50A1 MALE
323 N24A1 NaN
Apparently the individual ID is not the id of a single penguin, OR the sample have wrong values. OR penguins have the ability to change sex... *Note: After some research it turned out that Adélie penguin (Pygoscelis Adelia) e il Chinstrap penguin (Pygoscelis antarctica) and Gentoo penguin (Pygoscelis papua) actually have the ability to change sex...* this is actually something important to keep in mind, and it explains why there are Individuals with different sex in different samples. This means that would be useless to inference the sex based on the anatomical measures, because they cannot be related, at least not in these 3 species. I also mean that in out dataset the only way to know the sex is through blood isotopes, but they are missing for these rows. ## 6 Detect and handle outliers Use techniques like box plots, Z-scores, or IQR to identify potential outliers in the data. Depending on the context, decide whether to remove, transform, or keep outliers. Using to describe() before, I had the idea there wasn't anything strange in the dataset, but I want to be sure. So I will look further into it. ```python sns.boxplot(data=penguin_manager.get_df().select_dtypes(include=['float64', 'int64'])) ``` ![png](output_52_1.png) The body mass values are in a different range than the other values. ```python numerical_cols = penguin_manager.get_df().select_dtypes(include=['float64', 'int64']) # Creare un boxplot per ogni colonna numerica for col in numerical_cols: sns.boxplot(x=penguin_manager.get_df()[col]) plt.title(col) plt.show() ``` ![png](output_54_0.png) ![png](output_54_1.png) ![png](output_54_2.png) ![png](output_54_3.png) ![png](output_54_4.png) ![png](output_54_5.png) There are evident no outliers in the dataset seen with the box plot. I will calculate using the standard deviation method. ```python numerical_cols = penguin_manager.get_df().select_dtypes(include=['float64', 'int64']) means = numerical_cols.mean() stds = numerical_cols.std() # set the threshold to 3 standard deviations threshold = 3 outliers = pd.DataFrame() for col in numerical_cols.columns: col_outliers = penguin_manager.get_df()[(penguin_manager.get_df()[col] < means[col] - threshold * stds[col]) | (penguin_manager.get_df()[col] > means[col] + threshold * stds[col])] col_outliers['feature'] = col outliers = pd.concat([outliers, col_outliers]) outliers.head() ```
studyName Species Island Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) feature
Even this method doesn't find any outliers for 3 standard deviations. Calculating the z-score (how many st dev from the mean): ```python warnings.filterwarnings('ignore') !pip install scipy from scipy.stats import zscore warnings.filterwarnings('default') ``` Requirement already satisfied: scipy in c:\users\stefa\miniconda3\envs\mchinelearning\lib\site-packages (1.9.3) Requirement already satisfied: numpy<1.26.0,>=1.18.5 in c:\users\stefa\miniconda3\envs\mchinelearning\lib\site-packages (from scipy) (1.23.5) ```python z_scores = np.abs(zscore(penguin_manager.get_df().select_dtypes(include=['float64', 'int64']))) # print the rows with z-score > 3 or < -3 penguin_manager.get_df()[(z_scores > 3).any(axis=1)] ```
studyName Species Island Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
Overall, no outliers have been found. At this point I can say that there are no important outliers in the dataset, so I can move on. ## 7 Analyze relationships between variables Using scatter plots, correlation matrices, and cross-tabulations to explore relationships between pairs of variables. This will help identify potential multicollinearity, interactions, or confounding variables that may need to be addressed in your analysis. I will create some scatter plot and a bubble plot (no need for both, it's for future reference, where I could add a third numerical dimension) to see if there are some correlations between the features: ```python numerical_cols = penguin_manager.decode().select_dtypes(include=['float64', 'int64']) columns = numerical_cols.columns nrows = len(columns) - 1 ncols = 2 fig, axs = plt.subplots(nrows=nrows, ncols=ncols, figsize=(15, 5 * nrows)) for i, col1 in enumerate(columns[:-1]): for j, col2 in enumerate(columns[i + 1:]): if j * 2 < ncols: axs[i, j * 2].scatter(penguin_manager.get_df()[col1], penguin_manager.get_df()[col2]) axs[i, j * 2].set_xlabel(col1) axs[i, j * 2].set_ylabel(col2) if j * 2 + 1 < ncols: axs[i, j * 2 + 1].scatter(penguin_manager.get_df()[col1], penguin_manager.get_df()[col2], s=penguin_manager.get_df()['Body Mass (g)'] / 20) axs[i, j * 2 + 1].set_xlabel(col1) axs[i, j * 2 + 1].set_ylabel(col2) axs[i, j * 2 + 1].set_title('Bubble plot - Body Mass') plt.tight_layout() plt.show() ``` ![png](output_62_0.png) Some couple also have a good correlation, some of them appear to be grouped in a cluster, it may be the "Sex", the "Species" or the "Island". But I will check that. I will do the same with sns.pairplot, but with all features and I will add the sex, island and species as hue. ```python # Plot the pairwise relationships between the columns sns.pairplot(penguin_manager.get_df(), vars=numerical_cols.columns[:-1], hue="Sex") ``` ![png](output_64_1.png) No important correlation with sex found in the scatter plot, which is expected since now I know they can change sex. ```python sns.pairplot(penguin_manager.get_df(), vars=numerical_cols.columns[:-1], hue="Island") ``` ![png](output_66_1.png) The correlation with the island may be a little more evident, but it's not very important. ```python sns.pairplot(penguin_manager.get_df(), vars=numerical_cols.columns[:-1], hue="Species") ``` ![png](output_68_1.png) The correlation with the species seems to be more important since the clusters and normal distributions are more evident. Graphically we can see a good correlation between many couples, for instance the features body mass and flipper length is very correlated. But also compared to the 3 nominal category, we can see that some clusters are formed, and some of this cluster also have a linear correlation. Also, this separation clearly shows a that there is a tendency in the normal distributions of the features that doesn't appear with the sex separation. The most evident is the species separation. Overall, species and island seems to be more informative than sex. Again, these penguins can change sex and it seems that the only way to find the sex is though blood isotopes. Now I will check it with the linear correlation matrix. ```python # Get the numerical columns numerical_cols = penguin_manager.get_df().select_dtypes(include=['float64', 'int64']) columns = numerical_cols.columns # Create a new dataframe without NaN values noNan = penguin_manager.get_df().dropna() corr_matrix = noNan.corr(numeric_only=True) fig, ax = plt.subplots(figsize=(10, 10)) warnings.filterwarnings('ignore') sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", ax=ax) warnings.filterwarnings('default') ax.set_title("Linea Correlation Matrix") plt.show() ``` ![png](output_71_0.png) This also shows a correlation between the features body mass and flipper length. As well as some others important correlations of couples. Now I will check the correlation with the Spearman method which is used to find non-linear correlations. There is also the Kendall method, but I will use only the Spearman method. ```python # Create a new dataframe without NaN values noNan = penguin_manager.get_df().dropna() # Create the correlation matrix with the spearman method spearmanCorrMatrix = noNan.corr(method='spearman', numeric_only=True) sns.heatmap(spearmanCorrMatrix, annot=True) ``` ![png](output_73_1.png) This method shows a correlation based on the monotone relationship between the features. This means tha the also linear correlation is taken into account. The result are similar to the linear correlation matrix. However, to see a better correlation I should check after grouping by the nominal features which I won't do. ```python penguin_manager.get_df() ```
studyName Species Island Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
0 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A1 Yes 2007-11-11 39.1 18.7 181.0 3750.0 MALE NaN NaN
1 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A2 Yes 2007-11-11 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454
2 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N2A1 Yes 2007-11-16 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302
3 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N3A1 Yes 2007-11-16 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426
4 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N3A2 Yes 2007-11-16 39.3 20.6 190.0 3650.0 MALE 8.66496 -25.29805
... ... ... ... ... ... ... ... ... ... ... ... ... ...
337 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N38A1 No 2009-12-01 47.2 13.7 214.0 4925.0 FEMALE 7.99184 -26.20538
338 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N39A1 Yes 2009-11-22 46.8 14.3 215.0 4850.0 FEMALE 8.41151 -26.13832
339 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N39A2 Yes 2009-11-22 50.4 15.7 222.0 5750.0 MALE 8.30166 -26.04117
340 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N43A1 Yes 2009-11-22 45.2 14.8 212.0 5200.0 FEMALE 8.24246 -26.11969
341 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N43A2 Yes 2009-11-22 49.9 16.1 213.0 5400.0 MALE 8.36390 -26.15531

342 rows × 13 columns

## 8 Explore categorical variables Investigate the distribution of categorical variables using frequency tables, bar plots, or pie charts. Look for patterns, imbalances, or rare categories that may require special handling. ```python # get the categorical columns categorical_cols = penguin_manager.get_df().select_dtypes(include=['object']) df_categorical = penguin_manager.get_df()[categorical_cols.columns] df_categorical.describe() ```
studyName Species Island Individual ID Clutch Completion Sex
count 342 342 342 342 342 341
unique 3 3 3 190 2 3
top PAL0910 Adelie Penguin (Pygoscelis adeliae) Biscoe N69A1 Yes MALE
freq 119 151 167 3 307 171
### Species ```python # shows the species distribution sns.countplot(x='Species', data=df_categorical) plt.title("Species distribution") ``` Text(0.5, 1.0, 'Species distribution') ![png](output_80_1.png) ```python # shows on which island species live on a count plot sns.set_style("whitegrid") sns.countplot(x="Island", hue="Species", data=df_categorical) plt.title("Observations per Island and Species") plt.xlabel("Island") plt.ylabel("Observations") plt.show() ``` ![png](output_81_0.png) ```python # shows on which island species live on a pie plot df_categorical.groupby(['Island', 'Species']).size().unstack().plot(kind='pie', subplots=True, figsize=(20, 10), autopct='%1.1f%%') ``` array([, , ], dtype=object) ![png](output_82_1.png) ```python # shows the distribution of species in the study df_categorical.groupby(['studyName', 'Species']).size().unstack().plot(kind='pie', subplots=True, figsize=(20, 10), autopct='%1.1f%%') ``` array([, , ], dtype=object) ![png](output_83_1.png) ```python # shows the correlation between "Body Mass (g)" and "Species" in a swarm plot warnings.filterwarnings('ignore') sns.swarmplot(x="Species", y="Body Mass (g)", data=penguin_manager.get_df(), hue="Sex") warnings.filterwarnings('default') ``` ![png](output_84_0.png) ```python # shows the correlation between "Culmen Depth (mm)" and "Species" in a swarm plot warnings.filterwarnings('ignore') sns.swarmplot(x="Species", y="Culmen Depth (mm)", data=penguin_manager.get_df(), hue="Sex") warnings.filterwarnings('default') ``` ![png](output_85_0.png) ```python # shows the correlation between "Delta 15N (o/oo)" and "Species" in a swarm plot warnings.filterwarnings('ignore') sns.swarmplot(x="Species", y="Delta 15 N (o/oo)", data=penguin_manager.get_df(), hue="Sex") warnings.filterwarnings('default') ``` ![png](output_86_0.png) We can see that the species are imbalanced, there are 2 times more Adelie than Chinstrap. Moreover, we can see that The gentoo lives only on the Biscoe island, the Chinstrap only in the Chinstrap and the Adelie lives everywhere. So if we want to predict the island, the species feature will be very important for the first 2 species and almost useless for the latter. The studies seem to have a reasonable balance of species. As comparison with the numerical features, I have chosen the body mass, the culmen depth and the delta 15N. The gentoo seems to sit in specific ranges that is different that the other 2 species, also, using the "Sex" as hue, it seems that this feature may be correlated with the body mass (higher for males) and culmen depth (higher for males). So, even if these penguins could change sex, it may be a rare event that also have influence on the body mass and culmen depth. Therefore, it could be used for the model. ## Island ```python sns.countplot(x='Island', data=df_categorical) ``` ![png](output_88_1.png) ```python sns.countplot(x='Island', hue='studyName', data=df_categorical) ``` ![png](output_89_1.png) We can see that there is also a high imbalance between the sample of the islands the gentoo island is the most represented, which is also the only island the gentoo lives. The Torgersen island is underrepresented. Studies are slightly imbalanced if grouped by island. ### Clutch Completion ```python sns.countplot(x='Clutch Completion', data=df_categorical) ``` ![png](output_91_1.png) ### Sex ```python sns.countplot(x='Sex', data=penguin_manager.get_df()) ``` ![png](output_93_1.png) The dataset is overall imbalanced in the most important categorical features which is something to keep in mind while training the model. The sex is balanced, but we should be aware that is a less important category and the other features are not. In particular, I would not use the "Clutch Completion" feature because it is strongly imbalanced. Also, the "Toergsen" island is underrepresented. ## 9 Impute missing values At this point I know there is a correlation between species, anatomical values and sex / isotope, so I will impute them. This is not meant to be the perfect imputation, but just a way to have a complete dataset to work with. And a future reference. There are various approaches: - Impute with the mean - Impute with the median - Impute with the mode - Impute with a constant value - Impute using forward filling or back-filling - Impute with a value estimated by another predictive model I will use the average for the 2 deltas. ```python # fill deltas with the average df_filling = penguin_manager.get_df(as_copy=True) delta_15n_mean = df_filling['Delta 15 N (o/oo)'].mean() delta_13c_mean = df_filling['Delta 13 C (o/oo)'].mean() # Fill missing values with mean df_filling['Delta 15 N (o/oo)'].fillna(value=delta_15n_mean, inplace=True) df_filling['Delta 13 C (o/oo)'].fillna(value=delta_13c_mean, inplace=True) # Print the modified dataframe df_filling ```
studyName Species Island Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
0 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A1 Yes 2007-11-11 39.1 18.7 181.0 3750.0 MALE 8.733382 -25.686292
1 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A2 Yes 2007-11-11 39.5 17.4 186.0 3800.0 FEMALE 8.949560 -24.694540
2 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N2A1 Yes 2007-11-16 40.3 18.0 195.0 3250.0 FEMALE 8.368210 -25.333020
3 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N3A1 Yes 2007-11-16 36.7 19.3 193.0 3450.0 FEMALE 8.766510 -25.324260
4 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N3A2 Yes 2007-11-16 39.3 20.6 190.0 3650.0 MALE 8.664960 -25.298050
... ... ... ... ... ... ... ... ... ... ... ... ... ...
337 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N38A1 No 2009-12-01 47.2 13.7 214.0 4925.0 FEMALE 7.991840 -26.205380
338 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N39A1 Yes 2009-11-22 46.8 14.3 215.0 4850.0 FEMALE 8.411510 -26.138320
339 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N39A2 Yes 2009-11-22 50.4 15.7 222.0 5750.0 MALE 8.301660 -26.041170
340 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N43A1 Yes 2009-11-22 45.2 14.8 212.0 5200.0 FEMALE 8.242460 -26.119690
341 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N43A2 Yes 2009-11-22 49.9 16.1 213.0 5400.0 MALE 8.363900 -26.155310

342 rows × 13 columns

```python # check for empties with missingno warnings.filterwarnings('ignore') msno.matrix(df_filling) warnings.filterwarnings('default') ``` ![png](output_97_0.png) ```python # count the missing values df_filling.isnull().sum() ``` studyName 0 Species 0 Island 0 Individual ID 0 Clutch Completion 0 Date Egg 0 Culmen Length (mm) 0 Culmen Depth (mm) 0 Flipper Length (mm) 0 Body Mass (g) 0 Sex 1 Delta 15 N (o/oo) 0 Delta 13 C (o/oo) 0 dtype: int64 The dataset is almost full, now only one row with the value sex is missing. I will use a classifier to predict it. Since it's not worth to impute with a classifier for one row, I will drop it. ```python df_filling = df_filling.dropna(subset=['Sex']) df_filling.isnull().sum() ``` studyName 0 Species 0 Island 0 Individual ID 0 Clutch Completion 0 Date Egg 0 Culmen Length (mm) 0 Culmen Depth (mm) 0 Flipper Length (mm) 0 Body Mass (g) 0 Sex 0 Delta 15 N (o/oo) 0 Delta 13 C (o/oo) 0 dtype: int64 ```python warnings.filterwarnings('ignore') msno.matrix(df_filling) warnings.filterwarnings('default') ``` ![png](output_101_0.png) In total, I dropped 3 rows and imputed the others with the mean. Now I have a dataset withouth missing values. ```python # recreate the penguin manager penguin_manager = EncoderManager(df_filling, columns_to_label_encode, columns_to_ordinal_encode, columns_to_onehot_encode) penguin_manager.get_df() ```
studyName Species Island Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
0 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A1 Yes 2007-11-11 39.1 18.7 181.0 3750.0 MALE 8.733382 -25.686292
1 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N1A2 Yes 2007-11-11 39.5 17.4 186.0 3800.0 FEMALE 8.949560 -24.694540
2 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N2A1 Yes 2007-11-16 40.3 18.0 195.0 3250.0 FEMALE 8.368210 -25.333020
3 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N3A1 Yes 2007-11-16 36.7 19.3 193.0 3450.0 FEMALE 8.766510 -25.324260
4 PAL0708 Adelie Penguin (Pygoscelis adeliae) Torgersen N3A2 Yes 2007-11-16 39.3 20.6 190.0 3650.0 MALE 8.664960 -25.298050
... ... ... ... ... ... ... ... ... ... ... ... ... ...
337 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N38A1 No 2009-12-01 47.2 13.7 214.0 4925.0 FEMALE 7.991840 -26.205380
338 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N39A1 Yes 2009-11-22 46.8 14.3 215.0 4850.0 FEMALE 8.411510 -26.138320
339 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N39A2 Yes 2009-11-22 50.4 15.7 222.0 5750.0 MALE 8.301660 -26.041170
340 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N43A1 Yes 2009-11-22 45.2 14.8 212.0 5200.0 FEMALE 8.242460 -26.119690
341 PAL0910 Gentoo penguin (Pygoscelis papua) Biscoe N43A2 Yes 2009-11-22 49.9 16.1 213.0 5400.0 MALE 8.363900 -26.155310

341 rows × 13 columns

### 10 Normalize numeric features Normalization is a common preprocessing step in machine learning that aims to scale features to have similar ranges and avoid that one feature dominates the others during training. By normalizing the data, we can ensure that each feature contributes equally to the model's training, leading to more accurate and robust models. Normalization can also improve the convergence of some machine learning algorithms, such as gradient descent, and reduce the computational complexity. References: TP02 - Data Preprocessing The most commons are: #### Min-max normalization Used to scale all values in a range between 0 and 1. Function to use: sklearn.preprocessing.MinMaxScaler. In general, the minmax normalization: - It's sensitive to outliers. Outliers could outscale the other values. - It may not preserve the original shape of the data. - It's good for distance based algorithms (Ex. k-means clustering, hierarchical clustering, and principal component analysis (PCA)). - It's not good for non-linear data. Could produce too small values. #### Z-score normalization Used to standardize values by scaling them to have a mean of 0 and a standard deviation of 1. Function to use: sklearn.preprocessing.StandardScaler. In general, the zscore normalization: - Helps remove outliers influence. - Works well with normally distributed data. - Sensitive to EXTREME outliers. - It may not preserve the original shape of the data. #### Clipping This normalization method is used to clip the data between a min and a max value. It's used to remove outliers and to reduce the influence of extreme values. The data is not deleted, but it's replaced with the chosen min or max value if the data is outside the range. - Helps remove outliers influence. - It can influence the model. - Could remove important information that could contain useful information to modeling data #### Binning This normalization method is used to bin the data into a fixed number of bins. It's used to remove outliers and to reduce the influence of extreme values. Binning is a method of transforming continuous variables into discrete variables. It reduces the effects of minor observation errors. More info on binning: https://en.wikipedia.org/wiki/Data_binning And: TP02 - Data Preprocessing #### L1 normalization Used to normalize the data based on the sum of the absolute values of each observation. Function to use: sklearn.preprocessing.normalize with norm='l1'. #### L2 normalization Used to normalize the data based on the sum of the squared values of each observation. Function to use: sklearn.preprocessing.normalize with norm='l2'. #### Log transformation Used to scale the data and improve the linearity of relationships. Function to use: numpy.log or numpy.log1p. ### Normalization For this step I will normalize with Min-Max since I checked before there are no important outliers and these normalizations, and it works well for the PCA which I will use later. The distributions tend to be normal, but the Min-Max normalization should be fine. ```python df_numeric_ = penguin_manager.get_df().select_dtypes(include=[np.number]) df_numeric_.hist() ``` array([[, ], [, ], [, ]], dtype=object) ![png](output_105_1.png) ```python # normalize with Min-Max df_normalized = penguin_manager.normalize("min-max", inplace=True).get_df() df_normalized.select_dtypes(include=["float", "int"]).hist() ``` array([[, ], [, ], [, ]], dtype=object) ![png](output_106_1.png) The result seems pretty good, the distributions are similar to the original ones, but the values are scaled between 0 and 1. ## 11 Pandas Profiling Before going further, I will use the pandas profiling to get a quick overview of the data at this point. ```python # pandas profiling profile = ProfileReport(penguin_manager.get_df(), title="Pandas Profiling Report", explorative=True) warnings.filterwarnings('ignore') profile.to_file("profiling/profiling.html") warnings.filterwarnings('default') ``` Summarize dataset: 0%| | 0/5 [00:00
Body Mass (g) Flipper Length (mm) Culmen Length (mm) Culmen Depth (mm) Delta 15 N (o/oo) Delta 13 C (o/oo) Species
0 0.291667 0.152542 0.254545 0.666667 0.460122 0.412350 0
1 0.305556 0.237288 0.269091 0.511905 0.550450 0.719311 0
2 0.152778 0.389831 0.298182 0.583333 0.307537 0.521692 0
3 0.208333 0.355932 0.167273 0.738095 0.473964 0.524404 0
4 0.263889 0.305085 0.261818 0.892857 0.431532 0.532516 0
```python df_encoded_pca.shape ``` (341, 7) Now I have 341 rows and 7 columns. I will use the elbow method to find the best number of clusters. ```python pca = PCA() data_pca = pca.fit_transform(df_encoded_pca) df_encoded_pca['pca1'] = data_pca[:, 0] df_encoded_pca['pca2'] = data_pca[:, 1] # Plot explained variance ratio plt.plot(pca.explained_variance_ratio_) plt.xlabel('Principal Component') plt.ylabel('Explained Variance Ratio') plt.show() # Plot first two principal components sns.scatterplot(x='pca1', y='pca2', hue='Species', data=df_encoded_pca) plt.title('Clustering des pingouins avec PCA') plt.show() ``` ![png](output_115_0.png) ![png](output_115_1.png) I am not very confident with the PCA, but I think the best number components is 2 because the explained variance ratio is low and the number of component minimal. As we see with the PCA after all the scaling and encoding, it turned out that the result cluster are very clear. ```python pca = PCA(n_components=2) data_pca = pca.fit_transform(df_encoded_pca) df_encoded_pca['pca1'] = data_pca[:, 0] df_encoded_pca['pca2'] = data_pca[:, 1] # Plot explained variance ratio plt.plot(pca.explained_variance_ratio_) plt.xlabel('Principal Component') plt.ylabel('Explained Variance Ratio') plt.show() # Plot first two principal components sns.scatterplot(x='pca1', y='pca2', hue='Species', data=df_encoded_pca) plt.title('Clustering des pingouins avec PCA') plt.show() ``` ![png](output_117_0.png) ![png](output_117_1.png) ```python from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split # Definire le feature e il target X = df_encoded_pca[['pca1', 'pca2']] y = df_encoded_pca['Species'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X_train, y_train) accuracy = knn.score(X_test, y_test) print("Accuracy: {:.2f}".format(accuracy)) ``` Accuracy: 1.00 Using the PCA, the prediction is very good. ### Reducing and visualizing using t-SNE ```python df_encoded_tsne = df_encoded_pca.copy() ``` ```python tsne = TSNE(n_components=2) # fit and transform the data to two dimensions tsne_data = tsne.fit_transform(df_encoded_tsne) # create a scatter plot of the t-SNE representation of the data plt.scatter(tsne_data[:, 0], tsne_data[:, 1]) plt.xlabel('t-SNE Component 1') plt.ylabel('t-SNE Component 2') plt.title('t-SNE Scatter Plot') plt.show() ``` ![png](output_122_0.png) ```python X = tsne_data[:, :2] y = df_encoded_tsne['Species'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X_train, y_train) accuracy = knn.score(X_test, y_test) print("Accuracy: {:.2f}".format(accuracy)) ``` Accuracy: 1.00 Also with the t-SNE the prediction is very good. Using the default values the result is strange, even with this method, the clusters pop out very well. ## 12 Model selection and evaluation Based on the insights gained from EDA, choose the appropriate machine learning model(s) for your problem. Evaluate the performance of the model(s) using appropriate evaluation metrics and cross-validation techniques. Iterate on the model, features, or data preprocessing steps to improve performance, if necessary. For this step I will use the encoded and normalized dataset. I will use the KNN classifier. I will use the cross-validation to evaluate the model. At this point I use the KNN because I want a distance-based classifier, and I think it's the best choice for this dataset. I will use the cross-validation to evaluate the model. Moreover, This TP is not intended to train the perfect model, but to show the process of data analysis. However, I will with other models to see how they perform. I will try to predict the species. ```python y_knn = penguin_manager.get_df(as_copy=True)["Species"] X_knn = penguin_manager.get_df(as_copy=True).drop(columns=["Species"]) cols_to_label_encode_knn = ['Clutch Completion', 'Sex', 'Individual ID'] cols_to_ordinal_encode_knn = ["Date Egg"] cols_to_onehot_encode_knn = ['studyName', 'Island'] penguin_manager_knn = EncoderManager(X_knn, cols_to_label_encode_knn, cols_to_ordinal_encode_knn, cols_to_onehot_encode_knn) X_knn_encoded = penguin_manager_knn.encode(inplace=False) X_knn_encoded.head() ```
Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) studyName_PAL0708 studyName_PAL0809 studyName_PAL0910 Island_Biscoe Island_Dream Island_Torgersen
0 22 1 2.0 0.254545 0.666667 0.152542 0.291667 2 0.460122 0.412350 1.0 0.0 0.0 0.0 0.0 1.0
1 23 1 2.0 0.269091 0.511905 0.237288 0.305556 1 0.550450 0.719311 1.0 0.0 0.0 0.0 0.0 1.0
2 44 1 6.0 0.298182 0.583333 0.389831 0.152778 1 0.307537 0.521692 1.0 0.0 0.0 0.0 0.0 1.0
3 66 1 6.0 0.167273 0.738095 0.355932 0.208333 1 0.473964 0.524404 1.0 0.0 0.0 0.0 0.0 1.0
4 67 1 6.0 0.261818 0.892857 0.305085 0.263889 2 0.431532 0.532516 1.0 0.0 0.0 0.0 0.0 1.0
```python # split the data X_train_model, X_test_model, y_train_model, y_test_model = train_test_split(X_knn_encoded, y_knn, test_size=0.2, random_state=42) ``` ```python from sklearn.metrics import rand_score, adjusted_rand_score from sklearn import svm from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.cluster import KMeans knn = KNeighborsClassifier(n_neighbors=5).fit(X_train_model, y_train_model) y_pred_knn = knn.predict(X_test_model) print(f"Knn score: {accuracy_score(y_pred_knn, y_test_model)}") warnings.filterwarnings("ignore") kmeans = KMeans(n_clusters=3, random_state=42, n_init="auto").fit(X_train_model, y_train_model) print(f"KMeans score: {adjusted_rand_score(y_train_model, kmeans.labels_)}") warnings.filterwarnings("default") gnb = GaussianNB().fit(X_train_model, y_train_model) y_pred_gnb = gnb.predict(X_test_model) print(f"Gauss score: {accuracy_score(y_pred_gnb, y_test_model)}") rfc = RandomForestClassifier(n_estimators=50, criterion="gini").fit(X_train_model, y_train_model) y_pred_rfc = rfc.predict(X_test_model) print(f"RFC score: {accuracy_score(y_pred_rfc, y_test_model)}") reg = LogisticRegression(max_iter=50).fit(X_train_model, y_train_model) y_pred_reg = reg.predict(X_test_model) print(f"Logistic Regression: {accuracy_score(y_pred_reg, y_test_model)}") svc = svm.SVC(kernel="linear", C=1.0).fit(X_train_model, y_train_model) y_pred_svc = svc.predict(X_test_model) print(f"C-Support Vector: {accuracy_score(y_pred_svc, y_test_model)}") ``` Knn score: 0.9130434782608695 KMeans score: 0.09537487102572052 Gauss score: 0.8840579710144928 RFC score: 0.9710144927536232 Logistic Regression: 0.9710144927536232 C-Support Vector: 1.0 C:\Users\stefa\miniconda3\envs\MchineLearning\lib\site-packages\sklearn\linear_model\_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( I think that this is already a good results for this TP. I will stick with the KNN even if it's not the best accuracy for this dataset, but it has a higher improvement margin. I will try to improve the model by tuning the hyperparameters (which i won't do). I will use the decision tree to find the most informative features. ```python tree_clf = DecisionTreeClassifier() tree_clf.fit(X_knn_encoded, y_knn) feature_importances = tree_clf.feature_importances_ feature_importances ``` array([0.03380558, 0. , 0. , 0.35479631, 0.01317101, 0.5319982 , 0. , 0. , 0. , 0.0176711 , 0. , 0. , 0. , 0.0485578 , 0. , 0. ]) ```python indices = feature_importances.argsort()[::-1] for f in range(X_knn_encoded.shape[1]): print("%d. feature %d (%f)" % (f + 1, indices[f], feature_importances[indices[f]])) ``` 1. feature 5 (0.531998) 2. feature 3 (0.354796) 3. feature 13 (0.048558) 4. feature 0 (0.033806) 5. feature 9 (0.017671) 6. feature 4 (0.013171) 7. feature 15 (0.000000) 8. feature 14 (0.000000) 9. feature 12 (0.000000) 10. feature 11 (0.000000) 11. feature 10 (0.000000) 12. feature 8 (0.000000) 13. feature 7 (0.000000) 14. feature 6 (0.000000) 15. feature 2 (0.000000) 16. feature 1 (0.000000) ```python fi = pd.DataFrame({ 'feature': X_knn_encoded.columns, 'importance': tree_clf.feature_importances_ }) # ordina le feature importances in ordine decrescente fi = fi.sort_values(by='importance', ascending=False) # crea un grafico delle feature importances plt.figure(figsize=(10, 6)) plt.bar(fi['feature'], fi['importance']) plt.xticks(rotation=90) plt.xlabel('Feature') plt.ylabel('Importance') plt.title('Feature Importances') plt.show() ``` ![png](output_132_0.png) it seems that only the 6 features are important, i will try to retrain the model only with them. ```python X_train_knn_sim = X_train_model[["Flipper Length (mm)", "Culmen Length (mm)", "Island_Biscoe", "Individual ID", "Delta 13 C (o/oo)", "Delta 15 N (o/oo)", "Sex"]] X_test_knn_sim = X_test_model[["Flipper Length (mm)", "Culmen Length (mm)", "Island_Biscoe", "Individual ID", "Delta 13 C (o/oo)", "Delta 15 N (o/oo)", "Sex"]] X_test_knn_sim.head() ```
Flipper Length (mm) Culmen Length (mm) Island_Biscoe Individual ID Delta 13 C (o/oo) Delta 15 N (o/oo) Sex
322 0.949153 0.618182 1.0 31 0.379622 0.429100 2
116 0.457627 0.189091 0.0 113 0.201729 0.778965 2
113 0.322034 0.272727 1.0 108 0.070866 0.491998 1
42 0.406780 0.436364 0.0 39 0.771173 0.670639 2
126 0.389831 0.341818 0.0 131 0.307669 0.373327 2
```python knn_sim = KNeighborsClassifier(n_neighbors=5) # fit the model knn.fit(X_train_knn_sim, y_train_model) # predict y_pred_knn = knn.predict(X_test_knn_sim) # evaluate print("Accuracy:", accuracy_score(y_test_model, y_pred_knn)) ``` Accuracy: 0.5797101449275363 I had to try, but this want a good idea. I will use the original dataset. ## 14 Report and communicate findings Summarize your findings from the EDA and document any insights, patterns, or relationships discovered during the process. Use clear visualizations and concise language to communicate your results to stakeholders. ### Summary This dataset contains data about penguins. Some features are useless and can be ignored, but the other are useful to some information. FOr instance the species of the penguin. But other columns could be predicted. Some correlation is present with the sex of the penguin, as well as the others like the island, the flipper length and the culmen length. Also, the BMI seems to have a good predictability index. I manged to predict the species with a good accuracy, but I think that the model could be improved. I think that the dataset may be too small to train a good model, and the client should sample more data. It would be nice to apply a classificator (class = day) on the date without the year, on the date egg date, to predict futures egg dates in the year, but it would require a lot more samples since the final classes would be 365. An interesting information to add could be the date the egg opened, this would allow the prediction the egg birthdate. No outliers were not found, but some missing data were present. Overall the missing data wasn't too much, and it was possible to manage it. Only 3 rows were dropped. A problem could be the imbalance to some feature, for instance, it would not be possible to predict the clutch completion because the model would be too biased, trying balance it would not be possible otherwise it would result in too few data. Moreover, the client should sample more data about the Chinstrap penguin. It could be interesting to have the data about the behaviour to see how they could help to predict others features, or how the other features could predict the behaviour.