Sunday, September 15, 2024

Handling Missing Values || Feature Engineering || Machine Learning (Part2)

Programming LanguageHandling Missing Values || Feature Engineering || Machine Learning (Part2)


Hey reader👋Hope you are doing well😊
We know that to improve performance machine learning model feature engineering is crucial step. One of most important tasks in feature engineering is handling outliers. In this blog we are going to do a detailed discussion on handling missing values. So let’s get started 🔥.

Complete Case Analysis

Complete Case Analysis (CCA) is also “listwise deletion”. This method is used to handle missing data. In this technique all the rows that contain one or more missing values are excluded from dataset.

So here in final dataset only those rows are included that contain complete data.

Key Assumptions for CCA

  • Data should be completely missing at random.
    Suppose you have a dataset that contains 1000 rows and 5 columns now you have 50 such rows that have missing values. Now these 50 rows are random rows. And you can remove these rows.
    If you remove the data at random, the distribution of the data will remain unchanged.

  • When the proportion of missing data is small.

  • When simplicity and ease of implementation are prioritized.

Key Points of Complete Case Analysis

  • Simplicity: CCA is straightforward to implement and understand.

  • Bias: If the missing data are not missing completely at random (MCAR), CCA can introduce bias into the analysis.

  • Efficiency: Excluding data with missing values reduces the sample size, which can lead to a loss of statistical power.

  • Application: Commonly used in regression analysis, where only cases with complete data for all predictors are included.

Use CCA when data in a particular column missing is less than equal to 5% . You can remove complete column if missing data in that column is greater than or equal to 95%.

Implementation

import pandas as pd

//Load your dataset
//Replace ‘your_dataset.csv’ with the actual file path
df = pd.read_csv('your_dataset.csv')

//Display the original data
print("Original Data:")
print(df.head())

//Filter out rows with any missing data (Complete Case Analysis)
df_complete_case = df.dropna()

//Display data after applying CCA
print("\nData after Complete Case Analysis:")
print(df_complete_case)

//Check the number of rows before and after CCA
print("\nNumber of rows before CCA:", len(df))
print("Number of rows after CCA:", len(df_complete_case))

Note that CCA can create bias in data as on removal rows there are chances of losing important information.
Check the following notebook for implementation of Handling Missing Values -:
https://www.kaggle.com/code/nehagupta09/handling-missing-values

I hope you have understood that how missing values are handled in our dataset. In the next blog we are going to read take our discussion on feature engineering further. Till then stay connected and don’t forget to follow me.

Thankyou 💙

Check out our other content

Check out other tags:

Most Popular Articles