Handling Missing Data in Pandas Using SimpleImputer

Filling in the Gaps: Simple Techniques for Imputing Missing Values in Your Data

Jun 04, 2024

In the world of data science, dealing with missing data is a common challenge. Missing values can arise due to various reasons such as data entry errors, equipment malfunctions, or simply because some information was not available at the time of data collection. Regardless of the cause, handling missing data is crucial for building robust and accurate models. In this post, we’ll explore how to handle missing data in pandas DataFrame using SimpleImputer from the sklearn.impute module.

Understanding Missing Data

Before we dive into the solution, let’s briefly understand why handling missing data is important. Missing data can distort statistical analysis and machine learning models, leading to inaccurate results. There are several strategies to handle missing data, including:

Removing missing values: This is a simple but often impractical approach, especially if a significant portion of the data is missing.
Imputation: Filling in the missing values with substituted values. This can be done using various strategies like mean, median, mode, or a constant value.

Introducing SimpleImputer

SimpleImputer is a class in scikit-learn that provides basic strategies for imputing missing values. It can be used to fill missing values with a specified strategy for both numeric and categorical data.

Example: Handling Missing Data in a DataFrame

Let’s walk through an example. Suppose we have a DataFrame with both numeric and categorical columns:

import pandas as pd

# Example DataFrame with numeric and categorical columns
data = {
    'age': [25, 30, None, 22, 27],
    'salary': [50000, 60000, 65000, None, 70000],
    'gender': ['male', 'female', 'female', None, 'male'],
    'department': ['IT', 'HR', None, 'Finance', 'IT']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Applying SimpleImputer

We will use SimpleImputer to fill in the missing values. For numeric columns, we will use the mean strategy, and for categorical columns, we will use the most frequent value.

from sklearn.impute import SimpleImputer

# Separate numeric and categorical columns
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
categorical_cols = df.select_dtypes(include=['object']).columns

# Create imputers
numeric_imputer = SimpleImputer(strategy='mean')
categorical_imputer = SimpleImputer(strategy='most_frequent')

# Apply imputers
df[numeric_cols] = numeric_imputer.fit_transform(df[numeric_cols])
df[categorical_cols] = categorical_imputer.fit_transform(df[categorical_cols])

print("\nDataFrame after imputation:")
print(df)

Explanation

Separating Columns: We first separate the numeric and categorical columns.
Creating Imputers: We create separate SimpleImputer instances for numeric and categorical columns, specifying the imputation strategy.
Applying Imputers: We fit and transform the DataFrame columns using the respective imputers.

Conclusion

Handling missing data is a critical step in the data preprocessing pipeline. SimpleImputer provides a simple yet effective way to fill in missing values, ensuring that your data is complete and ready for analysis. By using different strategies for numeric and categorical data, you can tailor the imputation process to the specific needs of your dataset.

Try incorporating SimpleImputer into your data preprocessing workflow and see the difference it makes in your models’ performance. Happy coding!

shravankumar’s Substack

Discussion about this post

Ready for more?