Check for missing data in notebooks

Completed

Missing data refers to the lack of values in certain variables within a dataset.

Identifying and handling missing data is a crucial aspect of the data exploration and preprocessing phase in a machine learning project, and the way you handle them can significantly affect the performance of your model.

The key steps to handle missing data include evaluating the amount of missing data, identifying the nature of the missing data, and choosing the best method to handle the missing data values.

Identify missing data

To identify if there's any missing data in your dataset, you can use the functions isnull() or isna() from Pandas.

import pandas as pd
import numpy as np

# Create a sample DataFrame with some missing values
data = {
    'A': [1, 2, np.nan],
    'B': [4, np.nan, np.nan],
    'C': [7, 8, 9]
}
df = pd.DataFrame(data)

# Check for missing data
print(df.isnull())

Results:

       A      B      C
0  False  False  False
1  False   True  False
2   True   True  False

This outputs a DataFrame of the same size as df, but with True at the positions where values are missing (NaN), and False elsewhere.

To get the total number of missing values in the dataframe, you can use df.isnull().sum(). This returns the number of missing values for each column.

df.isnull().sum()

Results:

A    1
B    2
C    0
dtype: int64

Evaluate the nature of missing values

In a data science project, missing values can occur for various reasons and understanding their nature is crucial for handling them appropriately.

Here are some types of missing values:

  • Missing Completely at Random (MCAR): The missingness of data isn't related to any other variable's values and is random. This is the ideal scenario, but it's often not the case in real-world data.

  • Missing at Random (MAR): The missingness of data is related to some other variables' values but not the missing data itself. For example, if women are more likely to disclose their number of daily steps than men, then the daily steps data is MAR.

  • Missing Not at Random (MNAR): The missingness of data is related to the values of the missing data itself. For example, people with higher salaries might be less likely to disclose their income. Removing these records could introduce bias into the model, preventing it from accurately reflecting the complete information contained in the data.

Understanding the nature of missing values in your dataset can guide you on how to handle them. For MCAR and MAR, you might opt for deletion or imputation methods. For MNAR, these methods could introduce bias, so it might be better to gather more data or use model-based methods that can handle missing values.

Decide how to handle missing data

The approach to handling missing data can significantly impact the results of your analysis and the performance of your model. Here are some strategies you might consider.

  • Ignore: If only a small amount of data is missing, it might not have a significant impact on your model's performance.

  • Remove: If a particular row or column has many missing values, it might be best to remove it entirely.

  • Impute: Fill in the missing values with a specified value or estimate (like mean, median, mode, or using a machine learning algorithm like K-Nearest Neighbors (KNN)).

  • Use it as a new feature: Sometimes, the fact that a value is missing can be used as information itself. For example, in a product survey, unanswered questions about recommending the product could indicate customer dissatisfaction. In this case, the nonresponse can be a new feature indicating a likelihood of customer dissatisfaction.

There’s no one-size-fits-all solution for handling missing data. The best approach depends on the specifics of your dataset and the question you’re trying to answer.