Data cleaning is the process of fixing and preparing a dataset so it becomes accurate, consistent, and ready for analysis. In real-world projects, raw data often contains missing values, duplicates, incorrect formats, inconsistent labels, and outliers. If you analyse data without cleaning it, your results can be misleading, and even small errors can change totals, averages, trends, and conclusions.
A common starting step in data cleaning is handling missing values. You first identify which columns have missing data and decide what action makes sense. In some cases, you remove rows where important fields are missing. In other cases, you fill missing numeric values using methods like mean or median, and fill missing text fields with a placeholder like “Unknown,” depending on the use case.
Another key step is removing duplicates. Duplicate rows often appear due to repeated exports or data being combined from multiple sources. You need to decide whether duplicates should be fully removed or whether you should keep the most recent record based on a date or ID rule.
Data type correction is also essential. Numbers may be stored as text, dates may be stored as strings, and currency values may contain symbols or commas. Converting columns into correct types ensures sorting, filtering, grouping, and calculations work properly. Text cleaning is also common, such as trimming extra spaces, standardising case, fixing spelling variations, and mapping inconsistent categories into standard labels.
Finally, data cleaning includes checking for outliers and invalid values, such as negative quantities, impossible ages, or unexpected categories. After cleaning, you validate the dataset by re-checking missing values, duplicates, and summary statistics. The output of this process is a clean dataset that supports accurate analysis and reliable reporting.

