Exercise: Cleaning Data

This exercise helps you practise one of the most important skills in data analysis: cleaning messy data so it becomes accurate and usable. Most real datasets contain missing values, duplicates, inconsistent text, wrong data types, and outliers. In this exercise, you will take a raw dataset and prepare a clean version that can be analysed confidently.

Start by selecting a small dataset in CSV or Excel format. It can be sales data, survey data, student marks, or any public dataset. Load it into Python using Pandas and perform a quick inspection. Check the number of rows and columns, view the first few rows, and review column names. Then identify problems by checking missing values, duplicate rows, and obvious formatting issues.

Next, clean the structure of the dataset. Standardise column names so they are clear and consistent. Fix data types, such as converting dates to datetime format and numbers stored as text into numeric format. Handle missing values depending on the context. You may remove rows with missing critical fields, fill numeric missing values with a sensible value like mean or median, or fill text missing values with “Unknown” when appropriate.

After that, clean text fields. Remove extra spaces, standardise case (lowercase or title case), and correct inconsistent labels. If the dataset has duplicates, decide whether to remove duplicates fully or keep the latest record based on a timestamp column.

Finally, validate your cleaned dataset. Re-check missing values, confirm data types, and verify that key columns have sensible ranges. Save the cleaned dataset to a new CSV file and write a short summary of what you changed. The final output should be a clean dataset plus a simple cleaning log that explains your decisions.

Modules
Libraries

Get industry recognized certification – Contact us

keyboard_arrow_up