Data cleaning is the most important and time-consuming step in the data analysis process. It’s the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying or deleting the dirty or coarse data.
Link to the code
Dataset Used
Nashville Housing Dataset form Kaggle.
Tasks
1. Populate the property address.
2. Breaking out address into individual columns (address, city, state)
3. Change Y and N to Yes and No in the “Sold as Vacant field”
4. Remove duplicates5. Delete Unused Columns
Techniques
Here are some of the advanced techniques that I used for this data cleaning.
1. CTEs
2. Temp tables
3. Windows functions.