Machine-Learning-Questions-and-answers-Srtut.com

Data cleaning methods using python

Removing duplicates: In Python, duplicate rows can be removed using the pandas library. Here's an example:

import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace=True)
Handling missing values: Missing values can be replaced with the mean, median, or mode using the fillna() method of pandas. Here's an example:

df.fillna(df.mean(), inplace=True)
Handling outliers: Outliers can be detected and treated using various methods in Python. Here's an example of using the IQR method to detect and remove outliers:

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]
Removing irrelevant data: Columns that are not relevant to the analysis can be dropped using the drop() method of pandas. Here's an example:

df = df.drop(['column1', 'column2'], axis=1)
Standardizing and normalizing data: The sklearn library provides various methods for scaling and normalizing data. Here's an example of scaling the data using the StandardScaler method:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
Handling inconsistent data: Inconsistent data can be handled by finding and correcting errors, inconsistencies or discrepancies in the dataset. For example, removing white spaces from strings using the strip() method:

df['column1'] = df['column1'].str.strip()
Data integration: To combine data from different sources, pandas provides various methods such as concat(), merge(), and join(). Here's an example of merging two dataframes based on a common column:

df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
merged_df = pd.merge(df1, df2, on='column')

These are just a few examples of the data cleaning methods that can be implemented using Python. There are many more methods and libraries available for data cleaning and preprocessing in Python, depending on the specific requirements of the dataset.

Data cleaning methods using python

Home

ML-Types

Data Preprocessing

Regression

Classification

Association Rules

Python Libraries