Pandas is a popular Python library used for data manipulation and analysis. It is commonly used in machine learning projects to prepare and preprocess data for modeling. Here are some important functions in Pandas that are commonly used in machine learning:
Reading data from a file:
Pandas provides functions to read data from various file formats like CSV, Excel, SQL, etc. The read_csv() function is commonly used to read data from a CSV file.
import pandas as pd
data = pd.read_csv('data.csv')
Viewing data:
To view the data in a Pandas DataFrame, you can use the head() function to display the first few rows, and tail() function to display the last few rows.
print(data.head())
print(data.tail())
Selecting data: You can select data from a Pandas DataFrame based on certain conditions using the loc[] and iloc[] functions. The loc[] function is used for label-based indexing, while iloc[] is used for positional indexing.
# select rows with label-based indexing
data.loc[data['column_name'] == 'value']
# select rows with positional indexing
data.iloc[1:10, 2:5]
# select rows 1-9 and columns 2-4
Handling missing values:
Pandas provides several functions to handle missing values in a DataFrame, such as isna(), fillna(), and dropna().
# check for missing values
data.isna()
# fill missing values with mean of column
data.fillna(data.mean())
# drop rows with missing values
data.dropna()
Grouping data:
The groupby() function in Pandas is used to group data based on certain criteria and apply a function to each group.
# group data by column and calculate mean
data.groupby('column_name').mean()
# group data by multiple columns and calculate sum
data.groupby(['column1', 'column2']).sum()
Merging data:
Pandas provides functions to merge multiple DataFrames into a single DataFrame based on a common column.
# merge two DataFrames based on common column
merged_data = pd.merge(data1, data2, on='column_name')
Encoding categorical variables:
In machine learning, categorical variables need to be encoded as numeric values. Pandas provides functions for encoding categorical variables using one-hot encoding or label encoding.
# one-hot encoding
pd.get_dummies(data['column_name'])
# label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['column_name'] = le.fit_transform(data['column_name'])
Advertisement
Advertisement