Data discretization is the process of transforming continuous variables into categorical variables. This can be useful for machine learning algorithms that only accept categorical variables as inputs. Here is an example of how to perform data discretization using Python.
Let's say we have a dataset of student grades with a continuous variable "Grade" ranging from 0 to 100. We want to discretize this variable into three categories: "low", "medium", and "high".
Here is the Python code to perform this discretization using pandas library:
import pandas as pd
# Create a sample dataset
data = pd.DataFrame({'Grade': [75, 80, 90, 95, 60, 70, 85, 100]})
# Define the bins for the discretization
bins = [0, 69, 80, 100]
# Define the labels for the categories
labels = ['low', 'medium', 'high']
# Apply the discretization using pandas cut function
data['Grade_Category'] = pd.cut(data['Grade'], bins=bins, labels=labels)
# Print the resulting dataset
print(data)
Output of the code:
Grade Grade_Category
0 75 medium
1 80 medium
2 90 high
3 95 high
4 60 low
5 70 medium
6 85 high
7 100 high
In this example, we used the pd.cut() function to discretize the "Grade" variable into three categories. We defined the bins for the categories using the bins parameter, and the labels for the categories using the labels parameter. We then applied the discretization and stored the results in a new column called "Grade_Category". Finally, we printed the resulting dataset with the new category variable.
Advertisement
Advertisement