DPT Week 1
DPT Week 1
DPT Week 1
Questions:
Discuss various Data Preprocessing Methods
Answer:
Data preprocessing is a crucial step in the machine learning pipeline to clean, transform, and
prepare raw data for modeling. Below are some common preprocessing methods:
6. Feature Selection
Reducing the number of input features helps improve model performance and
interpretability.
- Variance Threshold: Remove features with low variance.
- Correlation Matrix: Remove highly correlated features.
- Recursive Feature Elimination (RFE): Iteratively selects features by ranking them based
on a model's performance.
7. Dimensionality Reduction
High-dimensional data can lead to overfitting and make visualization harder.
- Principal Component Analysis (PCA): Reduces dimensions by finding the principal
components that explain the most variance.
- t-SNE/UMAP: Useful for visualizing high-dimensional data in lower dimensions (though
not typically used for modeling).
10. Binning
Binning converts continuous variables into discrete intervals or bins.
- Equal-width binning: Divide the range of values into equal-sized intervals.
- Equal-frequency binning: Divide the range into intervals containing the same number of
observations.
These preprocessing steps depend on the dataset and the algorithm you plan to use.
Certain models, like decision trees, are less sensitive to feature scaling, while others (like
SVM or neural networks) require careful scaling and transformation.