How do I make a good classifier
How do I make a good classifier?
Let’s do an example with binary classification.
- Collect data (raw samples).
- Tabular:
pandas
- Tabular:
- Annotate data (multiple annotators → compute IRR to check reliability).
- Annotators are the humans labeling your data (e.g., deciding whether an instance is positive or negative).
- Since humans can disagree, IRR (inter-rater reliability) measures how consistently annotators label the same items.
- Common metrics:
- Cohen’s Kappa → for two annotators, adjusts for agreement by chance. →
sklearn.metrics.cohen_kappa_score - Fleiss’ Kappa → for multiple annotators. →
statsmodels.stats.inter_rater - Krippendorff’s Alpha → general, supports missing labels and different data types. →
krippendorff
- Cohen’s Kappa → for two annotators, adjusts for agreement by chance. →
- High IRR means your labels are reliable and can be trusted for training a classifier.
- Preprocess (tokenization, normalization, feature engineering, embeddings).
- Text processing:
nltk,spaCy - Basic Vectorization:
sklearn.feature_extraction.text(CountVectorizer,TfidfVectorizer). - Deep embeddings:
transformers.
- Text processing:
- Split into train/validation/test sets.
- Tabular:
pandas
- Tabular:
- Handle class imbalance (only on the training data, use validation to tune hyperparams):
- Downsampling → randomly reduce majority-class samples.
- Pros: balances quickly, smaller dataset.
Cons: throws away information.
- Pros: balances quickly, smaller dataset.
- Upsampling → duplicate or synthetically generate minority-class samples (e.g., SMOTE).
- Pros: keeps all data, improves minority class signal.
- Cons: may overfit (duplicates) or add artifacts (synthetic).
imbalanced-learn(imblearn):RandomUnderSampler,RandomOverSampler.SMOTE,ADASYN.
- Downsampling → randomly reduce majority-class samples.
- Train classifier on the training set (e.g., logistic regression, random forest, neural net, whatever).
- Classic ML:
scikit-learn(LogisticRegression, RandomForestClassifier). - Boosting:
xgboost,lightgbm - Deep learning:
pytorch,tensorflow
- Classic ML:
- Tune the hyperparameters using the validation set:
- Evaluate using metrics robust to imbalance (precision, recall, F1, ROC-AUC) on test sets that are left in their original class distribution.
sklearn.metrics(accuracy, precision, recall, F1, ROC-AUC)- Run an ablation study on important hyperparameters
- Ablation Study = sweeping or adjusting the threshold over a set of values to study effect. For example, Classifiers output probabilities (e.g.,
P(y=1|x)). You pick a threshold to turn probabilities into binary predictions. Default = 0.5, but this may not be optimal. For example, this could happen:- Lower threshold → ↑ recall, ↓ precision.
- Higher threshold → ↑ precision, ↓ recall.
- You can…
- Plot Precision-Recall curves.
- Plot ROC curves (TPR vs FPR).
- Compare metrics at multiple thresholds to select the right trade-off (maximize F1, enforce recall, minimize false positives, etc.). This best value totally depends on your problem.
- Ablation Study = sweeping or adjusting the threshold over a set of values to study effect. For example, Classifiers output probabilities (e.g.,
- Evaluate using metrics robust to imbalance (precision, recall, F1, ROC-AUC) on test sets that are left in their original class distribution.
- AFTER all of this, evaluate on the test set using metrics robust to imbalance (precision, recall, F1, ROC-AUC) on test sets that are left in their original class distribution (no up/downsampling the test set, leave it as is).
sklearn.metrics(accuracy, precision, recall, F1, ROC-AUC).