Data Preprocessing (IB CS A4.2): HL Guide

IB Computer Science A4.2 (HL) explained: why data cleaning matters, handling missing values and outliers, feature selection, and dimensionality reduction.

Jun 11, 2026

A machine learning model is only as good as the data you feed it. Topic A4.2 is HL only, and it is the unglamorous but crucial step of turning messy, real-world data into something a model can learn from well. Examiners reward students who can name the specific problems and the specific fixes.

This guide covers all three A4.2 understandings: the significance of data cleaning, the role of feature selection, and the importance of dimensionality reduction.

What does IB CS topic A4.2 cover? (HL only)

A4.2 is HL only and has three understandings: the significance of data cleaning, the role of feature selection, and the importance of dimensionality reduction. Together they make up data preprocessing, the work done before any model is trained.

Why does data preprocessing matter?

Real-world data is messy: it has gaps, errors, duplicates, and inconsistent formats. If you train on bad data, you get a bad model, the principle known as garbage in, garbage out.

Preprocessing is the pipeline that fixes this: clean the data, transform it into a consistent numerical form, then cut it down to the most useful features. In practice, good preprocessing often improves a model more than swapping in a fancier algorithm, which is why it is a topic in its own right.

What is data cleaning?

Data cleaning finds and fixes the problems in a dataset so they do not mislead the model.

There are four common issues. Missing values (empty cells or "n/a") can be imputed, filled in with the mean, median, or a placeholder, or the record can be removed. Outliers, values far from the rest, can be trimmed (removed) or capped (clamped to a sensible threshold), and tools like box plots help spot them. Duplicates, the same record stored twice, should be removed so they do not bias the model. Noise and errors, like inconsistent date formats or an impossible negative age, are fixed by standardising formats and correcting or removing invalid values.

Worked example: cleaning a customer table

Suppose a customer table has a row with a missing email, another with a negative income, and a duplicate of customer 002.

Missing email: flag it with a placeholder or exclude that field, rather than deleting the whole customer.
Negative income: an error; correct it (for example, take the absolute value if it is a sign mistake) or remove it.
Duplicate of 002: delete the repeated row so the customer is counted once.

The cleaned table now gives the model consistent, trustworthy data.

What is feature selection?

A feature is an input variable (a column). Feature selection is the process of keeping only the features that are useful for the prediction and dropping the irrelevant or redundant ones.

Fewer, better features make the model faster to train, easier to interpret, and less likely to overfit to noise. Crucially, the features you keep are still the original ones, just a smaller set of them, so the data stays easy to understand.

What is dimensionality reduction?

Dimensionality reduction also cuts the number of features, but instead of selecting a subset it combines the original features into a smaller set of new ones.

A common method is principal component analysis (PCA), which produces new combined features (components) that capture most of the information in far fewer dimensions. This tackles the problem that too many features slow training and hurt accuracy, sometimes called the curse of dimensionality. The trade-off is interpretability: the new components are mathematical combinations, so they are harder to explain than the original columns.

What is the difference between feature selection and dimensionality reduction?

Both reduce the number of features, but in different ways. Feature selection keeps a subset of the original features and discards the rest, so the results stay interpretable. Dimensionality reduction creates new combined features from the originals, which is more powerful for very high-dimensional data but produces components that are harder to interpret. Selection is "choose the best columns"; reduction is "merge columns into fewer new ones".

Common exam mistakes for IB CS A4.2

Confusing feature selection (keep a subset of original features) with dimensionality reduction (combine into new features).
Treating data cleaning as a minor afterthought. Bad data produces a bad model.
Removing all outliers automatically; some extreme values are genuine and meaningful.
Handling missing data only by deleting rows, when imputation often keeps more useful data.
Forgetting that A4.2 is HL only.

Quick recap of A4.2

Preprocessing turns messy real-world data into clean, model-ready input: garbage in, garbage out.
Data cleaning fixes missing values (impute or remove), outliers (trim or cap), duplicates, and noise.
Feature selection keeps a subset of the original, most relevant features.
Dimensionality reduction (such as PCA) combines features into fewer new components.
Both cut features, but selection keeps originals while reduction creates new ones. This subtopic is HL only.

Frequently asked questions

What is data cleaning?

Data cleaning is the process of finding and fixing problems in a dataset, such as missing values, outliers, duplicates, and inconsistent or invalid entries, so that they do not mislead a machine learning model. It is a key part of preprocessing.

How do you handle missing values?

Missing values can be imputed, meaning filled in with a sensible substitute such as the mean, median, or a placeholder, or the affected record can be removed if it is not needed. Imputation is often preferred because it keeps more of the data.

How do you handle outliers?

Outliers, values far from the rest of the data, are usually identified with tools like box plots, then either trimmed (removed) or capped (clamped to a chosen threshold). You should keep genuine extreme values rather than deleting every outlier automatically.

What is feature selection?

Feature selection is choosing the most relevant input features for a model and discarding irrelevant or redundant ones. It keeps a subset of the original features, which makes the model faster, easier to interpret, and less prone to overfitting.

What is dimensionality reduction?

Dimensionality reduction reduces the number of features by combining the originals into a smaller set of new features, for example using principal component analysis (PCA). The new components capture most of the information in fewer dimensions, though they are harder to interpret.

What is the difference between feature selection and dimensionality reduction?

Feature selection keeps a subset of the original features and drops the rest, so the data stays interpretable. Dimensionality reduction creates new combined features from the originals, which is more powerful for high-dimensional data but produces components that are harder to explain.

Looking for a printable summary? Grab the A4.2 Shuttle Learning revision sheet, a three-page knowledge organiser covering everything above.

Looking for an IB Computer Science tutor?

Hi, I'm Yuness, the tutor behind Shuttle Learning. I work one to one with IB Computer Science students at SL and HL, and I deliberately take on only a handful each year so every student gets my full attention. Most go on to earn the 6s and 7s they were aiming for, in the final exams and the IA alike.

If you would like that kind of support, book a free 15-minute call and tell me what you are stuck on. You can press BOOK A LESSON .