Chapter 5: Representing Data for Machines

5.1 Why Representation Matters

Machines do not understand text, images, or categories directly
Machines understand only numbers

Every type of data must be converted into numbers before using it in Machine Learning.

In real-world problems, data comes in many forms such as text, categories, or labels. However, Machine Learning models work using mathematical operations, and for that, everything must be represented numerically. This process of converting raw data into numbers is called data representation, and it is a critical step before training any model.

5.2 Everything Becomes Numbers

Text → numbers
Categories → numbers
Boolean values → numbers (0/1)

If data is not numeric, the model cannot process it.

For example:

“Yes” → 1, “No” → 0
Words → numerical representations
Images → pixel values (numbers)

No matter how complex the data looks, behind the scenes, it is always converted into numbers. This is the only language a machine understands.

5.3 Types of Data (Important for Encoding)

Numerical data
Categorical data:
- Nominal
- Ordinal

Understanding data type helps in choosing the correct encoding method.

Before converting data into numbers, it is important to understand what type of data we are dealing with. Different types of data require different handling methods. Choosing the wrong method can confuse the model and reduce performance.

Numerical Data

Data in the form of numbers
Can be used directly

Examples:

Age = 25
Salary = 50,000
Temperature = 32.5°C

Numerical data already has mathematical meaning, so models can directly use it. However, in some cases, we still need to scale or normalize it, which we will learn later.

Categorical Data

Data that represents categories
Cannot be used directly

Examples:

Color: Red, Blue
City: Mumbai, Delhi

Categorical data does not have direct numerical meaning, so we must convert it into numbers carefully.

Nominal Data

Categories with no order

Examples:

Color: Red, Blue, Green
City: Mumbai, Delhi

There is no ranking or order between categories.

Use One-Hot Encoding

Ordinal Data

Categories with meaningful order

Examples:

Size: Small, Medium, Large
Education: School, College, PhD

Here, order matters.

Use Label Encoding (carefully)

Explanation

The difference between nominal and ordinal data is very important. If we treat unordered data as ordered, the model may learn incorrect patterns. For example, assigning numbers to colors may imply that one color is “greater” than another, which is not true. However, for ordered data like “Small < Medium < Large,” numerical encoding makes sense because there is a natural ranking.

5.4 Label Encoding

Converts categories into numbers
Each category gets a unique number

Example:

Small → 0
Medium → 1
Large → 2

Label Encoding is simple and useful when there is a natural order in the data.

Explanation

Label encoding assigns a number to each category. It works well for ordinal data where order matters. However, it should be used carefully for nominal data, as it may introduce a false sense of ranking.

5.5 One-Hot Encoding

Converts categories into binary columns
Each category becomes a separate feature

Example:

Color	Red	Blue	Green
Red	1	0	0
Blue	0	1	0

One-Hot Encoding ensures that all categories are treated equally.

Explanation

Instead of assigning numbers, one-hot encoding creates separate columns for each category. Only one column is marked as “1” while others remain “0.” This removes any false ordering and is widely used for nominal data.

5.6 Numerical Data (Revisited)

Already numeric
May need scaling later

Examples:

Age
Income
Distance

Even though numerical data can be used directly, differences in scale (like age vs salary) can affect model performance. This is why scaling becomes important, which we will explore in later chapters.

5.7 Why Representation Impacts Models

Incorrect encoding → wrong learning
Proper encoding → better performance

The way data is represented directly affects how the model learns patterns.

If data is represented incorrectly, the model may learn patterns that do not exist. For example, treating categories as ordered when they are not can mislead the model. Proper representation ensures that the model learns meaningful relationships.

5.8 A Clear Mental Model

Machines understand only numbers
Data must be converted into numeric form
Type of data decides encoding method

Good representation = better learning

5.9 Why This Chapter Matters

Prepares data for ML models
Helps choose correct encoding
Prevents common mistakes

This chapter builds a strong foundation for all upcoming steps like preprocessing, scaling, and model training.

5.10 What Comes Next

Now that you understand how to represent data correctly, the next step is:

How to create better features from existing data

In the next chapter, we will explore:

Feature Engineering — the most powerful skill in Machine Learning

Chapter Summary

Machines understand only numbers
Data must be converted into numeric form
Numerical data can be used directly
Categorical data needs encoding
Nominal → One-Hot Encoding
Ordinal → Label Encoding
Correct representation improves model performance

Chapter 5: Representing Data for Machines #

5.1 Why Representation Matters #

5.2 Everything Becomes Numbers #

5.3 Types of Data (Important for Encoding) #

Numerical Data #

Categorical Data #

Nominal Data #

Ordinal Data #

Explanation #

5.4 Label Encoding #

Explanation #

5.5 One-Hot Encoding #

Explanation #

5.6 Numerical Data (Revisited) #

5.7 Why Representation Impacts Models #

5.8 A Clear Mental Model #

5.9 Why This Chapter Matters #

5.10 What Comes Next #

Chapter Summary #

Chapter 5: Representing Data for Machines

5.1 Why Representation Matters

5.2 Everything Becomes Numbers

5.3 Types of Data (Important for Encoding)

Numerical Data

Categorical Data

Nominal Data

Ordinal Data

Explanation

5.4 Label Encoding

Explanation

5.5 One-Hot Encoding

Explanation

5.6 Numerical Data (Revisited)

5.7 Why Representation Impacts Models

5.8 A Clear Mental Model

5.9 Why This Chapter Matters

5.10 What Comes Next

Chapter Summary