Chapter 5: Representing Data for Machines
5.1 Why Representation Matters
- Machines do not understand text, images, or categories directly
- Machines understand only numbers
Every type of data must be converted into numbers before using it in Machine Learning.
In real-world problems, data comes in many forms such as text, categories, or labels. However, Machine Learning models work using mathematical operations, and for that, everything must be represented numerically. This process of converting raw data into numbers is called data representation, and it is a critical step before training any model.
5.2 Everything Becomes Numbers
- Text → numbers
- Categories → numbers
- Boolean values → numbers (0/1)
If data is not numeric, the model cannot process it.
For example:
- “Yes” → 1, “No” → 0
- Words → numerical representations
- Images → pixel values (numbers)
No matter how complex the data looks, behind the scenes, it is always converted into numbers. This is the only language a machine understands.
5.3 Types of Data (Important for Encoding)
Numerical data
Categorical data:
- Nominal
- Ordinal
Understanding data type helps in choosing the correct encoding method.
Before converting data into numbers, it is important to understand what type of data we are dealing with. Different types of data require different handling methods. Choosing the wrong method can confuse the model and reduce performance.
Numerical Data
- Data in the form of numbers
- Can be used directly
Examples:
- Age = 25
- Salary = 50,000
- Temperature = 32.5°C
Numerical data already has mathematical meaning, so models can directly use it. However, in some cases, we still need to scale or normalize it, which we will learn later.
Categorical Data
- Data that represents categories
- Cannot be used directly
Examples:
- Color: Red, Blue
- City: Mumbai, Delhi
Categorical data does not have direct numerical meaning, so we must convert it into numbers carefully.
Nominal Data
- Categories with no order
Examples:
- Color: Red, Blue, Green
- City: Mumbai, Delhi
There is no ranking or order between categories.
Use One-Hot Encoding
Ordinal Data
- Categories with meaningful order
Examples:
- Size: Small, Medium, Large
- Education: School, College, PhD
Here, order matters.
Use Label Encoding (carefully)
Explanation
The difference between nominal and ordinal data is very important. If we treat unordered data as ordered, the model may learn incorrect patterns. For example, assigning numbers to colors may imply that one color is “greater” than another, which is not true. However, for ordered data like “Small < Medium < Large,” numerical encoding makes sense because there is a natural ranking.
5.4 Label Encoding
- Converts categories into numbers
- Each category gets a unique number
Example:
- Small → 0
- Medium → 1
- Large → 2
Label Encoding is simple and useful when there is a natural order in the data.
Explanation
Label encoding assigns a number to each category. It works well for ordinal data where order matters. However, it should be used carefully for nominal data, as it may introduce a false sense of ranking.
5.5 One-Hot Encoding
- Converts categories into binary columns
- Each category becomes a separate feature
Example:
| Color | Red | Blue | Green |
|---|---|---|---|
| Red | 1 | 0 | 0 |
| Blue | 0 | 1 | 0 |
One-Hot Encoding ensures that all categories are treated equally.
Explanation
Instead of assigning numbers, one-hot encoding creates separate columns for each category. Only one column is marked as “1” while others remain “0.” This removes any false ordering and is widely used for nominal data.
5.6 Numerical Data (Revisited)
- Already numeric
- May need scaling later
Examples:
- Age
- Income
- Distance
Even though numerical data can be used directly, differences in scale (like age vs salary) can affect model performance. This is why scaling becomes important, which we will explore in later chapters.
5.7 Why Representation Impacts Models
- Incorrect encoding → wrong learning
- Proper encoding → better performance
The way data is represented directly affects how the model learns patterns.
If data is represented incorrectly, the model may learn patterns that do not exist. For example, treating categories as ordered when they are not can mislead the model. Proper representation ensures that the model learns meaningful relationships.
5.8 A Clear Mental Model
- Machines understand only numbers
- Data must be converted into numeric form
- Type of data decides encoding method
Good representation = better learning
5.9 Why This Chapter Matters
- Prepares data for ML models
- Helps choose correct encoding
- Prevents common mistakes
This chapter builds a strong foundation for all upcoming steps like preprocessing, scaling, and model training.
5.10 What Comes Next
Now that you understand how to represent data correctly, the next step is:
How to create better features from existing data
In the next chapter, we will explore:
Feature Engineering — the most powerful skill in Machine Learning
Chapter Summary
- Machines understand only numbers
- Data must be converted into numeric form
- Numerical data can be used directly
- Categorical data needs encoding
- Nominal → One-Hot Encoding
- Ordinal → Label Encoding
- Correct representation improves model performance
Docs