Skip to content
On this page

Chapter 5: Representing Data for Machines


5.1 Why Representation Matters

  • Machines do not understand text, images, or categories directly
  • Machines understand only numbers

Every type of data must be converted into numbers before using it in Machine Learning.

In real-world problems, data comes in many forms such as text, categories, or labels. However, Machine Learning models work using mathematical operations, and for that, everything must be represented numerically. This process of converting raw data into numbers is called data representation, and it is a critical step before training any model.


5.2 Everything Becomes Numbers

  • Text → numbers
  • Categories → numbers
  • Boolean values → numbers (0/1)

If data is not numeric, the model cannot process it.

For example:

  • “Yes” → 1, “No” → 0
  • Words → numerical representations
  • Images → pixel values (numbers)

No matter how complex the data looks, behind the scenes, it is always converted into numbers. This is the only language a machine understands.


5.3 Types of Data (Important for Encoding)

  • Numerical data

  • Categorical data:

    • Nominal
    • Ordinal

Understanding data type helps in choosing the correct encoding method.

Before converting data into numbers, it is important to understand what type of data we are dealing with. Different types of data require different handling methods. Choosing the wrong method can confuse the model and reduce performance.


Numerical Data

  • Data in the form of numbers
  • Can be used directly

Examples:

  • Age = 25
  • Salary = 50,000
  • Temperature = 32.5°C

Numerical data already has mathematical meaning, so models can directly use it. However, in some cases, we still need to scale or normalize it, which we will learn later.


Categorical Data

  • Data that represents categories
  • Cannot be used directly

Examples:

  • Color: Red, Blue
  • City: Mumbai, Delhi

Categorical data does not have direct numerical meaning, so we must convert it into numbers carefully.


Nominal Data

  • Categories with no order

Examples:

  • Color: Red, Blue, Green
  • City: Mumbai, Delhi

There is no ranking or order between categories.

Use One-Hot Encoding


Ordinal Data

  • Categories with meaningful order

Examples:

  • Size: Small, Medium, Large
  • Education: School, College, PhD

Here, order matters.

Use Label Encoding (carefully)


Explanation

The difference between nominal and ordinal data is very important. If we treat unordered data as ordered, the model may learn incorrect patterns. For example, assigning numbers to colors may imply that one color is “greater” than another, which is not true. However, for ordered data like “Small < Medium < Large,” numerical encoding makes sense because there is a natural ranking.


5.4 Label Encoding

  • Converts categories into numbers
  • Each category gets a unique number

Example:

  • Small → 0
  • Medium → 1
  • Large → 2

Label Encoding is simple and useful when there is a natural order in the data.


Explanation

Label encoding assigns a number to each category. It works well for ordinal data where order matters. However, it should be used carefully for nominal data, as it may introduce a false sense of ranking.


5.5 One-Hot Encoding

  • Converts categories into binary columns
  • Each category becomes a separate feature

Example:

ColorRedBlueGreen
Red100
Blue010

One-Hot Encoding ensures that all categories are treated equally.


Explanation

Instead of assigning numbers, one-hot encoding creates separate columns for each category. Only one column is marked as “1” while others remain “0.” This removes any false ordering and is widely used for nominal data.


5.6 Numerical Data (Revisited)

  • Already numeric
  • May need scaling later

Examples:

  • Age
  • Income
  • Distance

Even though numerical data can be used directly, differences in scale (like age vs salary) can affect model performance. This is why scaling becomes important, which we will explore in later chapters.


5.7 Why Representation Impacts Models

  • Incorrect encoding → wrong learning
  • Proper encoding → better performance

The way data is represented directly affects how the model learns patterns.

If data is represented incorrectly, the model may learn patterns that do not exist. For example, treating categories as ordered when they are not can mislead the model. Proper representation ensures that the model learns meaningful relationships.


5.8 A Clear Mental Model

  • Machines understand only numbers
  • Data must be converted into numeric form
  • Type of data decides encoding method

Good representation = better learning


5.9 Why This Chapter Matters

  • Prepares data for ML models
  • Helps choose correct encoding
  • Prevents common mistakes

This chapter builds a strong foundation for all upcoming steps like preprocessing, scaling, and model training.


5.10 What Comes Next

Now that you understand how to represent data correctly, the next step is:

How to create better features from existing data

In the next chapter, we will explore:

Feature Engineering — the most powerful skill in Machine Learning


Chapter Summary

  • Machines understand only numbers
  • Data must be converted into numeric form
  • Numerical data can be used directly
  • Categorical data needs encoding
  • Nominal → One-Hot Encoding
  • Ordinal → Label Encoding
  • Correct representation improves model performance

Built with VitePress