Skip to content
On this page

Chapter 4: Understanding Data


4.1 Why Data Matters More Than Models

  • Machine Learning learns from data, not code
  • Better data → better model performance

A simple model with good data can outperform a complex model with poor data.

Many beginners focus too much on algorithms and ignore data. But in real-world Machine Learning, most of the effort goes into understanding and preparing data. If the data is incorrect, incomplete, or misleading, even the best model will fail. This is why experienced ML engineers spend more time analyzing data than choosing models.


4.2 Types of Data

  • Structured data
  • Unstructured data

The type of data determines how we process and use it.


Structured Data

  • Organized in rows and columns
  • Easy to store and analyze

Examples:

  • Excel files
  • Databases
  • CSV files

Structured data is clean and well-formatted. Each row represents a record, and each column represents a feature. For example, a table of house prices may include columns like size, location, and price. This type of data is commonly used in Machine Learning because it is easy to process.


Unstructured Data

  • No fixed format
  • Difficult to organize directly

Examples:

  • Text
  • Images
  • Audio

Unstructured data does not follow a tabular format. For example, an image is just a collection of pixels, and text is a sequence of words. Before using this data in Machine Learning, it must be converted into numerical form. This conversion process is what makes tasks like NLP and computer vision more complex.


4.3 Features and Labels

  • Features → input data
  • Labels → output (target)

The model learns the relationship between features and labels.


Example

House Price Prediction:

  • Features:

    • Size of house
    • Number of rooms
    • Location
  • Label:

    • Price of house

Features are the information we give to the model, and the label is what we want the model to predict. The model’s job is to learn how features influence the label.


Explanation

Think of features as clues and the label as the answer. The model studies these clues and learns patterns that connect them to the final outcome. Choosing the right features is critical because irrelevant or poor-quality features can confuse the model.


4.4 Training Data vs Test Data

  • Training data → used to teach the model
  • Test data → used to evaluate the model

Never mix training and test data.


Explanation

The model learns using training data. However, to check if it has truly learned or just memorized, we test it on unseen data called test data. This helps us measure how well the model will perform in real-world scenarios.


Simple Analogy

  • Training → studying for exam
  • Test → actual exam

If a student only memorizes questions from practice tests, they may fail in the real exam. Similarly, a model must generalize, not memorize.


4.5 Data Leakage (Very Important)

  • When model gets information it should not have
  • Leads to false high performance

One of the most common mistakes in ML.


Example

Suppose you are predicting whether a student will pass an exam, and your dataset includes the final result (pass/fail) as an input feature.

The model will easily “predict” the result because it already has the answer. This is not real learning—it is cheating.


Explanation

Data leakage happens when information from the future or from the test set accidentally enters the training data. This makes the model look very accurate during testing but fail in real-world use. Avoiding leakage is critical for building reliable systems.


4.6 Quality of Data

  • Clean data → better results
  • Noisy data → poor performance

Garbage in → garbage out


Explanation

If your data contains errors, missing values, or irrelevant information, the model will learn incorrect patterns. For example, if house prices are recorded incorrectly, the model will make wrong predictions. Ensuring data quality is one of the most important responsibilities of an ML engineer.


4.7 A Clear Mental Model

  • Data is the foundation
  • Features are inputs
  • Labels are outputs
  • Model learns patterns from data

Better data always leads to better learning.


4.8 Why This Chapter Matters

  • Helps you think like an ML engineer
  • Prevents common beginner mistakes
  • Improves model performance

Understanding data is not just a step—it is the foundation of Machine Learning. If you master this, every algorithm you learn later will make more sense and work better.


4.9 What Comes Next

Now that you understand data, the next step is:

How to convert data into numbers so that machines can understand it

In the next chapter, we will explore:

Representing Data for Machines


Chapter Summary

  • Data is more important than models
  • Two types: structured and unstructured
  • Features are inputs, labels are outputs
  • Training and test data must be separate
  • Data leakage must be avoided
  • Data quality directly impacts results

Built with VitePress