Introduction

DL vs ML

Deep learning is an evolution of traditional Machine Learning. The more high-quality data we have, the better deep learning algorithms will perform. This is not always the case for traditional ML, which can stagnate after a certain level.

Deep learning is based on Artificial Neural Networks (ANN), inspired by how the human brain works.

ML: Linear regression / Decision trees
DL: Artificial Neural Networks (ANN)

ANNs can capture deeper patterns than traditional ML. Deep learning uses multiple layers; the more layers, the deeper the learned representation — hence the name "Deep Learning".

Deep Learning vs Others

Deep learning typically requires significant compute power.

Perceptron

Neural networks are built from layers. Each layer takes inputs and applies parameters to produce outputs.

A single neuron (perceptron) has: a weight vector $w$, a bias $b$, and an activation function $f$.

Dot product: $$ w^\top x \,=\, \sum_{j=1}^{n} w_j x_j $$

Perceptron output for an input $x$: $$ h \,=\, f(w^\top x + b) $$

Perceptron schema

The perceptron is a linear model, most effective when data can be linearly separated.

Dot product

For a point $x = (x_1, x_2)$ and a vector $w = (w_1, w_2)$:

The decision boundary is $w^\top x = 0$, separating positive from negative.

Dot product

Linear separability

To use dot products for classification, linear separability matters. Example: the Iris dataset using Sepal Width and Sepal Length.

Goal: find a decision boundary to separate species (Green: Iris setosa, Orange: Iris virginica).

Iris Decision Boundary

Separating hyperplane: $$ \{ x \in \mathbb{R}^2 : \langle x, w \rangle = x_1 w_1 + x_2 w_2 = 0 \} $$

Loss function

We count classification errors for a given $w$.

Given $n$ points $X = (x_i)_{i=1,\dots,n} \in \mathbb{R}^d$ and labels $Y = (y_i)_{i=1,\dots,n} \in \{0,1\}$.

Classifier: $$ f(x_i, w) = \mathbf{1}[\langle x_i, w \rangle \ge 0] $$

Empirical 0/1 loss: $$ g(w,X,Y) = \sum_{i=1}^{n} \mathbf{1}[ f(x_i, w) \ne y_i ] $$

We minimize this loss to find the best separating hyperplane.

Loss functions list
Iris loss functions Iris loss functions 2

Activation function

Activation functions enable differentiable loss and appropriate outputs.

Hidden layers: often ReLU / Leaky ReLU.

Output layer depends on task: binary → Sigmoid; multiclass → Softmax; regression → Linear.

Activation functions

Gradient descent

With differentiable activations, we can use gradient descent to optimize $w$.

Example: if $f(x)=x^2$, then $f'(x)=2x$.

Gradient Descent start Gradient Descent xmin

Learning rate $\\lambda$ controls step size. Iterate until $|f'(x_k)| < {tol}$.

Update rule: $$ x_{k+1} = x_k - \lambda f'(x_k) $$

Convex Gradient Descent

Limitations: guarantees a global minimum only for convex functions. In DL, loss is non-convex; gradient methods often reach a local minimum.

Therefore, the learning rate is a critical hyperparameter for DL model performance.

Predictions with neural networks on tabular data

Context

We will train a multilayer perceptron on the Iris dataset (3 species). The goal is to classify the species from tabular features.

Multilayer perceptron (MLP)

An MLP is a sequence of layers where each layer’s input is the previous layer’s output. For three layers and input $x$:

$$ H_1 = \text{Layer}_1(x),\quad H_2 = \text{Layer}_2(H_1),\quad O = \text{Layer}_3(H_2). $$

Layers density

API styles:

Training pipeline

  1. Architecture definition: choose number of layers/neurons and data flow
  2. Compilation: loss, metrics, optimizer (e.g., Adam)
model.compile(
    loss="name_loss_function",
    optimizer="name_optimizer",
    metrics=["name_metric"],
)
        
  1. Training: run fit across epochs
model.fit(
    X_train, y_train,
    validation_split=p,
    epochs=nb_epochs,
    batch_size=batch_size,
)
        

Key notions: batch_size (samples per update); batch → model → prediction → loss → weight update; epochs (full passes over data).

Batch size illustration

validation_split holds out a portion of training data; evaluation runs at each epoch end.

Sequential model with Keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential()
model.add(Dense(units=64, activation="relu", input_shape=(784,)))
model.add(Dense(units=10, activation="softmax"))

model.fit(X, y, epochs=nb_epochs, batch_size=batch_size, validation_split=p)
        

Model performance

Use a confusion matrix on the test set. Convert probabilities to class IDs with NumPy argmax for scikit‑learn reports. For multi‑class tasks, a common loss is sparse_categorical_crossentropy.