Neural Networks (and more!) with PyTorch

Short Course - XI WPSM

Victor Coscrato

Introduction: What is PyTorch?

What is PyTorch?

  • A high-performance numerical computing library for Python.
  • Origins: Created by Facebook’s AI Research lab (FAIR) to overcome the flexibility limitations of existing frameworks at the time.
  • Open Source: It has been an open-source library since its inception in 2016, enabling rapid adoption and contributions from the global community.
  • Governance: Although it has always been open, governance shifted in 2022 from Meta to the PyTorch Foundation (under the Linux Foundation), ensuring neutral and collaborative management across multiple companies.
  • Key differentiator: Introduced the concept of dynamic computation graphs.

Why PyTorch?

  • Automatic differentiation (autograd): can automatically compute the gradient (the derivative) of any function you define. This eliminates the need to manually derive complex loss functions, a tedious and extremely error-prone process.
  • Hardware acceleration (GPU): enables the execution of mathematical operations on tensors (multidimensional arrays, like NumPy’s) in parallel on GPUs, which is essential for training deep neural network models in a reasonable time.
  • Flexibility: It is hard to think of a parametric model that cannot be implemented in PyTorch.

Tip

Think of PyTorch as NumPy with “superpowers”.

Why PyTorch?

  • Maturity: mature and stable, with a large and active community.
  • Popularity: dominates academic research, e.g. 80% of NeurIPS 2023 papers use PyTorch.
  • Momentum and feedback loop: maturity and popularity create a virtuous cycle of development and adoption.

Examples - Reinforcement Learning

Source: thumbnail from GothamChess (YouTube).

Examples - Computer Vision

Source: demo video from Ultralytics.

Examples - Generative AI

gemini chatgpt claude

OK, but why should I care?

Tip

Even if you don’t plan to train deep neural networks, PyTorch can help you – a lot.

Important

This short course will cover not only neural networks, but also everyday probabilistic and statistical applications.

Part 1: PyTorch Basics

Tensors

  • The fundamental data unit in PyTorch. Analogous to NumPy’s ndarrays, but with “superpowers”.
  • A tensor has essential attributes:
    • shape: the dimensionality of the tensor (e.g., scalar, vector, matrix).
    • dtype: the data type (torch.float32, torch.long, etc.).
    • device: where the tensor is allocated in memory (cpu or cuda).

Tensors

# Creating tensors in various ways
x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
print(f"Tensor from list:\n{x}\n")

# Attributes
print(f"Shape: {x.shape}")
print(f"Data type: {x.dtype}")
print(f"Device: {x.device}\n")

# Interoperability with NumPy (they share the same memory on CPU)
a_np = np.array([5, 6, 7])
a_pt = torch.from_numpy(a_np)
print(f"Tensor from NumPy: {a_pt}")
a_pt[0] = 99
print(f"Original NumPy was modified: {a_np}")
Tensor from list:
tensor([[1., 2.],
        [3., 4.]])

Shape: torch.Size([2, 2])
Data type: torch.float32
Device: cpu

Tensor from NumPy: tensor([5, 6, 7])
Original NumPy was modified: [99  6  7]

Tensor Operations

The syntax is familiar to NumPy users.

x = torch.tensor([[1., 2.], [3., 4.]])
y = torch.tensor([[5., 6.], [7., 8.]])

# Element-wise operations
print("Sum:", x + y)
print("Product (Hadamard):", x * y)

# Matrix multiplication
print("Matrix Product (@):", x @ y)
print("Matrix Product (matmul):", torch.matmul(x, y))

# Reshaping
print("Original shape:", x.shape)
print("Reshaped (view):", x.view(4, 1).shape)
Sum: tensor([[ 6.,  8.],
        [10., 12.]])
Product (Hadamard): tensor([[ 5., 12.],
        [21., 32.]])
Matrix Product (@): tensor([[19., 22.],
        [43., 50.]])
Matrix Product (matmul): tensor([[19., 22.],
        [43., 50.]])
Original shape: torch.Size([2, 2])
Reshaped (view): torch.Size([4, 1])

GPU Acceleration

Moving tensors to the GPU is trivial and can result in speed gains of orders of magnitude.

# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Create tensors on the chosen device
x_gpu = torch.randn(1000, 1000, device=device)
y_gpu = torch.randn(1000, 1000, device=device)

# The operation is executed on the device
z_gpu = x_gpu @ y_gpu

# To use with libraries like Matplotlib or NumPy, move back to CPU
z_cpu = z_gpu.to("cpu")
print(z_cpu.shape)
Using device: cuda
torch.Size([1000, 1000])

The Tensor “Superpower”

Important

autograd is the system that tracks all operations on tensors to automatically compute gradients. It is the foundation for model optimization.

  • Dynamic computation graph: PyTorch builds a graph “on-the-fly”.
  • Tracking: for a tensor x, if x.requires_grad=True, PyTorch records its operation “history”.
  • Gradient computation: by calling .backward() on a scalar (e.g., loss function), PyTorch applies the chain rule.

autograd in Practice

Let’s compute the derivative of \(y = x^2 \sin(x)\) at \(x = 2\).

Tip

The analytical derivative is \(2x \sin(x) + x^2 \cos(x)\).

# Define the tensor and enable gradient tracking
x = torch.tensor(2.0, requires_grad=True)

# Define the function
y = x**2 * torch.sin(x)

# Compute the gradient
y.backward()

# The gradient is stored in x.grad
print(f"Gradient of y with respect to x at x=2: {x.grad.item():.4f}")

# Analytical value for comparison
analytic_grad = 2 * 2 * np.sin(2) + 2**2 * np.cos(2)
print(f"Analytical value: {analytic_grad:.4f}")
Gradient of y with respect to x at x=2: 1.9726
Analytical value: 1.9726

Best Practices with Tensors

  • requires_grad=True only for parameters: input data generally does not need gradients.
  • Use with torch.no_grad() during inference to save memory and time.
w = torch.tensor(1.0, requires_grad=True)
x_obs = torch.tensor(2.0)  # observed data

y_hat = w * x_obs
y_hat.backward() # We'll see more about this later
print("gradient at w:", w.grad)

with torch.no_grad():
    pred = w * 10
print("prediction without tracking gradient:", pred)
gradient at w: tensor(2.)
prediction without tracking gradient: tensor(10.)

The Dynamic Graph and grad_fn

In PyTorch, results of operations on tensors with gradients carry a pointer to the operation that created them.

a = torch.tensor(5.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)
d = a * b + a

print("d:", d)
print("grad_fn of d:", d.grad_fn)
d: tensor(20., grad_fn=<AddBackward0>)
grad_fn of d: <AddBackward0 object at 0x7feed1175900>

Visualizing the Computation Graph

make_dot(d, params={'d': d, 'a': a, 'b': b})

  • Blue rectangles are the leaves (our parameters)
  • Gray rectangles are the intermediate operations
  • The arrow indicates the information flow (forward pass)

Computing the Derivatives

make_dot(d, params={'d': d, 'a': a, 'b': b})

We can compute gradients automatically through the backward pass: PyTorch traverses the graph in reverse and applies the chain rule to compute partial derivatives.

d.backward()
print("dd/da:", a.grad.item())
print("dd/db:", b.grad.item())
dd/da: 4.0
dd/db: 5.0

Tip

Recall that:

  • \(a = 5\) and \(b = 3\)
  • \(d = a \cdot b + a\)

It follows that:

  • \(\frac{\partial d}{\partial a} = b + 1 = 4\)
  • \(\frac{\partial d}{\partial b} = a = 5\).

Summary of the Basics

  • Tensors are PyTorch’s core data structure (shape, dtype, device).
  • requires_grad=True enables tracking for parameters that will be optimized.
  • autograd builds the dynamic graph and computes gradients with .backward().

Important

With this, we already have the basics to define models and train parameters via numerical optimization.

Part 2: Statistical Models with PyTorch

Example: Linear Regression

Let’s recall the analytical least squares solution for the linear regression model:

\[ y = X \beta + \epsilon \]

The quadratic loss is: \[ L(\beta) = \sum_{i=1}^{n} (y_i - X_i^T\beta)^2 \]

Differentiating and setting to zero, we obtain the closed-form solution:

\[ \hat{\beta} = (X^T X)^{-1}X^Ty \]

Let’s compare this solution with optimization via autograd.

Analytical Solution

# Generate synthetic data
np.random.seed(0)
n_samples = 10000
n_features = 3
X = np.random.rand(n_samples, n_features)
true_beta = np.array([2.0, -3.5, 1.0])
y = X @ true_beta + np.random.randn(n_samples) * 0.5

# Convert to PyTorch tensors
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32).view(-1, 1)

# Analytical solution
XTX_inv = torch.inverse(X_tensor.t() @ X_tensor)
beta_analytical = XTX_inv @ X_tensor.t() @ y_tensor
print("Analytical solution for beta:")
print(beta_analytical.numpy())
Analytical solution for beta:
[[ 1.9843934]
 [-3.5015233]
 [ 1.0014043]]

Numerical Solution with autograd

beta_opt = torch.randn(n_features, 1, requires_grad=True)
optimizer = torch.optim.SGD([beta_opt], lr=0.01)

n_epochs = 10000
for epoch in range(n_epochs):
    optimizer.zero_grad()
    y_pred = X_tensor @ beta_opt
    loss = torch.mean((y_tensor - y_pred) ** 2)
    loss.backward()
    optimizer.step()

print("Optimized beta via autograd:")
print(beta_opt.detach().numpy())
# make_dot(loss, params={"loss": loss, "beta_opt": beta_opt})
Optimized beta via autograd:
[[ 1.9843494]
 [-3.5014358]
 [ 1.0013658]]

Interpretation and Practical Gain

  • In the example, autograd recovers a solution very close to the analytical one.
  • The real gain is not finding the least squares solution, but having a universal optimizer for models without closed-form solutions.
  • The same code pattern extends to logistic regression, neural networks, and custom models.

Example: Logistic Regression

For logistic regression:

\[ P(y=1|X) = \sigma(X^T\beta), \quad \sigma(z)=\frac{1}{1+e^{-z}} \]

The loss function is the negative log-likelihood:

\[ L(\beta) = -\sum_{i=1}^{n} \left[ y_i \log(\sigma(X_i^T\beta)) + (1-y_i)\log(1-\sigma(X_i^T\beta)) \right] \]

Numerical Solution with autograd

# Generate synthetic data
np.random.seed(0)
n_samples = 10000
n_features = 3
X = np.random.rand(n_samples, n_features)
true_beta = np.array([-2.0, 1.5, -1.0])
logits = X @ true_beta
probabilities = 1 / (1 + np.exp(-logits))
y = np.random.binomial(1, probabilities)

# Convert to PyTorch tensors
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y.reshape(-1, 1), dtype=torch.float32)

Numerical Solution with autograd

beta_opt = torch.randn(n_features, 1, requires_grad=True)
bias_opt = torch.zeros(1, requires_grad=True)
optimizer = torch.optim.SGD([beta_opt, bias_opt], lr=0.1)

n_epochs = 10000
for epoch in range(n_epochs):
    optimizer.zero_grad()
    logits = X_tensor @ beta_opt + bias_opt
    y_pred = torch.sigmoid(logits)
    loss = -torch.mean(y_tensor * torch.log(y_pred) +
        (1 - y_tensor) * torch.log(1 - y_pred))
    loss.backward()
    optimizer.step()

print("Optimized beta via autograd:")
print(beta_opt.detach().numpy(), bias_opt.detach().numpy())
# make_dot(loss, params={"loss": loss, "beta_opt": beta_opt, "bias_opt": bias_opt})
Optimized beta via autograd:
[[-1.9756057]
 [ 1.5226276]
 [-0.9161564]] [-0.06478492]

Transition to torch.nn

So far, we have worked with tensors and autograd directly. The torch.nn API organizes this into reusable modules. Let’s repeat logistic regression using nn.Module.

class LogisticRegressionModel(nn.Module):
    def __init__(self, n_features):
        super(LogisticRegressionModel, self).__init__()
        self.linear = nn.Linear(n_features, 1)

    def forward(self, x):
        return torch.sigmoid(self.linear(x))

Logistic Regression with torch.nn

model_logistic = LogisticRegressionModel(n_features)
criterion = nn.BCELoss()
optimizer = optim.SGD(model_logistic.parameters(), lr=0.1)

n_epochs = 10000
for epoch in range(n_epochs):
    optimizer.zero_grad()
    y_pred = model_logistic(X_tensor)
    loss = criterion(y_pred, y_tensor)
    loss.backward()
    optimizer.step()

print("Optimized beta via torch.nn for logistic regression:")
print(model_logistic.linear.weight.detach().numpy(), model_logistic.linear.bias.detach().numpy())
# make_dot(loss, params=dict(model_logistic.named_parameters()))
Optimized beta via torch.nn for logistic regression:
[[-1.9755332   1.5226943  -0.91611755]] [-0.06487209]

The torch.nn API

  1. nn.Module: standard container for parameters and forward logic.
  2. nn.Linear: pre-built module with weights and bias.
  3. nn.MSELoss / nn.BCELoss: ready-made losses.
  4. torch.optim: parameter updates + zero_grad().

Part 3: Neural Networks with PyTorch

Review: Neural Networks

  • A neuron receives inputs \(\mathbf{x} \in \mathbb{R}^d\), computes a parametric linear combination, and applies a non-linear transformation \(\phi\): \[ y = \phi(\mathbf{w}^T\mathbf{x} + b) \]
  • The function \(\phi\) (e.g., ELU, ReLU, \(\tanh\)) is essential. Without it, a composition of neurons would collapse into a single affine function: \[ (W_2 (W_1 \mathbf{x} + b_1) + b_2) = (W_2 W_1)\mathbf{x} + (W_2 b_1 + b_2) = W'\mathbf{x} + b' \]

Note

Logistic regression is essentially a single neuron with \(\phi = \sigma\) (sigmoid function), where: \[ P(y=1|\mathbf{x}) = \sigma(\beta^T\mathbf{x}) \]

Review: Neural Networks

  • Universal approximation theorem: A feedforward neural network with at least one hidden layer and a non-linear activation function is capable of approximating any continuous function \(\mathbb{R}^n \to \mathbb{R}^m\), given enough neurons.
  • Deep networks: In practice, multiple layers tend to be a more efficient representation, with better generalization.

Multi-Layer Perceptron

An MLP with two hidden layers maps the data \(\mathbf{X}\) through the operations:

\[ \mathbf{h_1} = \phi_1(\mathbf{X}\mathbf{W}_1 + \mathbf{b}_1), \quad \mathbf{h_2} = \phi_1(\mathbf{h_1}\mathbf{W}_2 + \mathbf{b}_2), \quad \hat{\mathbf{y}} = \phi_2(\mathbf{h_2}\mathbf{W}_3 + \mathbf{b}_3) \]

\[ \hat{\mathbf{y}} = \phi_2 \left( \phi_1 \left( \phi_1 \left( \mathbf{X}\mathbf{W}_1 + \mathbf{b}_1 \right) \mathbf{W}_2 + \mathbf{b}_2 \right) \mathbf{W}_3 + \mathbf{b}_3 \right) \]

NeuralNetwork cluster_input Input (X) cluster_h1 Layer h 1 cluster_h2 Layer h 2 cluster_output Output (ŷ) x1 x 1 h1_1 h 1,1 x1->h1_1 h1_2 h 1,2 x1->h1_2 h1_3 h 1,3 x1->h1_3 x2 x 2 x2->h1_1 x2->h1_2 x2->h1_3 h2_1 h 2,1 h1_1->h2_1 h2_2 h 2,2 h1_1->h2_2 h2_3 h 2,3 h1_1->h2_3 h1_2->h2_1 h1_2->h2_2 h1_2->h2_3 h1_3->h2_1 h1_3->h2_2 h1_3->h2_3 y_hat ŷ h2_1->y_hat h2_2->y_hat h2_3->y_hat

Backpropagation

How do we optimize the parameters \(\mathbf{W}_1, \mathbf{W}_2, \mathbf{W}_3\) of a neural network? We need the gradients \(\nabla_{\mathbf{W}_k} \ell\) for each layer.

Consider the quadratic loss as an example:

\[ \ell(\hat{\mathbf{y}}, \mathbf{y}) = \|\hat{\mathbf{y}} - \mathbf{y}\|^2 \]

Substituting \(\hat{\mathbf{y}}\) with the MLP expression, the loss becomes a composition of functions:

\[ \ell(\hat{\mathbf{y}}, \mathbf{y}) = \left\| \phi_2 \left( \underbrace{\phi_1 \left( \underbrace{\phi_1 \left( \mathbf{X}\mathbf{W}_1 + \mathbf{b}_1 \right)}_{\mathbf{h}_1} \mathbf{W}_2 + \mathbf{b}_2 \right)}_{\mathbf{h}_2} \mathbf{W}_3 + \mathbf{b}_3 \right) - \mathbf{y} \right\|^2 \]

Backpropagation

By the chain rule, gradients propagate layer by layer, from back to front:

\[ \frac{\partial \ell}{\partial \mathbf{W}_3} = \frac{\partial \ell}{\partial \hat{\mathbf{y}}} \cdot \frac{\partial \hat{\mathbf{y}}}{\partial \mathbf{W}_3}, \qquad \frac{\partial \ell}{\partial \mathbf{W}_2} = \frac{\partial \ell}{\partial \hat{\mathbf{y}}} \cdot \frac{\partial \hat{\mathbf{y}}}{\partial \mathbf{h}_2} \cdot \frac{\partial \mathbf{h}_2}{\partial \mathbf{W}_2}, \qquad \frac{\partial \ell}{\partial \mathbf{W}_1} = \frac{\partial \ell}{\partial \hat{\mathbf{y}}} \cdot \frac{\partial \hat{\mathbf{y}}}{\partial \mathbf{h}_2} \cdot \frac{\partial \mathbf{h}_2}{\partial \mathbf{h}_1} \cdot \frac{\partial \mathbf{h}_1}{\partial \mathbf{W}_1} \]

Tip

This is exactly the computation that autograd performs when we call loss.backward(). No matter how many layers the network has, the chain rule applies recursively.

Optimizers

With the gradients \(\nabla_\theta \ell\) in hand, we need an update rule for the parameters.

  • SGD (Stochastic Gradient Descent): the simplest update: \[ \theta \leftarrow \theta - \eta \, \nabla_\theta \ell \] where \(\eta\) is the learning rate.
  • Adam: combines moving averages of the gradient (\(1^{\text{st}}\) moment) and the squared gradient (\(2^{\text{nd}}\) moment), adapting the learning rate for each parameter individually. It is the default optimizer in most modern applications.
  • In practice, just swap optim.SGD(...) for optim.Adam(...), the API is identical.

Important

There is no need to implement backpropagation or optimizers manually: PyTorch handles everything. Just define the network, the loss, and call loss.backward() + optimizer.step().

MLP for Binary Classification

class SimpleMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(SimpleMLP, self).__init__()
        self.layer1 = nn.Linear(input_dim, hidden_dim)
        self.layer2 = nn.Linear(hidden_dim, hidden_dim)
        self.output = nn.Linear(hidden_dim, output_dim)
        self.activation = nn.ReLU()
        self.output_activation = nn.Sigmoid()

    def forward(self, x):
        h1 = self.activation(self.layer1(x))
        h2 = self.activation(self.layer2(h1))
        y_pred = self.output_activation(self.output(h2))
        return y_pred

model = SimpleMLP(input_dim=3, hidden_dim=5, output_dim=1)
y_pred = model(torch.tensor([0, 0, 0], dtype=torch.float32))
# make_dot(y_pred, params=dict(model.named_parameters()))

MLP for Binary Classification

Recall…

autograd automatically differentiates through layers and activations. Just define the loss and call loss.backward().

model = SimpleMLP(input_dim=X.shape[1], hidden_dim=5, output_dim=1)
criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

n_epochs = 10000
for epoch in range(n_epochs):
    optimizer.zero_grad()
    y_pred = model(X_tensor)
    loss = criterion(y_pred, y_tensor)
    loss.backward()
    optimizer.step()

Exercises: Optimization with PyTorch

1: Poisson Regression (GLM)

Objective: implement and optimize a Poisson regression with a custom loss function.

We want to model:

\[ y_i \sim \text{Poisson}(\lambda_i), \quad \log(\lambda_i) = \mathbf{x}_i^\top \mathbf{w} + b \]

The corresponding NLL is:

\[ L = \sum_i \left(e^{\mathbf{x}_i^\top \mathbf{w} + b} - y_i(\mathbf{x}_i^\top \mathbf{w} + b)\right) \]

Task

  • Generate synthetic data with torch.poisson.
  • Implement the linear model in PyTorch.
  • Implement poisson_nll_loss.
  • Train with gradient descent and visualize convergence.

2: Gaussian Mixture Model (GMM)

Objective: fit the parameters of a 1D GMM with two components directly via autograd.

\[ p(y_i) = \pi_1 \mathcal{N}(y_i | \mu_1, \sigma_1^2) + \pi_2 \mathcal{N}(y_i | \mu_2, \sigma_2^2) \]

To enforce constraints: - softmax for mixture weights. - exp on log-standard deviations to ensure \(\sigma_k > 0\).

Task

  • Generate data from a 1D GMM with two Gaussians.
  • Define parameters as torch.nn.Parameter.
  • Implement gmm_nll_loss with torch.logsumexp.
  • Optimize parameters and compare learned density with actual data.

3: Heteroscedastic Linear Regression

Objective: model conditional mean and variance simultaneously.

\[ y_i \sim \mathcal{N}(\mu_i, \sigma_i^2), \quad \mu_i = f_\mu(\mathbf{x}_i), \quad \log(\sigma_i^2) = f_\sigma(\mathbf{x}_i) \]

The NLL can be written as:

\[ L = \frac{1}{n} \sum_i \left[\frac{1}{2}s_i + \frac{(y_i - \mu_i)^2}{2e^{s_i}}\right], \quad s_i = \log(\sigma_i^2) \]

Task

  • Create an nn.Module with two outputs (mu_head and logvar_head).
  • Implement heteroscedastic_gaussian_nll.
  • Train on simulated data with increasing variance.
  • Plot predicted mean and uncertainty bands (\(\pm 2\sigma\)).

4: CNN (MNIST)

Objective: build and train a CNN to classify digits.

Given an image \(\mathbf{X} \in \mathbb{R}^{28 \times 28}\), classify it (0 to 9) by minimizing the Cross-Entropy Loss: \[ L = -\frac{1}{n}\sum_{i,c} y_{i,c} \log(\hat{y}_{i,c}) \]

Task

  • Load MNIST via torchvision.datasets.
  • Create a model with nn.Conv2d, nn.MaxPool2d, and nn.Linear.
  • Train in mini-batches (DataLoader) with nn.CrossEntropyLoss().
  • Evaluate accuracy on the test set.

5: MF for Movie Recommendation

Objective: apply matrix factorization for recommendation systems.

We approximate the rating \(r_{u,i}\) (user \(u\), item \(i\)) by: \[ \hat{r}_{u,i} = \mu + b_u + b_i + \mathbf{p}_u^\top \mathbf{q}_i \]

Minimizing the mean squared error with \(L_2\) regularization: \[ L = \sum_{(u,i)} (r_{u,i} - \hat{r}_{u,i})^2 + \lambda \left(||\mathbf{p}_u||^2 + ||\mathbf{q}_i||^2 + b_u^2 + b_i^2\right) \]

Task

  • Use nn.Embedding for \(p_u, q_i, b_u, b_i\).
  • Compute \(\hat{r}_{u,i}\) in forward and optimize the loss with MSE.
  • Train on the movielens-100k dataset.
  • Provide top-K recommendations for a user.

6: RNN for Time Series

Objective: predict values in a time series using recurrent networks.

Given a sequence \((x_{t-k}, \dots, x_{t})\), predict \(x_{t+1}\) by updating the hidden state \(\mathbf{h}_t\): \[ \mathbf{h}_t = f(\mathbf{x}_t, \mathbf{h}_{t-1}), \quad \hat{x}_{t+1} = \mathbf{w}^\top \mathbf{h}_t + b \]

Task

  • Use the Sunspots dataset (statsmodels.api.datasets.sunspots).
  • Format data using history windows.
  • Implement a model with nn.RNN or nn.LSTM and nn.Linear.
  • Train by minimizing MSE and plot predictions against original data.