A high-performance numerical computing library for Python.
Origins: Created by Facebook’s AI Research lab (FAIR) to overcome the flexibility limitations of existing frameworks at the time.
Open Source: It has been an open-source library since its inception in 2016, enabling rapid adoption and contributions from the global community.
Governance: Although it has always been open, governance shifted in 2022 from Meta to the PyTorch Foundation (under the Linux Foundation), ensuring neutral and collaborative management across multiple companies.
Key differentiator: Introduced the concept of dynamic computation graphs.
Why PyTorch?
Automatic differentiation (autograd): can automatically compute the gradient (the derivative) of any function you define. This eliminates the need to manually derive complex loss functions, a tedious and extremely error-prone process.
Hardware acceleration (GPU): enables the execution of mathematical operations on tensors (multidimensional arrays, like NumPy’s) in parallel on GPUs, which is essential for training deep neural network models in a reasonable time.
Flexibility: It is hard to think of a parametric model that cannot be implemented in PyTorch.
Tip
Think of PyTorch as NumPy with “superpowers”.
Why PyTorch?
Maturity: mature and stable, with a large and active community.
Popularity: dominates academic research, e.g. 80% of NeurIPS 2023 papers use PyTorch.
Momentum and feedback loop: maturity and popularity create a virtuous cycle of development and adoption.
Even if you don’t plan to train deep neural networks, PyTorch can help you – a lot.
Important
This short course will cover not only neural networks, but also everyday probabilistic and statistical applications.
Part 1: PyTorch Basics
Tensors
The fundamental data unit in PyTorch. Analogous to NumPy’s ndarrays, but with “superpowers”.
A tensor has essential attributes:
shape: the dimensionality of the tensor (e.g., scalar, vector, matrix).
dtype: the data type (torch.float32, torch.long, etc.).
device: where the tensor is allocated in memory (cpu or cuda).
Tensors
# Creating tensors in various waysx = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)print(f"Tensor from list:\n{x}\n")# Attributesprint(f"Shape: {x.shape}")print(f"Data type: {x.dtype}")print(f"Device: {x.device}\n")# Interoperability with NumPy (they share the same memory on CPU)a_np = np.array([5, 6, 7])a_pt = torch.from_numpy(a_np)print(f"Tensor from NumPy: {a_pt}")a_pt[0] =99print(f"Original NumPy was modified: {a_np}")
Tensor from list:
tensor([[1., 2.],
[3., 4.]])
Shape: torch.Size([2, 2])
Data type: torch.float32
Device: cpu
Tensor from NumPy: tensor([5, 6, 7])
Original NumPy was modified: [99 6 7]
Tensor Operations
The syntax is familiar to NumPy users.
x = torch.tensor([[1., 2.], [3., 4.]])y = torch.tensor([[5., 6.], [7., 8.]])# Element-wise operationsprint("Sum:", x + y)print("Product (Hadamard):", x * y)# Matrix multiplicationprint("Matrix Product (@):", x @ y)print("Matrix Product (matmul):", torch.matmul(x, y))# Reshapingprint("Original shape:", x.shape)print("Reshaped (view):", x.view(4, 1).shape)
Moving tensors to the GPU is trivial and can result in speed gains of orders of magnitude.
# Check if GPU is availabledevice ="cuda"if torch.cuda.is_available() else"cpu"print(f"Using device: {device}")# Create tensors on the chosen devicex_gpu = torch.randn(1000, 1000, device=device)y_gpu = torch.randn(1000, 1000, device=device)# The operation is executed on the devicez_gpu = x_gpu @ y_gpu# To use with libraries like Matplotlib or NumPy, move back to CPUz_cpu = z_gpu.to("cpu")print(z_cpu.shape)
Using device: cuda
torch.Size([1000, 1000])
The Tensor “Superpower”
Important
autograd is the system that tracks all operations on tensors to automatically compute gradients. It is the foundation for model optimization.
Dynamic computation graph: PyTorch builds a graph “on-the-fly”.
Tracking: for a tensor x, if x.requires_grad=True, PyTorch records its operation “history”.
Gradient computation: by calling .backward() on a scalar (e.g., loss function), PyTorch applies the chain rule.
autograd in Practice
Let’s compute the derivative of \(y = x^2 \sin(x)\) at \(x = 2\).
Tip
The analytical derivative is \(2x \sin(x) + x^2 \cos(x)\).
# Define the tensor and enable gradient trackingx = torch.tensor(2.0, requires_grad=True)# Define the functiony = x**2* torch.sin(x)# Compute the gradienty.backward()# The gradient is stored in x.gradprint(f"Gradient of y with respect to x at x=2: {x.grad.item():.4f}")# Analytical value for comparisonanalytic_grad =2*2* np.sin(2) +2**2* np.cos(2)print(f"Analytical value: {analytic_grad:.4f}")
Gradient of y with respect to x at x=2: 1.9726
Analytical value: 1.9726
Best Practices with Tensors
requires_grad=True only for parameters: input data generally does not need gradients.
Use with torch.no_grad() during inference to save memory and time.
w = torch.tensor(1.0, requires_grad=True)x_obs = torch.tensor(2.0) # observed datay_hat = w * x_obsy_hat.backward() # We'll see more about this laterprint("gradient at w:", w.grad)with torch.no_grad(): pred = w *10print("prediction without tracking gradient:", pred)
gradient at w: tensor(2.)
prediction without tracking gradient: tensor(10.)
The Dynamic Graph and grad_fn
In PyTorch, results of operations on tensors with gradients carry a pointer to the operation that created them.
a = torch.tensor(5.0, requires_grad=True)b = torch.tensor(3.0, requires_grad=True)d = a * b + aprint("d:", d)print("grad_fn of d:", d.grad_fn)
d: tensor(20., grad_fn=<AddBackward0>)
grad_fn of d: <AddBackward0 object at 0x7feed1175900>
Visualizing the Computation Graph
make_dot(d, params={'d': d, 'a': a, 'b': b})
Blue rectangles are the leaves (our parameters)
Gray rectangles are the intermediate operations
The arrow indicates the information flow (forward pass)
Computing the Derivatives
make_dot(d, params={'d': d, 'a': a, 'b': b})
We can compute gradients automatically through the backward pass: PyTorch traverses the graph in reverse and applies the chain rule to compute partial derivatives.
Optimized beta via autograd:
[[-1.9756057]
[ 1.5226276]
[-0.9161564]] [-0.06478492]
Transition to torch.nn
So far, we have worked with tensors and autograd directly. The torch.nn API organizes this into reusable modules. Let’s repeat logistic regression using nn.Module.
class LogisticRegressionModel(nn.Module):def__init__(self, n_features):super(LogisticRegressionModel, self).__init__()self.linear = nn.Linear(n_features, 1)def forward(self, x):return torch.sigmoid(self.linear(x))
Logistic Regression with torch.nn
model_logistic = LogisticRegressionModel(n_features)criterion = nn.BCELoss()optimizer = optim.SGD(model_logistic.parameters(), lr=0.1)n_epochs =10000for epoch inrange(n_epochs): optimizer.zero_grad() y_pred = model_logistic(X_tensor) loss = criterion(y_pred, y_tensor) loss.backward() optimizer.step()print("Optimized beta via torch.nn for logistic regression:")print(model_logistic.linear.weight.detach().numpy(), model_logistic.linear.bias.detach().numpy())# make_dot(loss, params=dict(model_logistic.named_parameters()))
Optimized beta via torch.nn for logistic regression:
[[-1.9755332 1.5226943 -0.91611755]] [-0.06487209]
The torch.nn API
nn.Module: standard container for parameters and forward logic.
nn.Linear: pre-built module with weights and bias.
nn.MSELoss / nn.BCELoss: ready-made losses.
torch.optim: parameter updates + zero_grad().
Part 3: Neural Networks with PyTorch
Review: Neural Networks
A neuron receives inputs \(\mathbf{x} \in \mathbb{R}^d\), computes a parametric linear combination, and applies a non-linear transformation \(\phi\): \[ y = \phi(\mathbf{w}^T\mathbf{x} + b) \]
The function \(\phi\) (e.g., ELU, ReLU, \(\tanh\)) is essential. Without it, a composition of neurons would collapse into a single affine function: \[ (W_2 (W_1 \mathbf{x} + b_1) + b_2) = (W_2 W_1)\mathbf{x} + (W_2 b_1 + b_2) = W'\mathbf{x} + b' \]
Note
Logistic regression is essentially a single neuron with \(\phi = \sigma\) (sigmoid function), where: \[ P(y=1|\mathbf{x}) = \sigma(\beta^T\mathbf{x}) \]
Review: Neural Networks
Universal approximation theorem: A feedforward neural network with at least one hidden layer and a non-linear activation function is capable of approximating any continuous function \(\mathbb{R}^n \to \mathbb{R}^m\), given enough neurons.
Deep networks: In practice, multiple layers tend to be a more efficient representation, with better generalization.
Multi-Layer Perceptron
An MLP with two hidden layers maps the data \(\mathbf{X}\) through the operations:
How do we optimize the parameters \(\mathbf{W}_1, \mathbf{W}_2, \mathbf{W}_3\) of a neural network? We need the gradients \(\nabla_{\mathbf{W}_k} \ell\) for each layer.
This is exactly the computation that autograd performs when we call loss.backward(). No matter how many layers the network has, the chain rule applies recursively.
Optimizers
With the gradients \(\nabla_\theta \ell\) in hand, we need an update rule for the parameters.
SGD (Stochastic Gradient Descent): the simplest update: \[ \theta \leftarrow \theta - \eta \, \nabla_\theta \ell \] where \(\eta\) is the learning rate.
Adam: combines moving averages of the gradient (\(1^{\text{st}}\) moment) and the squared gradient (\(2^{\text{nd}}\) moment), adapting the learning rate for each parameter individually. It is the default optimizer in most modern applications.
In practice, just swap optim.SGD(...) for optim.Adam(...), the API is identical.
Important
There is no need to implement backpropagation or optimizers manually: PyTorch handles everything. Just define the network, the loss, and call loss.backward() + optimizer.step().
Create an nn.Module with two outputs (mu_head and logvar_head).
Implement heteroscedastic_gaussian_nll.
Train on simulated data with increasing variance.
Plot predicted mean and uncertainty bands (\(\pm 2\sigma\)).
4: CNN (MNIST)
Objective: build and train a CNN to classify digits.
Given an image \(\mathbf{X} \in \mathbb{R}^{28 \times 28}\), classify it (0 to 9) by minimizing the Cross-Entropy Loss: \[ L = -\frac{1}{n}\sum_{i,c} y_{i,c} \log(\hat{y}_{i,c}) \]
Task
Load MNIST via torchvision.datasets.
Create a model with nn.Conv2d, nn.MaxPool2d, and nn.Linear.
Train in mini-batches (DataLoader) with nn.CrossEntropyLoss().
Evaluate accuracy on the test set.
5: MF for Movie Recommendation
Objective: apply matrix factorization for recommendation systems.
Minimizing the mean squared error with \(L_2\) regularization: \[ L = \sum_{(u,i)} (r_{u,i} - \hat{r}_{u,i})^2 + \lambda \left(||\mathbf{p}_u||^2 + ||\mathbf{q}_i||^2 + b_u^2 + b_i^2\right) \]
Task
Use nn.Embedding for \(p_u, q_i, b_u, b_i\).
Compute \(\hat{r}_{u,i}\) in forward and optimize the loss with MSE.
Train on the movielens-100k dataset.
Provide top-K recommendations for a user.
6: RNN for Time Series
Objective: predict values in a time series using recurrent networks.
Given a sequence \((x_{t-k}, \dots, x_{t})\), predict \(x_{t+1}\) by updating the hidden state \(\mathbf{h}_t\): \[ \mathbf{h}_t = f(\mathbf{x}_t, \mathbf{h}_{t-1}), \quad \hat{x}_{t+1} = \mathbf{w}^\top \mathbf{h}_t + b \]
Task
Use the Sunspots dataset (statsmodels.api.datasets.sunspots).
Format data using history windows.
Implement a model with nn.RNN or nn.LSTM and nn.Linear.
Train by minimizing MSE and plot predictions against original data.