Deep Learning with Torch

Credits: Soumith Chintala, Nicholas Leonard, Tyler Neylon, Adam Paszke

What is Torch?

Torch is a scientific computing framework based on Lua[JIT] with strong CPU and CUDA backends.

Strong points of Torch:

  • Efficient Tensor library (like NumPy) with an efficient CUDA backend
  • Neural Networks package -- build arbitrary acyclic computation graphs with automatic differentiation
    • also with fast CUDA and CPU backends
  • Good community and industry support - several hundred community-built and maintained packages.

Introduction - Lua

  • Lua is pretty close to javascript.
    • variables are global by default, unless local keyword is used
  • Only has one data structure built-in, a table: {}. Doubles as a hash-table and an array.
  • 1-based indexing.
  • foo:bar() is the same as

Strings, numbers, tables

a = 'hello'
b = {}
b[1] = a
b[2] = 30
for i=1,#b do -- the # operator is the length operator in Lua (for LISTS ONLY)


num = 40
if num == 40 then
elseif num ~= 40 then  -- ~= is not equals.
  print('not 40')


function fib(n)
  if n < 2 then return 1 end
  return fib(n - 2) + fib(n - 1)
-- Closures
function adder(x)
  -- The returned function is created when adder is
  -- called, and remembers the value of x:
  return function (y) return x + y end


a = {1, 2, ['a'] = 3}
a = {1, 2, a = 3, print = function(self) print(self) end}
b = {1 = 'c'}
Metatables and Metamethods

  • Metatables allow us to change the behavior of a table. For instance, using metatables, we can define how Lua computes the expression a+b, where a and b are tables.
  • Values of __add, __index, ... are called metamethods.
f1 = {a = 1, b = 2}  -- Represents the fraction a/b.
f2 = {a = 2, b = 3}
metafraction = {}
function metafraction.__add(f1, f2)
  sum = {}
  sum.b = f1.b * f2.b
  sum.a = f1.a * f2.b + f2.a * f1.b
  return sum

setmetatable(f1, metafraction)
f1 + f2
-- An __index on a metatable overloads dot lookups
defaultFavs = {animal = 'gru', food = 'donuts'}
myFavs = {food = 'pizza'}
setmetatable(myFavs, {__index = defaultFavs})

Direct table lookups that fail will retry using the metatable's __index value, and this recurses.


  • Classes aren't built in; there are different ways to make them using tables and metatables.
Dog = {}                                   -- 1.

function                         -- 2.
  newObj = {sound = 'woof'}                -- 3.
--   self.__index = self                      -- 4.
  return setmetatable(newObj, {__index = Dog})        -- 5.

function Dog:makeSound()                   -- 6.
  print('I say ' .. self.sound)

mrDog = Dog:new()                          -- 7.
-- mrDog['sound']
mrDog:makeSound()                          -- 8.
-- print(mrDog['sound'])
  1. Dog acts like a class; it's really a table.
  2. function tablename:fn(...) is the same as function tablename.fn(self, ...). The : just adds a first argument called self. Read 7 & 8 below for how self gets its value.
  3. newObj will be an instance of class Dog.
  4. self = the class being instantiated. Often self = Dog, but inheritance can change it. newObj gets self's functions when we set both newObj's metatable and self's __index to self.
  5. Reminder: setmetatable returns its first arg.
  6. The : works as in 2, but this time we expect self to be an instance instead of a class.
  7. Same as, so self = Dog in new().
  8. Same as mrDog.makeSound(mrDog); self = mrDog.


Tensors are the main class of objects used in Torch 7:

  • An N-dimensional array that views an underlying Storage (a contiguous 1D-array)
  • Different Tensors can share the same Storage
  • Different types : FloatTensor, DoubleTensor, IntTensor, CudaTensor, and so on
  • Implements most Basic Linear Algebra Sub-routines (BLAS)
  • Supports random initialization, indexing, transposition, sub-tensor extractions, and more
  • Most operations for Float/Double are also implemented for Cuda Tensors (via cutorch)
a = torch.Tensor(5,3) -- construct a 5x3 matrix (initialized with garbage content, whatever was already there)
a = torch.rand(5,3)
-- matrix-matrix multiplication: syntax 1
-- matrix-matrix multiplication: syntax 2,b)
-- matrix-matrix multiplication: syntax 3
c:mm(a,b) -- store the result of a*b in c
b = a
c = a:clone()
m = torch.Tensor(a:size()):copy(a)
a = torch.Tensor(2, 3)
b = torch.Tensor(3, 2)
x = a:storage()
for i = 1, x:size() do
    x[i] = i
b[{{}, {1}}]:fill(2)
b[{{}, {2}}]:fill(3)
CUDA Tensors

Tensors can be moved onto GPU using the :cuda function

require 'cutorch';
a = a:cuda()
b = b:cuda()
c = c:cuda()
c:mm(a,b) -- done on GPU


a = torch.rand(5, 3)
In [ ]:
b = a:transpose(1, 2)

Tensor a and b share the same underlying storage.

a[{2,1}] = 5
Neural Networks

Neural networks in Torch can be constructed using the nn package.

require 'nn';
  • implements feed-forward neural networks
  • neural networks form a computational flow-graph of transformations
  • backpropagation is gradient descent using the chain rule

For example, look at this network that classfies digit images: LeNet

It is a simple feed-forward network. It takes the input, feeds it through several layers one after the other, and then finally gives the output.

Such a network container is nn.Sequential which feeds the input through several layers.

net = nn.Sequential()
net:add(nn.SpatialConvolution(1, 6, 5, 5)) -- 1 input image channel, 6 output channels, 5x5 convolution kernel
net:add(nn.SpatialMaxPooling(2,2,2,2))     -- A max-pooling operation that looks at 2x2 windows and finds the max.
net:add(nn.SpatialConvolution(6, 16, 5, 5))
net:add(nn.View(16*5*5))                   -- reshapes from a 3D tensor of 16x5x5 into 1D tensor of 16*5*5
net:add(nn.Linear(16*5*5, 120))            -- fully connected layer (matrix multiplication between input and weights)
net:add(nn.Linear(120, 84))
net:add(nn.Linear(84, 10))                 -- 10 is the number of outputs of the network (in this case, 10 digits)
net:add(nn.LogSoftMax())                   -- converts the output to a log-probability. Useful for classification problems

mlp=nn.Parallel(2,1);     -- iterate over dimension 2 of input
mlp:add(nn.Linear(10,3)); -- apply to first slice
mlp:add(nn.Linear(10,2))  -- apply to first second slice
x = torch.randn(10,2)

Other examples of nn containers are shown in the figure below: containers

Every neural network module in torch has automatic differentiation. It has a :forward(input) function that computes the output for a given input, flowing the input through the network. and it has a :backward(input, gradient) function that will differentiate each neuron in the network w.r.t. the gradient that is passed in. This is done via the chain rule.

input = torch.rand(1,32,32) -- pass a random tensor as input to the network
output = net:forward(input)
net:zeroGradParameters() -- zero the internal gradient buffers of the network (will come to this later)
gradInput = net:backward(input, torch.rand(10))
Criterion: Defining a loss function

When you want a model to learn to do something, you give it feedback on how well it is doing. This function that computes an objective measure of the model's performance is called a loss function.

A typical loss function takes in the model's output and the groundtruth and computes a value that quantifies the model's performance.

The model then corrects itself to have a smaller loss.

In torch, loss functions are implemented just like neural network modules, and have automatic differentiation.
They have two functions - forward(input, target), backward(input, target)

For example:

criterion = nn.ClassNLLCriterion() -- a negative log-likelihood criterion for multi-class classification
loss = criterion:forward(output, 3) -- let's say the groundtruth was class number: 3
gradients = criterion:backward(output, 3)
gradInput = net:backward(input, gradients)


require 'optim';
  • Optimization package for nn.
  • Provides training algorithms like SGD, LBFGS, etc.
  • Uses closures
parameters, gradParameters = net:getParameters()

-- Define a closure that computes the loss and dloss/dx.
feval = function(x)
    -- reset gradients

    -- 1. compute outputs (log probabilities) for each data point
    local output = net:forward(input)
    -- 2. compute the loss of these outputs, measured against the true labels
    local loss = criterion:forward(output, 3)
    -- 3. compute the derivative of the loss wrt the outputs of the model
    local dloss_doutput = criterion:backward(output, 3)
    -- 4. use gradients to update weights
    net:backward(input, dloss_doutput)

    -- optim expects us to return
    -- loss, (gradient of loss with respect to the weights)
    return loss, gradParameters
-- Define SGD parameters.
sgd_params = {
    learningRate = 1e-2,
    learningRateDecay = 1e-4,
    weightDecay = 0,
    momentum = 0

-- train for a number of epochs
epochs = 1e2
losses = {}
for i = 1,epochs do
    -- one step of SGD optimization (steepest descent)
    _,local_loss = optim.sgd(feval, parameters, sgd_params)
    -- accumulate error
    losses[#losses + 1] = local_loss[1]
Network graphs

nngraph provides graphical computation for the nn library in Torch.

require 'nngraph';
  • nngraph overloads the call operator (i.e. the () operator used for function calls) on all nn.Module objects.
  • When the call operator is invoked, it converts the nn.Module to nngraph.gModule.
  • The argument to the call operator specifies which modules will feed into this one during a forward pass.
add = nn.CAddTable()
t1 = torch.Tensor{1,2,3}
t2 = torch.Tensor{4,5,6}
output = add:forward({t1, t2})
print(t1 + t2)
x1 = nn.Identity()()
x2 = nn.Identity()()
a = nn.CAddTable()({x1, x2})
m = nn.gModule({x1, x2}, {a})
print(m:forward{t1, t2})

Vanilla RNN

input_size = 3
rnn_size = 2
inputs = {}
table.insert(inputs, nn.Identity()()) -- network input
table.insert(inputs, nn.Identity()()) -- h at time t-1
input = inputs[1]
prev_h = inputs[2]
i2h = nn.Linear(input_size, rnn_size)(input) -- input to hidden
h2h = nn.Linear(rnn_size, rnn_size)(prev_h)  -- hidden to hidden
next_h = nn.Tanh()(nn.CAddTable(){i2h, h2h})
outputs = {}
table.insert(outputs, next_h)

-- packs the graph into a convenient module with standard API (:forward(), :backward())
RNN = nn.gModule(inputs, outputs)
print(RNN:forward{torch.randn(1,3), torch.randn(1,2)})
print(RNN:backward({torch.Tensor{0,0,0}, torch.Tensor{0,0}}, torch.randn(1,2)))
$$ i_t = g(W_{xi}x_t + W_{hi}h_{t-1} + b_i) $$$$ f_t = g(W_{xf}x_t + W_{hf}h_{t-1} + b_f) $$$$ o_t = g(W_{xo}x_t + W_{ho}h_{t-1} + b_o) $$

Input transform:

$$ c\_in_t = tanh{(W_{xc}x_t + W_{hc}h_{t-1} + b_c)} $$

State update:

$$ c_t = f_t ⋅ c_{t-1} + i_t ⋅ c\_in_t $$$$ h_t = o_t ⋅ tanh{(c_t)} $$
input_size = 3
rnn_size = 2
inputs = {}
table.insert(inputs, nn.Identity()())   -- network input
table.insert(inputs, nn.Identity()())   -- c at time t-1
table.insert(inputs, nn.Identity()())   -- h at time t-1
input = inputs[1]
prev_c = inputs[2]
prev_h = inputs[3]
$$ i_t = g(W_{xi}x_t + W_{hi}h_{t-1} + b_i) $$$$ f_t = g(W_{xf}x_t + W_{hf}h_{t-1} + b_f) $$$$ o_t = g(W_{xo}x_t + W_{ho}h_{t-1} + b_o) $$$$ c\_in_t = tanh{(W_{xc}x_t + W_{hc}h_{t-1} + b_c)} $$
i2h = nn.Linear(input_size, 4 * rnn_size)(input)  -- input to hidden
h2h = nn.Linear(rnn_size, 4 * rnn_size)(prev_h)   -- hidden to hidden
preactivations = nn.CAddTable()({i2h, h2h})       -- i2h + h2h

All the gates use the sigmoid, and the input preactivation uses tanh.

-- gates
pre_sigmoid_chunk = nn.Narrow(2, 1, 3 * rnn_size)(preactivations)
all_gates = nn.Sigmoid()(pre_sigmoid_chunk)

-- input
in_chunk = nn.Narrow(2, 3 * rnn_size + 1, rnn_size)(preactivations)
in_transform = nn.Tanh()(in_chunk)
in_gate = nn.Narrow(2, 1, rnn_size)(all_gates)
forget_gate = nn.Narrow(2, rnn_size + 1, rnn_size)(all_gates)
out_gate = nn.Narrow(2, 2 * rnn_size + 1, rnn_size)(all_gates)

Cell and Hidden states

$$ c_t = f_t ⋅ c_{t-1} + i_t ⋅ c\_in_t $$$$ h_t = o_t ⋅ tanh{(c_t)} $$
-- previous cell state contribution
c_forget = nn.CMulTable()({forget_gate, prev_c})
-- input contribution
c_input = nn.CMulTable()({in_gate, in_transform})
-- next cell state
next_c = nn.CAddTable()({c_forget, c_input})
c_transform = nn.Tanh()(next_c)
next_h = nn.CMulTable()({out_gate, c_transform})

Defining the module

-- module outputs
outputs = {}
table.insert(outputs, next_c)
table.insert(outputs, next_h)

-- packs the graph into a convenient module with standard API (:forward(), :backward())
LSTM = nn.gModule(inputs, outputs)
print(LSTM:forward{torch.randn(1,3):zero(), torch.randn(1,2):zero(), torch.randn(1,2):zero()})
print(LSTM:backward({torch.randn(1,3), torch.randn(1,2), torch.randn(1,2)}, {torch.randn(1,2):zero(), torch.randn(1,2):zero()}))
