# Deep Learning with Torch¶

Credits: Soumith Chintala, Nicholas Leonard, Tyler Neylon, Adam Paszke

## What is Torch?¶

Torch is a scientific computing framework based on Lua[JIT] with strong CPU and CUDA backends.

Strong points of Torch:

• Efficient Tensor library (like NumPy) with an efficient CUDA backend
• Neural Networks package -- build arbitrary acyclic computation graphs with automatic differentiation
• also with fast CUDA and CPU backends
• Good community and industry support - several hundred community-built and maintained packages.

## Introduction - Lua¶

• Lua is pretty close to javascript.
• variables are global by default, unless local keyword is used
• Only has one data structure built-in, a table: {}. Doubles as a hash-table and an array.
• 1-based indexing.
• foo:bar() is the same as foo.bar(foo)

### Strings, numbers, tables¶

In [ ]:
a = 'hello'

In [ ]:
print(a)

In [ ]:
b = {}

In [ ]:
b[1] = a

In [ ]:
print(b)

In [ ]:
b[2] = 30

In [ ]:
for i=1,#b do -- the # operator is the length operator in Lua (for LISTS ONLY)
print(b[i])
end


### If-Else¶

In [ ]:
num = 40
if num == 40 then
print('40')
elseif num ~= 40 then  -- ~= is not equals.
print('not 40')
end


### Functions¶

In [ ]:
function fib(n)
if n < 2 then return 1 end
return fib(n - 2) + fib(n - 1)
end

In [ ]:
-- Closures
function adder(x)
-- The returned function is created when adder is
-- called, and remembers the value of x:
return function (y) return x + y end
end


### Tables¶

In [ ]:
a = {1, 2, ['a'] = 3}

In [ ]:
a = {1, 2, a = 3, print = function(self) print(self) end}

In [ ]:
a:print()

In [ ]:
b = {1 = 'c'}

In [ ]:
b


#### Metatables and Metamethods¶

• Metatables allow us to change the behavior of a table. For instance, using metatables, we can define how Lua computes the expression a+b, where a and b are tables.
• Values of __add, __index, ... are called metamethods.
In [ ]:
f1 = {a = 1, b = 2}  -- Represents the fraction a/b.
f2 = {a = 2, b = 3}

In [ ]:
metafraction = {}
function metafraction.__add(f1, f2)
sum = {}
sum.b = f1.b * f2.b
sum.a = f1.a * f2.b + f2.a * f1.b
return sum
end

setmetatable(f1, metafraction)

In [ ]:
f1 + f2

In [ ]:
-- An __index on a metatable overloads dot lookups
defaultFavs = {animal = 'gru', food = 'donuts'}
myFavs = {food = 'pizza'}
setmetatable(myFavs, {__index = defaultFavs})
print(myFavs.animal)


Direct table lookups that fail will retry using the metatable's __index value, and this recurses.

#### Classes¶

• Classes aren't built in; there are different ways to make them using tables and metatables.
In [ ]:
Dog = {}                                   -- 1.

function Dog.new()                         -- 2.
newObj = {sound = 'woof'}                -- 3.
--   self.__index = self                      -- 4.
return setmetatable(newObj, {__index = Dog})        -- 5.
end

function Dog:makeSound()                   -- 6.
print('I say ' .. self.sound)
end

mrDog = Dog:new()                          -- 7.
-- mrDog['sound']
mrDog:makeSound()                          -- 8.
-- print(mrDog['sound'])

1. Dog acts like a class; it's really a table.
2. function tablename:fn(...) is the same as function tablename.fn(self, ...). The : just adds a first argument called self. Read 7 & 8 below for how self gets its value.
3. newObj will be an instance of class Dog.
4. self = the class being instantiated. Often self = Dog, but inheritance can change it. newObj gets self's functions when we set both newObj's metatable and self's __index to self.
5. Reminder: setmetatable returns its first arg.
6. The : works as in 2, but this time we expect self to be an instance instead of a class.
7. Same as Dog.new(Dog), so self = Dog in new().
8. Same as mrDog.makeSound(mrDog); self = mrDog.

### Tensors¶

https://github.com/torch/torch7/blob/master/doc/tensor.md

Tensors are the main class of objects used in Torch 7:

• An N-dimensional array that views an underlying Storage (a contiguous 1D-array)
• Different Tensors can share the same Storage
• Different types : FloatTensor, DoubleTensor, IntTensor, CudaTensor, and so on
• Implements most Basic Linear Algebra Sub-routines (BLAS)
• Supports random initialization, indexing, transposition, sub-tensor extractions, and more
• Most operations for Float/Double are also implemented for Cuda Tensors (via cutorch)
In [ ]:
a = torch.Tensor(5,3) -- construct a 5x3 matrix (initialized with garbage content, whatever was already there)

In [ ]:
a = torch.rand(5,3)

In [ ]:
print(a)

In [ ]:
b=torch.rand(3,4)

In [ ]:
print(b)

In [ ]:
-- matrix-matrix multiplication: syntax 1
a*b

In [ ]:
-- matrix-matrix multiplication: syntax 2
torch.mm(a,b)

In [ ]:
-- matrix-matrix multiplication: syntax 3
c=torch.Tensor(5,4)
c:mm(a,b) -- store the result of a*b in c
print(c)

In [ ]:
b = a
print(a)
print(b)

In [ ]:
a:fill(1)

In [ ]:
c = a:clone()
print(a)
print(c)

In [ ]:
m = torch.Tensor(a:size()):copy(a)
print(m)

In [ ]:
c:fill(1.8)
print(a)
print(c)

In [ ]:
a = torch.Tensor(2, 3)

In [ ]:
b = torch.Tensor(3, 2)

In [ ]:
x = a:storage()
for i = 1, x:size() do
x[i] = i
end

In [ ]:
b[{{}, {1}}]:fill(2)
b[{{}, {2}}]:fill(3)

In [ ]:
print(a)
print(b)

In [ ]:
print(a*b)


#### CUDA Tensors¶

Tensors can be moved onto GPU using the :cuda function

In [ ]:
require 'cutorch';
a = a:cuda()
b = b:cuda()
c = c:cuda()
c:mm(a,b) -- done on GPU


#### Transpose¶

In [ ]:
a = torch.rand(5, 3)
print(a)

In [ ]:
b = a:transpose(1, 2)
print(b)


Tensor a and b share the same underlying storage.

In [ ]:
a[{2,1}] = 5
print(a)
print(b)

In [ ]:
print(a:storage())
print(b:storage())

In [ ]:
print(a:isContiguous())
print(b:isContiguous())


### Neural Networks¶

Neural networks in Torch can be constructed using the nn package.

In [ ]:
require 'nn';

• implements feed-forward neural networks
• neural networks form a computational flow-graph of transformations
• backpropagation is gradient descent using the chain rule

For example, look at this network that classfies digit images:

It is a simple feed-forward network. It takes the input, feeds it through several layers one after the other, and then finally gives the output.

Such a network container is nn.Sequential which feeds the input through several layers.

In [ ]:
net = nn.Sequential()
net:add(nn.SpatialConvolution(1, 6, 5, 5)) -- 1 input image channel, 6 output channels, 5x5 convolution kernel
net:add(nn.SpatialMaxPooling(2,2,2,2))     -- A max-pooling operation that looks at 2x2 windows and finds the max.
net:add(nn.SpatialConvolution(6, 16, 5, 5))
net:add(nn.SpatialMaxPooling(2,2,2,2))
net:add(nn.View(16*5*5))                   -- reshapes from a 3D tensor of 16x5x5 into 1D tensor of 16*5*5
net:add(nn.Linear(16*5*5, 120))            -- fully connected layer (matrix multiplication between input and weights)
net:add(nn.Linear(120, 84))
net:add(nn.Linear(84, 10))                 -- 10 is the number of outputs of the network (in this case, 10 digits)
net:add(nn.LogSoftMax())                   -- converts the output to a log-probability. Useful for classification problems

print(net:__tostring());

In [ ]:
mlp=nn.Parallel(2,1);     -- iterate over dimension 2 of input
mlp:add(nn.Linear(10,3)); -- apply to first slice
mlp:add(nn.Linear(10,2))  -- apply to first second slice
x = torch.randn(10,2)
print(x)
print(mlp:forward(x))


Other examples of nn containers are shown in the figure below:

Every neural network module in torch has automatic differentiation. It has a :forward(input) function that computes the output for a given input, flowing the input through the network. and it has a :backward(input, gradient) function that will differentiate each neuron in the network w.r.t. the gradient that is passed in. This is done via the chain rule.

In [ ]:
input = torch.rand(1,32,32) -- pass a random tensor as input to the network

In [ ]:
output = net:forward(input)

In [ ]:
net:zeroGradParameters() -- zero the internal gradient buffers of the network (will come to this later)

In [ ]:
gradInput = net:backward(input, torch.rand(10))

In [ ]:
print(#gradInput)


#### Criterion: Defining a loss function¶

When you want a model to learn to do something, you give it feedback on how well it is doing. This function that computes an objective measure of the model's performance is called a loss function.

A typical loss function takes in the model's output and the groundtruth and computes a value that quantifies the model's performance.

The model then corrects itself to have a smaller loss.

In torch, loss functions are implemented just like neural network modules, and have automatic differentiation.
They have two functions - forward(input, target), backward(input, target)

For example:

In [ ]:
criterion = nn.ClassNLLCriterion() -- a negative log-likelihood criterion for multi-class classification

In [ ]:
loss = criterion:forward(output, 3) -- let's say the groundtruth was class number: 3

In [ ]:
gradients = criterion:backward(output, 3)

In [ ]:
gradInput = net:backward(input, gradients)


#### Training¶

In [ ]:
require 'optim';

• Optimization package for nn.
• Provides training algorithms like SGD, LBFGS, etc.
• Uses closures
In [ ]:
parameters, gradParameters = net:getParameters()

-- Define a closure that computes the loss and dloss/dx.
feval = function(x)
-- reset gradients
gradParameters:zero()

-- 1. compute outputs (log probabilities) for each data point
local output = net:forward(input)
-- 2. compute the loss of these outputs, measured against the true labels
local loss = criterion:forward(output, 3)
-- 3. compute the derivative of the loss wrt the outputs of the model
local dloss_doutput = criterion:backward(output, 3)
-- 4. use gradients to update weights
net:backward(input, dloss_doutput)

-- optim expects us to return
-- loss, (gradient of loss with respect to the weights)
return loss, gradParameters
end

In [ ]:
-- Define SGD parameters.
sgd_params = {
learningRate = 1e-2,
learningRateDecay = 1e-4,
weightDecay = 0,
momentum = 0
}

-- train for a number of epochs
epochs = 1e2
losses = {}
for i = 1,epochs do
-- one step of SGD optimization (steepest descent)
_,local_loss = optim.sgd(feval, parameters, sgd_params)
-- accumulate error
losses[#losses + 1] = local_loss[1]
end
print(losses[1])
print(losses[#losses])

In [ ]:
print(torch.exp(net:forward(input)))


### Network graphs¶

https://github.com/torch/nngraph

nngraph provides graphical computation for the nn library in Torch.

In [ ]:
require 'nngraph';

• nngraph overloads the call operator (i.e. the () operator used for function calls) on all nn.Module objects.
• When the call operator is invoked, it converts the nn.Module to nngraph.gModule.
• The argument to the call operator specifies which modules will feed into this one during a forward pass.
In [ ]:
add = nn.CAddTable()
t1 = torch.Tensor{1,2,3}
t2 = torch.Tensor{4,5,6}
output = add:forward({t1, t2})

In [ ]:
print(t1 + t2)

In [ ]:
x1 = nn.Identity()()
x2 = nn.Identity()()
a = nn.CAddTable()({x1, x2})
m = nn.gModule({x1, x2}, {a})

In [ ]:
print(m:forward{t1, t2})


#### Vanilla RNN¶

In [ ]:
input_size = 3
rnn_size = 2

In [ ]:
inputs = {}
table.insert(inputs, nn.Identity()()) -- network input
table.insert(inputs, nn.Identity()()) -- h at time t-1
input = inputs[1]
prev_h = inputs[2]

In [ ]:
i2h = nn.Linear(input_size, rnn_size)(input) -- input to hidden
h2h = nn.Linear(rnn_size, rnn_size)(prev_h)  -- hidden to hidden

In [ ]:
next_h = nn.Tanh()(nn.CAddTable(){i2h, h2h})

In [ ]:
outputs = {}
table.insert(outputs, next_h)

-- packs the graph into a convenient module with standard API (:forward(), :backward())
RNN = nn.gModule(inputs, outputs)

In [ ]:
print(RNN:forward{torch.randn(1,3), torch.randn(1,2)})

In [ ]:
print(RNN:backward({torch.Tensor{0,0,0}, torch.Tensor{0,0}}, torch.randn(1,2)))

In [ ]:
RNN:get(1)


#### LSTM Implementation¶

Gates:

$$i_t = g(W_{xi}x_t + W_{hi}h_{t-1} + b_i)$$$$f_t = g(W_{xf}x_t + W_{hf}h_{t-1} + b_f)$$$$o_t = g(W_{xo}x_t + W_{ho}h_{t-1} + b_o)$$

Input transform:

$$c\_in_t = tanh{(W_{xc}x_t + W_{hc}h_{t-1} + b_c)}$$

State update:

$$c_t = f_t ⋅ c_{t-1} + i_t ⋅ c\_in_t$$$$h_t = o_t ⋅ tanh{(c_t)}$$
In [ ]:
input_size = 3
rnn_size = 2

In [ ]:
inputs = {}
table.insert(inputs, nn.Identity()())   -- network input
table.insert(inputs, nn.Identity()())   -- c at time t-1
table.insert(inputs, nn.Identity()())   -- h at time t-1
input = inputs[1]
prev_c = inputs[2]
prev_h = inputs[3]

$$i_t = g(W_{xi}x_t + W_{hi}h_{t-1} + b_i)$$$$f_t = g(W_{xf}x_t + W_{hf}h_{t-1} + b_f)$$$$o_t = g(W_{xo}x_t + W_{ho}h_{t-1} + b_o)$$$$c\_in_t = tanh{(W_{xc}x_t + W_{hc}h_{t-1} + b_c)}$$
In [ ]:
i2h = nn.Linear(input_size, 4 * rnn_size)(input)  -- input to hidden
h2h = nn.Linear(rnn_size, 4 * rnn_size)(prev_h)   -- hidden to hidden
preactivations = nn.CAddTable()({i2h, h2h})       -- i2h + h2h


All the gates use the sigmoid, and the input preactivation uses tanh.

In [ ]:
-- gates
pre_sigmoid_chunk = nn.Narrow(2, 1, 3 * rnn_size)(preactivations)
all_gates = nn.Sigmoid()(pre_sigmoid_chunk)

-- input
in_chunk = nn.Narrow(2, 3 * rnn_size + 1, rnn_size)(preactivations)
in_transform = nn.Tanh()(in_chunk)

In [ ]:
in_gate = nn.Narrow(2, 1, rnn_size)(all_gates)
forget_gate = nn.Narrow(2, rnn_size + 1, rnn_size)(all_gates)
out_gate = nn.Narrow(2, 2 * rnn_size + 1, rnn_size)(all_gates)


Cell and Hidden states

$$c_t = f_t ⋅ c_{t-1} + i_t ⋅ c\_in_t$$$$h_t = o_t ⋅ tanh{(c_t)}$$
In [ ]:
-- previous cell state contribution
c_forget = nn.CMulTable()({forget_gate, prev_c})
-- input contribution
c_input = nn.CMulTable()({in_gate, in_transform})
-- next cell state
next_c = nn.CAddTable()({c_forget, c_input})

In [ ]:
c_transform = nn.Tanh()(next_c)
next_h = nn.CMulTable()({out_gate, c_transform})


Defining the module

In [ ]:
-- module outputs
outputs = {}
table.insert(outputs, next_c)
table.insert(outputs, next_h)

-- packs the graph into a convenient module with standard API (:forward(), :backward())
LSTM = nn.gModule(inputs, outputs)

In [ ]:
print(LSTM:forward{torch.randn(1,3):zero(), torch.randn(1,2):zero(), torch.randn(1,2):zero()})

In [ ]:
print(LSTM:backward({torch.randn(1,3), torch.randn(1,2), torch.randn(1,2)}, {torch.randn(1,2):zero(), torch.randn(1,2):zero()}))

In [ ]:
LSTM.modules