Jekyll2018-10-21T11:30:36+05:30https://iitmcvg.github.io/CVIOfficial site for CVI-IIT MadrasComputer Vision and Intelligencecvigroup.cfi@gmail.comUsing our Docker Containers2018-09-16T00:00:00+05:302018-09-16T00:00:00+05:30https://iitmcvg.github.io/docker/docker-intro<p>In this post, we shall explain how you can get started with our containers - for either python, deep learning, computer vision or probabilistic modelling.</p>
<p>We are using these containers to manage session requirements for the Fall of 2018.</p>
<h2 id="getting-started-with-docker">Getting Started with Docker</h2>
<p>Firstly, download a compatible version of docker from here:</p>
<ul>
<li>For installation of Docker in Windows - <a href="https://download.docker.com/win/stable/DockerToolbox.exe">link</a></li>
<li>For installation of Docker in Linux(Ubuntu) - direct curl command available.</li>
<li>For installation of Docker in Mac - <a href="https://download.docker.com/mac/stable/Docker.dmg">link</a></li>
</ul>
<p><em>Note: while installing dockertools for windows, check all options inclduing UEFI and Virtualisation</em></p>
<h2 id="installation">Installation</h2>
<h3 id="linux">Linux</h3>
<ul>
<li>
<p>Run <code class="highlighter-rouge">curl -fsSL get.docker.com | sh</code> to get the latest version of docker.</p>
</li>
<li>
<p>Open your terminal, run <code class="highlighter-rouge">docker --version</code> to output the version.</p>
</li>
</ul>
<h3 id="macos">macOS</h3>
<ul>
<li>
<p>After you have installed docker, drag it to your applications folder.</p>
</li>
<li>
<p>Run the docker app, and open a new terminal.</p>
</li>
<li>
<p>Verify your docker version with <code class="highlighter-rouge">docker --version</code>.</p>
</li>
</ul>
<h3 id="windows">Windows:</h3>
<ul>
<li>Before installation, additional software packages like Kitematic and Virtualbox need to be marked, ensure you check all of them during the installer.</li>
<li>The installer adds Docker Toolbox to your Applications folder.</li>
<li>On your Desktop, find the Docker QuickStart Terminal icon.</li>
<li>Click the Docker QuickStart icon to launch a pre-configured Docker Toolbox terminal.</li>
<li>If the system displays a User Account Control prompt to allow VirtualBox to make changes to your computer. Choose Yes.</li>
<li>
<p>The terminal does several things to set up Docker Toolbox for you. When it is done, the terminal displays a prompt.</p>
</li>
<li>The prompt is traditionally a $ dollar sign. You type commands into the command line which is the area after the prompt. Your cursor is indicated by a highlighted area or a | that appears in the command line. After typing a command, always press RETURN.</li>
</ul>
<h2 id="docker-hello-world">Docker Hello World</h2>
<ul>
<li>Type the <strong>docker run hello-world</strong> command and press RETURN.</li>
<li>The command does some work for you, if everything runs well, the command’s output looks like this:</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ docker run hello-world
Unable to find image 'hello-world:latest' locally
Pulling repository hello-world
91c95931e552: Download complete
… … … …
</code></pre></div></div>
<h2 id="running-the-container">Running the Container</h2>
<ul>
<li>
<p>If you are on windows, run the command <strong>docker-machine ip</strong> and make a note of the IP address shown as output.</p>
</li>
<li>
<p>Run the container:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker run -it --name cvi --rm -p 8888:8888 iitmcvg/session:intro_CV bash
</code></pre></div> </div>
</li>
</ul>
<p>The image has the following tools:</p>
<ul>
<li>OpenCV 3.4.1</li>
<li>Tensorflow 1.10</li>
<li>Keras</li>
<li>Jupyter</li>
<li>
<p>Scientific python: Numpy, Scipy, Matplotlib … etc.</p>
</li>
<li>
<p>The command does some work for you. Downloading takes around 5 minutes. Be patient. Once the extraction is complete, you should see a terminal shell corresponding to the container (eg: root@xxxxxx).</p>
</li>
<li>Now, update session contents by giving the following command:</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git pull
</code></pre></div></div>
<ul>
<li>Run Jupyter with the following command.</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jupyter notebook --ip=0.0.0.0 --allow-root
</code></pre></div></div>
<ul>
<li>If everything goes well, the command’s output should look like this:</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@5e3ca2f04d54:/Content/Sessions/CV_Intro_Session_1_2018# jupyter notebook --ip=0.0.0.0 --allow-root
[I 13:42:26.667 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 13:42:26.668 NotebookApp] No web browser found: could not locate runnable browser.
[C 13:42:26.669 NotebookApp]
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://(5e3ca2f04d54 or 127.0.0.1):8888/?token=56d48e36ca256e00823506c4f2cf1fc89264a3ba025d3307
</code></pre></div></div>
<ul>
<li>Go to <code class="highlighter-rouge">localhost:8888</code> and enter the token (everthing from <code class="highlighter-rouge">token = </code>) there. Again, if on windows copy this URL, replace everything within () to the IP address that you noted down in the first step and paste this new URL in your browser.</li>
</ul>
<p>For example, http://192.168.99.100:8888/?token=e99ef0776ac2c2d848d580e7e86d10a5f8e187fe20be8ae3</p>
<ul>
<li>
<p>You are good to go if Jupyter Notebooks successfully opens up in your browser. One noted issue on windows is that Edge doesnot support using localhost. You would require chrome or firefox for the same.</p>
</li>
<li>
<p>Feel free to raise any issues regarding installation at <a href="https://github.com/iitmcvg/Content/issues">github issues</a> with the tag <code class="highlighter-rouge">docker install issue</code>. Elaborate on the specifics of the issue and we’ll try to address them.</p>
</li>
</ul>Computer Vision and Intelligencecvigroup.cfi@gmail.comIn this post, we shall explain how you can get started with our containers - for either python, deep learning, computer vision or probabilistic modelling.Problem Statements on basic OpenCV2018-09-14T00:00:00+05:302018-09-14T00:00:00+05:30https://iitmcvg.github.io/problem_statements/Problem_statements<p>Now that we’re done with the first session and the basics of Computer Vision and OpenCV, we have some problem statements for you to delve into. Do go through our <a href="https://github.com/iitmcvg/Content/blob/master/Sessions/CV_Intro_Session_1_2018/session_1_2018.ipynb">session notebook</a> if you missed it or don’t remember what was covered.</p>
<h2 id="task-1-ball-tracking">Task 1: Ball Tracking</h2>
<p>We are going to use OpenCV to draw bounding boxes around a ball of a specific colour. The output would finally run in real time and would look like:</p>
<p><img src="/assets/images/posts/Problem_Statements/ball_track.png" alt="" height="500px" width="500px" class="align-center" /></p>
<p>So let’s get started!!</p>
<ul>
<li>
<p>First learn to use cv2.VideoCapture() to access images from your webcam. So you should be able to store each frame in a variable called ‘frame’</p>
<p>The code will look a bit like:</p>
</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Import cv2
# Create VideoCapture object here
cap = ......
while(True):
# Read image from webcam here
ret, frame = ………
# frame now stores the image
# Show output using cv2.imshow()
.........
# Exit loop if ‘q’ is pressed
if cv2.waitKey(1) & 0xFF == ord('q'):
break
# Release the VideoCapture object and destroy all windows
cap.release()
cv2.destroyAllWindows()
</code></pre></div></div>
<ul>
<li>
<p>We are going detect only green objects. This is done easily in HSV space. In the loop first convert the image from BGR to HSV space using cv2.cvtColor().</p>
<p>The lower and upper ranges of green in HSV space are (29, 86, 6) and (100, 255, 255) respectively. Use cv2.inRange() to create a mask such that only those regions of the images which were green will be seen in the mask. Use cv2.imshow() on mask to see the output</p>
</li>
<li>
<p>Do erosion followed by dilation on the mask to remove noise.</p>
</li>
<li>
<p>We are now going to find contours in our mask. Find out how the cv2.findContours() function works and use it to do so. It will look a bit like:</p>
</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cnts = cv2.findContours(mask.copy(), cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)[-2]
# 'mask' is the output of erosion followed by dilation.
# 'cnts' is now a list of contours
</code></pre></div></div>
<ul>
<li>Use the max() function to choose the largest contour based on area. max() accepts two arguments: the first one is the list of contours and the second one (key) is the criteria for choosing the maximum. (Here it will be cv2.contourArea)</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C_max = max(cnts,key = cv2.contourArea)
# 'cnts' was the list of contours
# 'C_max' is stores the largest contour
</code></pre></div></div>
<ul>
<li>Use cv2.boundingRect() to get the coordinates of one point on the rectangle bounding the contour and its height and width. Now use the cv2.rectangle() function to draw the rectangle on the frame. Use cv2.imshow() and see the output.</li>
</ul>
<h2 id="task-2-edge-detection-using-difference-of-gaussians">Task 2: Edge Detection using Difference of Gaussians</h2>
<p>In our last session you would have come across the Gaussian Filter for blurring an image. If you use two Gaussians filters of different sizes (having different standard deviations) and find their difference, it acts like a band pass filter and can detect edges.</p>
<p>So try this out on the following image:
<img src="/assets/images/posts/Problem_Statements/laplacian1.jpg" alt="" height="500px" width="500px" class="align-center" /></p>
<p>Perform Gaussian Blurring with kernels of size 5x5 and 9x9 and find their difference and see the output.</p>
<h2 id="task-3-fog-removal">Task 3: Fog Removal</h2>
<p>Check out this colab <a href="https://colab.research.google.com/drive/14_1Qj9iF4RaGOSrvuahew4vxNlujC3uc#scrollTo=gK5XW9HvUci4">link</a> for this Problem Statement</p>
<p>Make a copy of it in your drive and then start working on it</p>
<h2 id="task-4-template-matching-using-histograms">Task 4: Template Matching using Histograms</h2>
<p>The task is to recognise the orange and white barrels in this <a href="https://www.youtube.com/watch?v=A9BVr7kltl8">video</a>. You will be using histogram backprojection to do so. Try to understand how cv2.calcBackProject() works for this.</p>
<p>The following are the steps you will roughly have to follow:</p>
<ul>
<li>
<p>Get an image of the region of interest(the barrel), convert this to HSV from RGB. This is what you will be using for template matching.</p>
</li>
<li>
<p>Open your video using cv2.VideoCapture(), and convert your frame to HSV</p>
</li>
<li>
<p>Calculate the histogram of your object using the cv2.calcHist() function. Normalize your histogram, and apply histogram backprojection using cv2.normalize() and cv2.calcBackProject().</p>
</li>
<li>
<p>Let the result after calculating backproject be ‘res’. Now to visualize ‘res’ better, we shall convolve with a circular disc.</p>
</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>disc = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (5,5))
cv2.filter2D(res, -1, disc, res) #res is the matrix obtained after back projection
</code></pre></div></div>
<ul>
<li>
<p>Threshold your image. Try out different values for best results.</p>
</li>
<li>
<p>Merge the thresholded matrices to get a 3 channel image</p>
</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>final = cv2.merge((thresh,thresh,thresh))
</code></pre></div></div>
<ul>
<li>Perform a bitwise or of ‘final’ with the target image and display the output</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>result = cv2.bitwise_or(target_img, final)
</code></pre></div></div>
<p>If this gives good results you can also go ahead and try to identify the white lines bounding the path.</p>
<p>Use the comment section to ask doubts or raise any issues you face.
Kindly refrain from putting up solutions though :P</p>Computer Vision and Intelligencecvigroup.cfi@gmail.comNow that we’re done with the first session and the basics of Computer Vision and OpenCV, we have some problem statements for you to delve into. Do go through our session notebook if you missed it or don’t remember what was covered.Session 1: Intro to Computer Vision2018-09-09T00:00:00+05:302018-09-09T00:00:00+05:30https://iitmcvg.github.io/sessions/intro-CV<h1 id="session-1-introduction-to-computer-vision">Session 1: Introduction to Computer Vision</h1>
<p>Greetings from the Computer Vision and Intelligence group, CFI!</p>
<p>We are overwhelmed with the response we received for our introductory session. And it’s time to get into some nitty-gritty of Computer Vision. Our next session will cover the fundamentals of Computer Vision using OpenCV.</p>
<p><strong>Github Link:</strong> https://github.com/iitmcvg/Content/tree/master/Sessions/CV_Intro_Session_1_2018
<strong>Date :</strong> 10th September 2018 (Monday)
<strong>Venue :</strong> ESB 127
<strong>Time :</strong>8:00 pm- 10:30 pm</p>
<iframe src="https://www.google.com/maps/embed?pb=!1m23!1m12!1m3!1d124406.89289444235!2d80.16030355909216!3d12.990045923321086!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!4m8!3e6!4m0!4m5!1s0x3a52677fdb777ceb%3A0xb9d8a78a4b0ef7d3!2sClass+Room+Complex%2C+IIT+Madras%2C+Indian+Institute+Of+Technology%2C+Chennai%2C+Tamil+Nadu+600036!3m2!1d12.9900553!2d80.2303441!5e0!3m2!1sen!2sin!4v1522947421266" width="400" height="300" frameborder="0" style="border:0" allowfullscreen=""></iframe>
<h2 id="getting-started-with-docker">Getting Started with Docker</h2>
<p>Firstly, download a compatible version of docker from here:</p>
<ul>
<li>For installation of Docker in Windows - https://download.docker.com/win/stable/DockerToolbox.exe</li>
<li>For installation of Docker in Linux(Ubuntu) - https://docs.docker.com/install/linux/docker-ce/ubuntu/</li>
<li>For installation of Docker in Mac - <br />
https://download.docker.com/mac/stable/Docker.dmg</li>
</ul>
<p><em>Note: while installing dockertools for windows, check all options inclduing UEFI and Virtualisation</em></p>
<h3 id="installation">Installation</h3>
<ul>
<li>Before installation, additional software packages like Kitematic and Virtualbox can be unchecked.</li>
<li>The installer adds Docker Toolbox to your Applications folder.</li>
<li>On your Desktop, find the Docker QuickStart Terminal icon.</li>
<li>Click the Docker QuickStart icon to launch a pre-configured Docker Toolbox terminal.</li>
<li>If the system displays a User Account Control prompt to allow VirtualBox to make changes to your computer. Choose Yes.</li>
<li>The terminal does several things to set up Docker Toolbox for you. When it is done, the terminal displays the $ prompt.</li>
<li>Make the terminal active by clicking your mouse next to the $ prompt.</li>
<li>
<table>
<tbody>
<tr>
<td>The prompt is traditionally a $ dollar sign. You type commands into the command line which is the area after the prompt. Your cursor is indicated by a highlighted area or a </td>
<td> that appears in the command line. After typing a command, always press RETURN.</td>
</tr>
</tbody>
</table>
</li>
<li>Type the <strong>docker run hello-world</strong> command and press RETURN.</li>
<li>The command does some work for you, if everything runs well, the command’s output looks like this:</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ docker run hello-world
Unable to find image 'hello-world:latest' locally
Pulling repository hello-world
91c95931e552: Download complete
… … … …
</code></pre></div></div>
<h3 id="running-the-container">Running the Container</h3>
<ul>
<li>
<p>If you are on windows, run the command <strong>docker-machine ip</strong> and make a note of the IP address shown as output.</p>
</li>
<li>
<p>Run the container:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker run -it --name cvi --rm -p 8888:8888 iitmcvg/session:intro_CV bash
</code></pre></div> </div>
</li>
</ul>
<p>The image has the following tools:</p>
<ul>
<li>OpenCV 3.4.1</li>
<li>Tensorflow 1.10</li>
<li>Keras</li>
<li>Jupyter</li>
<li>
<p>Scientific python: Numpy, Scipy, Matplotlib … etc.</p>
</li>
<li>
<p>The command does some work for you. Downloading takes around 5 minutes. Be patient. Once the extraction is complete, you should see a terminal shell corresponding to the container (eg: root@xxxxxx).</p>
</li>
<li>Now, update session contents by giving the following command:</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git pull
</code></pre></div></div>
<ul>
<li>Run Jupyter with the following command.</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jupyter notebook --ip=0.0.0.0 --allow-root
</code></pre></div></div>
<ul>
<li>If everything goes well, the command’s output should look like this:</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@5e3ca2f04d54:/Content/Sessions/CV_Intro_Session_1_2018# jupyter notebook --ip=0.0.0.0 --allow-root
[I 13:42:26.419 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 13:42:26.667 NotebookApp] Serving notebooks from local directory: /Content/Sessions/CV_Intro_Session_1_2018
[I 13:42:26.667 NotebookApp] The Jupyter Notebook is running at:
[I 13:42:26.667 NotebookApp] http://(5e3ca2f04d54 or 127.0.0.1):8888/?token=56d48e36ca256e00823506c4f2cf1fc89264a3ba025d3307
[I 13:42:26.667 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 13:42:26.668 NotebookApp] No web browser found: could not locate runnable browser.
[C 13:42:26.669 NotebookApp]
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://(5e3ca2f04d54 or 127.0.0.1):8888/?token=56d48e36ca256e00823506c4f2cf1fc89264a3ba025d3307
</code></pre></div></div>
<ul>
<li>Go to <code class="highlighter-rouge">localhost:8888</code> and enter the token (everthing from <code class="highlighter-rouge">token = </code>) there. Again, if on windows copy this URL, replace everything within () to the IP address that you noted down in the first step and paste this new URL in your browser.</li>
</ul>
<p>For example, http://192.168.99.100:8888/?token=e99ef0776ac2c2d848d580e7e86d10a5f8e187fe20be8ae3</p>
<ul>
<li>
<p>You are good to go if Jupyter Notebooks successfully opens up in your browser. One noted issue on windows is that Edge doesnot support using localhost. You would require chrome or firefox for the same.</p>
</li>
<li>
<p>Feel free to raise any issues regarding installation at <a href="https://github.com/iitmcvg/Content/issues">github issues</a> with the tag <code class="highlighter-rouge">docker install issue</code>. Elaborate on the specifics of the issue and we’ll try to address them.</p>
</li>
</ul>Computer Vision and Intelligencecvigroup.cfi@gmail.comSession 1: Introduction to Computer VisionPysangamam 2018: Workshop Content2018-09-08T00:00:00+05:302018-09-08T00:00:00+05:30https://iitmcvg.github.io/conferences/pysangamam-content<p>Thank you for the wonderful participation yesterday at PySangamam 2018.</p>
<p>As promised, here are our slides and repo’s used:</p>
<ul>
<li><a href="https://docs.google.com/presentation/d/1fCtbC-nzSKMg63sLPqJsVc7ENmhbLrMg6Z-lrhCfE_0/edit?usp=sharing">Slides</a></li>
<li><a href="https://github.com/iitmcvg/pysangamam">Docker file and Notebook repo</a></li>
<li><a href="https://github.com/iitmcvg/Fast-image-classification">Transfer Learning Extension</a></li>
<li><a href="https://github.com/iitmcvg/Content">Our content repo we maitain</a></li>
</ul>
<p>Feel free to write back to us at cvigroup[dot]cfi[at]gmail[dot]com, or follow us on facebook or twitter for more updates!</p>
<p><strong>Update 9th September 2018</strong>: We have embedded the slides in this post below.</p>
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vSulJC8mcUd5RlMNpRHYQzghXfq8TZMou86nyIQnZUTh3WgIq0s4yTr_5oPTJqxv7ziPEqGOSODzDfW/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>Computer Vision and Intelligencecvigroup.cfi@gmail.comThank you for the wonderful participation yesterday at PySangamam 2018.Team from CVI bags runners-up in YAH! 2k182018-08-22T00:00:00+05:302018-08-22T00:00:00+05:30https://iitmcvg.github.io/competitions/YAH<h1 id="yet-another-hackathon-yah-2018">Yet Another Hackathon (YAH!) 2018</h1>
<p>A team from CVI, named “Eye in the Sky”, participated in YAH! 2k18.</p>
<p><strong>We have an inter-disciplinary project going on in CFI which is in collaboration with CVI and Aero club.</strong></p>
<p><strong>The team, won the runner up prize worth 10000 rupees.</strong></p>
<p>There were three subdomains in the competition and the team had to work on any one of them, these subdomains were:</p>
<ul>
<li>
<p>New innovation and implementation in public services : Under the domain of Governance.</p>
</li>
<li>
<p>Gaming with a purpose : Under the domain of Education.</p>
</li>
<li>
<p>Personal Security : Under the domain of Security.</p>
</li>
</ul>
<p><img src="https://github.com/Ayushmaniar/Help_from_the_sky_2/blob/master/hack_6.jpeg?raw=true" alt="Image 1" /></p>
<p>The team chose to work on the subdomain of public services; specifically targeting Emergency Services and the use of Drones in Disaster management, employing real time Deep Learning and path planning algorithms in the codebase:</p>
<h3 id="our-unique-selling-points">Our unique selling points</h3>
<ul>
<li>
<p>Doing live person detection with the help of highly optimised YOLO architecture using Tensorflow API.</p>
</li>
<li>
<p>Achieving live video transfer - where the drone and laptop were not connected physically at all - and also sending accurate GPS coordinates to the base-station whenever a person is detected so that the relief forces can go out there and save him/her.</p>
</li>
<li>
<p>Achieving live Activity Recognition for an SOS signal done by a person during a disaster. This SOS signal includes raising both your hands and waving in front of the drone continously as a sign of asking for immediate Help Requirement . Also this idea is not implemented and present on the internet till date and it is atleast 3 times faster than the fastest available method for Activity Recognition - which relies on using 3D convolution of frames across time.</p>
</li>
<li>
<p>Making an interactive demo where in a user can zoom-in on the map of India and draw a rectangle around the disaster affected area after which a team of drones will go and distribute themselves in such a way so that the area survielled by the team of drones is maximum.</p>
</li>
</ul>
<p>Also the team would like to specially thank the CVI club heads <strong>Lokesh Kumar and Varun Sundar</strong> for their guidance and mentorship.</p>Computer Vision and Intelligencecvigroup.cfi@gmail.comYet Another Hackathon (YAH!) 2018CVI in Pysangamam 20182018-08-22T00:00:00+05:302018-08-22T00:00:00+05:30https://iitmcvg.github.io/conferences/Pysangamam<p>We are pleased to announce that 3 of our talk proposals from the Computer Vision and Intelligence group (CVI) have been selected as a part of <a href="https://pysangamam.org">PySangamam 2018</a>.</p>
<ul>
<li><strong>Lokesh Kumar</strong> would be covering an introductory talk on using numba for GPU parallelism in python.</li>
<li><strong>Nikhil Krishna</strong> would be delivering a talk titled “Gentle Introduction to Effective Parallelism via Computer Vision.”</li>
<li><strong>Rajat V D</strong> will be covering a talk on python generators and its use cases in Deep Learning</li>
</ul>
<p>We will also be conducting a workshop titled “Computer Vision through the Ages”, where we would take you along the evolution of Computer vision since the last two decades, cover factors that influence the field the most today and provide intuiton on the approaches taken in this period.</p>
<p>We’re super excited to be a part of PySangamam 2018, see you there too!</p>
<p>Follow us on facebook or twitter for more updates!</p>Computer Vision and Intelligencecvigroup.cfi@gmail.comWe are pleased to announce that 3 of our talk proposals from the Computer Vision and Intelligence group (CVI) have been selected as a part of PySangamam 2018.Exploring Adversarial Reprogramming2018-07-17T00:00:00+05:302018-07-17T00:00:00+05:30https://iitmcvg.github.io/papers/Exploring-Adversarial-Reprogramming<figure style="margin: 20px auto; text-align: center;">
<img src="/assets/images/posts/adv_reprog/cover_5iscat.png" alt="catis5" width="auto" style="margin:20px auto; display:inline-block" text-align="center" />
</figure>
<p><em>Author: Rajat V D</em></p>
<p><em>This post was also posted on medium at <a href="https://medium.com/@cvigroup.cfi/exploring-adversarial-reprogramming-c9e14bf3236a" class="btn btn--success"><i class="fa fa-medium" aria-hidden="true"></i><span> Medium</span></a></em></p>
<p><em>Orginal post <a href="https://rajatvd.github.io/Exploring-Adversarial-Reprogramming/">here</a></em></p>
<p>Google brain recently published a paper titled <a href="https://arxiv.org/pdf/1806.11146.pdf">Adversarial Reprogramming of Neural Networks</a> which caught my attention. It introduced a new kind of adversarial example for neural networks, those which could actually perform a useful task for the adversary as opposed to just fooling the attacked network. The attack ‘reprograms’ a network designed for a particular task to perform a completely different one. The paper showed that popular ImageNet architectures like Inception and ResNets can be successfully reprogrammed to perform quite well in different tasks like counting squares, MNIST and CIFAR-10.</p>
<p>I’m going to walk through the paper in this post, and also add some of my own small modifications to the work they presented in the paper. In particular, I experimented with a slightly different method of action of the adversarial program, different regularization techniques and also targeted different networks - ResNet 18 and AlexNet.</p>
<h1 id="paper-summary">Paper summary</h1>
<p>The paper demonstrated adversarial reprogramming of some famous ImageNet architectures like Inceptions v2, v3 and v4, along with some ResNets - 50, 101, and 152. They reprogrammed these networks to perform MNIST and CIFAR-10 classification, and also the task of counting squares in an image.</p>
<p>The gist of the reprogramming process is as follows:</p>
<ul>
<li>Take a pretrained model on ImageNet like Inception.</li>
<li>Re-assign ImageNet labels to the labels of your target task. So for example, let ‘great white shark’ = 1 for MNIST, and so on. You can assign multiple ImageNet labels to the same adversarial label as well.</li>
<li>Add an ‘adversarial program’ image to your MNIST image and pass that through the Inception model. Map the outputs of Inception using the remapping you chose above to get your MNIST predictions.</li>
<li>Train only the adversarial program image on the remapped labels, while keeping the Inception weights frozen.</li>
<li>Now you got yourself an MNIST classifier: Take an MNIST image, add on your trained adversarial program, run it through Inception, and remap its labels to get predictions for MNIST.</li>
</ul>
<p>The exact method of ‘adding’ the adversarial program is as follows. Since ImageNet models require a 224 x 224 image, we use that as the size of our program weights. Let’s call the weights image <script type="math/tex">W</script>. A nonlinear activation is applied to the weights after masking out the centre 28x28 section, which is then replaced by the MNIST image. This is the image which is passed in to our ImageNet model. Let’s define the mask with 0’s in the centre 28x28 as $M$. The adversarial input to the ImageNet model <script type="math/tex">X_{adv}</script> is:</p>
<script type="math/tex; mode=display">X_{adv} = \tanh(W \odot M)+ pad(X)</script>
<p>where $\odot$ represents element wise multiplication, and $X$ is the input MNIST image. The illustrations in the paper shown below sum it up well:</p>
<figure style="margin: 20px auto; text-align: center;">
<img src="/assets/images/posts/adv_reprog/paper_illustration1.PNG" alt="pap1" width="auto" style="margin:20px auto; display:inline-block" text-align="center" />
<figcaption>Add a masked adversarial program to an input image from the counting squares task.</figcaption>
</figure>
<figure style="margin: 20px 10%; text-align: center;">
<img src="/assets/images/posts/adv_reprog/paper_illustration2.PNG" alt="pap2" width="auto" style="margin:20px auto; display:inline-block" text-align="center" />
<figcaption>Pass through ImageNet model and remap outputs.</figcaption>
</figure>
<p>The outputs of the ImageNet model are trained using the cross-entropy loss as is normal for any classification problem. L2 regularization of the weight <script type="math/tex">W</script> was also done.</p>
<p>The results of the paper were quite interesting. Here are some important observations:</p>
<ul>
<li>They observed that using pre-trained ImageNet models allowed for a much higher accuracy than untrained or randomly initialized models(some models showed a disparity of ~80% test accuracy between trained and untrained).</li>
<li>The adversarial programs for different models showed different qualitative features, meaning that they were architecture specific in some sense.</li>
<li>Adversarially trained models showed basically no reduction in accuracy. This means that they are just as vulnerable to being reprogrammed as a normally trained model.</li>
</ul>
<h1 id="my-experiments">My Experiments</h1>
<h2 id="regularization-using-blurring">Regularization using blurring</h2>
<p>One very important difference between these adversarial programs and traditional adversarial examples is that traditional examples were only deemed adversarial because the perturbation added to them was small in magnitude. However in this case, as the authors state, “the magnitude of this perturbation[adversarial program] need not be constrained”, as the adversarial perturbation is not applied on any previously true example. This fact is leveraged when training the program, as there are no limits on how large <script type="math/tex">W</script> can be. Another point to note is that the perturbation is a nonlinear function of the trained weights <script type="math/tex">W</script>. This is in contrast to other adversarial examples like the <a href="https://arxiv.org/abs/1412.6572"><em>Fast Gradient Sign Method</em></a> which are linear perturbations.</p>
<p>One point which I’d like to bring up is that previous adversarial perturbations were better off if they contained high frequency components which made them “look like noise” (although they are anything but noise), as this resulted in perturbations which are imperceptible by the human eye. In other words, the perturbations did not change the true label of the image, but successfully fooled the targeted network. In the topic of adversarial programs however, there is no requirement for these perturbations to contain only high frequencies and be imperceptible, as the goal of the program is to repurpose the targeted network rather than to simply fool it. This means that we can enforce some smoothness in the trained program as a regularization technique as opposed to using L2 regularization.</p>
<p>I trained an adversarial program for ResNet-18 to classify MNIST digits using regularization techniques borrowed from <a href="https://www.auduno.com/2015/07/29/visualizing-googlenet-classes/">this post</a>. The basic idea is as follows:</p>
<ul>
<li>Blur the adversarial program image <script type="math/tex">W</script> after each gradient step using a gaussian blur, with a gradually decreasing sigma.</li>
<li>Blur the gradients as well, again using a gaussian blur with gradually decreasing sigma.</li>
</ul>
<p>After training for around 20 epochs, this was the resulting adversarial program:</p>
<figure style="margin: 20px auto; text-align: center;">
<img src="/assets/images/posts/adv_reprog/RESNET18_MNIST_masked_blurredgrad_and_weight_1k.gif" alt="res18gif" width="auto" style="margin:20px 50px; display:inline-block" text-align="center" />
<img src="/assets/images/posts/adv_reprog/RESNET18_MNIST_masked_blurredgrad_and_weight_1k.png" alt="res18" width="auto" style="margin:20px 50px; display:inline-block" text-align="center" />
<figcaption>Animation and final result of training an adversarial program using masking for Resnet 18 on MNIST</figcaption>
</figure>
<p>It managed a pretty high test accuracy of <strong>96.81%</strong>, beating some of the networks described in the paper by a few points (this could be a property of Resnet 18 compared to the other networks they used). Note that I chose the output label mapping to some arbitrary ten labels of ImageNet(the paper used the first 10 labels), which had no relation with the MNIST digits themselves. We can see that there are interesting low frequency artifacts in the program image, which have been introduced due to the blurring regularization. We can also see the effect of reducing the blurring as the program develops finer details in the later part of the gif.</p>
<p>However, I do believe that this method of transforming the input using a mask is a bit lacking, so I tried my hand at a different input transformation.</p>
<h2 id="transforming-by-resizing">Transforming by resizing</h2>
<p>The authors of the paper mention that the input transformation and output remapping “could be any consistent transformation that converts between the input (output) formats for the two tasks and causes the model to perform the adversarial task”. In the case of the MNIST adversarial reprogramming, the input transformation was masking the weights and adding the MNIST input to the centre. The authors state that they used this masking “purely to improve visualization of the action of the adversarial program” as one can clearly see the MNIST digit in the adversarial image. The masking is not required for this process to work. This however seems a bit lacking in that the network is now forced to differentiate between the 10 MNIST classes using only the information in the center 28x28 pixels of the input, while the remaining part of the input, the adversarial program, remains constant for all the 10 classes. Another transformation which retains visualization ability is to simply scale the MNIST input (linearly interpolate) and add it to the adversarial program weights without any masking, before applying the non-linearity. In this case, the fraction of the adversarial input which distinguishes between classes is the same as that would be for any other MNIST classifier. An example is shown below:</p>
<figure style="margin: 20px auto; text-align: center;">
<img src="/assets/images/posts/adv_reprog/scale_program_illu.png" alt="scale" width="auto" style="margin:20px auto; display:inline-block" text-align="center" />
<figcaption>Illustration of the input transformation described above.</figcaption>
</figure>
<p>Again, I trained an adversarial program using this new input transformation for Resnet 18. I used the same two regularization techniques of gradually decreased weight and gradient blurring as before, with the same parameters.</p>
<figure style="margin: 20px auto; text-align: center;">
<img src="/assets/images/posts/adv_reprog/RESNET18_MNIST_blurredgrad_and_weight_1k.gif" alt="res18scalegif" width="auto" style="margin:20px 50px; display:inline-block" text-align="center" />
<img src="/assets/images/posts/adv_reprog/RESNET18_MNIST_blurredgrad_and_weight_1k.png" alt="res18scale" width="auto" style="margin:20px 50px; display:inline-block" text-align="center" />
<figcaption>Animation and final result of training an adversarial program using scaling for Resnet 18 on MNIST.</figcaption>
</figure>
<figure style="margin: 20px auto; text-align: center;">
<img src="/assets/images/posts/adv_reprog/RESNET18_blurredgw_5.png" alt="5iscat" width="auto" style="margin:20px auto; display:inline-block" text-align="center" />
<figcaption>An example of adversarial image obtained by adding a scaled version of the MNIST digit 5 to an adversarial program, and then applying a sigmoid activation. Resnet 18 classifies this image as an 'Egyptian Cat' with 0.99 confidence</figcaption>
</figure>
<p>The model obtained a test accuracy of <strong>97.87%</strong>, around ~1% better than the masked transformation. It also obtained this accuracy after 15 epochs of training, showing faster convergence than the masked transformation. It also doesn’t sacrifice much in terms of visualization ability.</p>
<p>For comparisons, I also trained a program for a randomly initialized ResNet 18 network, using the gradient and weight blurring regularizations, and the scaling input transformation. As expected, the model performed much worse, with a test accuracy of only <strong>44.15%</strong> after 20 epochs. The program also showed a lack of low frequency textures and features despite the use of blurring regularization:</p>
<figure style="margin: 20px auto; text-align: center;">
<img src="/assets/images/posts/adv_reprog/RESNET18_randomweights_MNIST_blurredgrad_and_weight_1k.gif" alt="res18randgif" width="auto" style="margin:20px 50px; display:inline-block" text-align="center" />
<img src="/assets/images/posts/adv_reprog/RESNET18_randomweights_MNIST_blurredgrad_and_weight_1k.png" alt="res18rand" width="auto" style="margin:20px 50px; display:inline-block" text-align="center" />
<figcaption>Animation and final result of training an adversarial program for an untrained (randomly initialized) ResNet 18 network on MNIST even with blurring regularization</figcaption>
</figure>
<h2 id="multiple-output-label-mappings">Multiple output label mappings</h2>
<p>The above experiments have focused on changing the input transformation and regularization methods. I also experimented with using output label mappings which weren’t arbitrary. I experimented on this with CIFAR-10 because the task of reprogramming an ImageNet classifier to become a CIFAR-10 classifier are quite related. Both tasks have inputs which are photos of real objects, and their outputs are labels of these objects. It’s easy to find ImageNet labels which are closely related to CIFAR-10 labels. For example, the ImageNet label ‘airliner’ maps directly to the CIFAR-10 label ‘airplane’. To take a more structured approach, I ran the training images of CIFAR-10 (leaving aside a validation set) through ResNet 18 by rescaling to 224x224. This is equivalent to using an adversarial program which is initialized to 0. I then looked at the outputs of ResNet and compared them to the true labels of CIFAR-10. For this particular case, I performed multi-label mapping from ImageNet labels to CIFAR labels. In particular, I greedily mapped each ImageNet label to that CIFAR label which was classified the most by ResNet. In short:</p>
<ul>
<li>Run CIFAR train set through ResNet.</li>
<li>Get a histogram of ImageNet labels for each CIFAR label.</li>
<li>Map each ImageNet label to the CIFAR label with the highest value.</li>
<li>For example, let’s look at the ImageNet label <em>goldfish</em>. After passing the CIFAR training set through the model, let’s say 10 trucks were classified as goldfish, 200 birds, 80 airplanes, etc. (these numbers are just examples). Suppose the 200 birds is the largest number of a single CIFAR class which was classified as <em>goldfish</em>. I then choose to map the <em>goldfish</em> output to the CIFAR label <em>bird</em>. This is the greedy part - I map the ImageNet label <script type="math/tex">y</script> to that CIFAR label which was classified as <script type="math/tex">y</script> the most often (in the training set). I repeat this process for all 1000 ImageNet labels.</li>
</ul>
<p>To train the program with a multiple label mapping, I added the output probabilities of the multiple ImageNet labels corresponding to each CIFAR-10 label to get a set of 10 probabilities corresponding to each CIFAR label. I then used the negative log likelihood loss on these probabilities. Before going to the multi label mapping, I also trained an arbitrary 10 label mapping for ResNet 18 on CIFAR. It achieved a test accuracy of about <strong>61.84%</strong> after 35 epochs. I then trained the same ResNet using the above greedy multi-label mapping with the same training parameters. It achieved a test accuracy of <strong>67.63%</strong> after just 16 epochs.</p>
<figure style="margin: 20px auto; text-align: center;">
<img src="/assets/images/posts/adv_reprog/Resnet18_CIFAR10tanh_blurredgrad_weight_zeroinit_1k_multilabelremap.gif" alt="res18cifgif" width="auto" style="margin:20px 50px; display:inline-block" text-align="center" />
<img src="/assets/images/posts/adv_reprog/Resnet18_CIFAR10tanh_blurredgrad_weight_zeroinit_1k_multilabelremap.png" alt="res18cif" width="auto" style="margin:20px 50px; display:inline-block" text-align="center" />
<figcaption>Animation and final result of training an adversarial program using greedily chosen multiple output label mappings for Resnet 18 on CIFAR-10.</figcaption>
</figure>
<p>I repeated the above experiment on CIFAR-10 using AlexNet instead of Resnet 18. An arbitrary single output mapping yielded a test accuracy of <strong>57.40%</strong> after 21 epochs, while using the greedy multi output mapping boosted that to <strong>61.31%</strong> after 30 epochs.</p>
<figure style="margin: 20px auto; text-align: center;">
<img src="/assets/images/posts/adv_reprog/cifar10_programmed_trailertruck.png" alt="truck" width="auto" style="margin:20px auto; display:inline-block" text-align="center" />
<figcaption>Reprogrammed input of a CIFAR-10 truck which is classified by ResNet 18 as a 'trailer truck' with 0.42 confidence. Note that calling this input 'adversarial' is bit misleading, as this example can't really be considered one that 'fools' the target network.</figcaption>
</figure>
<p>While we can keep chasing for percent points, note that CIFAR-10 is actually a bad example for demonstrating adversarial reprogramming, as the so called ‘adversarial’ inputs $X_{adv}$ aren’t really adversarial in the sense that the CIFAR labels are highly related to the ImageNet labels which have been greedily mapped. For example, the above figure shows an ‘adversarial’ input of a CIFAR-10 truck which is classified by ResNet 18 as a ‘trailer truck’. This isn’t really an adversarial input as the image could be considered as one of a trailer truck. However, the examples for CIFAR discussed in the paper in which the CIFAR image is put in the centre with a masked label can be considered adversarial, as a large part of these images cannot be interpreted meaningfully.</p>
<h2 id="testing-transferability">Testing Transferability</h2>
<p>Another interesting thing to look at is whether these adversarial programs successfully transfer between different neural networks. In my experiments, it seemed that this was not the case. I tried using the MNIST adversarial program trained using ResNet 18 on AlexNet, and found that almost all the images were classified as ‘jigsaw puzzles’ irrespective of the digit they contained. Similar was the case for other ResNets like ResNet 34 and ResNet 50.</p>
<p>I also tried transferring the adversarial programs between the <em>same</em> networks but through a photograph on my phone. ResNet 18 classified a photo of the adversarial 5 as, again, a jigsaw puzzle as opposed to an Egyptian cat. This suggests that these adversarial examples are not very robust to noise, despite containing some low-frequency features. One possible reason for this could be that these examples occupy a niche in the adversarial example input space, as they were generated by training to fit a large dataset, however this remains a conjecture.</p>
<hr />
<p>I wrote the <a href="https://github.com/rajatvd/AdversarialReprogramming">code</a> for all these experiments using PyTorch. Special thanks to my brother Anjan for the numerous discussions we had about these ideas and explorations.</p>Computer Vision and Intelligencecvigroup.cfi@gmail.comSummer School Deep Learning Session 42018-07-14T00:00:00+05:302018-07-14T00:00:00+05:30https://iitmcvg.github.io/summer_school/DLSession4<h2 id="recurrant-neural-networks">Recurrant Neural Networks</h2>
<p>Recurrent Neural Networks (RNN’s) are very effective for Natural Language Processing and other sequence tasks because they have “memory”. They can read inputs <script type="math/tex">x^{\langle t \rangle}</script> (such as words) one at a time, and remember some information/context through the hidden layer activations that get passed from one time-step to the next. This allows a uni-directional RNN to take information from the past to process later inputs. A bidirection RNN can take context from both the past and the future.</p>
<p>The session notebook can be found <a href="https://github.com/iitmcvg/Content/tree/master/Sessions/Summer_School_2018/Session_DL_4">here</a></p>
<p><strong>Notation</strong>:</p>
<ul>
<li>Superscript <script type="math/tex">[l]</script> denotes an object associated with the <script type="math/tex">l^{th}</script> layer.
<ul>
<li>Example: <script type="math/tex">a^{[4]}</script> is the <script type="math/tex">4^{th}</script> layer activation. <script type="math/tex">W^{[5]}</script> and <script type="math/tex">b^{[5]}</script> are the <script type="math/tex">5^{th}</script> layer parameters.</li>
</ul>
</li>
<li>Superscript <script type="math/tex">(i)</script> denotes an object associated with the <script type="math/tex">i^{th}</script> example.
<ul>
<li>Example: <script type="math/tex">x^{(i)}</script> is the <script type="math/tex">i^{th}</script> training example input.</li>
</ul>
</li>
<li>Superscript <script type="math/tex">\langle t \rangle</script> denotes an object at the <script type="math/tex">t^{th}</script> time-step.
<ul>
<li>Example: <script type="math/tex">x^{\langle t \rangle}</script> is the input x at the <script type="math/tex">t^{th}</script> time-step. <script type="math/tex">x^{(i)\langle t \rangle}</script> is the input at the <script type="math/tex">t^{th}</script> timestep of example <script type="math/tex">i</script>.</li>
</ul>
</li>
<li>Lowerscript <script type="math/tex">i</script> denotes the <script type="math/tex">i^{th}</script> entry of a vector.
<ul>
<li>Example: <script type="math/tex">a^{[l]}_i</script> denotes the <script type="math/tex">i^{th}</script> entry of the activations in layer <script type="math/tex">l</script>.</li>
</ul>
</li>
</ul>
<h2 id="forward-propagation-for-the-basic-recurrent-neural-network">Forward propagation for the basic Recurrent Neural Network</h2>
<p>The basic RNN that you will implement has the structure below. In this example, <script type="math/tex">T_x = T_y</script>.</p>
<p><img src="https://imgur.com/Yaa79IN.png" alt="neuron" class="align-center" /></p>
<p>This is a Basic RNN model</p>
<p>Here’s how you can implement an RNN:</p>
<p><strong>Code Instructions</strong>:</p>
<ol>
<li>Implement the calculations needed for one time-step of the RNN.</li>
<li>Implement a loop over <script type="math/tex">T_x</script> time-steps in order to process all the inputs, one at a time.</li>
</ol>
<p>Let’s go!</p>
<h3 id="rnn-cell">RNN cell</h3>
<p>A Recurrent neural network can be seen as the repetition of a single cell. You are first going to implement the computations for a single time-step. The following figure describes the operations for a single time-step of an RNN cell.</p>
<p><img src="https://imgur.com/vGxAY57.png" alt="neuron" class="align-center" />
This is a basic RNN cell. Takes as input <script type="math/tex">x^{\langle t \rangle}</script> (current input) and <script type="math/tex">a^{\langle t - 1\rangle}</script> (previous hidden state containing information from the past), and outputs <script type="math/tex">a^{\langle t \rangle}</script> which is given to the next RNN cell and also used to predict <script type="math/tex">y^{\langle t \rangle}</script></p>
<p><strong>Code Instructions</strong>:</p>
<ol>
<li>Compute the hidden state with tanh activation: <script type="math/tex">a^{\langle t \rangle} = \tanh(W_{aa} a^{\langle t-1 \rangle} + W_{ax} x^{\langle t \rangle} + b_a)</script>.</li>
<li>Using your new hidden state <script type="math/tex">a^{\langle t \rangle}</script>, compute the prediction <script type="math/tex">\hat{y}^{\langle t \rangle} = softmax(W_{ya} a^{\langle t \rangle} + b_y)</script>. We provided you a function: <code class="highlighter-rouge">softmax</code>.</li>
<li>Store <script type="math/tex">(a^{\langle t \rangle}, a^{\langle t-1 \rangle}, x^{\langle t \rangle}, parameters)</script> in cache</li>
<li>Return <script type="math/tex">a^{\langle t \rangle}</script> , <script type="math/tex">y^{\langle t \rangle}</script> and cache</li>
</ol>
<p>We will vectorize over <script type="math/tex">m</script> examples. Thus, <script type="math/tex">x^{\langle t \rangle}</script> will have dimension <script type="math/tex">(n_x,m)</script>, and <script type="math/tex">a^{\langle t \rangle}</script> will have dimension <script type="math/tex">(n_a,m)</script>.</p>
<h3 id="rnn-forward-pass">RNN forward pass</h3>
<p>RNN as the repetition of the cell you’ve just built. If your input sequence of data is carried over 10 time steps, then you will copy the RNN cell 10 times. Each cell tak
es as input the hidden state from the previous cell (<script type="math/tex">a^{\langle t-1 \rangle}</script>) and the current time-step’s input data (<script type="math/tex">x^{\langle t \rangle}</script>). It outputs a hidden state (<script type="math/tex">a^{\langle t \rangle}</script>) and a prediction (<script type="math/tex">y^{\langle t \rangle}</script>) for this time-step.</p>
<p><img src="https://imgur.com/YdNCgkN.png" alt="neuron" class="align-center" />
A basic RNN is shown above. The input sequence <script type="math/tex">x = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})</script> is carried over <script type="math/tex">T_x</script> time steps. The network outputs <script type="math/tex">y = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})</script>.</p>
<p><strong>Code Instructions</strong>:</p>
<ol>
<li>Create a vector of zeros (<script type="math/tex">a</script>) that will store all the hidden states computed by the RNN.</li>
<li>Initialize the “next” hidden state as <script type="math/tex">a_0</script> (initial hidden state).</li>
<li>Start looping over each time step, your incremental index is <script type="math/tex">t</script> :
<ul>
<li>Update the “next” hidden state and the cache by running <code class="highlighter-rouge">rnn_cell_forward</code></li>
<li>Store the “next” hidden state in <script type="math/tex">a</script> (<script type="math/tex">t^{th}</script> position)</li>
<li>Store the prediction in y</li>
<li>Add the cache to the list of caches</li>
</ul>
</li>
<li>Return <script type="math/tex">a</script>, <script type="math/tex">y</script> and caches</li>
</ol>
<h3 id="basic-rnn--backward-pass">Basic RNN backward pass</h3>
<p>We will start by computing the backward pass for the basic RNN-cell.</p>
<p><img src="https://imgur.com/3EniMu4.png" alt="neuron" class="align-center" /></p>
<p>This is a RNN-cell’s backward pass. Just like in a fully-connected neural network, the derivative of the cost function <script type="math/tex">J</script> backpropagates through the RNN by following the chain-rule from calculus. The chain-rule is also used to calculate <script type="math/tex">(\frac{\partial J}{\partial W_{ax}},\frac{\partial J}{\partial W_{aa}},\frac{\partial J}{\partial b})</script> to update the parameters <script type="math/tex">(W_{ax}, W_{aa}, b_a)</script>.</p>
<p><strong>Deriving the one step backward functions:</strong></p>
<p>To compute the <code class="highlighter-rouge">rnn_cell_backward</code> you need to compute the following equations. It is a good exercise to derive them by hand.</p>
<p>The derivative of <script type="math/tex">\tanh</script> is <script type="math/tex">1-\tanh(x)^2</script>.</p>
<p>Similarly for <script type="math/tex">\frac{ \partial a^{\langle t \rangle} } {\partial W_{ax}}, \frac{ \partial a^{\langle t \rangle} } {\partial W_{aa}}, \frac{ \partial a^{\langle t \rangle} } {\partial b}</script>, the derivative of <script type="math/tex">\tanh(u)</script> is <script type="math/tex">(1-\tanh(u)^2)du</script>.</p>
<p>The final two equations also follow same rule and are derived using the <script type="math/tex">\tanh</script> derivative. Note that the arrangement is done in a way to get the same dimensions to match.</p>
<p><strong>Backward pass through the RNN</strong></p>
<p>Computing the gradients of the cost with respect to <script type="math/tex">a^{\langle t \rangle}</script> at every time-step <script type="math/tex">t</script> is useful because it is what helps the gradient backpropagate to the previous RNN-cell. To do so, you need to iterate through all the time steps starting at the end, and at each step, you increment the overall <script type="math/tex">db_a</script>, <script type="math/tex">dW_{aa}</script>, <script type="math/tex">dW_{ax}</script> and you store <script type="math/tex">dx</script>.</p>
<p><strong>Instructions</strong>:</p>
<p>Implement the <code class="highlighter-rouge">rnn_backward</code> function. Initialize the return variables with zeros first and then loop through all the time steps while calling the <code class="highlighter-rouge">rnn_cell_backward</code> at each time timestep, update the other variables accordingly.</p>
<p>In the next part, you will build a more complex LSTM model, which is better at addressing vanishing gradients. The LSTM will be better able to remember a piece of information and keep it saved for many timesteps.</p>
<h2 id="long-short-term-memory-lstm-network">Long Short-Term Memory (LSTM) network</h2>
<p>This following figure shows the operations of an LSTM-cell.</p>
<p><img src="https://imgur.com/wRyYVQ6.png" alt="neuron" class="align-center" /></p>
<p>This is a LSTM-cell. This tracks and updates a “cell state” or memory variable <script type="math/tex">c^{\langle t \rangle}</script> at every time-step, which can be different from <script type="math/tex">a^{\langle t \rangle}</script>.</p>
<p>Similar to the RNN example above, you will start by understanding the LSTM cell for a single time-step. Then you can iteratively call it from inside a for-loop to have it process an input with <script type="math/tex">T_x</script> time-steps.</p>
<h2 id="about-the-gates">About the gates</h2>
<h3 id="--forget-gate">- Forget gate</h3>
<p>For the sake of this illustration, lets assume we are reading words in a piece of text, and want use an LSTM to keep track of grammatical structures, such as whether the subject is singular or plural. If the subject changes from a singular word to a plural word, we need to find a way to get rid of our previously stored memory value of the singular/plural state. In an LSTM, the forget gate lets us do this:</p>
<script type="math/tex; mode=display">\Gamma_f^{\langle t \rangle} = \sigma(W_f[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_f)\tag{1}</script>
<p>Here, <script type="math/tex">W_f</script> are weights that govern the forget gate’s behavior. We concatenate <script type="math/tex">[a^{\langle t-1 \rangle}, x^{\langle t \rangle}]</script> and multiply by <script type="math/tex">W_f</script>. The equation above results in a vector <script type="math/tex">\Gamma_f^{\langle t \rangle}</script> with values between 0 and 1. This forget gate vector will be multiplied element-wise by the previous cell state <script type="math/tex">c^{\langle t-1 \rangle}</script>. So if one of the values of <script type="math/tex">\Gamma_f^{\langle t \rangle}</script> is 0 (or close to 0) then it means that the LSTM should remove that piece of information (e.g. the singular subject) in the corresponding component of <script type="math/tex">c^{\langle t-1 \rangle}</script>. If one of the values is 1, then it will keep the information.</p>
<h3 id="--update-gate">- Update gate</h3>
<p>Once we forget that the subject being discussed is singular, we need to find a way to update it to reflect that the new subject is now plural. Here is the formulat for the update gate:</p>
<script type="math/tex; mode=display">\Gamma_u^{\langle t \rangle} = \sigma(W_u[a^{\langle t-1 \rangle}, x^{\{t\}}] + b_u)\tag{2}</script>
<p>Similar to the forget gate, here <script type="math/tex">\Gamma_u^{\langle t \rangle}</script> is again a vector of values between 0 and 1. This will be multiplied element-wise with <script type="math/tex">\tilde{c}^{\langle t \rangle}</script>, in order to compute <script type="math/tex">c^{\langle t \rangle}</script>.</p>
<h3 id="--updating-the-cell">- Updating the cell</h3>
<p>To update the new subject we need to create a new vector of numbers that we can add to our previous cell state. The equation we use is:</p>
<script type="math/tex; mode=display">\tilde{c}^{\langle t \rangle} = \tanh(W_c[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_c)\tag{3}</script>
<p>Finally, the new cell state is:</p>
<script type="math/tex; mode=display">c^{\langle t \rangle} = \Gamma_f^{\langle t \rangle}* c^{\langle t-1 \rangle} + \Gamma_u^{\langle t \rangle} *\tilde{c}^{\langle t \rangle} \tag{4}</script>
<h3 id="--output-gate">- Output gate</h3>
<p>To decide which outputs we will use, we will use the following two formulas:</p>
<p><script type="math/tex">\Gamma_o^{\langle t \rangle}= \sigma(W_o[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_o)\tag{5}</script>
<script type="math/tex">a^{\langle t \rangle} = \Gamma_o^{\langle t \rangle}* \tanh(c^{\langle t \rangle})\tag{6}</script></p>
<p>Where in equation 5 you decide what to output using a sigmoid function and in equation 6 you multiply that by the <script type="math/tex">\tanh</script> of the previous state.</p>
<h2 id="lstm-cell">LSTM cell</h2>
<p><strong>Instructions</strong>:</p>
<ol>
<li>Concatenate <script type="math/tex">a^{\langle t-1 \rangle}</script> and <script type="math/tex">x^{\langle t \rangle}</script> in a single matrix: <script type="math/tex">concat = \begin{bmatrix} a^{\langle t-1 \rangle} \\ x^{\langle t \rangle} \end{bmatrix}</script></li>
<li>Compute all the formulas 1-6. You can use <code class="highlighter-rouge">sigmoid()</code> and <code class="highlighter-rouge">np.tanh()</code>.</li>
<li>Compute the prediction <script type="math/tex">y^{\langle t \rangle}</script>. You can use <code class="highlighter-rouge">softmax()</code></li>
</ol>
<h3 id="forward-pass-for-lstm">Forward pass for LSTM</h3>
<p>Now that you have implemented one step of an LSTM, you can now iterate this over this using a for-loop to process a sequence of <script type="math/tex">T_x</script> inputs.</p>
<p><img src="https://imgur.com/CFEgAAx.png" alt="neuron" class="align-center" /></p>
<p>The above image shows a LSTM over multiple time-steps.</p>
<p><strong>Exercise:</strong> Implement <code class="highlighter-rouge">lstm_forward()</code> to run an LSTM over <script type="math/tex">T_x</script> time-steps.</p>
<p><strong>Note</strong>: <script type="math/tex">c^{\langle 0 \rangle}</script> is initialized with zeros.</p>
<p>The forward passes for the basic RNN and the LSTM. When using a deep learning framework, implementing the forward pass is sufficient to build systems that achieve great performance. Now we will see how to do backpropagation in LSTM and RNNS</p>
<h2 id="lstm-backward-pass">LSTM backward pass</h2>
<h3 id="one-step-backward">One Step backward</h3>
<p>The LSTM backward pass is slighltly more complicated than the forward one. We have provided you with all the equations for the LSTM backward pass below. (If you enjoy calculus exercises feel free to try deriving these from scratch yourself.)</p>
<h3 id="gate-derivatives">Gate derivatives</h3>
<script type="math/tex; mode=display">d \Gamma_o^{\langle t \rangle} = da_{next}*\tanh(c_{next}) * \Gamma_o^{\langle t \rangle}*(1-\Gamma_o^{\langle t \rangle})\tag{7}</script>
<script type="math/tex; mode=display">d\tilde c^{\langle t \rangle} = dc_{next}*\Gamma_u^{\langle t \rangle}+ \Gamma_o^{\langle t \rangle} (1-\tanh(c_{next})^2) * i_t * da_{next} * \tilde c^{\langle t \rangle} * (1-\tanh(\tilde c)^2) \tag{8}</script>
<script type="math/tex; mode=display">d\Gamma_u^{\langle t \rangle} = dc_{next}*\tilde c^{\langle t \rangle} + \Gamma_o^{\langle t \rangle} (1-\tanh(c_{next})^2) * \tilde c^{\langle t \rangle} * da_{next}*\Gamma_u^{\langle t \rangle}*(1-\Gamma_u^{\langle t \rangle})\tag{9}</script>
<script type="math/tex; mode=display">d\Gamma_f^{\langle t \rangle} = dc_{next}*\tilde c_{prev} + \Gamma_o^{\langle t \rangle} (1-\tanh(c_{next})^2) * c_{prev} * da_{next}*\Gamma_f^{\langle t \rangle}*(1-\Gamma_f^{\langle t \rangle})\tag{10}</script>
<h3 id="parameter-derivatives">Parameter derivatives</h3>
<p><script type="math/tex">dW_f = d\Gamma_f^{\langle t \rangle} * \begin{pmatrix} a_{prev} \\ x_t\end{pmatrix}^T \tag{11}</script>
<script type="math/tex">dW_u = d\Gamma_u^{\langle t \rangle} * \begin{pmatrix} a_{prev} \\ x_t\end{pmatrix}^T \tag{12}</script>
<script type="math/tex">dW_c = d\tilde c^{\langle t \rangle} * \begin{pmatrix} a_{prev} \\ x_t\end{pmatrix}^T \tag{13}</script>
<script type="math/tex">dW_o = d\Gamma_o^{\langle t \rangle} * \begin{pmatrix} a_{prev} \\ x_t\end{pmatrix}^T \tag{14}</script></p>
<p>To calculate <script type="math/tex">db_f, db_u, db_c, db_o</script> you just need to sum across the horizontal (axis= 1) axis on <script type="math/tex">d\Gamma_f^{\langle t \rangle}, d\Gamma_u^{\langle t \rangle}, d\tilde c^{\langle t \rangle}, d\Gamma_o^{\langle t \rangle}</script> respectively. Note that you should have the <code class="highlighter-rouge">keep_dims = True</code> option.</p>
<p>Finally, you will compute the derivative with respect to the previous hidden state, previous memory state, and input.</p>
<p><script type="math/tex">da_{prev} = W_f^T*d\Gamma_f^{\langle t \rangle} + W_u^T * d\Gamma_u^{\langle t \rangle}+ W_c^T * d\tilde c^{\langle t \rangle} + W_o^T * d\Gamma_o^{\langle t \rangle} \tag{15}</script>
Here, the weights for equations 13 are the first n_a, (i.e. <script type="math/tex">W_f = W_f[:n_a,:]</script> etc…)</p>
<p><script type="math/tex">dc_{prev} = dc_{next}\Gamma_f^{\langle t \rangle} + \Gamma_o^{\langle t \rangle} * (1- \tanh(c_{next})^2)*\Gamma_f^{\langle t \rangle}*da_{next} \tag{16}</script>
<script type="math/tex">dx^{\langle t \rangle} = W_f^T*d\Gamma_f^{\langle t \rangle} + W_u^T * d\Gamma_u^{\langle t \rangle}+ W_c^T * d\tilde c_t + W_o^T * d\Gamma_o^{\langle t \rangle}\tag{17}</script>
where the weights for equation 15 are from n_a to the end, (i.e. <script type="math/tex">W_f = W_f[n_a:,:]</script> etc…)</p>
<h3 id="backward-pass-through-the-lstm-rnn">Backward pass through the LSTM RNN</h3>
<p>This part is very similar to the <code class="highlighter-rouge">rnn_backward</code> function you implemented above. You will first create variables of the same dimension as your return variables. You will then iterate over all the time steps starting from the end and call the one step function you implemented for LSTM at each iteration. You will then update the parameters by summing them individually. Finally return a dictionary with the new gradients.</p>
<p><strong>Instructions</strong>: Implement the <code class="highlighter-rouge">lstm_backward</code> function. Create a for loop starting from <script type="math/tex">T_x</script> and going backward. For each step call <code class="highlighter-rouge">lstm_cell_backward</code> and update the your old gradients by adding the new gradients to them. Note that <code class="highlighter-rouge">dxt</code> is not updated but is stored.</p>
<h2 id="activity-recognition">Activity Recognition</h2>
<p>Here’s a Github Gist to an activity recognition code using LSTM’s : <a href="https://gist.github.com/varun19299/88c94804cf63fe865d7d3e1b3e4aa630">link</a>.</p>
<h2 id="sequence-to-sequence-models---a-general-overview">Sequence to Sequence Models - a general overview</h2>
<p>Many times, we might have to convert one sequence to another. Really? Where?</p>
<p>We do this in machine translation. For this purpose we use models known as sequence to sequence models. (<strong>seq2seq</strong>)</p>
<p>If we take a high-level view, a seq2seq model has encoder, decoder and intermediate step as its main components:
<img src="https://cdn-images-1.medium.com/max/800/1*3lj8AGqfwEE5KCTJ-dXTvg.png" alt="alt text" class="align-center" /></p>
<p>A basic sequence-to-sequence model consists of two recurrent neural networks (RNNs): an encoder that processes the input and a decoder that generates the output. This basic architecture is depicted below.
<img src="https://www.tensorflow.org/images/basic_seq2seq.png" alt="alt text" class="align-center" /></p>
<p>Each box in the picture above represents a cell of the RNN, most commonly a GRU cell or an LSTM cell. Encoder and decoder can share weights or, as is more common, use a different set of parameters.</p>
<p>In the basic model depicted above, every input has to be encoded into a fixed-size state vector, as that is the only thing passed to the decoder. To allow the decoder more direct access to the input, an <strong>attention</strong> mechanism was introduced. We’ll look into details of the attention mechanism in the next part.</p>
<h2 id="encoder">Encoder</h2>
<p>Our input sequence is how are you. Each word from the input sequence is associated to a vector
w∈Rd (via a lookup table). In our case, we have 3 words, thus our input will be transformed into <script type="math/tex">[w0,w1,w2]∈R^{d×3}</script>. Then, we simply run an LSTM over this sequence of vectors and store the last hidden state outputed by the LSTM: this will be our encoder representation e. Let’s write the hidden states <script type="math/tex">[e_0,e_1,e_2]</script> (and thus <script type="math/tex">e=e_2</script>).
<img src="https://guillaumegenthial.github.io/assets/img2latex/seq2seq_vanilla_encoder.svg" alt="alt text" class="align-center" /></p>
<h2 id="decoder">Decoder</h2>
<p>Now that we have a vector e that captures the meaning of the input sequence, we’ll use it to generate the target sequence word by word. Feed to another LSTM cell: <script type="math/tex">e</script> as hidden state and a special start of sentence vector <script type="math/tex">w_{s o s}</script> as input. The LSTM computes the next hidden state <script type="math/tex">h_0 ∈ R h</script> . Then, we apply some function <script type="math/tex">g : R^h ↦ R^V</script> so that
<script type="math/tex">s_0 := g ( h_0 ) ∈ R^V</script> is a vector of the same size as the vocabulary.</p>
<p>\begin{equation}
h_0 = LSTM ( e , w_{s o s} )
\end{equation}
\begin{equation}
s_0 = g ( h_0 )
\end{equation}
\begin{equation}
p_0 = softmax ( s_0 )
\end{equation}
\begin{equation}
i_0 = argmax ( p_0 )$$
\end{equation}</p>
<p>Then, apply a softmax to <script type="math/tex">s_ 0</script> to normalize it into a vector of probabilities <script type="math/tex">p_0 ∈ R^V</script> . Now, each entry of <script type="math/tex">p_0</script> will measure how likely is each word in the vocabulary. Let’s say that the word “comment” has the highest probability (and thus <script type="math/tex">i_0 = argmax ( p_0 )</script> corresponds to the index of “comment”). Get a corresponding vector <script type="math/tex">w_{i_0} = w_{comment}</script> and repeat the procedure: the LSTM will take <script type="math/tex">h_0</script> as hidden state and <script type="math/tex">w_{comment}</script> as input and will output a probability vector <script type="math/tex">p_1</script> over the second word, etc.</p>
<p>\begin{equation}
h_1 = LSTM ( h_0 , w_{i_0} )
\end{equation}
\begin{equation}
s_1 = g ( h_1 )
\end{equation}
\begin{equation}
p_1 = softmax ( s_1 )
\end{equation}
\begin{equation}
i _1 = argmax ( p_1 )
\end{equation}</p>
<p>The decoding stops when the predicted word is a special end of sentence token.</p>
<p><img src="https://guillaumegenthial.github.io/assets/img2latex/seq2seq_vanilla_decoder.svg" alt="alt text" class="align-center" /></p>
<h2 id="attention-">Attention !!!</h2>
<p><img src="https://guillaumegenthial.github.io/assets/img2latex/seq2seq_attention_mechanism_new.svg" alt="alt text" class="align-center" /></p>
<h2 id="seq2seq-with-attention">Seq2Seq with Attention</h2>
<p>The previous model has been refined over the past few years and greatly benefited from what is known as attention. Attention is a mechanism that forces the model to learn to focus (= to attend) on specific parts of the input sequence when decoding, instead of relying only on the hidden vector of the decoder’s LSTM. One way of performing attention is as follows. We slightly modify the reccurrence formula that we defined above by adding a new vector <script type="math/tex">c_t</script> to the input of the LSTM
\begin{equation}
h_t = LSTM ( h_{t − 1} , [ w_{i_{t − 1 }}, c_t ] )
\end{equation}
\begin{equation}
s_t = g ( h_t )
\end{equation}
\begin{equation}
p_t = softmax ( s_t )
\end{equation}
\begin{equation}
i_t = argmax ( p_t )
\end{equation}</p>
<p>The vector c_t is the attention (or context) vector. We compute a new context vector at each decoding step. First, with a function <script type="math/tex">f ( h_{t − 1} , e_{t ′} ) ↦ α t ′ ∈ R</script> , compute a score for each hidden state <script type="math/tex">e_{t'}</script> of the encoder. Then, normalize the sequence of <script type="math/tex">αt′</script> using a softmax and compute c t as the weighted average of the <script type="math/tex">e_{t ′}</script>.</p>
<p><script type="math/tex">α_t ′ = f ( h_{t − 1 }, e_{t ′} ) ∈ R</script> for all <script type="math/tex">t ′</script></p>
<script type="math/tex; mode=display">\vec{\alpha} = softmax ( α )</script>
<script type="math/tex; mode=display">c_t = n \sum_{t'=0}^{n} \vec{\alpha}_{t′} e_{t ′}</script>
<p>The choice of the function <script type="math/tex">f</script> varies</p>
<h2 id="one-of-the-main-usage-of-a-sequence-to-sequence-model-is-in-neural-machine-translation">One of the main usage of a sequence to sequence model is in Neural Machine Translation</h2>
<p>Check out the Session note book for the code on how to do this</p>Computer Vision and Intelligencecvigroup.cfi@gmail.comRecurrant Neural NetworksSummer School Deep Learning Session 32018-07-13T00:00:00+05:302018-07-13T00:00:00+05:30https://iitmcvg.github.io/summer_school/DLSession3<h2 id="an-introduction-to-convolutional-neural-networks">An introduction to Convolutional Neural Networks</h2>
<p>In this post we will be giving an introduction to Convolutional Neural Networks and will show you how to implement one.</p>
<p>The session notebook can be found <a href="https://github.com/iitmcvg/Content/tree/master/Sessions/Summer_School_2018/Session_DL_3">here</a></p>
<p><strong>Neurons as Feature Detectors</strong></p>
<p>Recall how a neuron’s value is calculated. It consists of an aggregation and an activation step. It takes several real numbers as input and outputs a single real number.
\begin{equation}
Aggregation: Agg=b_1x_1 + b_2x_2 + \cdots + b_nx_n
\end{equation}
\begin{equation}
Activation: y=\sigma(Agg)
\end{equation}
Note that the parameters of this neuron are <script type="math/tex">b_i</script> and are <script type="math/tex">n</script> in number.
You may recall from JEE math that the aggregation step can be written as a dot product:
\begin{equation}
Agg=b\cdot x= |b||x|cos(\theta)
\end{equation}</p>
<p><img src="https://i.imgur.com/2NRhbZy.png" alt="neuron" class="align-center" /></p>
<p>Where <script type="math/tex">b</script> and <script type="math/tex">x</script> are column vectors and <script type="math/tex">\theta</script> is the angle between them. For fixed magnitudes, the aggregate will be maximised when the angle between them is 0. In other words, the aggregattion step measures the similarity between <script type="math/tex">x</script> and <script type="math/tex">b</script> (specifically the cosine similarity). When the aggregation is large and positive, the neuron’s activation will approach 1 while if they are very dissimilar, the neuron should approach 0. Thus, a neuron can be interpreted as a feature detector. Presence of the feature in the input is indicated by the value of the neuron’s output.</p>
<p>There is a catch, however. In the preceding section, we assumed that the magnitudes of <script type="math/tex">x</script> and <script type="math/tex">b</script> were fixed. It is reasonable to ignore the magnitude of <script type="math/tex">b</script> since we are only interested in comparing the ‘strength’ of the feature’s presence across different values of <script type="math/tex">x</script>. Ignoring the magnitude of <script type="math/tex">x</script> is dubious because we could drive the aggregate to a large value simply by letting all the elements of <script type="math/tex">x</script> be large values. This is why it is important to scale your data before passing it to any learning algorithm.</p>
<p><strong>Images and Feed Forward Neural Networks</strong></p>
<p>Suppose we want to use a neural network to classify images. An image consists of pixel values arranged in a matrix whose shape matches the resolution of the image. Since each pixel has a separate value for red, green and blue levels, you can imagine them as 3 matrices stacked up (3 channels). In order to use the image as input to the neural network from the last class, we need to transform this data into a 1-D vector. You should notice that the neural network we’ve been studying so far does not care about the ordering of the input vector, as long as all samples are ordered the same way. This order-independence allows us to squash our image in any way we want as long as we use the same method of squashing for each image in our dataset.</p>
<p>For simplicity, we will use grayscale images, which only require 1 matrix to store pixel values. Our squashing method is placing rows of the image next to each other.</p>
<p><img src="https://raw.githubusercontent.com/zalandoresearch/fashion-mnist/master/doc/img/fashion-mnist-sprite.png" alt="fmnist" class="align-center" /></p>
<p>The FMNIST dataset contains 60000 images in 10 classes:</p>
<p>T-shirt/top
, Trouser
, Pullover
, Dress
, Coat
, Sandal
, Shirt
, Sneaker
, Bag
, Ankle boot</p>
<p>Thinking back to our feature detector discussion, consider a neuron in the first hidden layer. It has 784 inputs, each associated with a weight. The output of the neuron measures the similarity of the 784 inputs to the weights. If we plot these weights in a 28x28 image, we can see what feature the neuron is detecting.
Here are the filters learned for 0 hidden layers. (green=positive, red=negative)</p>
<p><img src="https://i.imgur.com/jh5IN0J.png" alt="0 hidden layers" class="align-center" /></p>
<p>The case of zero hidden layers is equivalent to logistic regression. Let us plot the weights learned for a network with a single hidden layer.</p>
<p><img src="https://i.imgur.com/GnxXohj.png" alt="1 hidden layer" class="align-center" />
We see that the first layer learns many duplicate features</p>
<p>Neurons in subsequent layers can be analysed similarly, however plotting them will not yield very interpretable information.</p>
<p><img src="https://i.imgur.com/Ru1idV6.png" alt="1 hidden layer" class="align-center" /></p>
<p>These neurons are actually looking for combinations of features from the first layer that result in more abstract representations of the input. For example, the first layer might detect similarity to circles and squares while the next layer could combine be active when both those neurons are active.</p>
<p><strong>Convolution and Spatial Proximity</strong>
In the previous section, we were measuring the similarity of the image to a learned weight matrix. But what would happen if the object we are trying to detect doesn’t overlap significantly with the learned weights? What if our object of interest is in the corner of the image? The detector neuron would not be activated and we would not detect the object.</p>
<p>We could instead break the image into smaller pieces and look for parts of the object. This is what the convolution operator attempts to do. In the previous section, we looked to match the image with a 28x28 weight matrix. Why not look at every possible 5x5 square in the image and match it to a learned ‘filter’ of the same size?</p>
<p><img src="https://i.stack.imgur.com/I7DBr.gif" alt="Convolution operator" class="align-center" /></p>
<p>The calculation we are performing is the same as before. we are simply doing it in patches of the image. The operation of performing this dot product over all patches of the image is known as convolution. Each dot product yields a real number, so the convolution operator outputs a matrix of these scalars.</p>
<p><img src="https://i.imgur.com/FKkZKjY.png" alt="Kirsch filter" class="align-center" /></p>
<p><img src="https://i.imgur.com/ildbL7Y.png" alt="Kirsch filter" class="align-center" /></p>
<p>Above is the output of the 8 Kirsch filters that find edges oriented along the compass directions.</p>
<p>One advantage convolution offers over a feed forward network is that it greatly reduces the parameters to be learned. We have gone from learning a 28x28 filter to learning a 5x5 filter for a single neuron. That amounts to a reduction by a factor of 30. The disadvantage is that this filter, being so small, cannot capture very complex information about the image. At best we can find edges in the first layer. As before, the network will detect more complex features in the deeper layers. It could use a vertical edge and a horizontal edge to determine whether there is a T shape in the image.</p>
<p>For images, CNNs usually perform better than feed forward nets even though CNNs have fewer parameters. Further, CNNs can be represented equivalently as feed forward nets. Why can’t a feed forward net learn this equivalent weight matrix that the CNN learns? The answer probably lies in spatial connectivity. The convolution operator inherently depends only on the neighbours of a pixel and we know naturally that a pixel must always be considered in the context of its neighbours in order to characterise it appropriately. CNNs are forced into doing this while a feedforward net would have to learn this on its own and with a massive parameter search space, this is highly unlikely to happen.</p>
<h2 id="the-details-of-convolution">The Details of Convolution</h2>
<h3 id="padding">Padding</h3>
<p>Notice in the gif above that the output matrix is smaller than the input matrix. This happens because the filter is not allowed to exceed the bounds of the image and can result in incomplete usage of features located near the edge of the image and is a waste of valuable data. There are a few ways to overcome this by using padding. Padding refers to adding dummy pixels around the image so as to ensure that the convolution does not change the size of the output matrix. A few forms of padding are ’constant’, ’nearest’, ’mirror’ and ‘wrap’.</p>
<p><img src="https://adeshpande3.github.io/assets/Pad.png" alt="padding" class="align-center" /></p>
<h3 id="stride">Stride</h3>
<p>In our example, we’ve been shifting the filter 1 unit right before every dot product. If we instead move it by 2 units, the size of the output would reduce by a factor of 2. The amount by which the filter moves is known as the stride and can be adjusted to decrease the amount of computation required in subsequent layers.</p>
<p><img src="https://adeshpande3.github.io/assets/Stride1.png" alt="stride1" class="align-center" /></p>
<p>Now, let us increase the stride to 2.</p>
<p><img src="https://adeshpande3.github.io/assets/Stride2.png" alt="stride2" class="align-center" /></p>
<h3 id="shift-equivariance">Shift equivariance</h3>
<p>Consider the case below where the same image is shifted and fed into some convolution operator.
<img src="https://i.imgur.com/CVwwPOe.png" alt="equivariance" class="align-center" />
Notice that the output contains the same pattern of activations shifted by the same amount as the original image was shifted. This is known as translational equivariance. Convolutions are often misrepresented as being shift invariant but this is not the case since we can see that shifting the input does indeed change the output.</p>
<h3 id="handling-multiple-channels">Handling Multiple Channels</h3>
<p>Earlier, we assumed that the input to our convolution was a grayscale image which can be represented as a single matrix, rather than an rgb image which must be represented as a stack of matrices. How does convolution handle this?
Instead of learning a filter matrix, we can learn 3 filter matrices (one for each channel) and stack them, creating a 3x3x3 filter. This is called a tensor.</p>
<p><img src="http://machinethink.net/images/vggnet-convolutional-neural-network-iphone/ConvolutionKernel@2x.png" alt="tensor" class="align-center" /></p>
<p>The tensor moves across the image in the same way as before, performing a dot product and giving a matrix output after passing over the entire image. Convolution is therefore a tensor to matrix operation.</p>
<h3 id="handling-multiple-filters">Handling Multiple Filters</h3>
<p>Given an image, we would like to extract as much information about it as possible. We might want to find all the different oriented edges separately (similar to the kirsch operator). This would involve using many filters at once. Since each filter outputs a matrix, it is fairly straightforward to stack these matrices giving a tensor with <script type="math/tex">f</script> channels (where <script type="math/tex">f</script> is the number of filters).
Let’s put all of this together and see what a convolution with multiple input and output channels looks like.</p>
<p><img src="http://machinethink.net/images/vggnet-convolutional-neural-network-iphone/ConvLayer@2x.png" alt="multi in - multi out" class="align-center" /></p>
<h2 id="building-a-cnn">Building a CNN</h2>
<h3 id="series-convolution">Series Convolution</h3>
<p>Now that we’ve seen how convolutions work, we can start to construct a neural network. The key idea here is to apply several convolutions in series so as to detect more and more nuanced features as we get deeper into the network. Why would this happen at all? We can show by a simple example that series convolutions can detect shapes in an image.</p>
<p>Let’s say we want to detect squares in our image. we could first use a kirsch filters to find the edges of the square first. The next set of convolutions can find whether these edges are perpendicular to each other and locate crossings. The next layer could convolve with a filter detecting 4 dots. This is a really simple example to show that features are heirarchical. We first detected edges, then combinations of edges and then shapes. Further layers can even detect textures.</p>
<h3 id="pooling">Pooling</h3>
<p>A weakness of this method is that it isn’t scale invariant. This wouldn’t be able to detect squares of all sizes. 2 solutions to this are possible. The first solution would be to increase the stride of our filter so that spatially distant features in the image are brought closer together in the resulting feature map. As we go deeper into the network, feature maps would get smaller and smaller. The second way of going about this would be to perform a down sampling of the features maps at each stage. This is known as pooling. There are various methods of pooling like average and median, but the most common is called max pooling.</p>
<p><img src="http://ufldl.stanford.edu/tutorial/images/Pooling_schematic.gif" alt="pooling" class="align-center" /></p>
<p>Max pooling involves moving a box over contiguous areas of the feature map and selecting the max value in boxes. The strides of the boxes are generally equal to the size of the boxes. 2x2 max-pooling is very common in deep learning. It is also non-parametric so it doesn’t need to learn any features to perform its operations, unlike filter convolutions. This is why most people prefer this method.</p>
<p><img src="http://cs231n.github.io/assets/cnn/pool.jpeg" alt="Pooling operation" class="align-center" /></p>
<p>In addition to bringing us closer to scale invariance, max pooling is robust to small shifts in the image, therefore we have also taken a step in the direction of translation invariance.</p>
<h3 id="fully-connected-layers">Fully Connected Layers</h3>
<p>Convolutions suffer from an inherent bias: they only take into account the neighbourhood of a pixel. if we were for example trying to detect a car in a high resolution image, the network would have to perform multiple down samplings until a single filter is able to cover 2 active wheel detector neurons. We could instead use fully connected layers to take note of these distributed features.</p>
<h3 id="putting-it-all-together">Putting it all together</h3>
<p><img src="https://www.mathworks.com/content/mathworks/www/en/discovery/convolutional-neural-network/_jcr_content/mainParsys/image_copy.adapt.full.high.jpg/1517522275430.jpg" alt="CNN" class="align-center" /></p>
<h2 id="cnns-over-the-years">CNNs over the years</h2>
<h3 id="lenet5">LENET5</h3>
<p>It is the year 1994, and this is one of the very first convolutional neural networks, and what propelled the field of Deep Learning.</p>
<p>This pioneering work by Yann LeCun was named LeNet5 after many previous successful iterations since they year 1988.</p>
<p><img src="https://cdn-images-1.medium.com/max/800/0*V1vb9SDnsU1eZQUy.jpg" alt="alt text" class="align-center" /></p>
<h3 id="imagenet">IMAGENET</h3>
<p>ImageNet Large Scale Visual Recognition Challenge (ILSVRC) evaluates algorithms for object detection and image classification at large scale. One high level motivation is to allow researchers to compare progress in detection across a wider variety of objects – taking advantage of the quite expensive labeling effort. Another motivation is to measure the progress of computer vision for large scale image indexing for retrieval and annotation.</p>
<p><img src="https://raw.githubusercontent.com/TerrenceMiao/Data-Science/master/ImageNet%20Top%205%20Error%20Rate.jpeg" alt="alt text" class="align-center" /></p>
<h3 id="alexnet">ALEXNET</h3>
<p>In 2012, Alex Krizhevsky released AlexNet which was a deeper and much wider version of the LeNet and won by a large margin the difficult ImageNet competition.</p>
<p>Wide as in 20+% !</p>
<p><img src="https://cdn-images-1.medium.com/max/800/0*vsi8JJFV_O6Z34ks.png" alt="alt text" class="align-center" /></p>
<h3 id="vgg">VGG</h3>
<p>The VGG networks from Oxford were the first to use much smaller 3×3 filters in each convolutional layers and also combined them as a sequence of convolutions.</p>
<p>This seems to be contrary to the principles of LeNet, where large convolutions were used to capture similar features in an image.</p>
<p>Instead of the 9×9 or 11×11 filters of AlexNet, filters started to become smaller, too dangerously close to the infamous 1×1 convolutions that LeNet wanted to avoid, at least on the first layers of the network.</p>
<p>But the great advantage of VGG was the insight that multiple 3×3 convolution in sequence can emulate the effect of larger receptive fields, for examples 5×5 and 7×7. These ideas will be also used in more recent network architectures as Inception and ResNet.</p>
<p><img src="https://cdn-images-1.medium.com/max/800/0*HREIJ1hjF7z4y9Dd.jpg" alt="alt text" class="align-center" /></p>
<p><img src="https://qph.fs.quoracdn.net/main-qimg-ba81c87204be1a5d11d64a464bca39eb" alt="alt text" class="align-center" /></p>
<p>The VGG networks uses multiple 3×3 convolutional layers to represent complex features.</p>
<p>Notice blocks 3, 4, 5 of VGG-E: 256×256 and 512×512 3×3 filters are used multiple times in sequence to extract more complex features and the combination of such features.</p>
<p>This is effectively like having large 512×512 classifiers with 3 layers, which are convolutional! This obviously amounts to a massive number of parameters, and also learning power.</p>
<p>But training of these network was difficult, and had to be split into smaller networks with layers added one by one. All this because of the lack of strong ways to regularize the model, or to somehow restrict the massive search space promoted by the large amount of parameters.</p>
<p>VGG used large feature sizes in many layers and thus inference was quite costly at run-time. Reducing the number of features, as done in Inception bottlenecks, will save some of the computational cost.</p>
<h3 id="googlenet-and-inception">GoogleNet and Inception</h3>
<p><em>Let the Games begin XD</em></p>
<p><a href="https://arxiv.org/abs/1409.4842">Christian Szegedy</a> from Google begun a quest aimed at reducing the computational burden of deep neural networks, and devised the GoogLeNet the first Inception architecture.</p>
<p>By now, Fall 2014, deep learning models were becoming extermely useful in categorizing the content of images and video frames.</p>
<p>Most skeptics had given in that Deep Learning and neural nets came back to stay this time. Given the usefulness of these techniques, the internet giants like Google were very interested in efficient and large deployments of architectures on their server farms.</p>
<h3 id="inception-v3-and-v2">INCEPTION V3 (AND V2)</h3>
<p>In February 2015 Batch-normalized Inception was introduced as Inception V2.</p>
<p>Batch-normalization computes the mean and standard-deviation of all feature maps at the output of a layer, and normalizes their responses with these values. This corresponds to “whitening” the data, and thus making all the neural maps have responses in the same range, and with zero mean. This helps training as the next layer does not have to learn offsets in the input data, and can focus on how to best combine features.</p>
<h3 id="resnet">RESNET</h3>
<p>The revolution then came in December 2015, at about the same time as Inception v3. ResNet have a simple ideas: feed the output of two successive convolutional layer AND also bypass the input to the next layers!</p>
<p><img src="https://cdn-images-1.medium.com/max/800/0*0r0vS8myiqyOb79L.jpg" alt="alt text" class="align-centers" /></p>
<h3 id="inception-v4-or-inception_resnet_v2">Inception v4 or Inception_Resnet_v2</h3>
<ol>
<li>List item</li>
<li>List item</li>
</ol>
<p><em>There’s no method to this madness</em></p>
<p>And Christian and team are at it again with a new version of Inception.</p>
<p>The Inception module after the stem is rather similar to Inception V3:</p>
<p><img src="https://cdn-images-1.medium.com/max/800/0*SJ7DP_-0R1vdpVzv.jpg" alt="alt text" class="align-center" /></p>
<h2 id="convolutional-neural-networks-accuracy">Convolutional Neural Networks’ Accuracy</h2>
<p><img src="https://cdn-images-1.medium.com/max/800/1*kBpEOy4fzLiFxRLjpxAX6A.png" alt="alt text" class="align-center" /></p>Computer Vision and Intelligencecvigroup.cfi@gmail.comAn introduction to Convolutional Neural NetworksSummer School Deep Learning Session 22018-07-12T00:00:00+05:302018-07-12T00:00:00+05:30https://iitmcvg.github.io/summer_school/DLSession2<h2 id="deep-feedforward-networks---a-general-description">Deep Feedforward Networks - a general description</h2>
<p>The session notebook can be found <a href="https://github.com/iitmcvg/Content/tree/master/Sessions/Summer_School_2018/Session_DL_2">here</a></p>
<p>The session notebook can be found <a href="https://github.com/iitmcvg/Content/tree/master/Sessions/Summer_School_2018/Session_DL_2">here</a></p>
<p><strong>Deep feedforward networks</strong>, also often called feedforward neural networks, or <strong>multilayer perceptrons (MLPs)</strong>, are the quintessential deep learning models. The goal of a feedforward network is to approximate some function <script type="math/tex">f^*</script>. For example, for a classifier, <script type="math/tex">y = f^∗(x)</script> maps an input <script type="math/tex">x</script> to a category <script type="math/tex">y</script>. A feedforward network defines a mapping <script type="math/tex">y = f (x; θ)</script> and learns the value of the parameters θ that result in the best function approximation.</p>
<p>These models are called feedforward because information flows through the function being evaluated from <script type="math/tex">x</script>, through the intermediate computations used to define <script type="math/tex">f</script>, and finally to the output <script type="math/tex">y</script>.</p>
<p>Feedforward neural networks are called networks because they are typically represented by composing together many different functions. The model is asso- ciated with a directed acyclic graph describing how the functions are composed together. For example, we might have three functions <script type="math/tex">f^{(1)}</script>,<script type="math/tex">f^{(2)}</script>, and <script type="math/tex">f^{(3)}</script> connected in a chain, to form <script type="math/tex">f(x) = f^{(3)}(f^{(2)}(f^{(1)}(x)))</script>. These chain structures are the most commonly used structures of neural networks. In this case, <script type="math/tex">f^{(1)}</script> is called the first layer of the network, <script type="math/tex">f^{(2)}</script> is called the second layer, and so on.</p>
<p>The overall <strong>length of the chain</strong> gives the <strong>depth</strong> of the model. It is from this terminology that the name <strong>“deep learning”</strong> arises. The final layer of a feedforward network is called the <strong>output layer</strong>. During neural network training, we drive <script type="math/tex">f(x)</script> to match <script type="math/tex">f^∗(x)</script>. The training data provides us with noisy, approximate examples of <script type="math/tex">f^∗(x)</script> evaluated at different training points. Each example <script type="math/tex">x</script> is accompanied by a label <script type="math/tex">y ≈ f^∗(x)</script>. The training examples specify directly what the output layer must do at each point x; it must produce a value that is close to <script type="math/tex">y</script>. The behavior of the other layers is not directly specified by the training data. The learning algorithm must decide how to use those layers to produce the desired output, but the training data does not say what each individual layer should do. Instead, the learning algorithm must decide how to use these layers to best implement an approximation of <script type="math/tex">f^∗</script>. Because the training data does not show the desired output for each of these layers, these layers are called <strong>hidden layers</strong>.</p>
<p>Finally, these networks are called <strong>neural</strong> because they are loosely inspired by <strong>neuroscience</strong>. Each hidden layer of the network is typically <strong>vector-valued</strong>. The dimensionality of these hidden layers determines the width of the model. Each element of the vector may be interpreted as playing a role analogous to a neuron.</p>
<p><img src="https://i.imgur.com/38lpenv.png" alt="alt" class="align-center" /></p>
<h2 id="learning-the-xor-function">Learning the XOR function</h2>
<p>To make the idea of a feedforward network more concrete, we begin with an example of a fully functioning feedforward network on a very simple task: learning the <strong>XOR function</strong>.</p>
<p>The <strong>XOR function (“exclusive or”)</strong> is an operation on <strong>two binary values, x1 and x2</strong>. When <strong>exactly one</strong> of these binary values is <strong>equal to 1</strong>, the XOR function <strong>returns</strong> <strong>1</strong>. Otherwise, it returns 0. The XOR function provides the target function <script type="math/tex">y = f^∗(x)</script> that we want to learn. Our model provides a function <script type="math/tex">y = f(x;θ)</script> and our learning algorithm will adapt the parameters θ to make f as similar as possible to <script type="math/tex">f^∗</script>.</p>
<p>In this simple example, we will not be concerned with statistical generalization. We want our network to perform correctly on the four points<br />
<script type="math/tex">\mathbb{X}=</script> { <script type="math/tex">[0,0]^T, [0,1]^T, [1,0]^T,[1,1]^T</script> }. We will train the network on all four of these points. The only challenge is to fit the training set.</p>
<p>Clearly, this is a regression problem where we can use the mean squared error loss function.</p>
<p>Evaluated on our whole training set, the MSE loss function is</p>
<p><script type="math/tex">J(θ)= \dfrac{1}{4} \Sigma_{x\epsilon\mathbb{X}}(f (x)−f(x;θ))^2</script> .</p>
<p>Now we must choose the form of our model, f (x; θ). Suppose that we choose a linear model, with θ consisting of w and b. Our model is defined to be</p>
<p><script type="math/tex">f(x; w, b) = x^Tw + b</script>.</p>
<p>We can minimize <script type="math/tex">J(θ)</script> in closed form with respect to w and b using the normal
equations.</p>
<p>After solving the normal equations, we obtain a closed form solution, <script type="math/tex">w = 0</script> and <script type="math/tex">b = 0.5</script>. This means that the linear model simply outputs 0.5 everywhere.</p>
<p><strong>But why couldn’t a linear model represent this function?</strong></p>
<p>Let’s now model this with a <strong>DNN</strong>.
Specifically, we will introduce a very simple feedforward network with one
hidden layer containing two hidden units. This feedforward network has a vector of hidden units <script type="math/tex">h</script> that are
computed by a function <script type="math/tex">f^{(1)} (x; W , c )</script>. The values of these hidden units are then
used as the input for a second layer. The second layer is the output layer of the
network. The output layer is still just a linear regression model, but now it is
applied to <script type="math/tex">h</script> rather than to <script type="math/tex">x</script> . The network now contains two functions chained
together: <script type="math/tex">h = f (1) (x; W , c )</script> and <script type="math/tex">y = f^{(2)} (h; w, b)</script>, with the complete model being
<script type="math/tex">f ( x ; W , c , w , b ) = f^{(2)} (f^{(1)} ( x ))</script> .</p>
<p>For a better illustration of the network, let’s look at this figure.</p>
<p><img src="https://imgur.com/BQH5IG1.png" alt="alt text" class="align-center" /></p>
<p>Here <script type="math/tex">[x1,x2]</script> is the input vector, and <script type="math/tex">y</script> is the output. Now we need a non-linear transformation from the input feature space to the hidden feature space of <script type="math/tex">h</script>. This is achieved through the first layer. The first layer can be represented as <script type="math/tex">h=f^{(1)}(x,W,c)</script> where <script type="math/tex">f^{(1)}</script> is a non-linear transformation in itself. Thus <script type="math/tex">h = g(W^Tx+c)</script> where <script type="math/tex">W^Tx+c</script> is an affine transform and <script type="math/tex">g</script> is a non-linear function. This function <script type="math/tex">g</script> is called as the activation function. In this case, (and most cases we will) let’s use a simple function known as <strong>ReLU</strong>. ReLU (Rectified Linear Unit), not as complex as it sounds, is the simple function <script type="math/tex">max\{0,x\}</script>. The output layer is just a linear function <script type="math/tex">w^Th+b</script>. Overall, the neural network represents the function</p>
<script type="math/tex; mode=display">f(x;W,c,w,b)=w^Tmax\{0,W^Tx+c\} +b</script>
<p><strong>With this setup let’s guess a solution</strong></p>
<h2 id="gradient-based-learning">Gradient-Based Learning</h2>
<p><strong>Can we keep guessing solutions like this ?</strong></p>
<p>State-of-the-art neural networks have millions of parameters to be tuned. For optimizing these million parameters, we need an objective towards which we would like to drive our model. This objective is minimizing a value defined by a <strong>cost function</strong>. There are many cost functions which we use depending on the purpose</p>
<h3 id="cost-functions">Cost functions</h3>
<p>The cost functions used in neural networks are the same as those used in simpler paramteric models such as the linear model. These cost functions represent a parametrized distribution whose parameters are to be optimized.</p>
<p>You would have seen some cost functions yesterday.</p>
<p>To name a few,</p>
<ul>
<li>Mean Squared Error Loss</li>
<li>Cross Entropy Loss</li>
<li>L1 Loss</li>
</ul>
<p>Some times we use a neural network to model our loss function as well. You’ll be learning that later.</p>
<h3 id="gradient-descent">Gradient Descent</h3>
<p>Our task is now to minimize the cost function.</p>
<p><strong>How do we do that?</strong>
In the case of the linear model, the loss function was convex w.r.t the parameters as seen in the image. This is a plot of a loss function <script type="math/tex">J</script> w.r.t a parameter <script type="math/tex">w</script>.
<img src="https://i.imgur.com/LAJ8Uag.png" alt="alt text" class="align-center" /></p>
<p>From, the figure it is clear that for convex functions we can achieve the minima by descending using gradients.
Esssentially, we can represent this using the update statement. We see that the gradient is positive if we move far too much from the optimal point and negative if we are far behind the optimal point. Thus the negative of the gradient gives the direction in which we have to drive our parameter <script type="math/tex">w</script>.</p>
<p><img src="https://i.imgur.com/oBIKHky.png" alt="alt text" /></p>
<p>Here, <script type="math/tex">\alpha</script> is known as the learning rate because it decides how much to alter the parameter in each step.</p>
<p>For convex cost functions there are global convergance guarantees.</p>
<p>This step represents one iteration of gradient descent. We do this iteratively until we reach the optimal point.</p>
<p>While this is true for convex loss functions, what about neural networks?</p>
<p>The cost functions for Neural Networks need not be convex. Thus, there is no guarantees as such.</p>
<p>So what do we do?</p>
<p>We continue using gradient descent with some precautions. <strong>We initialise the weights to be some random numbers close to 0</strong>.</p>
<h3 id="back-propagation">Back Propagation</h3>
<p>While in linear models gradients are easy to compute, what about neural networks?</p>
<p>This can be done using chain rule.</p>
<p>In “<strong>Deep Learning</strong>”, we call it backpropagation.</p>
<p><strong>Circuit Intuition for Back Propagation</strong></p>
<p><img src="https://i.imgur.com/MjZjeSP.png" alt="alt text" class="align-center" /></p>
<h3 id="optimization">Optimization</h3>
<p>We will now discuss 3 optimizers.</p>
<ul>
<li><strong>Batch Gradient Descent</strong></li>
<li><strong>Stochastic Gradient Descent</strong></li>
<li><strong>Mini Batch Gradient Descent</strong></li>
</ul>
<p><strong>Batch Gradient Descent Optimization</strong> involves computing the loss for the entire dataset at once and then backpropagating and updating at a time.</p>
<p>In <strong>Stochastic Gradient Descent Optimization</strong>, we compute the loss function for a single training example , one at a time, and then backpropagating.</p>
<p>In <strong>Mini Batch Gradient Descent</strong>, the entire training data is divided into batches and we compute loss for each batch at a time and backpropagate.</p>
<p>Mini Batch Gradient Descent technique is commonly used for some advantages that it provides.</p>
<p>There are various complex optimization algorithms such as momentum, RMSprop, Adam etc. Of course, we won’t go into details on them.</p>
<h3 id="hyperparameters">Hyperparameters</h3>
<p>Hyperparameters are some predefined constants used to train a network.
For example, the learning rate, number of epochs to train for, batch size, number of layers in the network etc. are all hyperparameters</p>
<p>There is no simple way to decide these. Basically, try them all and choose the best.</p>
<h2 id="a-fast-pythonic-implementation-of-a-feed-forward-neural-network-vectorised">A Fast Pythonic Implementation of a Feed Forward Neural Network (Vectorised)</h2>
<script src="https://gist.github.com/sadidaa/3be436ab1ec6966271ac71f31cb12fce.js"></script>
<p><img src="/assets/images/posts/Summer_School/DL2/im1.png" alt="alt" class="align-center" /></p>
<p>The sigmoid function “squashes” inputs to lie between 0 and 1. Unfortunately, this means that for inputs with sigmoid output close to 0 or 1, the gradient with respect to those inputs are close to zero. This leads to the phenomenon of vanishing gradients, where gradients drop close to zero, and the net does not learn well.</p>
<p>On the other hand, the relu function (max(0, x)) does not saturate with input size. Plot these functions to gain intution.</p>
<script src="https://gist.github.com/varun19299/d2047e37d6e45d7b21b3091f23acb9f1.js"></script>
<h2 id="overfitting-and-underfitting">Overfitting and Underfitting</h2>
<p><img src="https://cdn-images-1.medium.com/max/1125/1*_7OPgojau8hkiPUiHoGK_w.png" alt="overfitting" class="align-center" /></p>
<p><img src="https://www.apixio.com/wp-content/uploads/2017/10/classification-with-overfitting-2.png" alt="overfittign" class="align-center" /></p>
<p><img src="https://raw.githubusercontent.com/alexeygrigorev/wiki-figures/master/ufrt/kddm/overfitting-logreg-ex.png" alt="over" class="align-center" /></p>
<p><img src="http://bioinfo.iric.ca/wpbioinfo/wp-content/uploads/2017/10/error_curves.png" alt="fitting_curves" class="align-center" /></p>
<p><img src="http://srdas.github.io/DLBook/DL_images/UnderfittingOverfitting.png" alt="overfitting" class="align-center" /></p>
<p><strong>Overfitting</strong></p>
<p>Overfitting refers to a model that models the training data too well.</p>
<ul>
<li>Overfitting happens when a model learns the <strong>detail and noise</strong> in the training data to the extent that it <strong>negatively</strong> impacts the performance of the model on new data.</li>
<li>
<p><strong>Noise or random fluctuations</strong> in the training data is picked up and <strong>learned</strong> as concepts by the model</p>
</li>
<li>Occurs when the <strong>Representation Power</strong> of the model is way too much when compared to the <strong>actual complexity</strong> needed to solve the problem.</li>
</ul>
<p><strong>Underfitting</strong></p>
<p>Underfitting refers to a model that can neither model the training data nor generalize to new data.</p>
<ul>
<li>
<p>An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.</p>
</li>
<li>
<p>Underfitting can easily be detected as the training performance will be low given a proper metric. So its obviously not suitable for deployment.</p>
</li>
<li>
<p>Increase the model’s representation power by increasing the number of parameters to optimize incase of parametric models</p>
<ul>
<li>In neural Nets, increase the number of hidden layers and no of neurons per hidden layer. This increases the models representation capability.</li>
</ul>
</li>
</ul>
<h3 id="how-to-avoid-overfitting">How to avoid overfitting?</h3>
<blockquote>
<p>Regularization</p>
</blockquote>
<p><strong>Parameter Penalties</strong></p>
<ul>
<li>Adding Parameter norm penalty <script type="math/tex">\Omega(\theta)</script> to the loss function</li>
<li><script type="math/tex">\Omega(\theta)</script> can be any function of <script type="math/tex">\theta</script>, we will see about it in detail.</li>
</ul>
<p>\begin{equation}
\vec J(\theta: X,y) = J(\theta : X, y) + \alpha\ \Omega(\theta) <br />
\alpha\ \epsilon\ [0, \infty]
\end{equation}</p>
<p>The term <script type="math/tex">\alpha</script> decides the amount of regularization term to add.</p>
<table>
<thead>
<tr>
<th><script type="math/tex">\alpha</script></th>
<th>Regularization</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>No Regularization whatsoever</td>
</tr>
<tr>
<td><script type="math/tex">\downarrow</script></td>
<td><script type="math/tex">\downarrow</script></td>
</tr>
<tr>
<td><script type="math/tex">\uparrow</script></td>
<td><script type="math/tex">\uparrow</script></td>
</tr>
<tr>
<td><script type="math/tex">\infty</script></td>
<td>Infinite Penalty, <script type="math/tex">\theta</script> collapses to 0</td>
</tr>
</tbody>
</table>
<ul>
<li>In Neural nets, only <script type="math/tex">W</script> parameters are subject to regularization, bias vectors (<script type="math/tex">b</script>) are not. This is because,
<ul>
<li>Each weight <script type="math/tex">W_{ij}</script> specifies how 2 variables interact. Fitting / finding the correct Weight value requires observing different values of the 2 variables at different conditions.</li>
<li>Bias Vectors control only single variable.</li>
<li>We might induce underfitting by including redularization in bias values.</li>
</ul>
</li>
</ul>
<p><strong><script type="math/tex">L^2</script> Parameter Regularization</strong></p>
<blockquote>
<p>Commonly known as <strong>weight decay</strong>.</p>
</blockquote>
<ul>
<li>
<p>This regularization strategy drives weights close to <strong>origin</strong>.</p>
</li>
<li>
<script type="math/tex; mode=display">\begin{equation} \Omega(\theta) = \frac{1}{2} ||w|| _2^2 \end{equation}</script>
</li>
<li>Also known as <strong>ridge regression</strong> or <strong>Tikhonov Regression</strong></li>
</ul>
<p>Lets assume A, B are highly correlated features.</p>
<blockquote>
<p>Corellation means A and B are couppled in a sense. +ve correlation and -ve correlation.</p>
</blockquote>
<p>They are so correlated so that we can assume A <script type="math/tex">\approx</script> B.</p>
<p>These 2 being the features of the model, the weights will multiply them and we get</p>
<script type="math/tex; mode=display">Y = W_aA + W_bB</script>
<p>Lets assume <script type="math/tex">W_a = 4, W_b = -2</script>, but since A and B are almost equal so</p>
<p>\begin{equation}
Y = 4A - 2B \approx 2A
\end{equation}</p>
<p>But,</p>
<p>\begin{equation}
Y = 10A - 8B \approx 2A
\end{equation}</p>
<p>and again,</p>
<p>\begin{equation}
Y = 1000002A - 1000000B \approx 2A
\end{equation}</p>
<p>So you can see the difficulty in <script type="math/tex">optimization</script>. So this regularization basically says, if such a condition arises, choose the smallest (closest to origin) <script type="math/tex">W_a, W_b</script> which satisfies the condition.</p>
<p><strong><script type="math/tex">L^1</script> Regularization</strong></p>
<p>Here similar to <script type="math/tex">L^2</script> Regularization, but the regularization function is different.</p>
<p>\begin{equation}
\Omega(\theta) = ||w||_1 = \sum_i|w_i|
\end{equation}</p>
<p>Here the optimal solution for some paramters will be 0. This means, <script type="math/tex">L^1</script> regularization will favour <strong>sparse</strong> solutions.</p>
<p>Can be used in <strong>feature selection</strong> mechanism. If weights of some features reduces to 0, this means we can safely disregard those features from our model.</p>
<ul>
<li>Remember <script type="math/tex">W</script> values for features implies the importance of the features in the prediction output. If the weight for a particular feature is 0, this means its nor important.</li>
</ul>
<p><strong>Norm Regularizations as Constraint Optimizations</strong></p>
<p>Recall,</p>
<p>\begin{equation}
\vec J(\theta; X, y) = J(\theta; X, y) + \alpha\ \Omega(\theta)
\end{equation}</p>
<p>Also recalling Lagrange Multipliers,</p>
<p><img src="https://i.stack.imgur.com/9NIoJ.png" alt="lagrange_multipliers" class="align-center" /></p>
<p><img src="http://math.etsu.edu/multicalc/prealpha/chap2/chap2-9/10-8-20.gif" alt="lagrange_multipliers" class="align-center" /></p>
<h3 id="an-example-of-overfitting">An example of overfitting</h3>
<p>We’ll now see an example of overfitting and another where we try to combat that using regularization</p>
<script src="https://gist.github.com/sadidaa/6ba23c86d55d8c0e3e887425055de26c.js"></script>
<p>Evaluation result on Test Data : Loss = 0.07914772396665067, accuracy = 0.9747</p>
<p><img src="/assets/images/posts/Summer_School/DL2/im2.png" alt="alt" class="align-center" /></p>
<p><img src="/assets/images/posts/Summer_School/DL2/im3.png" alt="alt" class="align-center" /></p>
<p><strong>There is a clear sign of OverFitting. Why do you think so?</strong></p>
<p>Carefully see the Validation loss and Training loss curve. Validation loss decreases and then it gradually increases. This means that model is memorising the dataset, though in this case accuracy is much higher.</p>
<p>How to combat that??</p>
<p><strong>Use Regularization !</strong></p>
<script src="https://gist.github.com/sadidaa/b8eed79afaa3c579c633fa3885b659a4.js"></script>
<p>loss: 0.0722 - acc: 0.9796</p>
<p><img src="/assets/images/posts/Summer_School/DL2/im4.png" alt="alt" class="align-center" /></p>
<p><img src="/assets/images/posts/Summer_School/DL2/im5.png" alt="alt" class="align-center" /></p>
<p><strong>What we note??</strong></p>
<ul>
<li>Validation loss is not increasing as it did before.</li>
<li>Difference between the validation and training accuracy is not that much</li>
</ul>
<p>This implies better generalisation and can work will on unseen data samples.</p>
<h2 id="comparision-of-various-optimizers-stochastic-gradient-descent-rmsprop-adam-adagrad">Comparision of Various Optimizers: Stochastic Gradient Descent, RMSprop, Adam, Adagrad</h2>
<script src="https://gist.github.com/sadidaa/a278e7301d16ea311445f17e3103a0a4.js"></script>
<p><img src="/assets/images/posts/Summer_School/DL2/im6.png" alt="alt" class="align-center" /></p>Computer Vision and Intelligencecvigroup.cfi@gmail.comDeep Feedforward Networks - a general description