Skip to content

Commit 0595830

Browse files
authored
Merge pull request animator#711 from iABn0rma1/main
[animator#526] CNN From Scratch
2 parents 7be03ee + 9e56fb1 commit 0595830

10 files changed

+226
-0
lines changed
27.6 KB
Loading
37.4 KB
Loading
Loading
Loading
91.8 KB
Loading
Loading
77.6 KB
Loading
73.7 KB
Loading

contrib/machine-learning/index.md

+1
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
- [Decision Tree Learning](Decision-Tree.md)
77
- [Support Vector Machine Algorithm](support-vector-machine.md)
88
- [Artificial Neural Network from the Ground Up](ArtificialNeuralNetwork.md)
9+
- [Introduction To Convolutional Neural Networks (CNNs)](intro-to-cnn.md)
910
- [TensorFlow.md](tensorFlow.md)
1011
- [PyTorch.md](pytorch.md)
1112
- [Types of optimizers](Types_of_optimizers.md)
+225
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
# Understanding Convolutional Neural Networks (CNN)
2+
3+
## Introduction
4+
Convolutional Neural Networks (CNNs) are a specialized type of artificial neural network designed primarily for processing structured grid data like images. CNNs are particularly powerful for tasks involving image recognition, classification, and computer vision. They have revolutionized these fields, outperforming traditional neural networks by leveraging their unique architecture to capture spatial hierarchies in images.
5+
6+
### Why CNNs are Superior to Traditional Neural Networks
7+
1. **Localized Receptive Fields**: CNNs use convolutional layers that apply filters to local regions of the input image. This localized connectivity ensures that the network learns spatial hierarchies and patterns, such as edges and textures, which are essential for image recognition tasks.
8+
2. **Parameter Sharing**: In CNNs, the same filter (set of weights) is used across different parts of the input, significantly reducing the number of parameters compared to fully connected layers in traditional neural networks. This not only lowers the computational cost but also mitigates the risk of overfitting.
9+
3. **Translation Invariance**: Due to the shared weights and pooling operations, CNNs are inherently invariant to translations of the input image. This means that they can recognize objects even when they appear in different locations within the image.
10+
4. **Hierarchical Feature Learning**: CNNs automatically learn a hierarchy of features from low-level features like edges to high-level features like shapes and objects. Traditional neural networks, on the other hand, require manual feature extraction which is less effective and more time-consuming.
11+
12+
### Use Cases of CNNs
13+
- **Image Classification**: Identifying objects within an image (e.g., classifying a picture as containing a cat or a dog).
14+
- **Object Detection**: Detecting and locating objects within an image (e.g., finding faces in a photo).
15+
- **Image Segmentation**: Partitioning an image into segments or regions (e.g., dividing an image into different objects and background).
16+
- **Medical Imaging**: Analyzing medical scans like MRI, CT, and X-rays for diagnosis.
17+
18+
> This guide will walk you through the fundamentals of CNNs and their implementation in Python. We'll build a simple CNN from scratch, explaining each component to help you understand how CNNs process images and extract features.
19+
20+
### Let's start by understanding the basic architecture of CNNs.
21+
22+
## CNN Architecture
23+
Convolution layers, pooling layers, and fully connected layers are just a few of the many building blocks that CNNs use to automatically and adaptively learn spatial hierarchies of information through backpropagation.
24+
25+
### Convolutional Layer
26+
The convolutional layer is the core building block of a CNN. The layer's parameters consist of a set of learnable filters (or kernels), which have a small receptive field but extend through the full depth of the input volume.
27+
28+
#### Input Shape
29+
The dimensions of the input image, including the number of channels (e.g., 3 for RGB images & 1 for Grayscale images).
30+
![image](assets/cnn-input_shape.png)
31+
32+
- The input matrix is a binary image of handwritten digits,
33+
where '1' marks the pixels containing the digit (ink/grayscale area) and '0' marks the background pixels (empty space).
34+
- The first matrix shows the represnetation of 1 and 0, which can be depicted as a vertical line and a closed loop.
35+
- The second matrix represents 9, combining the loop and line.
36+
37+
#### Strides
38+
The step size with which the filter moves across the input image.
39+
![image](assets/cnn-strides.png)
40+
41+
- This visualization will help you understand how the filter (kernel) moves acroos the input matrix with stride values of (3,3) and (2,2).
42+
- A stride of 1 means the filter moves one step at a time, ensuring it covers the entire input matrix.
43+
- However, with larger strides (like 3 or 2 in this example), the filter may not cover all elements, potentially missing some information.
44+
- While this might seem like a drawback, higher strides are often used to reduce computational cost and decrease the output size, which can be beneficial in speeding up the training process and preventing overfitting.
45+
46+
#### Padding
47+
Determines whether the output size is the same as the input size ('same') or reduced ('valid').
48+
![image](assets/cnn-padding.png)
49+
50+
- `Same` padding is preferred in earlier layers to preserve spatial and edge information, as it can help the network learn more detailed features.
51+
- Choose `valid` padding when focusing on the central input region or requiring specific output dimensions.
52+
- Padding value can be determined by $ ( f - 1 ) \over 2 $, where f isfilter size
53+
54+
#### Filters
55+
Small matrices that slide over the input data to extract features.
56+
![image](assets/cnn-filters.png)
57+
58+
- The first filter aims to detect closed loops within the input image, being highly relevant for recognizing digits with circular or oval shapes, such as '0', '6', '8', or '9'.
59+
- The next filter helps in detecting vertical lines, crucial for identifying digits like '1', '4', '7', and parts of other digits that contain vertical strokes.
60+
- The last filter shows how to detect diagonal lines in the input image, useful for identifying the slashes present in digits like '1', '7', or parts of '4' and '9'.
61+
62+
#### Output
63+
A set of feature maps that represent the presence of different features in the input.
64+
![image](assets/cnn-ouputs.png)
65+
66+
- With no padding and a stride of 1, the 3x3 filter moves one step at a time across the 7x5 input matrix. The filter can only move within the original boundaries of the input, resulting in a smaller 5x3 output matrix. This configuration is useful when you want to reduce the spatial dimensions of the feature map while preserving the exact spatial relationships between features.
67+
- By adding zero padding to the input matrix, it is expanded to 9x7, allowing the 3x3 filter to "fit" fully on the edges and corners. With a stride of 1, the filter still moves one step at a time, but now the output matrix is the same size (7x5) as the original input. Same padding is often preferred in early layers of a CNN to preserve spatial information and avoid rapid feature map shrinkage.
68+
- Without padding, the 3x3 filter operates within the original input matrix boundaries, but now it moves two steps at a time (stride 2). This significantly reduces the output matrix size to 3x2. Larger strides are employed to decrease computational cost and the output size, which can be beneficial in speeding up the training process and preventing overfitting. However, they might miss some finer details due to the larger jumps.
69+
- The output dimension of a CNN model is given by, $$ n_{out} = { n_{in} + (2 \cdot p) - k \over s } $$
70+
where,
71+
n<sub>in</sub> = number of input features
72+
p = padding
73+
k = kernel size
74+
s = stride
75+
76+
- Also, the number of trainable parameters for each layer is given by, $ (n_c \cdot [k \cdot k] \cdot f) + f $
77+
where,
78+
n<sub>c</sub> = number of input channels
79+
k x k = kernel size
80+
f = number of filters
81+
an additional f is added for bias
82+
83+
### Pooling Layer
84+
Pooling layers reduce the dimensionality of each feature map while retaining the most critical information. The most common form of pooling is max pooling.
85+
- **Input Shape:** The dimensions of the feature map from the convolutional layer.
86+
- **Pooling Size:** The size of the pooling window (e.g., 2x2).
87+
- **Strides:** The step size for the pooling operation.
88+
- **Output:** A reduced feature map highlighting the most important features.
89+
<div align='center'>
90+
<img src='assets/cnn-pooling.png' width='800'></img>
91+
</div>
92+
93+
- The high values (8) indicate that the "closed loop" filter found a strong match in those regions.
94+
- First matrix of size 6x4 represents a downsampled version of the input.
95+
- While the second matrix with 3x2, resulting in more aggressive downsampling.
96+
97+
### Flatten Layer
98+
The flatten layer converts the 2D matrix data to a 1D vector, which can be fed into a fully connected (dense) layer.
99+
- **Input Shape:** The 2D feature maps from the previous layer.
100+
- **Output:** A 1D vector that represents the same data in a flattened format.
101+
![image](assets/cnn-flattened.png)
102+
103+
### Dropout Layer
104+
Dropout is a regularization technique to prevent overfitting in neural networks by randomly setting a fraction of input units to zero at each update during training time.
105+
- **Input Shape:** The data from the previous layer.
106+
- **Dropout Rate:** The fraction of units to drop (e.g., 0.5 for 50% dropout).
107+
- **Output:** The same shape as the input, with some units set to zero.
108+
![image](assets/cnn-dropout.png)
109+
110+
- The updated 0 values represents the dropped units.
111+
112+
## Implementation
113+
114+
Below is the implementation of a simple CNN in Python. Each function within the `CNN` class corresponds to a layer in the network.
115+
116+
```python
117+
import numpy as np
118+
119+
class CNN:
120+
def __init__(self):
121+
pass
122+
123+
def convLayer(self, input_shape, channels, strides, padding, filter_size):
124+
height, width = input_shape
125+
input_shape_with_channels = (height, width, channels)
126+
print("Input Shape (with channels):", input_shape_with_channels)
127+
128+
# Generate random input and filter matrices
129+
input_matrix = np.random.randint(0, 10, size=input_shape_with_channels)
130+
filter_matrix = np.random.randint(0, 5, size=(filter_size[0], filter_size[1], channels))
131+
132+
print("\nInput Matrix:\n", input_matrix[:, :, 0])
133+
print("\nFilter Matrix:\n", filter_matrix[:, :, 0])
134+
135+
padding = padding.lower()
136+
137+
if padding == 'same':
138+
# Calculate padding needed for each dimension
139+
pad_height = filter_size[0] // 2
140+
pad_width = filter_size[1] // 2
141+
142+
# Apply padding to the input matrix
143+
input_matrix = np.pad(input_matrix, ((pad_height, pad_height), (pad_width, pad_width), (0, 0)), mode='constant')
144+
145+
# Adjust height and width to consider the padding
146+
height += 2 * pad_height
147+
width += 2 * pad_width
148+
149+
elif padding == 'valid':
150+
pass
151+
152+
else:
153+
return "Invalid Padding!!"
154+
155+
# Output dimensions
156+
conv_height = (height - filter_size[0]) // strides[0] + 1
157+
conv_width = (width - filter_size[1]) // strides[1] + 1
158+
output_matrix = np.zeros((conv_height, conv_width, channels))
159+
160+
# Convolution Operation
161+
for i in range(0, height - filter_size[0] + 1, strides[0]):
162+
for j in range(0, width - filter_size[1] + 1, strides[1]):
163+
receptive_field = input_matrix[i:i + filter_size[0], j:j + filter_size[1], :]
164+
output_matrix[i // strides[0], j // strides[1], :] = np.sum(receptive_field * filter_matrix, axis=(0, 1, 2))
165+
166+
return output_matrix
167+
168+
def maxPooling(self, input_matrix, pool_size=(2, 2), strides_pooling=(2, 2)):
169+
input_height, input_width, input_channels = input_matrix.shape
170+
pool_height, pool_width = pool_size
171+
stride_height, stride_width = strides_pooling
172+
173+
# Calculate output dimensions
174+
pooled_height = (input_height - pool_height) // stride_height + 1
175+
pooled_width = (input_width - pool_width) // stride_width + 1
176+
177+
# Initialize output
178+
pooled_matrix = np.zeros((pooled_height, pooled_width, input_channels))
179+
180+
# Perform max pooling
181+
for c in range(input_channels):
182+
for i in range(0, input_height - pool_height + 1, stride_height):
183+
for j in range(0, input_width - pool_width + 1, stride_width):
184+
patch = input_matrix[i:i + pool_height, j:j + pool_width, c]
185+
pooled_matrix[i // stride_height, j // stride_width, c] = np.max(patch)
186+
187+
return pooled_matrix
188+
189+
def flatten(self, input_matrix):
190+
return input_matrix.flatten()
191+
192+
def dropout(self, input_matrix, dropout_rate=0.5):
193+
assert 0 <= dropout_rate < 1, "Dropout rate must be in [0, 1)."
194+
dropout_mask = np.random.binomial(1, 1 - dropout_rate, size=input_matrix.shape)
195+
return input_matrix * dropout_mask
196+
```
197+
198+
Run the below command to generate output with random input and filter matrices, depending on the given size.
199+
200+
```python
201+
input_shape = (5, 5)
202+
channels = 1
203+
strides = (1, 1)
204+
padding = 'valid'
205+
filter_size = (3, 3)
206+
207+
cnn_model = CNN()
208+
209+
conv_output = cnn_model.convLayer(input_shape, channels, strides, padding, filter_size)
210+
print("\nConvolution Output:\n", conv_output[:, :, 0])
211+
212+
pool_size = (2, 2)
213+
strides_pooling = (1, 1)
214+
215+
maxPool_output = cnn_model.maxPooling(conv_output, pool_size, strides_pooling)
216+
print("\nMax Pooling Output:\n", maxPool_output[:, :, 0])
217+
218+
flattened_output = cnn_model.flatten(maxPool_output)
219+
print("\nFlattened Output:\n", flattened_output)
220+
221+
dropout_output = cnn_model.dropout(flattened_output, dropout_rate=0.3)
222+
print("\nDropout Output:\n", dropout_output)
223+
```
224+
225+
Feel free to play around with the parameters!

0 commit comments

Comments
 (0)