### [김건우][AI 기초] 12. 딥러닝을 위한 확률적 경사 하강법(SGD) 쉽게 이해하기 - 손실 함수와 최적화 및 코딩

**
작성자**
: 김건우

(2024-06-02)

조회수 : 931

[YOUTUBE] [김건우][AI 기초] 딥러닝을 위한 확률적 경사 하강법(SGD) 쉽게 이해하기 - 손실 함수와 최적화 및 코딩

[YouTube] 최적설계: Introduction, Smart Design Lab @강남우 교수

질문 1) 경사하강법(Gradient Descent)이란?

질문 2) 에포크(epoch)란?

질문 3) 손실함수(loss function, 비용함수: cost function; 오차함수: error function)이란?

▪ 경사하강법(Gradient Descent)이란?

[YouTube] Gradient Descent in 3 minutes, Visually Explained, 2022

To answer that question let's say: |

For example you can work with a neural network: Then for each example in the data set, you compare the output of the neural network to the ideal output. Take the difference the squre and then the sum and you get the cost of a single training example. By taking the sum over all training examples you get the overall cost function. And the problem now becomes: How do we find the right value of theta(θ) that makes this cost function as small as possible? |

And this is where the second step or gradient descent comes in. Well data usually has many more entries thant just two. |

So what can we do? So by taking a small step in the opposite direction, you can hope to make the function decrease at each direction. Gradient descent as I just presented it is known as Vanilla gradient descent |

In stochastic gradient descent, instead of taking this sum over all training examples when taking the gradient of the cost function, which can be expensive when the number of examples is big, we can take a small random subset at each iteration. |

Adaptive gradient descent picks a different step size for each component of data. Momentum gradient descent keeps track of previously computed gradient to build up momentum and accelerate conversions to the minimizer. |

▪ 에포크(epoch)란?

[AI Wiki] Epochs, Batch Size, & Iterations

## Epochs, Batch Size, & IterationsIn most cases, it is not possible to feed all the training data into an algorithm in one pass. This is due to the size of the dataset and memory limitations of the compute instance used for training. There is some terminology required to better understand how data is best broken into smaller pieces. An So if a dataset includes 1,000 images split into mini-batches of 100 images, it will take 10 iterations |

## What is the right number of epochs?During each pass through the network, the weights are updated and the curve goes from underfitting, to optimal, to overfitting. There is no magic rule for choosing the number of epochs — this is a hyperparameter that must be determined before training begins. |

## What is the right batch size?Like the number of epochs, batch size is a hyperparameter with no magic rule of thumb. Choosing a batch size that is too small will introduce a high degree of variance (noisiness) within each batch as it is unlikely that a small sample is a good representation of the entire dataset. Conversely, if a batch size is too large, it may not fit in memory of the compute instance used for training and it will have the tendency to overfit the data. It's important to note that batch size is influenced by other hyperparameters such as learning rate so the A common heuristic for batch size is to use the square root of the size of the dataset. However this is a hotly debated topic. |

▪ 손실함수(loss function, 비용함수: cost function; 오차함수: error function)이란?

[Wikipedia] Loss function

In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) ^{[1]} is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its opposite (in specific domains, variously called a reward function, a profit function, a utility function, a fitness function, etc.), in which case it is to be maximized. The loss function could include terms from several levels of the hierarchy. |