Lecture 강의

Undergrads 학부 Graduates 대학원 역학(Mech)/설계(Design)/FEM 인공지능(AI)/기계학습(ML)/IoT SAP/ETABS OpenSees/FeView/STKO 아바쿠스(Abaqus) 파이썬(Python)/매트랩(Matlab) 엑셀(Excel) VBA 마이다스(MIDAS)

[김건우][AI 기초] 12. 딥러닝을 위한 확률적 경사 하강법(SGD) 쉽게 이해하기 - 손실 함수와 최적화 및 코딩

작성자 : 김건우

(2024-06-02)

조회수 : 931

[YOUTUBE] [김건우][AI 기초] 딥러닝을 위한 확률적 경사 하강법(SGD) 쉽게 이해하기 - 손실 함수와 최적화 및 코딩

[YouTube] 최적설계: Introduction, Smart Design Lab @강남우 교수

 

질문 1) 경사하강법(Gradient Descent)이란?

질문 2) 에포크(epoch)란?

질문 3) 손실함수(loss function, 비용함수: cost function; 오차함수: error function)이란?

 

▪  경사하강법(Gradient Descent)이란?

[YouTube] Gradient Descent in 3 minutes, Visually Explained, 2022

Gradient descent is the process by which machines learn how to generate new faces, play hide and seek, and even beat the best humans at games like dota.
But what exactly is gradient descent?

To answer that question let's say:
you're trying to train your computer to listen to an audio file and recognize three spoken commands (Up/Stop/Down) based on a labeled data set.

The first step is to formulate this machine learning task as a mathematical optimization problem.

For example you can work with a neural network:
whose weights are unknown variables 
whose input is an audio data and 
whose output is a vector of size 3 
where each entry represents how much the neural network thinks 
the audio corresponds to each command(Up/Stop/Down).

Then for each example in the data set, you compare the output of the neural network to the ideal output. Take the difference the squre and then the sum and you get the cost of a single training example. By taking the sum over all training examples you get the overall cost function. And the problem now becomes: How do we find the right value of theta(θ) that makes this cost function as small as possible?

And this is where the second step or gradient descent comes in. 
And you might ask: 
Why do we need to invent an algorithm for this at all?
Can't we just plot the function and point to the minimizer?

Well data usually has many more entries thant just two.
And in that setting it's hard to see what the function looks like.

So what can we do?
The insight of gradient descent is that
while we cannot get a look at the whole function all at once, 
we can easily evaluate the cost function at an arbitrary point.
And with a process called back propagation 
we can also evaluate the negative gradient of this cost function.
And take a small step in that direction,
gradient descent continues in a iterative fashion.
it computes the negative gradient at the new point.
Takes a step in that direction and so on and so forth, 
the intuition behind every is that 
the gradient of a fuction gives you the direction that increases that function the most.

So by taking a small step in the opposite direction, you can hope to make the function decrease at each direction.

Gradient descent as I just presented it is known as Vanilla gradient descent
because several variants have been developed throughout the years to improve it.

In stochastic gradient descent, instead of taking this sum over all training examples when taking the gradient of the cost function, 
which can be expensive when the number of examples is big,
we can take a small random subset at each iteration.

Adaptive gradient descent picks a different step size for each component of data.
And this can be extremely useful when data is parsed like text data and image data.

Momentum gradient descent keeps track of previously computed gradient to build up momentum and accelerate conversions to the minimizer.

 

▪  에포크(epoch)란?

[AI Wiki] Epochs, Batch Size, & Iterations

Epochs, Batch Size, & Iterations

In most cases, it is not possible to feed all the training data into an algorithm in one pass. This is due to the size of the dataset and memory limitations of the compute instance used for training. There is some terminology required to better understand how data is best broken into smaller pieces.

An epoch elapses when an entire dataset is passed forward and backward through the neural network exactly one time. If the entire dataset cannot be passed into the algorithm at once, it must be divided into mini-batches. Batch size is the total number of training samples present in a single min-batch. An iteration is a single gradient update (update of the model's weights) during training. The number of iterations is equivalent to the number of batches needed to complete one epoch.

So if a dataset includes 1,000 images split into mini-batches of 100 images, it will take 10 iterations to complete a single epoch.

What is the right number of epochs?

During each pass through the network, the weights are updated and the curve goes from underfitting, to optimal, to overfitting. There is no magic rule for choosing the number of epochs — this is a hyperparameter that must be determined before training begins.

What is the right batch size?

Like the number of epochs, batch size is a hyperparameter with no magic rule of thumb. Choosing a batch size that is too small will introduce a high degree of variance (noisiness) within each batch as it is unlikely that a small sample is a good representation of the entire dataset. Conversely, if a batch size is too large, it may not fit in memory of the compute instance used for training and it will have the tendency to overfit the data. It's important to note that batch size is influenced by other hyperparameters such as learning rate so the combination of these hyperparameters is as important as batch size itself.

A common heuristic for batch size is to use the square root of the size of the dataset. However this is a hotly debated topic.

 

▪  손실함수(loss function, 비용함수: cost function; 오차함수: error function)이란?

[Wikipedia] Loss function

In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) [1] is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its opposite (in specific domains, variously called a reward function, a profit function, a utility function, a fitness function, etc.), in which case it is to be maximized. The loss function could include terms from several levels of the hierarchy.