cnn 논문리뷰

논문리뷰 2024. 7. 30. 12:19

Introduction 요약:

간단한 recognation task에서는 적은 labeled dataset도 잘 수행되지만

현실에 있는 객체들을 적용하기 위해서는 더큰 학습데이터가 필요함

왜냐하면 간단한 recognation task에서는 적은 labeled dataset도 잘 수행되지만

공부해올것:

parameter(선형회귀)

neuron

max-pooling

pooling에는 어떤종류가 있는지

non-saturating neurons -> 공부할것

dropout -> 공부

고양이가 있으면 상하반전등 데이터를 변경 (augmentation기법)-> 공부

출처: https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

Title: ImageNet Classification with Deep Convolutional Neural Networks

Abstract:

task:

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes.

the ImageNet LSVRC-2010 contest 이고,

그래서 trained, deep convolutional neural network to classify the 1.2 million high-resolution images

이미지가 1000 different classes like (종류가 1000가지, 동물이면 동물이름별)

. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art.

The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax

. The neural network, which has

60 million parameters and 650,000 neurons, consists of

five convolutional layers, some of which are

followed by max-pooling layers, and three fully-connected layers

with a final 1000-way softmax.

To make training faster, we used non-saturating neurons

and a very efficient GPU implementation of the convolution operation.

To reduce overfitting in the fully-connected layers

we employed a recently-developed regularization method

called “dropout” that proved to be very effective

We also entered a variant of this model in the ILSVRC-2012 competition(대회)

좋은결과를 냈다.

and achieved a winning top-5 test error rate of 15.3%,

compared to 26.2% achieved by the second-best entry

introduction

한문단 요약, 이런문제가 있어서 이걸 했다

1 Introduction

Current approaches to object recognition make essential use of machine learning methods.

object recognition 라는 task 를 하기위한 Current approaches 최근의 방법들은 machine learning methods를 사용했다.

To improve their performance, we can collect larger datasets, learn more powerful models, and use better techniques for preventing overfitting.

Until recently, datasets of labeled images were relatively small — on the order of tens of thousands of images (e.g., NORB [16], Caltech-101/256 [8, 9], and CIFAR-10/100 [12]). S

Simple recognition tasks can be solved quite well with datasets of this size, especially if they are augmented with label-preserving transformations.

label-preserving transformations.

간단한 recognation task에서는 적은 labeled dataset도 잘 수행됨

고양이가 있으면 상하반전등 데이터를 변경 (augmentation기법)-> 공부

고양이가 뒤집혔어도 잘보존되게 학습시킴

. For example, the currentbest error rate on the MNIST digit-recognition task (<0.3%) approaches human performance [4].

MNIST digit-recognition task에서 the currentbest error rate는 human performance만큼했다.

But objects in realistic settings exhibit considerable variability,

so to learn to recognize them it is necessary to use much larger training sets.

하지만 현실에 있는 객체들을 적용하기 위해서는 더큰 학습데이터가 필요하다.

And indeed, the shortcomings of small image datasets

have been widely recognized (e.g., Pinto et al. [21]),

but it has only recently become possible to collect labeled datasets

with millions of images.

많은 dataset이 필요하다.

The new larger datasets include LabelMe [23],

which consists of hundreds of thousands of fully-segmented images,

and ImageNet [6], which consists of over 15 million labeled high-resolution images

in over 22,000 categories.

The new larger datasets fully-segmented images , ImageNe로 구성되있음

introduction

한문단 요약, 이런문제가 있어서 이걸 했다.

논문리뷰 방법

번역안돌리고 introduction읽는연습 -

다른사람 요약

파트별로 요약

1. abstract로 summary정리

2. introduction은 우리가 어떤식으로 연구했다는거를 설명-간단하게도괜찮

추가공부할거

neural network 공부하기

1. Fully connected layer랑 cnn이랑 어떤 차이점이 있는지(input위주로)

2. input 과 kernel 과 output의 관계

3. stride와 padding이 사용되는 이유

4. stride1과 stride2 로 했을때 어떻게 달라지는지를 2번 질문과 연관시켜설명하시오.

2 The Dataset - data 양, 어떤 데이터인지 카테고리인지, labeled인지

ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories

The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool. Starting in 2010, as part of the Pascal Visual Object Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held.

In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.

--> 요약: ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories

ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available, so this is the version on which we performed most of our experiments.

Since we also entered our model in the ILSVRC-2012 competition, in Section 6 we report our results on this version of the dataset as well, for which test set labels are unavailable

On ImageNet, it is customary to report two error rates: top-1 and top-5, where the top-5 error rate is the fraction of test images for which the correct label is not among the five labels considered most probable by the model.

category [개 , 고양이, 소, 호랑이, 닭, 비둘기, 매 , 원숭이 ] input(개사진) 모델통과시켜 output = [0.6 , 0.2 , 0.1 , 0.1]

개사진맞아서 error-rate:0

ImageNet consists of variable-resolution images, while our system requires a constant input dimensionality.

Therefore, we down-sampled the images to a fixed resolution of 256 × 256. Given a rectangular image, we first rescaled the image such that the shorter side was of length 256, and then cropped out the central 256×256 patch from the resulting image.

We did not pre-process the images in any other way, except for subtracting the mean activity over the training set from each pixel. So we trained our network on the (centered) raw RGB values of the pixels.

3 The Architecture

The architecture of our network is summarized in Figure 2. It contains eight learned layers — five convolutional and three fully-connected. Below, we describe some of the novel or unusual features of our network’s architecture. Sections 3.1-3.4 are sorted according to our estimation of their importance, with the most important first.

3.1 ReLU Nonlinearity

The standard way to model a neuron’s output f as a function of its input x is with f(x) = tanh(x) or f(x) = (1 + e −x ) −1 . In terms of training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity f(x) = max(0, x).

multi layer

는 바늘에 찔렸을때 불필요한 정보(냄새)같은거를 전달하지 않게 비중요한것을 정리하기 위해 필요함.

f(x) = tanh(x) 탄젠트함수 or f(x) = (1 + e −x ) −1 시그모이드함수 등을 써봤는데 f(x) = max(0, x)가 제일 빠르게 진행되었다.

Following Nair and Hinton [20], we refer to neurons with this nonlinearity as Rectified Linear Units (ReLUs). Deep convolutional neural networks with ReLUs train several times faster than their equivalents with tanh units.

Nair and Hinton에 따르면, Rectified Linear Units (ReLUs)라는 비선형관련 신경을 참조하고 있다.

This is demonstrated in Figure 1, which shows the number of iterations required to reach 25% training error on the CIFAR-10 dataset for a particular four-layer convolutional network.

Figure 1는 반복 숫자를 보여주는데 이는 25% training error 에 달성하기위한것이고 for a particular four-layer convolutional network 를 위한 CIFAR-10 dataset 위의

This plot shows that we would not have been able to experiment with such large neural networks for this work if we had used traditional saturating neuron models.

이 구성은 실험은 큰 신경망 실험을 위한것에는 가능하지 않다는것을 보여준다. 만약 우리가 전통 saturating neuron models로 진행하면

We are not the first to consider alternatives to traditional neuron models in CNNs.

우리는 CNNs에서 전통적 방식 신경모델을 대체하려는게 우선순위가 아니다

For example, Jarrett et al. [11] claim that the nonlinearity f(x) = |tanh(x)| works particularly well

with their type of contrast normalization followed by local average pooling on the Caltech-101 dataset.

예를들어 Jarrett et al. [11] 에 따르면 비선형인 nonlinearity f(x) = |tanh(x)| 는 normalization대비 타입에서 잘 작동한다.

the Caltech-101 dataset에 대한 로컬평균풀링에 따른 대비 정규화 유형과 함께 특히 잘 작동한다고 주장한다.

However, on this dataset the primary concern is preventing overfitting, so the effect they are observing is different from the accelerated ability to fit the training set which we report when using ReLUs. Faster learning has a great influence on the performance of large models trained on large datasets.

. However, on this dataset the primary concern is preventing overfitting, so the effect they are observing is different from the accelerated ability to fit the training set which we report when using ReLUs.

relu는 빠르게 하려는게 목적이고 그전 tahn함수랑 시그모이드는 오버피팅을 막기위한것이다.

However, on this dataset the primary concern is preventing overfitting

그러나 이 데이터셋에 대한 주고려는 오버피팅을 막는것이다.

so the effect they are observing is different from the accelerated ability to fit the training set which we report when using ReLUs.

그래서 그들이 관찰하고 있는 효과는 다르다. training set 우리가 ReLUs를 사용할때랑 가속화하는 능력에서 다르다.

Figure 1: A four-layer convolutional neural network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons (dashed line).

그래프에서 4개의 레이어인 convolutional neural network solid line인 ReLUs 는 CIFAR-10 상 25% training error rate에 도달했다.

그림 1: ReLU를 사용한 4계층 합성곱 신경망(실선)은 tanh 뉴런을 사용한 동등한 네트워크(점선)보다 6배 더 빨리 CIFAR-10에서 25%의 학습 오류율에 도달합니다.

The learning rates for each network were chosen independently to make training as fast as possible. No regularization of any kind was employed.

각각의 네트워크의 러닝 레이트는 가능한빠른 트레이닝을 하기 위해서 독립적으로 골라졌다. 아무종류의 정규화는 하지 않았다.

각 네트워크의 학습률은 가능한 한 빨리 학습하기 위해 독립적으로 선택되었습니다. 어떠한 종류의 정규화도 사용되지 않았습니다.

The magnitude of the effect demonstrated here varies with network architecture, but networks with ReLUs consistently learn several times faster than equivalents with saturating neurons.

이펙트의 크기는 네트워크 아키택쳐의 다양함을 증명한다. 그러나 ReLUs와의 여러번 saturating 뉴런의 동등함보다 빨리 일관적으로 빠르게 배운다.

여기에서 입증된 효과의 크기는 네트워크 아키텍처에 따라 다르지만, ReLU를 사용한 네트워크는 포화 뉴런을 사용한 네트워크보다 일관되게 몇 배 더 빠르게 학습합니다.

요약:

DATASET TRAINING 을 전통적 모델인 tanh 함수와 relu와 비교한 실험이었습니다.

tanh 함수는 large data 적합하지 않았고, 로컬평균풀링에 따른 대비 정규화 유형과 함께 특히 잘 작동

overfitting막기위한것인데 반해,
ReLUs 는 thanh함수비해 6배가 빠르게 학습합니다. 또한 정규화는 사용안했습니다.

In terms of training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity f(x) = max(0, x).

gradient descent 을 사용한 훈련에 있어서 saturating nonlinearities 은 non-saturating nonlinearity 인 relu함수 (f(x) = max(0, x) )보다 훨씬 느리다.

Deep convolutional neural networks with ReLUs train several times faster than their equivalents with tanh units.

ReLU를 사용한 Deep convolutional neural networks 은 tanh 단위를 사용한 동등 신경망보다 몇 배 더 빨리 학습합니다.

This plot shows that we would not have been able to experiment with such large neural networks for this work if we had used traditional saturating neuron models.

이 그래프는 우리가 기존의 포화 뉴런 모델을 사용했다면 이 연구에서 이렇게 큰 신경망을 실험할 수 없었을 것임을 보여줍니다.

그러나 이 데이터 세트에서 가장 중요한 관심사는 과적합을 방지하는 것이므로, 그들이 관찰하는 효과는 ReLU를 사용할 때 보고하는 훈련 세트에 대한 가속된 적합성과 다릅니다. 더 빠른 학습은 대규모 데이터 세트에서 훈련된 대규모 모델의 성능에 큰 영향을 미칩니다.

(1) ReLU

(현재 ReLU를 활성화 함수로 많이 사용하고 있지만 논문이 쓰여지던 당시에는 tanh과 sigmoid가 주로 사용됨)

AlexNet에서는 많은 양의 데이터와 깊은 네트워크 구조 때문에 빠르게 학습할 능력이 요구된다. 이를 위해서 ReLU 활성화 함수를 사용하였는데 tanh과 비교했을 때 6배 정도 빠른 시간을 보였다.

물론 ReLU는 본 논문이 처음 제안하는 것이 아니다. 이전 연구에서는 과적합을 피하기 위해서 ReLU를 활용했다(Jarrett et al) 그러나 AlexNet에서 주요 목표는 과적합 방지가 아니라 빠른 학습이다.

-----------------------------------

3.2 Training on Multiple GPUs

A single GTX 580 GPU has only 3GB of memory, which limits the maximum size of the networks that can be trained on it.

하나의 GTX 580 GPU 는 오직 3GB 메모리만 가졌다. 이것은 그것에 훈련가능한 네트워크의 최대사이즈를 제한한다.

단일 GTX 580 GPU는 메모리가 3GB에 불과하므로 이를 사용하여 학습할 수 있는 네트워크의 최대 크기가 제한됩니다.

It turns out that 1.2 million training examples are enough to train networks which are too big to fit on one GPU.

1.2 백만 훈련샘플이 네트워크를 훈련시키기에 충분하다고 판명되었는데 이것은 GPU에 맞도록하기 위해서는 너무 크다.

120만 개의 학습 예제가 하나의 GPU에 맞추기에는 너무 큰 네트워크를 학습시키는 데 충분한 것으로 나타났습니다.

Therefore we spread the net across two GPUs. Current GPUs are particularly well-suited to cross-GPU parallelization, as they are able to read from and write to one another’s memory directly, without going through host machine memory.

'논문리뷰' 카테고리의 다른 글

chapter01. n-gram 언어모델 (1)	2024.09.01
Distributed Representations of Words and Phrases and their Compositionality (Word2Vec) (1)	2024.08.30

ABOUT ME

자연어(NLP) 자연어(NLP)

'논문리뷰' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'논문리뷰' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바