Papers By Category

Abstract - Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety oftasks. Since 2014 very deep convolutional networks startedto become mainstream, yielding substantial gains in vari-ous benchmarks. Although increased model size and com-putational cost tend to translate to immediate quality gainsfor most tasks (as long as enough labeled data is providedfor training), computational efficiency and low parametercount are still enabling factors for various use cases such asmobile vision and big-data scenarios. Here we are explor-ing ways to scale up networks in ways that aim at utilizingthe added computation as efficiently as possible by suitablyfactorized convolutions and aggressive regularization. Webenchmark our methods on the ILSVRC 2012 classificationchallenge validation set demonstrate substantial gains overthe state of the art:21.2%top-1and5.6%top-5error forsingle frame evaluation using a network with a computa-tional cost of5billion multiply-adds per inference and withusing less than 25 million parameters. With an ensemble of4models and multi-crop evaluation, we report3.5%top-5error and17.3%top-1error.

On the Variance of the Adaptive Learning Rate and Beyond

Abstract - The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in details. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate (i.e., it has problematically large variance in the early stage), suggest warmup works as a variance reduction technique, and provide both empirical and theoretical evidence to verify our hypothesis. We further propose RAdam, a new variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Extensive experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the effectiveness and robustness of our proposed method

Abstract - Abstract—Diagnosing different retinal diseases from SpectralDomain Optical Coherence Tomography (SD-OCT) images is achallenging task. Different automated approaches such as imageprocessing, machine learning and deep learning algorithms havebeen used for early detection and diagnosis of retinal diseases.Unfortunately, these are prone to error and computationalinefficiency, which requires further intervention from humanexperts. In this paper, we propose a novel convolution neuralnetwork architecture to successfully distinguish between differentdegeneration of retinal layers and their underlying causes.The proposed novel architecture outperforms other classificationmodels while addressing the issue of gradient explosion. Ourapproach reaches near perfect accuracy of 99.8% and 100% fortwo separately available Retinal SD-OCT data-set respectively.Additionally, our architecture predicts retinal diseases in realtime while outperforming human diagnosticians.Keywords—SD-OCT, Convolutional Neural Networks, RetinalDegeneration; Residual Neural Network; Deep Learning; Com-puter Vision

One-Shot Instance Segmentation

Abstract - We tackle the problem of one-shot instance segmentation: Given an example image of a novel, previously unknown object category, find and segment all objects of this category within a complex scene. To address this challenging new task, we propose Siamese Mask R-CNN. It extends Mask R-CNN by a Siamese backbone encoding both reference image and scene, allowing it to target detection and segmentation towards the reference category. We demonstrate empirical results on MS Coco highlighting challenges of the one-shot setting: while transferring knowledge about instance segmentation to novel object categories works very well, targeting the detection network towards the reference category appears to be more difficult. Our work provides a first strong baseline for one-shot instance segmentation and will hopefully inspire further research into more powerful and flexible scene analysis algorithms.

Abstract - Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes. The approach does not require changes to loss functions or network architectures, and is applicable both when training from scratch and when fine-tuning an existing GAN on another dataset. We demonstrate, on several datasets, that good results are now possible using only a few thousand training images, often matching StyleGAN2 results with an order of magnitude fewer images. We expect this to open up new application domains for GANs. We also find that the widely used CIFAR-10 is, in fact, a limited data benchmark, and improve the record FID from 5.59 to 2.42.

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

Abstract - Emotion cause extraction (ECE), the task aimed at extracting the potential causes behind certain emotions in text, has gained much attention in recent years due to its wide applications. However, it suffers from two shortcomings: 1) the emotion must be annotated before cause extraction in ECE, which greatly limits its applications in real-world scenarios; 2) the way to first annotate emotion and then extract the cause ignores the fact that they are mutually indicative. In this work, we propose a new task: emotion-cause pair extraction (ECPE), which aims to extract the potential pairs of emotions and corresponding causes in a document. We propose a 2-step approach to address this new ECPE task, which first performs individual emotion extraction and cause extraction via multi-task learning, and then conduct emotion-cause pairing and filtering. The experimental results on a benchmark emotion cause corpus prove the feasibility of the ECPE task as well as the effectiveness of our approach.

BAKSA at SemEval-2020 Task 9- Bolstering CNN with Self-Attention forSentiment Analysis of Code Mixed Text

Abstract - Sentiment Analysis of code-mixed text has diversified applications in opinion mining rangingfrom tagging user reviews to identifying social or political sentiments of a sub-population. In thispaper, we present an ensemble architecture of convolutional neural net (CNN) and self-attentionbased LSTM for sentiment analysis of code-mixed tweets. While the CNN component helps inthe classification of positive and negative tweets, the self-attention based LSTM, helps in theclassification of neutral tweets, because of its ability to identify correct sentiment among multiplesentiment bearing units. We achieved F1 scores of 0.707 (ranked5th) and 0.725 (ranked13th) onHindi-English (Hinglish) and Spanish-English (Spanglish) datasets, respectively. The submissionsfor Hinglish and Spanglish tasks were made under the usernamesayushkandharsh6respectively.

Back to Top ↑

Image processing

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

Abstract - Despite the breakthroughs in accuracy and speed of single image super-resolution using faster and deeper convolutional neural networks, one central problem remains largely unsolved: how do we recover the finer texture details when we super-resolve at large upscaling factors? The behavior of optimization-based super-resolution methods is principally driven by the choice of the objective function. Recent work has largely focused on minimizing the mean squared reconstruction error. The resulting estimates have high peak signal-to-noise ratios, but they are often lacking high-frequency details and are perceptually unsatisfying in the sense that they fail to match the fidelity expected at the higher resolution. In this paper, we present SRGAN, a generative adversarial network (GAN) for image super-resolution (SR). To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4x upscaling factors. To achieve this, we propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes our solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, we use a content loss motivated by perceptual similarity instead of similarity in pixel space. Our deep residual network is able to recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive mean-opinion-score (MOS) test shows hugely significant gains in perceptual quality using SRGAN. The MOS scores obtained with SRGAN are closer to those of the original high-resolution images than to those obtained with any state-of-the-art method.

Abstract - The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics.

Evaluating the Utility of Hand-crafted Features in Sequence Labelling∗

Abstract - Conventional wisdom is that hand-crafted features are redundant for deep learning models, as they already learn adequate representations of text automatically from corpora. In this work, we test this claim by proposing a new method for exploiting handcrafted features as part of a novel hybrid learning approach, incorporating a feature auto-encoder loss component. We evaluate on the task of named entity recognition (NER), where we show that including manual features for part-of-speech, word shapes and gazetteers can improve the performance of a neural CRF model. We obtain a F 1 of 91.89 for the CoNLL-2003 English shared task, which significantly outperforms a collection of highly competitive baseline models. We also present an ablation study showing the importance of auto-encoding, over using features as either inputs or outputs alone, and moreover, show including the autoencoder components reduces training requirements to 60%, while retaining the same predictive accuracy.

Back to Top ↑

Deep Learning

Playing Atari with Deep Reinforcement Learning

Abstract - We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

Abstract - A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it’s substantially faster than sample-level autoregressive methods.

NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS(Tacotron2)

Abstract - This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.

Back to Top ↑

Reinforcement Learning

Weight Uncertainty in Neural Networks