Skip to content

The official code to reproduce results from the NACCL2019 paper: White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks

Notifications You must be signed in to change notification settings

orgoro/white-2-black

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

white2black

INTRODUCTION

The official code to reproduce results in the NACCL2019 paper: White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks

The code is divided into sub-packages:

1. ./Agents - adversarial learned attck generators
2. ./Attacks - optimization attacks like hot flip
3. ./Toxicity Classifier - a classifier of sentences toxic/non toxic
4. ./Data - data handling
5. ./Resources - resources for other categories

ALGORITHM

As seen in the figure below we train a classifier to predict the class of toxic and non-toxic sentences. We attack this model using a white-box algorithm called hot-flip and distill the knowledge into a second model - DistFlip. DistFlip is able to generate attacks in a black-box manner. These attacks generalize well to the Google Perspective algorithm (tested Jan 2019). algorithm

DATA

We used the data from this kaggle challenge by Jigsaw

For data flip using HotFlip+ you can download the data from Google Drive and unzip it into: ./toxic_fool/resources/data

RESULTS

The number of flips needed to change the label of a sentences using the original white box algorithm and ours (green) survival rate

Some example sentences: examples

About

The official code to reproduce results from the NACCL2019 paper: White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages