Last month, NIST released its Draft NISTIR 8269, A Taxonomy and Terminology of Adversarial Machine Learning.

The taxonomy is intended to assist researchers and practitioners in developing a common lexicon around Adversarial Machine Learning, with the goal of setting standards and best practices for managing the security of Artificial Intelligence (“AI”) systems against attackers.

Adversarial Machine Learning refers to the manipulation and exploitation of Machine Learning, defined by the document as “the components of an AI system [that] include the data, model, and processes for training, testing, and validation.” Researchers in this area study ways to design Machine Learning algorithms, or “models,” to resist security challenges and manage the potential consequences of intentional attacks.

NIST’s taxonomy is organized around three concepts that inform a risk assessment of AI systems: attacks, defenses, and consequences. Differing from previous surveys, the draft NISTIR includes “consequences” as a separate dimension of risk, because the consequences of Adversarial Machine Learning attacks depend on both the attacks themselves and the defenses in place, and may not be consistent with the original intent of an attacker.

Attacks. The taxonomy organizes attacks into two basic types, Training Attacks and Testing Attacks, each including sub-categories of threats. Training Attacks seek to obtain or influence the training data or model. For example, all or part of the training data can be stolen and used to create a substitute model, which in turn can be used to test potential inputs and attacks. Alternatively, adversaries can manipulate the data used to train the target model, or manipulate the target model itself with sufficient access.

Testing Attacks, on the other hand, do not tamper with the target model or the data used in training. Instead, they generate adversarial inputs. These inputs can confuse the model and evade proper classification–small changes in an input, when properly calibrated, can create large changes in output. Adversarial inputs can also be used to collect and infer information about the model or training data–even without direct knowledge of the model itself, attackers can build substitute models based on observing input-output pairings.

Defenses. The taxonomy organizes defenses into two categories mirroring the attack types: defenses against Training Attacks and defenses against Testing Attacks. In both cases, NIST cautions that defensive methods often can have a negative effect on a model’s performance and accuracy. Defenses against Training Attacks seek to defend the model and the underlying training data. Traditional encryption is still a fundamental part of this defense. Further, in what the report calls Data Sanitization, adversarial data can be identified by testing the impact of potentially adversarial examples on classification performance. Alternatively, rather than trying to detect adversarial data, Robust Statistics practices are another technique for defending against Training Attacks. Robust Statistics uses constraints and regularization techniques to reduce possible distortions of the model caused by “poisoned data.”

Defenses against Testing Attacks can include Robustness Improvements, which include Adversarial Training, Gradient Masking, Defensive Distillation, Ensemble Methods, Feature Squeezing, and Reformers/Autoencoders. An ongoing pillar of effective defense against Testing Attacks is Differential Privacy, which randomizes training data values within a statistical range to ensure that model outputs do not reveal any additional information records in the training data. However, a model’s prediction accuracy is inherently degraded by this approach. Homomorphic Encryption, an alternative approach, encrypts data in a form that a neural network can process without decrypting the data. While this approach solves the accuracy trade-offs of Differential Privacy, it demands higher computational performance.

Consequences. The taxonomy categorizes consequences of Adversarial Machine Learning into three categories: Integrity Violations, Availability Violations, and Confidentiality Violations. In Integrity Violations, the inference process is undermined, resulting in a reduction in the confidence of outputs or misclassifications of inputs. Availability Violations involve the reduction in the speed or accessibility of the model, up to and including to the point of complete unavailability to users. Confidentiality Violations occur when an attacker obtains or infers information about the model or data. NIST includes Training Attacks that allow creation of substitute models in this category, as well as those that result in the revelation of the actual model architecture or parameters. Attacks can also expose confidential data, such as whether an individual record was included in the dataset used to train the target model, or personal information (e.g., medical records) that may be included in training data or otherwise.

Adversarial Machine Learning is and will increasingly be a tremendous challenge in securing AI systems. With its new taxonomy, NIST demonstrates continued interest in playing a large role in setting standards and best practices for managing AI security. The public comment period for this draft document is open through Monday, December 16, 2019 and comments may be submitted at