Skip to content

API

The spindle-token API is broken up into 4 namespaces:

  1. The top-level spindle_token module containing the most commonly used public functions.
  2. An OPPRL module containing implementations of every OPPRL protocol version, including PII normalization, token specifications, and cryptographic transformations.
  3. An OPPRL metadata module exposing Spark-free token metadata for lightweight schema inspection.
  4. A module of core abstractions than can be extended in advanced use cases to add additional functionality.

spindle_token

A module containing the main API of spindle-token.

Most users will only need to use the 3 main functions in this top-level module along with the provided configuration objects corresponding to OPPRL tokenization.

The 3 main functions provide tokenization and transcoding capabilities for data senders and recipients respectively.

PiiAttribute

Bases: ABC

An attribute (aka column) of personally identifiable information (PII) to use when constructing tokens.

This abstract base class is intended to be extended by users to add support for building tokens from a custom PII attribute.

Attributes:

Name Type Description
attr_id

An identifier for the PiiAttribute. Should be unique across all logically different PiiAttributes.

__init__(attr_id)

Initializes the PiiAttribute with the given globally unique attribute ID.

transform(column, dtype) abstractmethod

Transforms the raw PII column into a normalized representation.

A normalized value has eliminated all representation or encoding differences so all instances of the same logical values have identical physical values. For example, text attributes will often be normalized by filtering to alpha-numeric characters and whitespace, standardizing all whitespace to the space character, and converting all alpha characters to uppercase to ensure that all ways of representing the same phrase normalize to the exact same string.

Parameters:

Name Type Description Default
column Column

The spark Column expression for the PII attribute being normalized.

required
dtype DataType

The spark DataType object of the column object found on the DataFrame being normalized. Can be used to delegate to different normalization logic based on different schemas of input data. For example, a subject's birth date may be a DateType, StringType, or LongType on input data and thus requires corresponding normalization into a DateType.

required

Returns:

Type Description
Column

A pyspark Column expression of normalized PII values.

derivatives()

A collection of PII attributes that can be derived from this PII attribute, including this PiiAttribute.

Returns:

Type Description
dict[str, 'PiiAttribute']

A dict with globally unique (typically namespaced) attribute IDs as the key. Values are instances of

dict[str, 'PiiAttribute']

PiiAttribute that produce normalized values for each derivative attribute

dict[str, 'PiiAttribute']

from the normalized values of this PiiAttribute.

Token dataclass

A specification of a token.

Attributes:

Name Type Description
name str

An identifier safe name for the attribute. Will be used as the column name on dataframes. Must be unique across other Token specifications.

protocol TokenProtocolFactory

An instance of TokenProtocolFactory that, when provided encryption keys, produced an instance of the TokenProtocol that generates this kind of token.

attribute_ids Iterable[str]

A collection of attribute IDs used to lookup instances of PiiAttribute corresponding to the fields used to create the token.

TokenProtocol

Bases: ABC

An abstract base class for a specific version of the OPPRL tokenization protocol.

This abstract base class is intended to be extended by users who want to implement custom tokenization protocols.

It is assumed that instances of the TokenProtocol will provide any configuration or other inputs (such as encryption keys) required to produce tokens. See TokenProtocolFactory for more information.

tokenize(attribute_ids) abstractmethod

Creates a Column expression for a single token.

Parameters:

Name Type Description Default
attribute_ids list[str]

A collection PiiAttribute attribute IDs corresponding to the attributes to combine into the token.

required

Returns:

Type Description
Column

A pyspark Column expression representing token values.

transcode_out(token) abstractmethod

Transcodes the given token into an ephemeral token.

Parameters:

Name Type Description Default
token Column

A pyspark Column of tokens.

required

Returns:

Type Description
Column

A pyspark Column expression of ephemeral tokens created from the input tokens.

transcode_in(ephemeral_token) abstractmethod

Transcodes the given ephemeral token into a normal token.

Parameters:

Name Type Description Default
ephemeral_token Column

A pyspark Column of ephemeral tokens.

required

Returns:

Type Description
Column

A pyspark Column expression of tokens created from the input ephemeral tokens.

generate_pem_keys(key_size=2048)

Generates a fresh RSA key pair.

Parameters:

Name Type Description Default
key_size int

The size (in bits) of the key.

2048

Returns:

Type Description
tuple[bytes, bytes]

A tuple containing the private key and public key bytes. Both in the PEM encoding.

spindle_token.opprl

Standard configurations for different versions of the OPPRL protocol.

This package is Spark-backed. Importing the package itself is safe, but accessing OPPRL version objects requires the optional spark extra.

spindle_token.opprl.metadata

spindle_token.core

The core abstractions of spindle-token, including abstract base classes for extending functionality.

The spindle-token library provides base interfaces that cane be extended by users to define custom token specifications to encrypt with existing versions of OPPRL cryptography protocols, or define entirely new tokenization protocols.

Warning

Extending the base classes in this module to customize the tokenization behavior has no security or privacy guarantees. These abstractions -- like all OSS -- are "use at your own risk" and users should only use these advanced features if they understand them.

PiiAttribute

Bases: ABC

An attribute (aka column) of personally identifiable information (PII) to use when constructing tokens.

This abstract base class is intended to be extended by users to add support for building tokens from a custom PII attribute.

Attributes:

Name Type Description
attr_id

An identifier for the PiiAttribute. Should be unique across all logically different PiiAttributes.

__init__(attr_id)

Initializes the PiiAttribute with the given globally unique attribute ID.

transform(column, dtype) abstractmethod

Transforms the raw PII column into a normalized representation.

A normalized value has eliminated all representation or encoding differences so all instances of the same logical values have identical physical values. For example, text attributes will often be normalized by filtering to alpha-numeric characters and whitespace, standardizing all whitespace to the space character, and converting all alpha characters to uppercase to ensure that all ways of representing the same phrase normalize to the exact same string.

Parameters:

Name Type Description Default
column Column

The spark Column expression for the PII attribute being normalized.

required
dtype DataType

The spark DataType object of the column object found on the DataFrame being normalized. Can be used to delegate to different normalization logic based on different schemas of input data. For example, a subject's birth date may be a DateType, StringType, or LongType on input data and thus requires corresponding normalization into a DateType.

required

Returns:

Type Description
Column

A pyspark Column expression of normalized PII values.

derivatives()

A collection of PII attributes that can be derived from this PII attribute, including this PiiAttribute.

Returns:

Type Description
dict[str, 'PiiAttribute']

A dict with globally unique (typically namespaced) attribute IDs as the key. Values are instances of

dict[str, 'PiiAttribute']

PiiAttribute that produce normalized values for each derivative attribute

dict[str, 'PiiAttribute']

from the normalized values of this PiiAttribute.

TokenProtocol

Bases: ABC

An abstract base class for a specific version of the OPPRL tokenization protocol.

This abstract base class is intended to be extended by users who want to implement custom tokenization protocols.

It is assumed that instances of the TokenProtocol will provide any configuration or other inputs (such as encryption keys) required to produce tokens. See TokenProtocolFactory for more information.

tokenize(attribute_ids) abstractmethod

Creates a Column expression for a single token.

Parameters:

Name Type Description Default
attribute_ids list[str]

A collection PiiAttribute attribute IDs corresponding to the attributes to combine into the token.

required

Returns:

Type Description
Column

A pyspark Column expression representing token values.

transcode_out(token) abstractmethod

Transcodes the given token into an ephemeral token.

Parameters:

Name Type Description Default
token Column

A pyspark Column of tokens.

required

Returns:

Type Description
Column

A pyspark Column expression of ephemeral tokens created from the input tokens.

transcode_in(ephemeral_token) abstractmethod

Transcodes the given ephemeral token into a normal token.

Parameters:

Name Type Description Default
ephemeral_token Column

A pyspark Column of ephemeral tokens.

required

Returns:

Type Description
Column

A pyspark Column expression of tokens created from the input ephemeral tokens.

TokenProtocolFactory

Bases: ABC, Generic[P]

An abstract base class for factories that instantiate TokenProtocol implementations with user provided encryption keys.

This abstract base class is intended to be extended by users who want to implement custom tokenization protocols.

Attributes:

Name Type Description
factory_id

An identifier for the TokenProtocolFactory. Should be globally unique across all logically different TokenProtocolFactory.

__init__(factory_id)

Initializes the TokenProtocolFactory with the given globally unique factory ID.

bind(private_key, recipient_public_key) abstractmethod

Creates an instance of the TokenProtocol with the user provided encryption keys.

Parameters:

Name Type Description Default
private_key bytes

The private RSA key to use when tokenizing PII and transcoding tokens.

required
recipient_public_key bytes | None

The public RSA key of the intended data recipient to use when transcoding tokens into ephemeral tokens. Can be None if the instance of TokenProtocol will not be transcoding tokens into ephemeral tokens.

required

Returns:

Type Description
P

An instance of a TokenProtocol implementation.

Token dataclass

A specification of a token.

Attributes:

Name Type Description
name str

An identifier safe name for the attribute. Will be used as the column name on dataframes. Must be unique across other Token specifications.

protocol TokenProtocolFactory

An instance of TokenProtocolFactory that, when provided encryption keys, produced an instance of the TokenProtocol that generates this kind of token.

attribute_ids Iterable[str]

A collection of attribute IDs used to lookup instances of PiiAttribute corresponding to the fields used to create the token.