API

The spindle-token API is broken up into 3 namespaces:

The top-level spindle_token module containing the most commonly used public functions.
An OPPRL module containing implementations of every OPPRL protocol version, including PII normalization, token specifications, and cryptographic transformations.
A module of core abstractions than can be extended in advanced use cases to add additional functionality.

`spindle_token`

A module containing the main API of spindle-token.

Most users will only need to use the 3 main functions in this top-level module along with the provided configuration objects corresponding to OPPRL tokenization.

The 3 main functions provide tokenization and transcoding capabilities for data senders and recipients respectively.

`tokenize(df, col_mapping, tokens, private_key=None)`

Adds encrypted token columns based on PII.

All PII columns found in the DataFrame are normalized according and transformed as needed according to the col_mapping. The PII attributes that make up of each Token objects in tokens are then hashed and encrypted together according to their respective protocol versions.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The pyspark `DataFrame` containing all PII attributes.	required
`col_mapping`	`Mapping[PiiAttribute, str]`	A dictionary that maps instances of `PiiAttribute` to the corresponding column name of `df`.	required
`tokens`	`Iterable[Token]`	A collection of `Token` objects that denotes which tokens (from which PII attributes) should be added to the dataframe.	required
`private_key`	`bytes \| None`	Your private RSA key. This argument should only be set when reading from a secrets manager or testing, otherwise it is recommended to set the SPINDLE_TOKEN_PRIVATE_KEY environment variable with your private key.	`None`

Returns:

Type	Description
`DataFrame`	The input `DataFrame` with by encrypted tokens added.

`transcode_out(df, tokens, recipient_public_key=None, private_key=None)`

Transcodes token columns of a dataframe into ephemeral tokens.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The pyspark `DataFrame` containing token columns.	required
`tokens`	`Iterable[Token]`	A collection of `Token` objects that denote which columns of the input dataframe will be transcoded into ephemeral tokens.	required
`recipient_public_key`	`bytes \| None`	The public RSA key of the recipient who will be receiving the dataset with ephemeral tokens. Can also be supplied the SPINDLE_TOKEN_RECIPIENT_PUBLIC_KEY environment variable.	`None`
`private_key`	`bytes \| None`	Your private RSA key. This argument should only be set when reading from a secrets manager or testing, otherwise it is recommended to set the SPINDLE_TOKEN_PRIVATE_KEY environment variable with your private key.	`None`

Returns: The DataFrame with the tokens replaced by ephemeral tokens for sending to the recipient.

`transcode_in(df, tokens, private_key=None)`

Transcodes ephemeral token columns into normal tokens.

Used by the data recipient of a dataset containing ephemeral tokens produced by transcode_out to transcode the ephemeral tokens such that they will match other datasets tokenized with the same private key.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Spark `DataFrame` with ephemeral token columns to transcode.	required
`tokens`	`Iterable[Token]`	A collection of `Token` objects that denote which columns of the input dataframe will be transcoded from ephemeral tokens into normal tokens.	required
`private_key`	`bytes \| None`	Your private RSA key. Must be the corresponding private key for the public key given to the sender when calling `transcode_out`. This argument should only be set when reading from a secrets manager or testing, otherwise it is recommended to set the SPINDLE_TOKEN_PRIVATE_KEY environment variable with your private key.	`None`

Returns:

Type	Description
`DataFrame`	The `DataFrame` with the ephemeral tokens replaced with normal tokens.

`generate_pem_keys(key_size=2048)`

Generates a fresh RSA key pair.

Parameters:

Name	Type	Description	Default
`key_size`	`int`	The size (in bits) of the key.	`2048`

Returns:

Type	Description
`tuple[bytes, bytes]`	A tuple containing the private key and public key bytes. Both in the PEM encoding.

`spindle_token.opprl`

A module of classes that provide standard configuration for different versions of the OPPRL protocol.

Each class is made up of class variables holding all instances of PiiAttributes, Token, and TokenProtocolFactory that make up the complete spec of the corresponding OPPRL version.

`OpprlV0`

All instances of PiiAttribute, Token, and TokenProtocolFactory for v0 of the OPPRL protocol.

All members are class variables, and therefore this class does not need to be instantiated.

Attributes:

Name	Type	Description
`first_name`	`NameAttribute`	The PII attribute for a subject's first name.
`last_name`	`NameAttribute`	The PII attribute for a subject's last (aka family) name.
`gender`	`GenderAttribute`	The PII attribute for a subject's gender.
`birth_date`	`DateAttribute`	The PII attribute for a subject's date of birth.
`protocol`	`TokenProtocolFactory`	The tokenization protocol for producing OPPRL version 0 tokens.
`token1`	`Token`	A token generated from first initial, last name, gender, and birth date.
`token2`	`Token`	A token generated from first soundex, last soundex, gender, and birth date.
`token3`	`Token`	A token generated from first metaphone, last metaphone, gender, and birth date.

`OpprlV1`

All instances of PiiAttribute, Token, and TokenProtocolFactory for v1 of the OPPRL protocol.

All members are class variables, and therefore this class does not need to be instantiated.

Attributes:

Name	Type	Description
`first_name`	`NameAttribute`	The PII attribute for a subject's first name.
`last_name`	`NameAttribute`	The PII attribute for a subject's last (aka family) name.
`gender`	`GenderAttribute`	The PII attribute for a subject's gender.
`birth_date`	`DateAttribute`	The PII attribute for a subject's date of birth.
`email`	`EmailAttribute`	The PII attribute for a subject's email address.
`hem`	`HashedEmail`	The PII attribute for a subject's SHA2 hashed email address.
`phone`	`PhoneNumberAttribute`	The PII attribute for a subject's phone number.
`ssn`	`SsnAttribute`	The PII attribute for a subject's social security number.
`group_number`	`GroupNumberAttribute`	The PII attribute for a subject's health plan group number.
`member_id`	`MemberIdAttribute`	The PII attribute for a subject's health plan member ID.
`protocol`	`TokenProtocolFactory`	The tokenization protocol for producing OPPRL version 0 tokens.
`token1`	`Token`	A token generated from first initial, last name, gender, and birth date.
`token2`	`Token`	A token generated from first soundex, last soundex, gender, and birth date.
`token3`	`Token`	A token generated from first metaphone, last metaphone, gender, and birth date.
`token4`	`Token`	A token generated from first initial, last name, and birth date.
`token5`	`Token`	A token generated from first soundex, last soundex, and birth date.
`token6`	`Token`	A token generated from first metaphone, last metaphone, and birth date.
`token7`	`Token`	A token generated from first name and phone number.
`token8`	`Token`	A token generated from birth date and phone number.
`token9`	`Token`	A token generated from first name and SSN.
`token10`	`Token`	A token generated from birth date and SSN.
`token11`	`Token`	A token generated from an email address.
`token12`	`Token`	A token generated from a SHA2 hashed email address.
`token13`	`Token`	A token generated from health plan group number and member ID.

`IdentityAttribute`

Bases: PiiAttribute

An implementation of PiiAttribute with no transformation (normalization) logic.

This class is useful when your data contains columns that can be used as attributes as-is with no normalization. In particular, if your data has columns that correspond to PII attributes that would typically be derived from other PII attributes, such as the initial of the first name.

Examples:

Create an identity attribute that uses the first initial directly as opposed to deriving from the first name.

>>> attribute = IdentityAttribute("opprl.v1.first.initial")

The transform method returns the input column unchanged.

>>> attribute.transform(col("first_initial"), StringType())
Column<'first_initial'>

There are no derivatives of identity attributes aside from the attribute itself.

>>> attribute.derivatives()
{'opprl.v1.first.initial': IdentityAttribute(opprl.v1.first.initial)}

Identity attributes can be passed to the tokenize function.

>>> from spindle_token.opprl import OpprlV1 as v1
>>> tokenize(
>>>     df,
>>>     {
>>>         IdentityAttribute("opprl.v1.first.initial"): "first_initial",
>>>         v1.last_name: "last_name",
>>>         v1.gender: "gender",
>>>         v1.birth_date: "birth_date",
>>>     },
>>>     [v1.token1],
>>> )
DataFrame[first_initial: string, last_name: string, ..., opprl_token_1v1: string]

`spindle_token.core`

The core abstractions of spindle-token, including abstract base classes for extending functionality.

The spindle-token library provides base interfaces that cane be extended by users to define custom token specifications to encrypt with existing versions of OPPRL cryptography protocols, or define entirely new tokenization protocols.

Warning

Extending the base classes in this module to customize the tokenization behavior has no security or privacy guarantees. These abstractions -- like all OSS -- are "use at your own risk" and users should only use these advanced features if they understand them.

`PiiAttribute`

Bases: ABC

An attribute (aka column) of personally identifiable information (PII) to use when constructing tokens.

This abstract base class is intended to be extended by users to add support for building tokens from a custom PII attribute.

Attributes:

Name	Type	Description
`attr_id`		An identifier for the PiiAttribute. Should be unique across all logically different PiiAttributes.

`init(attr_id)`

Initializes the PiiAttribute with the given globally unique attribute ID.

`transform(column, dtype)` `abstractmethod`

Transforms the raw PII column into a normalized representation.

A normalized value has eliminated all representation or encoding differences so all instances of the same logical values have identical physical values. For example, text attributes will often be normalized by filtering to alpha-numeric characters and whitespace, standardizing all whitespace to the space character, and converting all alpha characters to uppercase to ensure that all ways of representing the same phrase normalize to the exact same string.

Parameters:

Name	Type	Description	Default
`column`	`Column`	The spark `Column` expression for the PII attribute being normalized.	required
`dtype`	`DataType`	The spark `DataType` object of the `column` object found on the `DataFrame` being normalized. Can be used to delegate to different normalization logic based on different schemas of input data. For example, a subject's birth date may be a `DateType`, `StringType`, or `LongType` on input data and thus requires corresponding normalization into a `DateType`.	required

Returns:

Type	Description
`Column`	A pyspark Column expression of normalized PII values.

`derivatives()`

A collection of PII attributes that can be derived from this PII attribute, including this PiiAttribute.

Returns:

Type	Description
`dict[str, PiiAttribute]`	A `dict` with globally unique (typically namespaced) attribute IDs as the key. Values are instances of
`dict[str, PiiAttribute]`	PiiAttribute that produce normalized values for each derivative attribute
`dict[str, PiiAttribute]`	from the normalized values of this PiiAttribute.

`TokenProtocol`

Bases: ABC

An abstract base class for a specific version of the OPPRL tokenization protocol.

This abstract base class is intended to be extended by users who want to implement custom tokenization protocols.

It is assumed that instances of the TokenProtocol will provide any configuration or other inputs (such as encryption keys) required to produce tokens. See TokenProtocolFactory for more information.

`tokenize(attribute_ids)` `abstractmethod`

Creates a Column expression for a single token.

Parameters:

Name	Type	Description	Default
`attribute_ids`	`list[str]`	A collection `PiiAttribute` attribute IDs corresponding to the attributes to combine into the token.	required

Returns:

Type	Description
`Column`	A pyspark `Column` expression representing token values.

`transcode_out(token)` `abstractmethod`

Transcodes the given token into an ephemeral token.

Parameters:

Name	Type	Description	Default
`token`	`Column`	A pyspark `Column` of tokens.	required

Returns:

Type	Description
`Column`	A pyspark `Column` expression of ephemeral tokens created from the input tokens.

`transcode_in(ephemeral_token)` `abstractmethod`

Transcodes the given ephemeral token into a normal token.

Parameters:

Name	Type	Description	Default
`ephemeral_token`	`Column`	A pyspark `Column` of ephemeral tokens.	required

Returns:

Type	Description
`Column`	A pyspark `Column` expression of tokens created from the input ephemeral tokens.

`TokenProtocolFactory`

Bases: ABC, Generic[P]

An abstract base class for factories that instantiate TokenProtocol implementations with user provided encryption keys.

This abstract base class is intended to be extended by users who want to implement custom tokenization protocols.

Attributes:

Name	Type	Description
`factory_id`		An identifier for the `TokenProtocolFactory`. Should be globally unique across all logically different `TokenProtocolFactory`.

`init(factory_id)`

Initializes the TokenProtocolFactory with the given globally unique factory ID.

`bind(private_key, recipient_public_key)` `abstractmethod`

Creates an instance of the TokenProtocol with the user provided encryption keys.

Parameters:

Name	Type	Description	Default
`private_key`	`bytes`	The private RSA key to use when tokenizing PII and transcoding tokens.	required
`recipient_public_key`	`bytes \| None`	The public RSA key of the intended data recipient to use when transcoding tokens into ephemeral tokens. Can be `None` if the instance of `TokenProtocol` will not be transcoding tokens into ephemeral tokens.	required

Returns:

Type	Description
`P`	An instance of a `TokenProtocol` implementation.

`Token` `dataclass`

A specification of a token.

Attributes:

Name	Type	Description
`name`	`str`	An identifier safe name for the attribute. Will be used as the column name on dataframes. Must be unique across other `Token` specifications.
`protocol`	`TokenProtocolFactory`	An instance of `TokenProtocolFactory` that, when provided encryption keys, produced an instance of the `TokenProtocol` that generates this kind of token.
`attribute_ids`	`Iterable[str]`	A collection of attribute IDs used to lookup instances of `PiiAttribute` corresponding to the fields used to create the token.

API

spindle_token

tokenize(df, col_mapping, tokens, private_key=None)

transcode_out(df, tokens, recipient_public_key=None, private_key=None)

transcode_in(df, tokens, private_key=None)

generate_pem_keys(key_size=2048)

spindle_token.opprl

OpprlV0

OpprlV1

IdentityAttribute

spindle_token.core

PiiAttribute

__init__(attr_id)

transform(column, dtype) abstractmethod

derivatives()

TokenProtocol

tokenize(attribute_ids) abstractmethod

transcode_out(token) abstractmethod

transcode_in(ephemeral_token) abstractmethod

TokenProtocolFactory

__init__(factory_id)

bind(private_key, recipient_public_key) abstractmethod

Token dataclass

`spindle_token`

`tokenize(df, col_mapping, tokens, private_key=None)`

`transcode_out(df, tokens, recipient_public_key=None, private_key=None)`

`transcode_in(df, tokens, private_key=None)`

`generate_pem_keys(key_size=2048)`

`spindle_token.opprl`

`OpprlV0`

`OpprlV1`

`IdentityAttribute`

`spindle_token.core`

`PiiAttribute`

`init(attr_id)`

`transform(column, dtype)` `abstractmethod`

`derivatives()`

`TokenProtocol`

`tokenize(attribute_ids)` `abstractmethod`

`transcode_out(token)` `abstractmethod`

`transcode_in(ephemeral_token)` `abstractmethod`

`TokenProtocolFactory`

`init(factory_id)`

`bind(private_key, recipient_public_key)` `abstractmethod`

`Token` `dataclass`