Skip to content

API

The spindle-token API is broken up into 3 namespaces:

  1. The top-level spindle_token module containing the most commonly used public functions.
  2. An OPPRL module containing implementations of every OPPRL protocol version, including PII normalization, token specifications, and cryptographic transformations.
  3. A module of core abstractions than can be extended in advanced use cases to add additional functionality.

spindle_token

A module containing the main API of spindle-token.

Most users will only need to use the 3 main functions in this top-level module along with the provided configuration objects corresponding to OPPRL tokenization.

The 3 main functions provide tokenization and transcoding capabilities for data senders and recipients respectively.

tokenize(df, col_mapping, tokens, private_key=None)

Adds encrypted token columns based on PII.

All PII columns found in the DataFrame are normalized according and transformed as needed according to the col_mapping. The PII attributes that make up of each Token objects in tokens are then hashed and encrypted together according to their respective protocol versions.

Parameters:

Name Type Description Default
df DataFrame

The pyspark DataFrame containing all PII attributes.

required
col_mapping Mapping[PiiAttribute, str]

A dictionary that maps instances of PiiAttribute to the corresponding column name of df.

required
tokens Iterable[Token]

A collection of Token objects that denotes which tokens (from which PII attributes) should be added to the dataframe.

required
private_key bytes | None

Your private RSA key. This argument should only be set when reading from a secrets manager or testing, otherwise it is recommended to set the SPINDLE_TOKEN_PRIVATE_KEY environment variable with your private key.

None

Returns:

Type Description
DataFrame

The input DataFrame with by encrypted tokens added.

transcode_out(df, tokens, recipient_public_key=None, private_key=None)

Transcodes token columns of a dataframe into ephemeral tokens.

Parameters:

Name Type Description Default
df DataFrame

The pyspark DataFrame containing token columns.

required
tokens Iterable[Token]

A collection of Token objects that denote which columns of the input dataframe will be transcoded into ephemeral tokens.

required
recipient_public_key bytes | None

The public RSA key of the recipient who will be receiving the dataset with ephemeral tokens. Can also be supplied the SPINDLE_TOKEN_RECIPIENT_PUBLIC_KEY environment variable.

None
private_key bytes | None

Your private RSA key. This argument should only be set when reading from a secrets manager or testing, otherwise it is recommended to set the SPINDLE_TOKEN_PRIVATE_KEY environment variable with your private key.

None

Returns: The DataFrame with the tokens replaced by ephemeral tokens for sending to the recipient.

transcode_in(df, tokens, private_key=None)

Transcodes ephemeral token columns into normal tokens.

Used by the data recipient of a dataset containing ephemeral tokens produced by transcode_out to transcode the ephemeral tokens such that they will match other datasets tokenized with the same private key.

Parameters:

Name Type Description Default
df DataFrame

Spark DataFrame with ephemeral token columns to transcode.

required
tokens Iterable[Token]

A collection of Token objects that denote which columns of the input dataframe will be transcoded from ephemeral tokens into normal tokens.

required
private_key bytes | None

Your private RSA key. Must be the corresponding private key for the public key given to the sender when calling transcode_out. This argument should only be set when reading from a secrets manager or testing, otherwise it is recommended to set the SPINDLE_TOKEN_PRIVATE_KEY environment variable with your private key.

None

Returns:

Type Description
DataFrame

The DataFrame with the ephemeral tokens replaced with normal tokens.

generate_pem_keys(key_size=2048)

Generates a fresh RSA key pair.

Parameters:

Name Type Description Default
key_size int

The size (in bits) of the key.

2048

Returns:

Type Description
tuple[bytes, bytes]

A tuple containing the private key and public key bytes. Both in the PEM encoding.

spindle_token.opprl

A module of classes that provide standard configuration for different versions of the OPPRL protocol.

Each class is made up of class variables holding all instances of PiiAttributes, Token, and TokenProtocolFactory that make up the complete spec of the corresponding OPPRL version.

OpprlV0

All instances of PiiAttribute, Token, and TokenProtocolFactory for v0 of the OPPRL protocol.

All members are class variables, and therefore this class does not need to be instantiated.

Attributes:

Name Type Description
first_name NameAttribute

The PII attribute for a subject's first name.

last_name NameAttribute

The PII attribute for a subject's last (aka family) name.

gender GenderAttribute

The PII attribute for a subject's gender.

birth_date DateAttribute

The PII attribute for a subject's date of birth.

protocol TokenProtocolFactory

The tokenization protocol for producing OPPRL version 0 tokens.

token1 Token

A token generated from first initial, last name, gender, and birth date.

token2 Token

A token generated from first soundex, last soundex, gender, and birth date.

token3 Token

A token generated from first metaphone, last metaphone, gender, and birth date.

OpprlV1

All instances of PiiAttribute, Token, and TokenProtocolFactory for v1 of the OPPRL protocol.

All members are class variables, and therefore this class does not need to be instantiated.

Attributes:

Name Type Description
first_name NameAttribute

The PII attribute for a subject's first name.

last_name NameAttribute

The PII attribute for a subject's last (aka family) name.

gender GenderAttribute

The PII attribute for a subject's gender.

birth_date DateAttribute

The PII attribute for a subject's date of birth.

email EmailAttribute

The PII attribute for a subject's email address.

hem HashedEmail

The PII attribute for a subject's SHA2 hashed email address.

phone PhoneNumberAttribute

The PII attribute for a subject's phone number.

ssn SsnAttribute

The PII attribute for a subject's social security number.

group_number GroupNumberAttribute

The PII attribute for a subject's health plan group number.

member_id MemberIdAttribute

The PII attribute for a subject's health plan member ID.

protocol TokenProtocolFactory

The tokenization protocol for producing OPPRL version 0 tokens.

token1 Token

A token generated from first initial, last name, gender, and birth date.

token2 Token

A token generated from first soundex, last soundex, gender, and birth date.

token3 Token

A token generated from first metaphone, last metaphone, gender, and birth date.

token4 Token

A token generated from first initial, last name, and birth date.

token5 Token

A token generated from first soundex, last soundex, and birth date.

token6 Token

A token generated from first metaphone, last metaphone, and birth date.

token7 Token

A token generated from first name and phone number.

token8 Token

A token generated from birth date and phone number.

token9 Token

A token generated from first name and SSN.

token10 Token

A token generated from birth date and SSN.

token11 Token

A token generated from an email address.

token12 Token

A token generated from a SHA2 hashed email address.

token13 Token

A token generated from health plan group number and member ID.

IdentityAttribute

Bases: PiiAttribute

An implementation of PiiAttribute with no transformation (normalization) logic.

This class is useful when your data contains columns that can be used as attributes as-is with no normalization. In particular, if your data has columns that correspond to PII attributes that would typically be derived from other PII attributes, such as the initial of the first name.

Examples:

Create an identity attribute that uses the first initial directly as opposed to deriving from the first name.

>>> attribute = IdentityAttribute("opprl.v1.first.initial")

The transform method returns the input column unchanged.

>>> attribute.transform(col("first_initial"), StringType())
Column<'first_initial'>

There are no derivatives of identity attributes aside from the attribute itself.

>>> attribute.derivatives()
{'opprl.v1.first.initial': IdentityAttribute(opprl.v1.first.initial)}

Identity attributes can be passed to the tokenize function.

>>> from spindle_token.opprl import OpprlV1 as v1
>>> tokenize(
>>>     df,
>>>     {
>>>         IdentityAttribute("opprl.v1.first.initial"): "first_initial",
>>>         v1.last_name: "last_name",
>>>         v1.gender: "gender",
>>>         v1.birth_date: "birth_date",
>>>     },
>>>     [v1.token1],
>>> )
DataFrame[first_initial: string, last_name: string, ..., opprl_token_1v1: string]

spindle_token.core

The core abstractions of spindle-token, including abstract base classes for extending functionality.

The spindle-token library provides base interfaces that cane be extended by users to define custom token specifications to encrypt with existing versions of OPPRL cryptography protocols, or define entirely new tokenization protocols.

Warning

Extending the base classes in this module to customize the tokenization behavior has no security or privacy guarantees. These abstractions -- like all OSS -- are "use at your own risk" and users should only use these advanced features if they understand them.

PiiAttribute

Bases: ABC

An attribute (aka column) of personally identifiable information (PII) to use when constructing tokens.

This abstract base class is intended to be extended by users to add support for building tokens from a custom PII attribute.

Attributes:

Name Type Description
attr_id

An identifier for the PiiAttribute. Should be unique across all logically different PiiAttributes.

__init__(attr_id)

Initializes the PiiAttribute with the given globally unique attribute ID.

transform(column, dtype) abstractmethod

Transforms the raw PII column into a normalized representation.

A normalized value has eliminated all representation or encoding differences so all instances of the same logical values have identical physical values. For example, text attributes will often be normalized by filtering to alpha-numeric characters and whitespace, standardizing all whitespace to the space character, and converting all alpha characters to uppercase to ensure that all ways of representing the same phrase normalize to the exact same string.

Parameters:

Name Type Description Default
column Column

The spark Column expression for the PII attribute being normalized.

required
dtype DataType

The spark DataType object of the column object found on the DataFrame being normalized. Can be used to delegate to different normalization logic based on different schemas of input data. For example, a subject's birth date may be a DateType, StringType, or LongType on input data and thus requires corresponding normalization into a DateType.

required

Returns:

Type Description
Column

A pyspark Column expression of normalized PII values.

derivatives()

A collection of PII attributes that can be derived from this PII attribute, including this PiiAttribute.

Returns:

Type Description
dict[str, PiiAttribute]

A dict with globally unique (typically namespaced) attribute IDs as the key. Values are instances of

dict[str, PiiAttribute]

PiiAttribute that produce normalized values for each derivative attribute

dict[str, PiiAttribute]

from the normalized values of this PiiAttribute.

TokenProtocol

Bases: ABC

An abstract base class for a specific version of the OPPRL tokenization protocol.

This abstract base class is intended to be extended by users who want to implement custom tokenization protocols.

It is assumed that instances of the TokenProtocol will provide any configuration or other inputs (such as encryption keys) required to produce tokens. See TokenProtocolFactory for more information.

tokenize(attribute_ids) abstractmethod

Creates a Column expression for a single token.

Parameters:

Name Type Description Default
attribute_ids list[str]

A collection PiiAttribute attribute IDs corresponding to the attributes to combine into the token.

required

Returns:

Type Description
Column

A pyspark Column expression representing token values.

transcode_out(token) abstractmethod

Transcodes the given token into an ephemeral token.

Parameters:

Name Type Description Default
token Column

A pyspark Column of tokens.

required

Returns:

Type Description
Column

A pyspark Column expression of ephemeral tokens created from the input tokens.

transcode_in(ephemeral_token) abstractmethod

Transcodes the given ephemeral token into a normal token.

Parameters:

Name Type Description Default
ephemeral_token Column

A pyspark Column of ephemeral tokens.

required

Returns:

Type Description
Column

A pyspark Column expression of tokens created from the input ephemeral tokens.

TokenProtocolFactory

Bases: ABC, Generic[P]

An abstract base class for factories that instantiate TokenProtocol implementations with user provided encryption keys.

This abstract base class is intended to be extended by users who want to implement custom tokenization protocols.

Attributes:

Name Type Description
factory_id

An identifier for the TokenProtocolFactory. Should be globally unique across all logically different TokenProtocolFactory.

__init__(factory_id)

Initializes the TokenProtocolFactory with the given globally unique factory ID.

bind(private_key, recipient_public_key) abstractmethod

Creates an instance of the TokenProtocol with the user provided encryption keys.

Parameters:

Name Type Description Default
private_key bytes

The private RSA key to use when tokenizing PII and transcoding tokens.

required
recipient_public_key bytes | None

The public RSA key of the intended data recipient to use when transcoding tokens into ephemeral tokens. Can be None if the instance of TokenProtocol will not be transcoding tokens into ephemeral tokens.

required

Returns:

Type Description
P

An instance of a TokenProtocol implementation.

Token dataclass

A specification of a token.

Attributes:

Name Type Description
name str

An identifier safe name for the attribute. Will be used as the column name on dataframes. Must be unique across other Token specifications.

protocol TokenProtocolFactory

An instance of TokenProtocolFactory that, when provided encryption keys, produced an instance of the TokenProtocol that generates this kind of token.

attribute_ids Iterable[str]

A collection of attribute IDs used to lookup instances of PiiAttribute corresponding to the fields used to create the token.