API
The spindle-token API is broken up into 3 namespaces:
- The top-level
spindle_tokenmodule containing the most commonly used public functions. - An OPPRL module containing implementations of every OPPRL protocol version, including PII normalization, token specifications, and cryptographic transformations.
- A module of
coreabstractions than can be extended in advanced use cases to add additional functionality.
spindle_token
A module containing the main API of spindle-token.
Most users will only need to use the 3 main functions in this top-level module along with the provided configuration objects corresponding to OPPRL tokenization.
The 3 main functions provide tokenization and transcoding capabilities for data senders and recipients respectively.
tokenize(df, col_mapping, tokens, private_key=None)
Adds encrypted token columns based on PII.
All PII columns found in the DataFrame are normalized according and transformed as needed according to the col_mapping.
The PII attributes that make up of each Token objects in tokens are then hashed and
encrypted together according to their respective protocol versions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The pyspark |
required |
col_mapping
|
Mapping[PiiAttribute, str]
|
A dictionary that maps instances of |
required |
tokens
|
Iterable[Token]
|
A collection of |
required |
private_key
|
bytes | None
|
Your private RSA key. This argument should only be set when reading from a secrets manager or testing, otherwise it is recommended to set the SPINDLE_TOKEN_PRIVATE_KEY environment variable with your private key. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The input |
transcode_out(df, tokens, recipient_public_key=None, private_key=None)
Transcodes token columns of a dataframe into ephemeral tokens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The pyspark |
required |
tokens
|
Iterable[Token]
|
A collection of |
required |
recipient_public_key
|
bytes | None
|
The public RSA key of the recipient who will be receiving the dataset with ephemeral tokens. Can also be supplied the SPINDLE_TOKEN_RECIPIENT_PUBLIC_KEY environment variable. |
None
|
private_key
|
bytes | None
|
Your private RSA key. This argument should only be set when reading from a secrets manager or testing, otherwise it is recommended to set the SPINDLE_TOKEN_PRIVATE_KEY environment variable with your private key. |
None
|
Returns:
The DataFrame with the tokens replaced by ephemeral tokens for sending to the recipient.
transcode_in(df, tokens, private_key=None)
Transcodes ephemeral token columns into normal tokens.
Used by the data recipient of a dataset containing ephemeral tokens produced by transcode_out
to transcode the ephemeral tokens such that they will match other datasets tokenized with the same private key.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Spark |
required |
tokens
|
Iterable[Token]
|
A collection of |
required |
private_key
|
bytes | None
|
Your private RSA key. Must be the corresponding private key for the public key given to the sender when calling
|
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The |
generate_pem_keys(key_size=2048)
Generates a fresh RSA key pair.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key_size
|
int
|
The size (in bits) of the key. |
2048
|
Returns:
| Type | Description |
|---|---|
tuple[bytes, bytes]
|
A tuple containing the private key and public key bytes. Both in the PEM encoding. |
spindle_token.opprl
A module of classes that provide standard configuration for different versions of the OPPRL protocol.
Each class is made up of class variables holding all instances of PiiAttributes, Token, and TokenProtocolFactory that make up the complete spec of the corresponding OPPRL version.
OpprlV0
All instances of PiiAttribute, Token, and TokenProtocolFactory for v0 of the OPPRL protocol.
All members are class variables, and therefore this class does not need to be instantiated.
Attributes:
| Name | Type | Description |
|---|---|---|
first_name |
NameAttribute
|
The PII attribute for a subject's first name. |
last_name |
NameAttribute
|
The PII attribute for a subject's last (aka family) name. |
gender |
GenderAttribute
|
The PII attribute for a subject's gender. |
birth_date |
DateAttribute
|
The PII attribute for a subject's date of birth. |
protocol |
TokenProtocolFactory
|
The tokenization protocol for producing OPPRL version 0 tokens. |
token1 |
Token
|
A token generated from first initial, last name, gender, and birth date. |
token2 |
Token
|
A token generated from first soundex, last soundex, gender, and birth date. |
token3 |
Token
|
A token generated from first metaphone, last metaphone, gender, and birth date. |
OpprlV1
All instances of PiiAttribute, Token, and TokenProtocolFactory for v1 of the OPPRL protocol.
All members are class variables, and therefore this class does not need to be instantiated.
Attributes:
| Name | Type | Description |
|---|---|---|
first_name |
NameAttribute
|
The PII attribute for a subject's first name. |
last_name |
NameAttribute
|
The PII attribute for a subject's last (aka family) name. |
gender |
GenderAttribute
|
The PII attribute for a subject's gender. |
birth_date |
DateAttribute
|
The PII attribute for a subject's date of birth. |
email |
EmailAttribute
|
The PII attribute for a subject's email address. |
hem |
HashedEmail
|
The PII attribute for a subject's SHA2 hashed email address. |
phone |
PhoneNumberAttribute
|
The PII attribute for a subject's phone number. |
ssn |
SsnAttribute
|
The PII attribute for a subject's social security number. |
group_number |
GroupNumberAttribute
|
The PII attribute for a subject's health plan group number. |
member_id |
MemberIdAttribute
|
The PII attribute for a subject's health plan member ID. |
protocol |
TokenProtocolFactory
|
The tokenization protocol for producing OPPRL version 0 tokens. |
token1 |
Token
|
A token generated from first initial, last name, gender, and birth date. |
token2 |
Token
|
A token generated from first soundex, last soundex, gender, and birth date. |
token3 |
Token
|
A token generated from first metaphone, last metaphone, gender, and birth date. |
token4 |
Token
|
A token generated from first initial, last name, and birth date. |
token5 |
Token
|
A token generated from first soundex, last soundex, and birth date. |
token6 |
Token
|
A token generated from first metaphone, last metaphone, and birth date. |
token7 |
Token
|
A token generated from first name and phone number. |
token8 |
Token
|
A token generated from birth date and phone number. |
token9 |
Token
|
A token generated from first name and SSN. |
token10 |
Token
|
A token generated from birth date and SSN. |
token11 |
Token
|
A token generated from an email address. |
token12 |
Token
|
A token generated from a SHA2 hashed email address. |
token13 |
Token
|
A token generated from health plan group number and member ID. |
IdentityAttribute
Bases: PiiAttribute
An implementation of PiiAttribute with no transformation (normalization) logic.
This class is useful when your data contains columns that can be used as attributes as-is with no normalization. In particular, if your data has columns that correspond to PII attributes that would typically be derived from other PII attributes, such as the initial of the first name.
Examples:
Create an identity attribute that uses the first initial directly as opposed to deriving from the first name.
>>> attribute = IdentityAttribute("opprl.v1.first.initial")
The transform method returns the input column unchanged.
>>> attribute.transform(col("first_initial"), StringType())
Column<'first_initial'>
There are no derivatives of identity attributes aside from the attribute itself.
>>> attribute.derivatives()
{'opprl.v1.first.initial': IdentityAttribute(opprl.v1.first.initial)}
Identity attributes can be passed to the tokenize function.
>>> from spindle_token.opprl import OpprlV1 as v1
>>> tokenize(
>>> df,
>>> {
>>> IdentityAttribute("opprl.v1.first.initial"): "first_initial",
>>> v1.last_name: "last_name",
>>> v1.gender: "gender",
>>> v1.birth_date: "birth_date",
>>> },
>>> [v1.token1],
>>> )
DataFrame[first_initial: string, last_name: string, ..., opprl_token_1v1: string]
spindle_token.core
The core abstractions of spindle-token, including abstract base classes for extending functionality.
The spindle-token library provides base interfaces that cane be extended by users to define custom token specifications to encrypt with existing versions of OPPRL cryptography protocols, or define entirely new tokenization protocols.
Warning
Extending the base classes in this module to customize the tokenization behavior has no security or privacy guarantees. These abstractions -- like all OSS -- are "use at your own risk" and users should only use these advanced features if they understand them.
PiiAttribute
Bases: ABC
An attribute (aka column) of personally identifiable information (PII) to use when constructing tokens.
This abstract base class is intended to be extended by users to add support for building tokens from a custom PII attribute.
Attributes:
| Name | Type | Description |
|---|---|---|
attr_id |
An identifier for the PiiAttribute. Should be unique across all logically different PiiAttributes. |
__init__(attr_id)
Initializes the PiiAttribute with the given globally unique attribute ID.
transform(column, dtype)
abstractmethod
Transforms the raw PII column into a normalized representation.
A normalized value has eliminated all representation or encoding differences so all instances of the same logical values have identical physical values. For example, text attributes will often be normalized by filtering to alpha-numeric characters and whitespace, standardizing all whitespace to the space character, and converting all alpha characters to uppercase to ensure that all ways of representing the same phrase normalize to the exact same string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
Column
|
The spark |
required |
dtype
|
DataType
|
The spark |
required |
Returns:
| Type | Description |
|---|---|
Column
|
A pyspark Column expression of normalized PII values. |
derivatives()
A collection of PII attributes that can be derived from this PII attribute, including this PiiAttribute.
Returns:
| Type | Description |
|---|---|
dict[str, PiiAttribute]
|
A |
dict[str, PiiAttribute]
|
PiiAttribute that produce normalized values for each derivative attribute |
dict[str, PiiAttribute]
|
from the normalized values of this PiiAttribute. |
TokenProtocol
Bases: ABC
An abstract base class for a specific version of the OPPRL tokenization protocol.
This abstract base class is intended to be extended by users who want to implement custom tokenization protocols.
It is assumed that instances of the TokenProtocol will provide any configuration or other inputs
(such as encryption keys) required to produce tokens. See TokenProtocolFactory
for more information.
tokenize(attribute_ids)
abstractmethod
Creates a Column expression for a single token.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attribute_ids
|
list[str]
|
A collection |
required |
Returns:
| Type | Description |
|---|---|
Column
|
A pyspark |
transcode_out(token)
abstractmethod
Transcodes the given token into an ephemeral token.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
token
|
Column
|
A pyspark |
required |
Returns:
| Type | Description |
|---|---|
Column
|
A pyspark |
transcode_in(ephemeral_token)
abstractmethod
Transcodes the given ephemeral token into a normal token.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ephemeral_token
|
Column
|
A pyspark |
required |
Returns:
| Type | Description |
|---|---|
Column
|
A pyspark |
TokenProtocolFactory
Bases: ABC, Generic[P]
An abstract base class for factories that instantiate TokenProtocol implementations with user provided encryption keys.
This abstract base class is intended to be extended by users who want to implement custom tokenization protocols.
Attributes:
| Name | Type | Description |
|---|---|---|
factory_id |
An identifier for the |
__init__(factory_id)
Initializes the TokenProtocolFactory with the given globally unique factory ID.
bind(private_key, recipient_public_key)
abstractmethod
Creates an instance of the TokenProtocol with the user provided encryption keys.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
private_key
|
bytes
|
The private RSA key to use when tokenizing PII and transcoding tokens. |
required |
recipient_public_key
|
bytes | None
|
The public RSA key of the intended data recipient to use when transcoding tokens into ephemeral tokens.
Can be |
required |
Returns:
| Type | Description |
|---|---|
P
|
An instance of a |
Token
dataclass
A specification of a token.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
An identifier safe name for the attribute. Will be used as the column name on dataframes. Must be unique
across other |
protocol |
TokenProtocolFactory
|
An instance of |
attribute_ids |
Iterable[str]
|
A collection of attribute IDs used to lookup instances of |