Command Line Interface
The spindle-token command line interface (CLI) offers tokenization and transcription capabilities of data files from the local file system.
Usage Guide
The spindle-token CLI is included with every installation of the spindle-token library. Install spindle-token to your python (virtual) environment using pip
. See our getting started guide for more information. Make sure the python interpreter directory is on your PATH.
You can test your installation and environment setup by running the --help
command. You should see documentation about the spindle-token CLI.
spindle-token --help
Once installed, you can run the commands for tokenize
and transcrypt
with the relevant options and arguments. All commands and sub-commands follow the same general design. Positional arguments and paths the the input data and desired location to write output data. Options configure how the input data is transformed. For example, options dictate which tokens should be generated, which file format to use, and the encryption key.
This example invocation of the tokenize
command illustrates the general pattern.
spindle-token tokenize \
--token opprl_token_1 --token opprl_token_2 --token opprl_token_3 \
--key private_key.pem \
--format csv \
--parallelism 1 \
pii.csv tokens.csv
Encryption Keys
The OPPRL protocol leaves the responsibility of encryption key management to the user. The spindle-token CLI assumes the public and private keys are stored in files on the local filesystem. The location of the PEM file can be passed using the corresponding option or an environment variable. This table describes the option names and environment variables that can be use to supply private and public keys respectively.
Private Key
The private RSA key can be set using one of the following methods:
- Use the
--key
option (or-k
alias) to specify a path to a PEM file. - Set the
SPINDLE_TOKEN_PRIVATE_KEY_FILE
environment variable to specify a path the PEM file. - Set the
SPINDLE_TOKEN_PRIVATE_KEY
environment variable to specify the key as a UTF-8 string. If both environment variables are set, the_FILE
variant takes precedence.
Public Keys
The public keys of data recipients (used in transcryption) can be set using one of the following methods:
- Use the
--recipient
option (or-r
alias) to specify a path to a PEM file. - Set the
SPINDLE_TOKEN_RECIPIENT_PUBLIC_KEY_FILE
environment variable to specify a path the PEM file. - Set the
SPINDLE_TOKEN_RECIPIENT_PUBLIC_KEY
environment variable to specify the key as a UTF-8 string. If both environment variables are set, the_FILE
variant takes precedence.
File Formats
The spindle-token CLI supports multiple data file formats. The recommended file format is parquet because parquet files are efficient, compressed, and have unambiguous schemas. The spindle-token CLI also supports CSV files. The file format must be specified using the --format
option or SPINDLE_TOKEN_FORMAT
environment variable.
When using CSV files there are a few assumptions that the data file(s) must meet.
-
The first row of each CSV file must be column headers. See the next section for expectation on the column names.
-
The field separate (aka delimiter) should be a pipe
|
character.
Input datasets can either be a single data file or partitioned into a directory of multiple data files. Splitting larger datasets into multiple data files can help the CLI parallelize over larger datasets. If the input dataset is a single data file the output dataset will be single data file. Similarly, if the input dataset is a directory the argument for the output location will be a directory that will include a partitioned output dataset.
Column Names
THe OPPRL tokenization protocol requires specific PII attributes to be normalized, transformed, concatenated, hashed, and then encrypted together. Thus, the spindle-token CLI must know which columns of the input dataset correspond to each logical PII attribute (first name, last name, birth date, etc).
The spindle-token CLI requires specific column naming for PII columns so that the proper normalization rules are applied to each attribute and the final tokens are created from the correct subset of inputs. The following list contains the exact column names the CLI will look for.
first_name
last_name
gender
birth_date
If you are only adding tokens that require a subset of these PII fields, the input dataset may omit columns for the other PII attributes that are not required. For information on which PII attributes are required for each token in the OPPRL protocol, see the official specification.
Parallelism
The spindle-token CLI supports multi-threaded parallelism. This helps work with larger datasets that are partitioned into multiple part files within a directory. If --parallelism
is set to a number >1 then that number of partition files will be processed at once. This option can also be set with the SPINDLE_TOKEN_PARALLELISM
environment variable. If parallelism is not provided, the spindle-token will default to using the same number of threads as the host machine has logical cores.
Commands
The help text, options, and arguments of every command and sub-command of the spindle-token CLI. You can get this documentation for the specific version of the CLI installed in your python environment using the --help
option on any command or sub-command.
spindle-token tokenize
Add tokens to a dataset of PII.
Creates a dataset at the OUTPUT location that adds encrypted OPPRL tokens to the INPUT dataset. Does not modify the INPUT dataset.
INPUT is the path to the dataset to tokenize. If INPUT is a file, it must be of the format provided to
the --format
option. If INPUT is a directory, all files within the directory that match the given
format will be considered a partition of the dataset.
OUTPUT is the file or directory in which the tokenized dataset will be written. If INPUT is a file, the OUTPUT will be written to as a file. If INPUT is a directory, the OUTPUT will be a directory containing a dataset partitioned into files.
Usage:
spindle-token tokenize [OPTIONS] INPUT OUTPUT
Options:
-t, --token [opprl_token_1|opprl_token_2|opprl_token_3]
An OPPRL token to add to the dataset. Can be
passed multiple times. [required]
-k, --key FILENAME The PEM file containing your private key.
-f, --format [parquet|csv] The file format of input and output data
files. [required]
-p, --parallelism INTEGER The number of worker threads to parallelize
over. Useful when the input dataset is
partitioned into multiple part files. If not
supplied, defaults to the number of logical
cores.
--help Show this message and exit.
spindle-token transcrypt
Prepare tokenized datasets to be sent or received.
Usage:
spindle-token transcrypt [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
in
Convert a dataset of ephemeral tokens into tokens.
INPUT is the path to the dataset of ephemeral tokens to create tokens from. If INPUT is a file, it must
be of the format provided to the --format
option. If INPUT is a directory, all files within the
directory that match the given format will be run.
OUTPUT is the file or directory in which the tokenized dataset will be written. If INPUT is a file, the OUTPUT will be written to as a file. If INPUT is a directory, the OUTPUT will be a directory containing a dataset partitioned into files.
Usage:
spindle-token transcrypt in [OPTIONS] INPUT OUTPUT
Options:
-t, --token [opprl_token_1|opprl_token_2|opprl_token_3]
The column name of an OPPRL token on the
input data to transcrypt. [required]
-k, --key FILENAME The PEM file containing your private key.
-f, --format [parquet|csv] The file format of input and output data
files. [required]
-p, --parallelism INTEGER The number of worker threads to parallelize
over. Useful when the input dataset is
partitioned into multiple part files. If not
supplied, defaults to the number of logical
cores.
--help Show this message and exit.
out
Prepare ephemeral tokens for a specific recipient.
INPUT is the path to the dataset of tokens to create ephemeral tokens from. If INPUT is a file, it must
be of the format provided to the --format
option. If INPUT is a directory, all files within the
directory that match the given format will be considered a partition of the dataset.
OUTPUT is the file or directory in which the tokenized dataset will be written. If INPUT is a file, the OUTPUT will be written to as a file. If INPUT is a directory, the OUTPUT will be a directory containing a dataset partitioned into files.
Usage:
spindle-token transcrypt out [OPTIONS] INPUT OUTPUT
Options:
-t, --token [opprl_token_1|opprl_token_2|opprl_token_3]
The column name of an OPPRL token on the
input data to transcrypt. [required]
-r, --recipient FILENAME The PEM file containing the recipients
public key.
-k, --key FILENAME The PEM file containing your private key.
-f, --format [parquet|csv] The file format of input and output data
files. [required]
-p, --parallelism INTEGER The number of worker threads to parallelize
over. Useful when the input dataset is
partitioned into multiple part files. If not
supplied, defaults to the number of logical
cores.
--help Show this message and exit.
Limitations
The following limitations of the spindle-token CLI are. For a superior experience, consider using spindle-token as a Python library. If your use case requires addressing some of these limitations, please open an issue with additional details.
No Horizontal Scaling
The spindle-token CLI is built with Apache Spark to allow for data parallelism. Spark is designed to distribute workloads horizontally across a cluster of multiple machines connected to the same network. The spindle-token CLI runs spark in "local" mode that switches execution to a multi-threaded design on a single host machine.
The spindle-token CLI cannot be passed to spark-submit
, nor is there currently a way to pass spark connect information to a remote spark cluster. If you would like to use spindle-token on a Spark cluster, it is recommended that you use the spindle-token Python library.
No Remote File Systems
The spindle-token CLI reads and writes files using the local file system. This means there is no native support for files stored on remote filesystems, like S3.
You may be able to work with remote file systems if you have a method of mounting the remote file system to the local file system.
If you require working with datasets on remote filesystems like S3, it is recommended that you use the spindle-token Python library and configure pyspark to read from S3.