Open Privacy Preserving Record Linkage Protocol

Version 1.0

1. Overview

This document is a specification for the Open Privacy Preserving Record Linkage (OPPRL) protocol, which brings the capability to link records across datasets without sharing raw Personally Identifiable Information (PII). The protocol is designed with the following goals:

Interoperability - Implementations can be created in many data systems while ensuring that tokens generated by different implementations remain compatible.
Security - Carefully selected encryption algorithms are used at each point in the process to mitigate any chance of catastrophic failure or information leakage.
Decentralization - No trusted third parties are needed to act as centralized authorities. No single point of failure.
Scalability - Tokenization is an embarrassingly parallel task that should scale efficiently to billions of records.

Privacy Preserving Record Linkage (PPRL) is a crucial component to data de-identification systems. PPRL obfuscates identifying attributes or other sensitive information about the subjects described in the records of a dataset while still preserving the ability to link records pertaining to the same subject using an encrypted token. This practice is sometimes referred to as “tokenization” and is one of the components of data de-identification.

The task of PPRL is to replace the attributes of every record denoting Personally Identifiable Information (PII) with a token produced by a one-way cryptographic function. This prevents observers of the tokenized data from obtaining the PII. The tokens are produced deterministically such that input records with the same, or similar, PII attributes will produce an identical token. This allows practitioners to associate records across datasets that are highly likely to belong to the same data subject without having access to PII.

Tokenization is also used when data is shared between organizations to limit, or in some cases fully mitigate, the risk of subject re-identification in the event an untrusted third party gains access to a dataset containing sensitive data. Each party produces encrypted tokens using a different secret key so that any compromised data asset is, at worst, only matchable to other datasets maintained by the same party. During data sharing transactions, a specific “transcode” data flow is used to first re-encrypt the sender’s tokens into ephemeral tokens that do not match tokens in any other dataset and can only be ingested using the recipient’s secret key. At no point in the “transcode” data flow is the original PII used.

2. Glossary

Asymmetric Encryption: Encryption using a pair of keys: a public key for encryption and a private key for decryption. Allows secure communication without pre-sharing a secret key.

Attribute: A single field of a record denoting one piece of information about the subject.

Custodian: A user in possession of a data asset.

Data Asset: A collection of records with attributes. Can be a single dataset, or a collection of related datasets.

Derived Attribute: An attribute created by transforming another attribute. For example, a first initial attribute can be derived from a first name attribute.

Ephemeral Token: A token produced with non-deterministic encryption (ie. RSA with OAEP and MGF1) that is used when data is in transit from a sender to recipient. Ephemeral tokens cannot be used for record linkage until they are transcoded into normal tokens by the recipient.

Implementer: An individual or organization that creates a software tool that implements this specification.

Normalization: A transformation applied to an attribute to standardize the representation and remove aesthetic differences. Normalized attributes are more likely to equate across records pertaining to the same subject without increasing the likelihood of equality across records pertaining to different subjects. Example: removing leading and trailing whitespace characters.

Personally Identifying Information (PII): Attributes of a data asset that can be used to determine the identity of a subject. Examples include name, residential address, gender, age, phone number, email, as well as other demographic or socio-economic attributes.

Recipient: A user that receives a data asset from a custodian.

Subject: A person who is being described by one or more records in a data asset.

Symmetric Encryption: Encryption using a single secret key for both encryption and decryption. Faster than asymmetric encryption but requires secure key exchange.

Token: A string of text produced by encrypting the hash of concatenated, normalized PII fields. Used deterministically for record linkage.

Transcode: The process of securely transforming tokens encrypted with a sender’s key into tokens encrypted with a recipient’s key, enabling linkage without sharing private keys or PII. Data being transcoded has ephemeral tokens while in transit.

User: The end user of an OPPRL implementation. May refer to an individual or organization.

3. Encryption Keys

The OPPRL protocol relies heavily on cryptographic functions. Some of these functions encrypt and decrypt values using an encryption key.

3.1 User Specific RSA Key Pair

A user/organization specific RSA key pair consisting of a private key and corresponding public key. Keys should be 2048 bits or larger [RFC 8017].

Under no circumstances should a user’s private key be shared with other users or any other third party. Doing so would compromise security and privacy of the user’s OPPRL tokens.

Implementers may add additional functionality to manage this key pair on behalf of users, or rely on the user to supply their own keys. Secure method for this are out of scope for this protocol.

3.2 Derived AES Key

A key used when performing AES-GCM-SIV symmetric encryption and decryption. To reduce the number of encryption keys that users must coordinate, this key is derived from the private key of the user specific RSA key pair.

During normal operation, users should not be accessing this AES key and it should never be persisted or output in any way by OPPRL implementations.

3.2.1 Version >=1.0

A 32 byte key is derived from the user’s private RSA key using the HKDF algorithm [RFC 5869].

When deriving the AES key using HKDF, no salt is used and the “info” argument is set as the UTF-8 encoding of the text “opprl.v1.aes” in order to ensure the derived key is specific to the OPPRL key derivation.

3.2.2 Version <1.0

A 32 byte key is derived from the user’s private RSA key using the SHAKE256 algorithm [FIPS 202].

4. Functions

A variety of transformations are used when normalizing PII and encoding encrypted data as text. In order to preserve token equality across implementations, this section describes the suite of relevant transformations as functions. These functions are referenced throughout the remainder of this document to describe the behavior of data flows.

`ALPHA_WS_ONLY`

Remove all characters of a string aside from alpha characters (IEEE POSIX Standard regular expression [a-zA-Z]) and whitespace.

Example: ALPHA_WS_ONLY("Hello... world!") => "Hello world"

`BASE64_NO_NEWLINE`

Convert binary data into a string of text using the base64 encoding without any newline characters in the output [RFC 4648].

Many implementations of base64 conform to the Multipurpose Internet Mail Extensions (MIME) standard which adds newline characters every 76 characters [RFC 2045]. These characters have no impact on decoding, and we remove them because writing multi-line strings to text files can cause issues if not handled carefully.

Example: BASE64_NO_NEWLINE("abc") => "YWJj"

`COLLAPSE_WS`

Replace all sub-strings of one or more whitespace characters to a single space character.

Example: COLLAPSE_WS("Hello world ") => "Hello world "

`DECRYPT_AES_GCM_SIV`

Decrypts input binary data using the AES-GCM-SIV algorithm and user’s secret key from Section 3.2. This specific variant of AES is resistant to nonce reuse, which is critical for tokenization applications [RFC 8452].

`DECRYPT_RSA`

Decrypts input binary data using the RSA algorithm and the private key corresponding to the public key used to encrypt the data. This function should use OAEP padding with SHA256 for both the mask generation function (MGF1) and the hashing algorithm [RFC 8017].

See Section 3.1 for more details about the RSA key pair.

`DIGITS_ONLY`

Removes all non-digit characters from a string of text.

Example: DIGITS_ONLY("abc-123 456") => "123456"

`EMPTY_TO_NULL`

Null if the given string is empty, otherwise return the string unchanged.

Example:

EMPTY_TO_NULL("") => null

EMPTY_TO_NULL("abc") => "abc"

`ENCRYPT_AES_GCM_SIV`

Encrypts input binary data using the AES-GCM-SIV algorithm and user’s secret key from Section 3.2 [RFC 8452].

`ENCRYPT_RSA`

Encrypts input binary data using the RSA algorithm and a recipient’s public key. See Section 3.1 [RFC 8017].

`FIRST_CHAR`

Returns the first character of the given string.

Example: FIRST_CHAR("abc") => "a"

`FORMAT_DATE`

Returns the given date as a formatted string in the yyyy-MM-dd format.

Example: FORMAT_DATE(date) => "1997-01-01"

`IS_VALID_SSN`

Returns TRUE if the input string is a valid US Social Security Number and FALSE otherwise [SSA 1982, SSA 2011].

The string must be exactly 9 numeric characters.
The first character must not be a “9”
The area number must not be “000” or “666”
The group number must not be “00”
The serial number must not be “0000”

`LOWERCASE`

Convert all alpha characters of a string to lowercase. Leave digits and whitespace unchanged.

Example: LOWERCASE("ABC 123") => "abc 123"

`METAPHONE`

Returns the Metaphone phonetic encoding of a string [Philips 1990].

Example: METAPHONE("Healthcare") => "HL0KR"

`NULL_IF_NOT`

Takes a boolean condition and a value. If the condition is true, returns the value unchanged. Otherwise return NULL.

Example:

NULL_IF_NOT(TRUE, "abc") => "abc"

NULL_IF_NOT(FALSE, "abc") => NULL

`PARSE_DATE`

Parses a date (or timestamp) from a string of text. The exact logic will vary by use case depending on the format of the input string.

`REMAP`

Remaps the input value according to a lookup table. Non-null input values that are not found in the lookup table are given a default output value.

NULL values in the input are propagated as NULL output values.

Example:

lookup = {"A" -> "X",
          "B" -> "Y",
          "C" -> "Z"}
REMAP("B", lookup, default="unknown") => "Y"
REMAP("D", lookup, default="unknown") => "unknown"
REMAP(NULL, lookup, default="unknown") => NULL

`REMOVE_WS`

Remove all whitespace characters from the given string.

Example: REMOVE_WS(" a b c d") => "abc"

`SHA256`

Hash the input using the SHA256 function [FIPS 180-4].

`SOUNDEX`

Returns the Soundex phonetic encoding of the input text [Russell 1922].

Example: SOUNDEX("Healthcare") => "H432"

`TO_E164`

Takes a string containing a phone number and returns a string of text representing the same phone number in the E.164 international standard [ITU E.164].

Example: TO_E164("(234) 555-6789") => "+12345556789"

`TO_STRING`

Returns the input as a string of text. If given a string, returns the input unchanged.

Example: TO_STRING(123) => "123"

`TRIM`

Remove all leading and trailing whitespace from a string.

Example: TRIM(" abc ") => "abc"

`TRUNCATE_TO_DATE`

Truncates a value representing a point in time to the start of the day. In other words, this function removes all components of the point in time smaller than a day such as hour, minute, and second. The only components present on the outputs should be year, month, and day.

If the input value is simply a date with no time components the date should be returned unchanged.

Depending on the host platform used to implement the protocol, the input values may be expressed as an instance of a data type called datetime or timestamp.

The inputs should not be a UNIX epoch unless a corresponding timezone is known. Otherwise an accurate date cannot be derived.

The output should be a local date without timezone information.

Using SQL semantics, this function should be equivalent to CAST(DATE_TRUNC("day", input_timestamp) AS DATE). For most variants of SQL, the invocation of DATE_TRUNC is redundant, but we include it here for illustrative purposes.

Example: TRUNCATE_TO_DATE(2025-09-26 14:19:38.975197) => 2025-09-26

`UNBASE64`

Decode a base64 encoded string into binary.

`UPPERCASE`

Convert all alpha characters of a string to uppercase. Leave digits and whitespace unchanged.

Example: UPERCASE("abc 123") => "ABC 123"

5. Canonical PII Attributes

This section contains descriptions of the canonical PII attributes used to create the standard OPPRL tokens detailed in Section 6.3. Each attribute has standard normalization that should be shared by all OPPRL implementations to improve link quality.

5.1 First Name

The subject’s first name.

Normalization:

x = ALPHA_WS_ONLY(input)
x = UPPERCASE(x)
x = COLLAPSE_WS(x)
x = TRIM(x)
x = EMPTY_TO_NULL(x)
return x

5.2 First Initial

The initial of the subject’s first name.

If deriving from the subject’s full first name, perform all normalizations from Section 5.1 prior to computing this attribute.

Normalization:

x = FIRST_CHAR(input)
x = UPPERCASE(x)
return x

5.3 First Soundex

The Soundex phonetic encoding of the subject’s first name.

If deriving from the subject’s full first name, perform all normalizations from Section 5.1 prior to computing this attribute.

Normalization:

x = SOUNDEX(input)
return x

5.4 First Metaphone

The Metaphone phonetic encoding of the subject’s first name.

If deriving from the subject’s full first name, perform all normalizations from Section 5.1 prior to computing this attribute.

Normalization:

x = METAPHONE(input)
return x

5.5 Last Name

The subject’s last (aka family) name.

Normalization:

x = ALPHA_WS_ONLY(input)
x = UPPERCASE(x)
x = COLLAPSE_WS(x)
x = TRIM(x)
x = EMPTY_TO_NULL(x)
return x

5.6 Last Initial

The initial of the subject’s last name.

If deriving from the subject’s full last name, perform all normalizations from Section 5.5 prior to computing this attribute.

Normalization:

x = FIRST_CHAR(input)
x = UPPERCASE(x)
return x

5.7 Last Soundex

The Soundex phonetic encoding of the subject’s last name.

If deriving from the subject’s full last name, perform all normalizations from Section 5.5 prior to computing this attribute.

Normalization:

x = SOUNDEX(input)
return x

5.8 Last Metaphone

The Metaphone phonetic encoding of the subject’s last name.

If deriving from the subject’s full last name, perform all normalizations from Section 5.5 prior to computing this attribute.

Normalization:

x = METAPHONE(input)
return x

5.9 Gender

The subject’s gender. Normalized non-null values are either “F” for female, “M” for male, or “O” for other.

Normalization:

x = UPPERCASE(input)
x = COLLAPSE_WS(x)
x = TRIM(x)
x = EMPTY_TO_NULL(x)
x = FIRST_CHAR(x)
mapping = {
  "F" -> "F"
  "W" -> "F", 
  "G" -> "F",
  "M" -> "M",
  "B" -> "M",
}
x = REMAP(x, mapping, "O")
return x

5.10 Birth Date

The subject’s date of birth.

Normalization:

x = input
if x is a string:
  x = PARSE_DATE(x)
x = TRUNCATE_TO_DATE(x)
x = FORMAT_DATE(x)
return x

5.11 Email Address

An email address associated with the subject.

Normalization:

x = LOWERCASE(input)
x = REMOVE_WS(x)
x = EMPTY_TO_NULL(x)
return x

5.12 Hashed Email Address

A SHA256 hash of an email address associated with the subject. This attribute is often called a Hashed Email Address (HEM) and is a widely used identifier in industry practice, although it alone has some privacy concerns addressed by tokenization [Liveramp 2022].

If deriving from the subject’s email address, perform all normalizations from Section 5.11 prior to computing this attribute.

Normalization:

x = LOWERCASE(x)
return x

5.13 Phone Number

A phone number associated with the subject represented as a E.164 formatted string of text [ITU E.164].

Normalization:

x = TO_STRING(input)
x = TO_E164(x)
return x

The subject’s US social security number [SSA 1982, SSA 2011].

Normalization:

x = TO_STRING(input)
x = DIGITS_ONLY(x)
x = NULL_IF_NOT(IS_VALID_SSN(x))
return x

5.15 Health Plan Group Number

The group number of the subject’s health plan.

Normalization:

x = UPPERCASE(input)
x = REMOVE_WS(x)
x = EMPTY_TO_NULL(x)
return x

5.16 Health Plan Member ID

The subject’s health plan member ID (aka subscriber ID).

Normalization:

x = UPPERCASE(input)
x = REMOVE_WS(x)
x = EMPTY_TO_NULL(x)
return x

6. Tokenizing PII

To create tokens from a record of PII attributes, a multi-stage workflow is performed that combines multiple PII values together, hashes them, and encrypts the resulting hash with the user’s secret key. The specific subsets of PII attributes and cryptographic functions have been carefully selected to protect the security and privacy of subjects while minimizing the false negative and false positive match rates when linking.

For the remainder of this section, we assume that all PII values have been properly normalized. See Section 6 for the canonical normalization rules of every PII attribute.

6.1 Data Flow: PII to Tokens

The following subsections describe the process of generating a token from a set of PII attributes, A, from a record, R.

We denote the PII value of R corresponding to attribute a ∈ A as R_a. For example, if A = {first, last, birth_date} then it may be that R_first ↦ John and R_last ↦ Doe.

If R_a = NULL for any a ∈ A, the output token is NULL and all steps of this process should be skipped.

6.1.1 Joining PII

The PII values for the token’s attributes are joined into one plaintext string of text. Assuming the token’s attributes are independent, the cardinality of this string is the product of attribute cardinalities. Large cardinalities protect against frequency attacks.

Input: A record R, containing normalized PII values for each token attribute a ∈ A.

Process:

Let A^′ be a sorted version of A such that attribute names are arranged in lexicographic order. This ensures the joined PII plaintext is consistent irrespective of the order of at PII attributes and record fields.

Join all PII values R_a for all a ∈ A^′ in order. Each values is separated by a colon (:) character.

Result: A string of text containing all PII values delimited by a colon.

Example: "1970-01-01:John:Doe"

6.1.2 Hashing

The joined PII plaintext is obfuscated by a cryptographic hash function (SHA512) to prevent any reversal of tokens back to PII. After the PII has been hashed, the PII plaintext is never used at any point during tokenization or linking [FIPS 180-4].

Input: A joined plaintext PII string from Section 6.1.1.

Process: Hash the input with the SHA512 hash function.

Result: A 64 byte hash value.

6.1.3 Encryption

The hash values from Section 6.2.2 are the same for all users when generated from the same PII records. It is crucial for each user to generate tokens that are unique for that user. If one user has a re-identification incident – such as a custodial leak of tokens and their plaintext PII – the subject identities for the tokens of all other users remain secure through use of a different secret key. For this reason, the hash values are encrypted with each user’s respective secret key.

Input:

hash: A 64 byte hash value produced by Section 6.1.2.
key: The 32 byte AES key described in Section 3.2.

Process:

Encrypt the hash value with the user’s AES key using the AES-GCM-SIV encryption method. To ensure encryption is deterministic, a 12 byte “nonce” of all zeros is used. It is critical to use the Synthetic Initialization Vector (SIV) variation of AES-GCM because it protects against catastrophic data loss when reusing a nonce [RFC 8452].

In the context of tokenizing PII for PPRL, encryption must be deterministic to ensure that the same PII values result in the same token and therefore can be linked with an equality comparison.

Result: The binary data of the encrypted token.

6.1.4 Representation

The final token must be represented as text such that it can be easily written to a large variety of data files and databases.

Input: Binary data of the encrypted token produced in Section 6.1.3.

Process:

The token binary is converted into a text representation using the base64 encoding using the logic in the BASE64_NO_NEWLINE function (See Section 4).

Result: The final token.

6.2 Token Specifications

The following lists include the subsets of PII attributes from Section 5 that are used to create each canonical token in the OPPRL specification.

All attributes are listed in the exact lexicographical order (by name) that they must appear in the joined PII string in Section 6.1.1.

Versions >=0.0

Token 1: Birth Date, First Initial, Gender, Last Name
Token 2: Birth Date, First Soundex, Gender, Last Soundex
Token 3: Birth Date, First Metaphone, Gender, Last Metaphone

Versions >=1.0

Token 4: Birth Date, First Initial, Last Name
Token 5: Birth Date, First Soundex, Last Soundex
Token 6: Birth Date, First Metaphone, Last Metaphone
Token 7: First Name, Phone Number
Token 8: Birth Date, Phone Number
Token 9: First Name, Social Security Number
Token 10: Birth Date, Social Security Number
Token 11: Email Address
Token 12: Hashed Email Address
Token 13: Health Plan Group Number, Health Plan Member ID

7. Transcoding Tokens

A key function of Privacy Preserving Record Linkage (PPRL) is the ability to associate records describing the same Subject without requiring knowledge of the Subject’s identity. This task is commonly called “linking” or “matching”.

A common use case is to link datasets originating from different Custodians. As described above, every Custodian’s data assets have tokens created from their unique private key(s) and therefore tokenized data cannot be linked directly. This is important for preserving security and privacy in the event of a Custodian data breach. In the scenario where an untrusted third party obtains (potentially re-identified) OPPRL tokens of one Custodian, all other Custodians with OPPRL tokens are protected by the fact that their data assets contain different token values for the same subjects.

To enable cross-custodian data linking without introducing a universal token representation that could be used by untrusted third parties, a “transcode” cryptographic workflow is performed in coordination between Custodian and Recipient. The workflow is broken up into 2 data flows: one performed by the Custodian to prepare data to be shared, and one performed by the recipient to create linkable tokenized records.

7.1 Data Flow: Custodian Tokens to Ephemeral Tokens

Prior to this data flow, the Recipient must have shared their public RSA key with the Custodian that will be sending data. This key exchange can occur over unsecured communication channels as long as the Recipient’s private key has not been compromised and the Custodian has authenticated the source of public key as the intended recipient. These topics are out of scope for this protocol.

Inputs:

token: A base64 encoded token (Section 6.1)
recipient_public_key: The recipient’s public RSA key (Section 3.1)
private_key: The custodian’s private RSA key (Section 3.1)

Process:

b = UNBASE64(token)
h = DECRYPT_AES_GCM_SIV(b, private_key)
e = ENCRYPT_RSA(h, recipient_public_key)
return BASE64_NO_NEWLINE(e)

See function definitions in .

Result: A base64 encoded ephemeral token.

7.2 Data Flow: Ephemeral Tokens to Recipient Tokens

This data flow is performed by a Recipient of ephemeral tokens created by a Custodian with the Recipient’s public key. Ephemeral tokens are base6

Inputs:

ephemeral_token: A base64 encoded ephemeral token (Section 7.1)
private_key: The recipient’s private RSA key (Section 3.1)

Process:

b = UNBASE64(ephemeral_token)
h = DECRYPT_RSA(b, private_key)
t = ENCRYPT_AES_GCM_SIV(h, private_key)
return BASE64_NO_NEWLINE(t)

See function definitions in .

Result: A base64 encoded token that can be linked (via equality) to other data assets tokenized with the same Recipient’s private key.

8. Security Engineering Considerations

This section contains information for implementers that falls outside the scope of standardized tokenization and linking, but are important considerations with respect to security and privacy. Ultimately, implementers are responsible for making informed architectural decisions based on the specific scenarios they are targeting.

8.1 Secrets Management

The security of OPPRL replies upon the secrecy of user’s private RSA key, and the derived AES key. It is common for multiple individual users (both personnel and systems) within the same user organization to collaboratively maintain tokenized data assets, and therefore shared access to the organization’s secret key is required.

It is highly recommended that users adopt a key management system with role-based access control (RBAC) such that organization administrators can enforce the principle of least privilege. In addition, using an audit logging solution to record and monitor administrative and access activity related to private encryption keys is strongly recommended.

8.2 Key Rotation

Users should consider periodically rotating encryption keys to reduce the volume of data encrypted with a single key and shorten the window of vulnerability for a compromised key.

User’s can perform a one-time Transcoding workflow as described in Section 7 where the user acts as both the sending Custodian and the Recipient.

After rotating encryption keys, users must share their new public RSA key with the custodians that send them data.

When data recipients use a key rotation policy, it is recommended that data custodian's sending data should communicate which public key they are using. This communication can happen out-of-band or the sender can include some form of checksum generated from the public key used to transcode the tokens into ephemeral tokens. This is currently out of scope for the OPPRL protocol.

References

[FIPS 202] National Institute of Standards and Technology. 2015. “SHA-3 Standard: Permutation-Based Hash and Extendable-Output Functions.” Federal Information Processing Standards Publications (FIPS) 202. Washington, D.C.: U.S. Department of Commerce. https://doi.org/10.6028/nist.fips.202.

[FIPS 180-4] National Institute of Standards and Technology. 2023. “Secure Hash Standard (SHS).” Federal Information Processing Standards Publications (FIPS) 180-4, March 07, 2023 Revision. Washington, D.C.: U.S. Department of Commerce. https://doi.org/10.6028/NIST.FIPS.180-4.

[RFC 2045] Freed, Ned, and Dr. Nathaniel S. Borenstein. 1996. “Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies.” Request for Comments. RFC 2045; RFC Editor. https://doi.org/10.17487/RFC2045.

[RFC 8452] Gueron, Shay, Adam Langley, and Yehuda Lindell. 2019. “AES-GCM-SIV: Nonce Misuse-Resistant Authenticated Encryption.” Request for Comments. RFC 8452; RFC Editor. https://doi.org/10.17487/RFC8452.

[RFC 4648] Josefsson, Simon. 2006. “The Base16, Base32, and Base64 Data Encodings.” Request for Comments. RFC 4648; RFC Editor. https://doi.org/10.17487/RFC4648.

[RFC 5869] Krawczyk, Hugo, and Pasi Eronen. 2010. “HMAC-based Extract-and-Expand Key Derivation Function (HKDF).” Request for Comments. RFC 5869; RFC Editor. https://doi.org/10.17487/RFC5869.

[RFC 8017] Moriarty, Kathleen, Burt Kaliski, Jakob Jonsson, and Andreas Rusch. 2016. “PKCS #1: RSA Cryptography Specifications Version 2.2.” Request for Comments. RFC 8017; RFC Editor. https://doi.org/10.17487/RFC8017.

[Liveramp 2022] LiveRamp. 2022. “Email Hashing: The Trouble with Hashed Emails (HEMs).” https://liveramp.com/blog/the-trouble-with-hashed-emails-hems/.

[Philips 1990] Philips, Lawrence. 1990. “Hanging on the Metaphone.” In. https://api.semanticscholar.org/CorpusID:59912108.

[Russel 1922] Russell, Robert C. 1922. “Index.” U.S. Patent Office; U.S. Patent.

[ITU E.164] International Telecommunication Union. 2010. “The International Public Telecommunication Numbering Plan.” Recommendation ITU-T E.164.

[SSA 1982] Social Security Administration. 1982. “Meaning of the Social Security Number.” 11. Vol. 45. Social Security Administration. https://www.ssa.gov/policy/docs/ssb/v45n11/v45n11p29.pdf.

[SSA 2011] Social Security Administration. 2011. “Social Security Is Changing the Way SSNs Are Issued.” Social Security Administration. https://www.ssa.gov/kc/SSAFactSheet--IssuingSSNs.pdf.

Open Privacy Preserving Record Linkage Protocol

1. Overview

2. Glossary

3. Encryption Keys

3.1 User Specific RSA Key Pair

3.2 Derived AES Key

3.2.1 Version >=1.0

3.2.2 Version <1.0

4. Functions

ALPHA_WS_ONLY

BASE64_NO_NEWLINE

COLLAPSE_WS

DECRYPT_AES_GCM_SIV

DECRYPT_RSA

DIGITS_ONLY

EMPTY_TO_NULL

ENCRYPT_AES_GCM_SIV

ENCRYPT_RSA

FIRST_CHAR

FORMAT_DATE

IS_VALID_SSN

LOWERCASE

METAPHONE

NULL_IF_NOT

PARSE_DATE

REMAP

REMOVE_WS

SHA256

SOUNDEX

TO_E164

TO_STRING

TRIM

TRUNCATE_TO_DATE

UNBASE64

UPPERCASE

5. Canonical PII Attributes

5.1 First Name

5.2 First Initial

5.3 First Soundex

5.4 First Metaphone

5.5 Last Name

5.6 Last Initial

5.7 Last Soundex

5.8 Last Metaphone

5.9 Gender

5.10 Birth Date

5.11 Email Address

5.12 Hashed Email Address

5.13 Phone Number

5.14 Social Security Number

5.15 Health Plan Group Number

5.16 Health Plan Member ID

6. Tokenizing PII

6.1 Data Flow: PII to Tokens

6.1.1 Joining PII

6.1.2 Hashing

6.1.3 Encryption

6.1.4 Representation

6.2 Token Specifications

Versions >=0.0

Versions >=1.0

7. Transcoding Tokens

7.1 Data Flow: Custodian Tokens to Ephemeral Tokens

7.2 Data Flow: Ephemeral Tokens to Recipient Tokens

8. Security Engineering Considerations

8.1 Secrets Management

8.2 Key Rotation

References

`ALPHA_WS_ONLY`

`BASE64_NO_NEWLINE`

`COLLAPSE_WS`

`DECRYPT_AES_GCM_SIV`

`DECRYPT_RSA`

`DIGITS_ONLY`

`EMPTY_TO_NULL`

`ENCRYPT_AES_GCM_SIV`

`ENCRYPT_RSA`

`FIRST_CHAR`

`FORMAT_DATE`

`IS_VALID_SSN`

`LOWERCASE`

`METAPHONE`

`NULL_IF_NOT`

`PARSE_DATE`

`REMAP`

`REMOVE_WS`

`SHA256`

`SOUNDEX`

`TO_E164`

`TO_STRING`

`TRIM`

`TRUNCATE_TO_DATE`

`UNBASE64`

`UPPERCASE`