nest-data-source-api 0.7.0

NEST Data Source API Service
Documentation

NEST Data Source API

The NEST Data Source API is a data sharing API that provides a way for the NEST research data platform to interact with multiple data sources in a secure, privacy-preserving manner. Each data source implements the DataSource trait and exposes the DataSourceApi, enabling NEST to communicate with it through encrypted data exchange with blind pseudonymization. Under the hood, the API uses paas-client to pseudonymize data between NEST and data sources, ensuring that different systems can share data about participants and research activities without compromising privacy.

Overview

This API implements a privacy-friendly architecture where:

  • Data subjects (participants) are represented with different domain-specific pseudonyms in each system
  • Data exchange happens through polymorphically encrypted JSON with blind (homomorphic) pseudonymization
  • Unlinkability is preserved even in case of data leakage, as each system operates in its own pseudonymization domain with minimized quasi-identifiers
  • Batch operations eliminate quasi-identifiers by reshuffling records during transcryption

Architecture

Pseudonymization Domains

Each data source and NEST itself operates in a separate pseudonymization domain (or sometimes, multiple). Within each domain, participants are identified by domain-specific pseudonyms. When data needs to be shared between domains, identifiers are transformed through blind pseudonymization, making it impossible to link data across domains without authorized transcryption.

Encryption and Transcryption Flow

The system uses a non-interactive, asynchronous protocol for data exchange:

  1. Encryption: The sending system encrypts data (including participant identifiers) using libpep and paas-client
  2. Transmission: Encrypted data is sent to the receiving system over regular HTTPS connections
  3. Transcryption: The receiving system requests the pseudonymization service to blindly transform identifiers from the sender's domain to its own domain
  4. Decryption: The receiving system decrypts the transcrypted data in its own domain

This approach ensures:

  • End-to-end encryption: Data remains encrypted during transmission
  • Blind pseudonymization: The pseudonymization service performs homomorphic operations without seeing plaintext
  • Asynchronous processing: No real-time coordination between sender and receiver required

Polymorphic Encryption and Privacy

The system uses Polymorphic Encryption and Pseudonymization (PEP) to ensure strong privacy guarantees. A critical aspect of maintaining privacy is preventing linkability through quasi-identifiers.

Key privacy principles:

  • Domain-specific pseudonyms: Participant identifiers are domain-specific and cannot be linked across systems without authorized transcryption. Use these freely as they provide privacy protection.
  • Avoid globally unique identifiers: Whenever possible, avoid including globally unique values (GUIDs, timestamps with high precision, sequential IDs) that could serve as quasi-identifiers linking records across different queries or batches.
  • Minimal data in progress responses: Activity status and progress reports should contain minimal information. Use boolean flags and small integers rather than detailed metadata that could enable fingerprinting.
  • Polymorphic encryption: The same plaintext encrypts to different ciphertexts each time, preventing pattern matching even on encrypted data.
  • Block size considerations: PEP operates on 15-byte encryption blocks. Pseudonyms and attribute values that fit within a single block (≤15 bytes) are most efficient and provide the strongest privacy guarantees. Longer values are supported but require multiple blocks, which is computationally expensive and may have privacy implications for pseudonyms.

Example - Progress Response (Good):

{
  "has_data": true,
  "data_present": 1,
  "placeholder": 0
}

This reveals only the essential boolean information without quasi-identifiers.

Example - Progress Response (Bad):

{
  "has_data": true,
  "record_id": "550e8400-e29b-41d4-a716-446655440000",  // ❌ Global unique identifier
  "timestamp": "2024-03-15T14:32:17.392847Z",           // ❌ High-precision timestamp
  "data_size_bytes": 4729                               // ❌ Unique value enabling fingerprinting
}

These fields could be used as quasi-identifiers to link records even when pseudonyms are properly protected.

Data Model

The API focuses on two main entities for the NEST research platform:

  • Participants: Data subjects involved in research activities, identified by domain-specific pseudonyms (for example, jane_doe@example.com for registration, but converted into d42ac9c168ffc3b5856ee29e88fafc3ee321230e7df874cfc48f548a76ca4f37 or ae16865fe6a91b815791af7b1b61612b1221209964763010b603dcfb4e0c0243 in other domains)
  • Research Activities: Actions or events associated with participants (such as survey responses, measurements, etc., for example project2501_survey_1)

Each message contains:

  • Metadata:
    • Participant being queried for (pseudonymized identifier)
    • Pseudonymization domain information
    • Activity type
    • Datetime: Timestamp of when the request/response was made (NOT the timestamp of the activity/record itself)
  • Record data: The actual encrypted JSON data being exchanged (containing the activity details, status, results, and potentially pseudonyms of multiple participants involved)
    • May include its own timestamp fields indicating when the activity occurred

Multi-Participant Records

Important: This is a data sharing API where a single record can relate to multiple participants. Records may relate to multiple participants (e.g., collaborative activities, teacher-student interactions), and the same record can be returned when querying for different participants involved.

The participant pseudonym appears in two distinct places:

  1. In the message metadata - identifies which participant the request/response is for
  2. Inside the record data - may contain pseudonyms of all participants involved in the activity

Examples of Multi-Participant Scenarios

  • Collaborative activities: Two participants perform a joint task or survey together
  • Hierarchical relationships: A teacher and student, where both are pseudonymous participants
  • Group activities: Multiple participants in a shared research session

In these cases:

  • The data source may return the same record when queried for different participants
  • Each participant receives the shared record, but with their own pseudonym in the metadata
  • The record data contains all involved participants' pseudonyms

Example: A teacher-student interaction record

When the teacher queries for their activities:

Request for teacher:
  Metadata: {
    participant: "ae16865fe6...",
    activity: "feedback_session",
    datetime: "2024-03-15T14:30:00Z"  ← Request/response timestamp
  }
  Record data: {
    "teacher": "ae16865fe6...",
    "student": "d42ac9c168...",
    "interaction": "...",
  }

When the student queries for their activities later:

Request for student:
  Metadata: {
    participant: "d42ac9c168...",
    activity: "feedback_session",
    datetime: "2024-03-15T16:45:00Z"  ← Different request
  }
  Record data: {
    "teacher": "ae16865fe6...",
    "student": "d42ac9c168...",
    "interaction": "...",
  }
  ↑ Same record returned

Message Types

The API supports the following operations (all available in both single and batch modes):

Participant Operations

  • Create: Register new participants
  • Read: Retrieve participant information
  • Update: Modify participant data
  • Delete: Remove participants

Activity Operations

  • Create: Record new research activities for participants
  • Read: Retrieve activity information
  • Update: Modify activity data
  • Delete: Remove activities

Additional Activity Operations

  • Request Activity Results: Retrieve resulting data from research activities for analysis or archiving
  • Report Activity Status/Progress: Check data availability and track activity progress for dashboard displays

Batch Operations

To improve efficiency and preserve privacy, all operations support batch mode. Batch operations are available for all participant and activity CRUD operations, as well as result requests and progress tracking.

Note: In future versions, batch operations may become the primary or required interface to maximize privacy through reshuffling.

Unlinkability Through Reshuffling

During batch transcryption, the pseudonymization service reshuffles the order of records. This critical feature eliminates quasi-identifiers by breaking any correlation between:

  • The order in which records were submitted
  • Temporal patterns
  • Other implicit linkage information

All items in a batch must have the same JSON structure to enable this privacy-preserving reshuffling while maintaining data integrity.

JSON Structure Constraints

To maintain unlinkability in batch operations, all records must use consistent JSON structures that consume the same number of encryption blocks (15 bytes each). Different response values must not leak information through structure differences.

Note: Record data may contain participant pseudonyms as field values. Since this API is for data sharing, multiple participants may appear within the same record (e.g., "teacher": "ae16865fe6...", "student": "d42ac9c168..."). These pseudonyms are part of the encrypted record data.

Example: Activity Status Reports

When reporting whether data is available, both positive and negative responses must have identical structure.

Data available:

{
  "has_data": true,
  "data_present": 1,
  "placeholder": 0
}

No data available:

{
  "has_data": false,
  "data_present": 0,
  "placeholder": 0
}

Both responses:

  • Use the same field names
  • Have the same number of fields
  • Use numeric values of similar size
  • Consume the same number of encryption blocks

Invalid Example

This structure leaks information:

Data available (includes data_url field):

{
  "has_data": true,
  "data_url": "https://example.com/data/12345"
}

No data (missing data_url field):

{
  "has_data": false
}

The presence or absence of the data_url field would reveal information through the encrypted structure, even before decryption.

Design Guidelines

When designing record data JSON structures:

  1. Fixed field sets: All possible responses must include the same fields
  2. Consistent types: Use consistent data types (e.g., always integers for counts, not sometimes strings)
  3. Padding fields: Add placeholder fields if needed to maintain consistent size
  4. Bounded strings: If strings are necessary, use fixed-length encoded values or enums
  5. Null encoding: Represent absence with explicit values (0, false) rather than null or missing fields
  6. Participant pseudonyms: May appear as field values within the record data for data sharing scenarios
  7. Avoid globally unique values: Minimize use of GUIDs, high-precision timestamps, sequential IDs, or other values that could serve as quasi-identifiers. This is especially critical for progress/status responses where minimal information should be revealed.
  8. Use domain-specific identifiers: Prefer domain-specific pseudonyms over global identifiers, as they provide unlinkability across domains
  9. Optimize for 15-byte blocks: Keep pseudonyms and frequently-used attribute values at or below 15 bytes when possible. This maximizes efficiency and provides stronger privacy guarantees. Longer values require multiple encryption blocks, are computationally more expensive, and may leak information.

Error Handling in Batches

Batch operations support partial success: individual records within a batch can succeed or fail independently. Each record in the response includes an error code when that specific record encounters an issue.

Example: In a batch of 10 activity creation requests:

  • 8 records may succeed normally
  • 2 records may fail with specific error codes (e.g., participant not found, invalid data)
  • The overall batch operation completes, returning status for all 10 records

Implementation

Technology Stack

  • Framework: Built on the actix-web framework for HTTP server functionality
  • Authentication: Uses actix-web-httpauth middleware for simple token-based authentication
  • Cryptography: Leverages libpep v0.10+ for PEP (Polymorphic Encryption and Pseudonymization)
  • Pseudonymization: Integrates with paas-client for communication with the pseudonymization service to pseudonymize data between NEST and data sources
  • Data Format: JSON with PEPJSONValue and EncryptedPEPJSONValue types

Key Types

  • PEPJSONValue: Unencrypted PEP JSON representation (marks which fields contain identifiers) - used for record data
  • EncryptedPEPJSONValue: Serializable encrypted JSON (for API transmission) - the encrypted form of record data
  • Participant: Unencrypted participant with domain-specific pseudonym (used in message metadata)
  • EncryptedParticipant: Encrypted participant ready for transmission (used in message metadata)
  • ActivityRequest/ActivityResponse: Messages for research activities
    • Metadata: Contains the participant being queried for, domain, activity type, and datetime
    • Record data: Contains the actual data, which may include pseudonyms of multiple participants
  • BatchActivityRequest/BatchActivityResponse: Batch operation messages
    • Each record includes metadata (participant, domain, activity type, datetime) and encrypted record data
    • Each record may relate to multiple participants

Data Source Implementation

Each data source must:

  1. Implement the DataSource trait - defines how the data source handles participants and activities
  2. Expose the DataSourceApi - provides HTTP endpoints for NEST to interact with
  3. Handle participant and activity CRUD operations (in both single and batch modes)
  4. Manage domain-specific pseudonyms for participants

Deployment Options

The API offers flexibility in deployment:

  • Standalone server: Run as an independent HTTP service with its own actix-web server instance
  • Embedded in larger application: Include the DataSourceApi as part of a larger actix-web application for integrated systems

Both deployment modes use the same actix-web framework and actix-web-httpauth middleware for authentication.

Privacy Guarantees

The system provides strong privacy guarantees through polymorphic encryption and careful data minimization:

  1. Domain Isolation: Data from different domains cannot be linked without authorized transcryption. Pseudonyms are domain-specific and provide no linkage across systems.
  2. Encrypted Transmission: All data exchanged is end-to-end encrypted using polymorphic encryption, where the same plaintext produces different ciphertexts each time.
  3. Blind Processing: The pseudonymization service operates on encrypted data without access to plaintexts through homomorphic operations.
  4. Unlinkability: Batch operations eliminate temporal and ordering correlations through reshuffling. Quasi-identifiers are minimized by design.
  5. No Centralized Plaintext: No single entity has access to linked data across all domains.
  6. Minimal Data Exposure: Progress and status responses contain only essential information, avoiding globally unique identifiers that could enable record linkage or fingerprinting.

Use Cases

NEST Research Platform

The primary use case is the NEST research data platform, where:

  • Multiple research sites collect and share participant data
  • Each site operates in its own pseudonymization domain
  • NEST coordinates data collection and analysis across sites
  • Participants can collaborate across domains (e.g., joint studies, teacher-student interactions)
  • The same activity record may be shared among multiple participants
  • Privacy is preserved even if one site is compromised

Generic Applications

The architecture generalizes to any scenario requiring:

  • Secure multi-party data sharing
  • Privacy-preserving data integration
  • Compliance with data protection regulations (GDPR, HIPAA)
  • Research data management with participant privacy

Security Considerations

  • Authentication: All API endpoints require valid authentication tokens
  • Transport Security: Use TLS/HTTPS for all communications
  • Key Management: Pseudonymization keys are managed by the central pseudonymization service
  • Audit Logging: Operations should be logged for compliance and debugging

Future Enhancements

Potential improvements for the redesigned architecture:

  • Enhanced batch operation APIs for better efficiency
  • Improved error handling and recovery mechanisms
  • Support for additional message types and workflows
  • Performance optimizations for large-scale deployments
  • Comprehensive audit trail functionality

References

  • PEP Framework: Verheul, E., & Jacobs, B. (2017). Polymorphic Encryption and Pseudonymisation in Identity Management and Medical Research
  • n-PEP Extension: Doesburg, J., van Gastel, B., & Poll, E. (to be published)
  • NEST Platform: NEST Documentation

License

Apache-2.0

Authors