NEST Data Source API
The NEST Data Source API is a data sharing API that provides a way for the NEST research data platform to interact with multiple data sources in a secure, privacy-preserving manner.
Each data source implements the DataSource trait and exposes the DataSourceApi, enabling NEST to communicate with it through encrypted data exchange with blind pseudonymization.
Under the hood, the API uses paas-client to pseudonymize data between NEST and data sources, ensuring that different systems can share data about participants and research activities without compromising privacy.
Overview
This API implements a privacy-friendly architecture where:
- Data subjects (participants) are represented with different domain-specific pseudonyms in each system
- Data exchange happens through polymorphically encrypted JSON with blind (homomorphic) pseudonymization
- Unlinkability is preserved even in case of data leakage, as each system operates in its own pseudonymization domain with minimized quasi-identifiers
- Batch operations eliminate quasi-identifiers by reshuffling records during transcryption
Architecture
Pseudonymization Domains
Each data source and NEST itself operates in a separate pseudonymization domain (or sometimes, multiple). Within each domain, participants are identified by domain-specific pseudonyms. When data needs to be shared between domains, identifiers are transformed through blind pseudonymization, making it impossible to link data across domains without authorized transcryption.
Encryption and Transcryption Flow
The system uses a non-interactive, asynchronous protocol for data exchange:
- Encryption: The sending system encrypts data (including participant identifiers) using
libpepandpaas-client - Transmission: Encrypted data is sent to the receiving system over regular HTTPS connections
- Transcryption: The receiving system requests the pseudonymization service to blindly transform identifiers from the sender's domain to its own domain
- Decryption: The receiving system decrypts the transcrypted data in its own domain
This approach ensures:
- End-to-end encryption: Data remains encrypted during transmission
- Blind pseudonymization: The pseudonymization service performs homomorphic operations without seeing plaintext
- Asynchronous processing: No real-time coordination between sender and receiver required
Polymorphic Encryption and Privacy
The system uses Polymorphic Encryption and Pseudonymization (PEP) to ensure strong privacy guarantees. A critical aspect of maintaining privacy is preventing linkability through quasi-identifiers.
Key privacy principles:
- Domain-specific pseudonyms: Participant identifiers are domain-specific and cannot be linked across systems without authorized transcryption. Use these freely as they provide privacy protection.
- Avoid globally unique identifiers: Whenever possible, avoid including globally unique values (GUIDs, timestamps with high precision, sequential IDs) that could serve as quasi-identifiers linking records across different queries or batches.
- Minimal data in progress responses: Activity status and progress reports should contain minimal information. Use boolean flags and small integers rather than detailed metadata that could enable fingerprinting.
- Polymorphic encryption: The same plaintext encrypts to different ciphertexts each time, preventing pattern matching even on encrypted data.
- Block size considerations: PEP operates on 15-byte encryption blocks. Pseudonyms and attribute values that fit within a single block (≤15 bytes) are most efficient and provide the strongest privacy guarantees. Longer values are supported but require multiple blocks, which is computationally expensive and may have privacy implications for pseudonyms.
Example - Progress Response (Good):
This reveals only the essential boolean information without quasi-identifiers.
Example - Progress Response (Bad):
These fields could be used as quasi-identifiers to link records even when pseudonyms are properly protected.
Data Model
The API focuses on two main entities for the NEST research platform:
- Participants: Data subjects involved in research activities, identified by domain-specific pseudonyms (for example,
jane_doe@example.comfor registration, but converted intod42ac9c168ffc3b5856ee29e88fafc3ee321230e7df874cfc48f548a76ca4f37orae16865fe6a91b815791af7b1b61612b1221209964763010b603dcfb4e0c0243in other domains) - Research Activities: Actions or events associated with participants (such as survey responses, measurements, etc., for example
project2501_survey_1)
Each message contains:
- Metadata:
- Participant being queried for (pseudonymized identifier)
- Pseudonymization domain information
- Activity type
- Datetime: Timestamp of when the request/response was made (NOT the timestamp of the activity/record itself)
- Record data: The actual encrypted JSON data being exchanged (containing the activity details, status, results, and potentially pseudonyms of multiple participants involved)
- May include its own timestamp fields indicating when the activity occurred
Multi-Participant Records
Important: This is a data sharing API where a single record can relate to multiple participants. Records may relate to multiple participants (e.g., collaborative activities, teacher-student interactions), and the same record can be returned when querying for different participants involved.
The participant pseudonym appears in two distinct places:
- In the message metadata - identifies which participant the request/response is for
- Inside the record data - may contain pseudonyms of all participants involved in the activity
Examples of Multi-Participant Scenarios
- Collaborative activities: Two participants perform a joint task or survey together
- Hierarchical relationships: A teacher and student, where both are pseudonymous participants
- Group activities: Multiple participants in a shared research session
In these cases:
- The data source may return the same record when queried for different participants
- Each participant receives the shared record, but with their own pseudonym in the metadata
- The record data contains all involved participants' pseudonyms
Example: A teacher-student interaction record
When the teacher queries for their activities:
Request for teacher:
Metadata: {
participant: "ae16865fe6...",
activity: "feedback_session",
datetime: "2024-03-15T14:30:00Z" ← Request/response timestamp
}
Record data: {
"teacher": "ae16865fe6...",
"student": "d42ac9c168...",
"interaction": "...",
}
When the student queries for their activities later:
Request for student:
Metadata: {
participant: "d42ac9c168...",
activity: "feedback_session",
datetime: "2024-03-15T16:45:00Z" ← Different request
}
Record data: {
"teacher": "ae16865fe6...",
"student": "d42ac9c168...",
"interaction": "...",
}
↑ Same record returned
Message Types
The API supports the following operations (all available in both single and batch modes):
Participant Operations
- Create: Register new participants
- Read: Retrieve participant information
- Update: Modify participant data
- Delete: Remove participants
Activity Operations
- Create: Record new research activities for participants
- Read: Retrieve activity information
- Update: Modify activity data
- Delete: Remove activities
Additional Activity Operations
- Request Activity Results: Retrieve resulting data from research activities for analysis or archiving
- Report Activity Status/Progress: Check data availability and track activity progress for dashboard displays
Batch Operations
To improve efficiency and preserve privacy, all operations support batch mode. Batch operations are available for all participant and activity CRUD operations, as well as result requests and progress tracking.
Note: In future versions, batch operations may become the primary or required interface to maximize privacy through reshuffling.
Unlinkability Through Reshuffling
During batch transcryption, the pseudonymization service reshuffles the order of records. This critical feature eliminates quasi-identifiers by breaking any correlation between:
- The order in which records were submitted
- Temporal patterns
- Other implicit linkage information
All items in a batch must have the same JSON structure to enable this privacy-preserving reshuffling while maintaining data integrity.
JSON Structure Constraints
To maintain unlinkability in batch operations, all records must use consistent JSON structures that consume the same number of encryption blocks (15 bytes each). Different response values must not leak information through structure differences.
Note: Record data may contain participant pseudonyms as field values. Since this API is for data sharing, multiple participants may appear within the same record (e.g., "teacher": "ae16865fe6...", "student": "d42ac9c168..."). These pseudonyms are part of the encrypted record data.
Example: Activity Status Reports
When reporting whether data is available, both positive and negative responses must have identical structure.
Data available:
No data available:
Both responses:
- Use the same field names
- Have the same number of fields
- Use numeric values of similar size
- Consume the same number of encryption blocks
Invalid Example
❌ This structure leaks information:
Data available (includes data_url field):
No data (missing data_url field):
The presence or absence of the data_url field would reveal information through the encrypted structure, even before decryption.
Design Guidelines
When designing record data JSON structures:
- Fixed field sets: All possible responses must include the same fields
- Consistent types: Use consistent data types (e.g., always integers for counts, not sometimes strings)
- Padding fields: Add placeholder fields if needed to maintain consistent size
- Bounded strings: If strings are necessary, use fixed-length encoded values or enums
- Null encoding: Represent absence with explicit values (0, false) rather than null or missing fields
- Participant pseudonyms: May appear as field values within the record data for data sharing scenarios
- Avoid globally unique values: Minimize use of GUIDs, high-precision timestamps, sequential IDs, or other values that could serve as quasi-identifiers. This is especially critical for progress/status responses where minimal information should be revealed.
- Use domain-specific identifiers: Prefer domain-specific pseudonyms over global identifiers, as they provide unlinkability across domains
- Optimize for 15-byte blocks: Keep pseudonyms and frequently-used attribute values at or below 15 bytes when possible. This maximizes efficiency and provides stronger privacy guarantees. Longer values require multiple encryption blocks, are computationally more expensive, and may leak information.
Error Handling in Batches
Batch operations support partial success: individual records within a batch can succeed or fail independently. Each record in the response includes an error code when that specific record encounters an issue.
Example: In a batch of 10 activity creation requests:
- 8 records may succeed normally
- 2 records may fail with specific error codes (e.g., participant not found, invalid data)
- The overall batch operation completes, returning status for all 10 records
Implementation
Technology Stack
- Framework: Built on the
actix-webframework for HTTP server functionality - Authentication: Uses
actix-web-httpauthmiddleware for simple token-based authentication - Cryptography: Leverages
libpepv0.10+ for PEP (Polymorphic Encryption and Pseudonymization) - Pseudonymization: Integrates with
paas-clientfor communication with the pseudonymization service to pseudonymize data between NEST and data sources - Data Format: JSON with
PEPJSONValueandEncryptedPEPJSONValuetypes
Key Types
PEPJSONValue: Unencrypted PEP JSON representation (marks which fields contain identifiers) - used for record dataEncryptedPEPJSONValue: Serializable encrypted JSON (for API transmission) - the encrypted form of record dataParticipant: Unencrypted participant with domain-specific pseudonym (used in message metadata)EncryptedParticipant: Encrypted participant ready for transmission (used in message metadata)ActivityRequest/ActivityResponse: Messages for research activities- Metadata: Contains the participant being queried for, domain, activity type, and datetime
- Record data: Contains the actual data, which may include pseudonyms of multiple participants
BatchActivityRequest/BatchActivityResponse: Batch operation messages- Each record includes metadata (participant, domain, activity type, datetime) and encrypted record data
- Each record may relate to multiple participants
Data Source Implementation
Each data source must:
- Implement the
DataSourcetrait - defines how the data source handles participants and activities - Expose the
DataSourceApi- provides HTTP endpoints for NEST to interact with - Handle participant and activity CRUD operations (in both single and batch modes)
- Manage domain-specific pseudonyms for participants
Deployment Options
The API offers flexibility in deployment:
- Standalone server: Run as an independent HTTP service with its own
actix-webserver instance - Embedded in larger application: Include the
DataSourceApias part of a largeractix-webapplication for integrated systems
Both deployment modes use the same actix-web framework and actix-web-httpauth middleware for authentication.
Privacy Guarantees
The system provides strong privacy guarantees through polymorphic encryption and careful data minimization:
- Domain Isolation: Data from different domains cannot be linked without authorized transcryption. Pseudonyms are domain-specific and provide no linkage across systems.
- Encrypted Transmission: All data exchanged is end-to-end encrypted using polymorphic encryption, where the same plaintext produces different ciphertexts each time.
- Blind Processing: The pseudonymization service operates on encrypted data without access to plaintexts through homomorphic operations.
- Unlinkability: Batch operations eliminate temporal and ordering correlations through reshuffling. Quasi-identifiers are minimized by design.
- No Centralized Plaintext: No single entity has access to linked data across all domains.
- Minimal Data Exposure: Progress and status responses contain only essential information, avoiding globally unique identifiers that could enable record linkage or fingerprinting.
Use Cases
NEST Research Platform
The primary use case is the NEST research data platform, where:
- Multiple research sites collect and share participant data
- Each site operates in its own pseudonymization domain
- NEST coordinates data collection and analysis across sites
- Participants can collaborate across domains (e.g., joint studies, teacher-student interactions)
- The same activity record may be shared among multiple participants
- Privacy is preserved even if one site is compromised
Generic Applications
The architecture generalizes to any scenario requiring:
- Secure multi-party data sharing
- Privacy-preserving data integration
- Compliance with data protection regulations (GDPR, HIPAA)
- Research data management with participant privacy
Security Considerations
- Authentication: All API endpoints require valid authentication tokens
- Transport Security: Use TLS/HTTPS for all communications
- Key Management: Pseudonymization keys are managed by the central pseudonymization service
- Audit Logging: Operations should be logged for compliance and debugging
Future Enhancements
Potential improvements for the redesigned architecture:
- Enhanced batch operation APIs for better efficiency
- Improved error handling and recovery mechanisms
- Support for additional message types and workflows
- Performance optimizations for large-scale deployments
- Comprehensive audit trail functionality
References
- PEP Framework: Verheul, E., & Jacobs, B. (2017). Polymorphic Encryption and Pseudonymisation in Identity Management and Medical Research
- n-PEP Extension: Doesburg, J., van Gastel, B., & Poll, E. (to be published)
- NEST Platform: NEST Documentation
License
Apache-2.0
Authors
- Job Doesburg job@jobdoesburg.nl
- Julian van der Horst julian.vanderhorst@ru.nl