MIP-004: A Hashing Standard for Input and Output Data Integrity
This page is automatically synced from the masumi-network/masumi-improvement-proposals repository README.
MIP-004: A Hashing Standard for Input and Output Data Integrity
Author
Patrick Tobler, Sandro Sachier, Albina Nikiforova, Andreas Osberghaus
Title
MIP-004: A Hashing Standard for Input and Output Data Integrity
Abstract
This proposal introduces a foundational standard for creating deterministic, verifiable hashes for both inputs and outputs within the Masumi network. We propose two distinct but related three-step processes: one for the user's input_data
payload and another for the agent's resulting output
. Both processes are anchored by a shared identifier_from_purchaser
and use a semicolon delimiter to ensure an unambiguous, verifiable pre-image. The input data is serialized using the JSON Canonicalization Scheme (JCS, RFC 8785), while the output is treated as a raw string. Both pre-images are then hashed using SHA-256. Adopting this comprehensive standard is crucial for ensuring end-to-end data integrity, enabling non-repudiation, and fostering interoperability between all participants in the Masumi ecosystem.
Problem Statement
For the Masumi network to function as a trust-minimized system, there must be a standard way to create a unique and tamper-proof fingerprint for the entire lifecycle of a user-agent interaction. This includes both the data provided by the user and the output generated by the AI agent. Without a formally defined hashing standard, the network would face significant challenges:
- Lack of Verifiability: Users would have no way to prove exactly what input data they submitted, and agents would have no way to prove what output they generated for a specific request. This makes disputes impossible to resolve.
- No Verifiable Link: It would be difficult to cryptographically link a specific output back to the exact input that created it, weakening the chain of custody.
- Inconsistent Implementations: Different services might implement their own proprietary hashing methods, creating a fragmented ecosystem where hashes are incompatible and trust is diminished.
- Data Integrity Risks: There would be no reliable method to confirm that the data an agent receives is identical to the data the user sent, or that the output a user receives is what the agent generated. This opens the door to accidental corruption or man-in-the-middle modifications.
A clear, deterministic, and universally adopted hashing standard for both inputs and outputs is a prerequisite for a secure and functional decentralized AI network.
Solution
We propose the adoption of a formal, dual-process hashing standard. This standard defines two separate procedures: one for generating a hash of the input_data
and another for the output
. Both hashes are anchored by the same identifier_from_purchaser
. This standard uses a semicolon delimiter to create an unambiguous pre-image, which is a critical feature for robust security and verifiability.
Technical Specification
The standard specifies two hashing functions: one for inputs and one for outputs.
1. Input Hashing
The input hash is generated from an input_data
dictionary and an identifier_from_purchaser
string.
Step 1.1: Canonical JSON Serialization
To ensure a deterministic representation, the input_data
dictionary must be serialized into a byte string. This serialization must conform to the JSON Canonicalization Scheme (JCS) as specified in RFC 8785. The resulting bytes should be interpreted as a UTF-8 string.
Step 1.2: Pre-image Construction
The pre-image shall be constructed by concatenating the identifier_from_purchaser
, a semicolon separator (;
), and the canonical JSON string from Step 1.1.
string_to_hash = identifier_from_purchaser + ";" + canonical_input_json_string
Step 1.3: Hashing
The string_to_hash
must be encoded to bytes using UTF-8 and hashed with SHA-256. The result must be a lowercase hexadecimal string.
2. Output Hashing
The output hash is generated from the agent's output
string and the same identifier_from_purchaser
string.
Step 2.1: Output Data
The output
data is treated as a raw UTF-8 string.
Step 2.2: Pre-image Construction
The pre-image shall be constructed by concatenating the identifier_from_purchaser
, a semicolon separator (;
), and the raw output
string.
string_to_hash = identifier_from_purchaser + ";" + output
Step 2.3: Hashing
The string_to_hash
must be encoded to bytes using UTF-8 and hashed with SHA-256. The result must be a lowercase hexadecimal string.
Proposed Implementation
The following Python code serves as a reference implementation. It demonstrates how to implement the standard efficiently by adhering to the DRY (Don't Repeat Yourself) principle, using a shared internal function for the common hashing logic while separating the distinct data preparation steps.
import hashlib
import canonicaljson
import logging as logger
def _create_hash_from_payload(payload_string: str, identifier_from_purchaser: str) -> str:
"""
Internal core function that performs the standardized hashing.
It takes the final, processed data payload string and the identifier.
"""
# Steps 1.2, 2.2: Construct the pre-image with a semicolon delimiter.
string_to_hash = f"{identifier_from_purchaser};{payload_string}"
logger.debug(f"Pre-image for hashing: {string_to_hash}")
# Steps 1.3, 2.3: Encode to UTF-8 and hash with SHA-256.
return hashlib.sha256(string_to_hash.encode('utf-8')).hexdigest()
def create_masumi_input_hash(input_data: dict, identifier_from_purchaser: str) -> str:
"""
Creates an input hash according to MIP-004.
This function handles the specific pre-processing for input data (JCS).
"""
# Step 1.1: Serialize the input dict using JCS (RFC 8785).
canonical_input_json_string = canonicaljson.encode_canonical_json(input_data).decode('utf-8')
logger.debug(f"Canonical Input JSON: {canonical_input_json_string}")
# Call the core hashing function with the processed data.
return _create_hash_from_payload(canonical_input_json_string, identifier_from_purchaser)
def create_masumi_output_hash(output_string: str, identifier_from_purchaser: str) -> str:
"""
Creates an output hash according to MIP-004.
This function uses the raw output string as the payload.
"""
# Step 2.1: The output is a raw string, so no special processing is needed.
# Call the core hashing function with the raw data.
return _create_hash_from_payload(output_string, identifier_from_purchaser)
Rationale
This proposed solution is chosen for the following reasons:
- End-to-End Verifiability: Hashing both the input and output with a shared identifier creates a cryptographically verifiable record of the entire transaction, from request to result.
- Unambiguous by Design: The use of a fixed semicolon (
;
) separator between the identifier and the data payload is a critical security feature. It creates an unambiguous pre-image, preventing potential 'Concatenation Ambiguity' attacks where a malicious actor could craft inputs to cause a hash collision. - Based on Open Standards: The standard leverages widely adopted and vetted standards: SHA-256 for cryptographic security and RFC 8785 for deterministic serialization of inputs.
- Simplicity and Clarity: The processes are straightforward and easy for developers to understand and implement across different programming languages, fostering wide adoption.
- Deterministic by Design: The use of canonical JSON for inputs is critical for ensuring that any participant in the network can reliably reproduce the input hash given the same initial data.
Risks and Considerations
- Limited Extensibility: The current proposal does not include a versioning or a flexible structure for adding more contextual data to the hash (e.g., a timestamp or nonce). Future extensions would require a new MIP.
- Implementation Correctness: The security of the entire system relies on all parties using a correct and compliant RFC 8785 library for the input hash. An incorrect implementation would lead to hash mismatches and system failure.
- Output Data Type: This specification assumes the agent
output
is a UTF-8 string. If agents need to produce structured data (like JSON) or binary data, a future MIP may be required to define a canonicalization or encoding scheme for outputs to ensure determinism.