The quest for accurate data (Part 1): “Integrity” versus “Authenticity.”
Geneva, Switzerland, November 10 2022
“Integrity” and “Authenticity” are the cryptographic data security properties to ensure data accuracy across a digital network, connection, or exchange. Unfortunately, both properties are only sometimes adhered to simultaneously, which leads to security degradation. For example, even if a data source is authentic, it does not necessarily mean that the metadata and data in the message have integrity. This blog post delves into the dualism of the two data security properties and how they translate to data capture and entry processes in a digital system. These properties ensure uncompromised cryptographic data security and provide the basis for structuring metadata and data to preserve the original context of messages at any interaction point in a data lifecycle.
Data accuracy, the sole standard of data quality, refers to data consistency with reality, where data messages require the following characteristics to be deemed accurate.
Authenticable lineage assures where the data originated and where it moves over time in a chain of interrelated events.
Incorruptible content assures that the data was not intentionally or unintentionally modified from source to target.
Comprehensible meaning provides information about the context of the data and how to interpret and comprehend it.
Nowadays, it is common in digital systems to have uninterpretable data from a known source in its original format. Furthermore, approximately 80% of recorded data in today's digital landscape is unstructured, and the trend is towards more unstructured data [1][2]. However, data is more accessible for systems to interpret for artificial intelligence, machine learning, and statistical analysis if it is tied to structured metadata to provide comprehensible meaning.
Without “meaning”, even the most secure data won’t provide value because it is difficult to interpret for processing.
The integrity of digital objects and textual relationships
Data semantics is the study of the meaning and use of specific pieces of data in computer programming and other areas that employ data. Without a system of interpretation, data has no inherent structural, definitional, or textual meaning. This interpretation is provided by "metadata", sets of data that give meaning to any stored sequence of bytes. In computer science, an object can be a variable, a data structure, a function, or a method [3]. Metadata organises textual data by using attributes associated with a particular object. Any change in metadata could influence the meaning of the data it describes, which is why assuring metadata integrity is as important as guaranteeing the integrity of the data itself.
Data capture is the process of collecting structured and unstructured information electronically and converting it into data readable by a computer or other electronic device. The process entails implementing structural, definitional, and textual definitions (“metadata”) to interpret data that adhere to those definitions. In a balanced digital network [4], connection, or exchange, data capture requires objects to be deterministically identifiable and content-addressable to ensure the integrity of the textual content of the message.
So, "Integrity" relates to objects. All objects and their relationships MUST be deterministic to ensure data accuracy, completeness, and consistency. An object is deterministic if any operation's result and final state depend solely on the initial state and the operation's arguments.
We can achieve textual integrity by combining data integrity with metadata integrity.
The authenticity of digital events
Data inputs are sequences of bytes that provide information to a computer in a moment. In computer science, a tractual event is an action or occurrence identified by a program that has significance for system hardware or software. Stored sequences of bytes are known as "data".
Data entry is the process of transcribing information into an electronic medium such as a computer or other electronic device. The process entails storing state changes as recorded events to determine the authenticity of the data origin (the “source”), its status, and where it moves over time. In a balanced digital network [4], connection, or exchange, data entry requires append-only logs to accompany signed data inputs to identify the origin and creation of tractual events at recorded moments so that the inputted data can be considered authentic.
So, "Authenticity" relates to events where cryptographic unicity is the only available tool in the digital landscape. All recorded events MUST be associated with at least one public/private key pair to be considered authentic. Public/private key pairs provide the underpinning for all digital signatures, a mathematical scheme for certifying that event log entries are authentic.
The veracity of inputted data
With deterministic data capture and authentic data entry processes, cryptographically accurate data is possible. However, another data security property we need to consider is “Veracity”, the data's truthfulness, which is epistemic, and, therefore, down to human accountability. So, just because a data capture construct may facilitate semantically accurate data, a free-form text field at the capture point would allow the user to enter erroneous data. Unfortunately, technology is not a cure for human fallibility.
Let's provide the following hypothetical scenario. Suppose I were to enter "Rpbwrt" instead of "Robert" as my first name. The inputted data might have textual integrity (i.e., the structural, definitional, and textual characteristics of the data capture are deterministic), it would have tractual authenticity (i.e., you could verify that I entered it), and yet I've still managed to enter a typographical error into the system, where the veracity of the data is compromised.
The quest for "accurate data" is indeed challenging. We will address the veracity of inputted data in part 2 of the series.
The Dynamic Data Economy and Decentralised Semantics
Although treated as separate security properties, "Integrity" and “Authenticity" are interdependent. As with any successful union, they are interconnected and counterbalancing. For example, the accuracy of data lineage refers to the authenticity of the data origin, what happens to it, and where it moves over time. The accuracy of data contextualisation refers to the integrity of the message content so that all interacting actors can understand its meaning in context.
At the Human Colossus Foundation, we created and are fostering the development of the Dynamic Data Economy (DDE), the critical infrastructure for a data-agile economy, to empower people and organisations through better-informed decisions. The DDE Principles define these fundamental security principles (“Textual Integrity” and “Tractual Authenticity”), demonstrating their interdependence. The DDE Trust Infrastructure Stack recognises the differentiation between them, offering textual integrity to the Structure infrastructure (Layer 1) and tractual authenticity to the Causality infrastructure (Layer 2). In combination, the two properties provide the necessary cryptographic assurance to data capture and entry processes within any secure digital system, providing the basis to tackle contextual veracity, the primary security principle of the Knowledge infrastructure at Layer 3.
Guiding the tooling development for the DDE Structure Layer is Decentralised Semantics [5], an exciting new data domain spearheaded by the Foundation. Decentralised Semantics refactors the definition of integrity by putting the meaning of data at the core of textual integrity. This domain provides enhanced data modelling for an evolving data-agile economy comprising distributed data ecosystems where harmonisation between competing standards is essential for data object comprehension in any bilateral exchange. As a result, it enables the structuring of unstructured data with morphologic accuracy.
If you are interested in reading more about data accuracy, sign up for our newsletter and stay tuned for part 2 of the series.
References
When writing this post, we were conscious of the need for accuracy regarding the terms used in this article. However, as technological precision might blur end-user readability, we forced ourselves to use explicit and standard references, the Oxford or Cambridge dictionaries. Thus, we collect the definitions in this table for the reader's convenience.
[1] Deep Talk. 80% of the world’s data is unstructured (October 2021)
[2] Bean, R. Why Becoming a Data-Driven Organization Is So Hard (February 2022), Harvard Business Review
[3] Wikipedia. Object (computer science)
[4] Knowles, P. Active and Passive Identifiers
[5] Knowles, P., Mitwicki, R., Page, P. Decentralised semantics in distributed data ecosystems: Ensuring the structural, definitional, and contextual harmonisation and integrity of deterministic objects and objectual relationships (2022)