Automation and Robots for Disaster Response

UP-SDCG: A Method of Sensitive Data Classification for Collaborative Edge Computing in Financial Cloud Environment

By inergency On Mar 18, 2024

[ad_1]

1. Introduction

With the advent of the big data era, data have been recognized as essential production factors. To promote the data factor market, ensuring data security is a fundamental requirement. In this context, sensitive data pertain to information that, if disclosed or compromised, has the potential to inflict significant harm upon individuals or society. Sensitive data encompass personal privacy information, such as names, phone numbers, bank account numbers, ID numbers, addresses, passwords, email addresses, educational backgrounds, and medical records. Additionally, this includes enterprise data that are not suitable for public disclosure, such as the company’s operational details, IP address lists, and network structure.

Effectively addressing the diverse and constantly evolving compliance requirements poses a formidable challenge. As the digital transformation gains momentum, numerous countries’ laws and regulations, coupled with security requirements stipulated by industry organizations (e.g., PCI DSS [1], SOX [2], HIPAA, GDPR [3], CCPA [4], etc.), underscore the importance of identifying and classifying sensitive data as the initial step in data protection. Enterprises are confronted with the task of streamlining their compliance workflows by leveraging simplified technology environments and pre-built templates. This necessitates understanding the precise locations of their data and determining whether additional safeguards are necessary. It also involves identifying both structured and unstructured sensitive data, both locally and in the cloud, that fall under regulatory scrutiny. Subsequently, these data must be categorized and cataloged for ongoing vulnerability monitoring.

In recent years, the financial industry has witnessed a rapid acceleration of the open banking model, where data applications are shared between banks and third-party service providers. More than 30 countries and regions worldwide have already adopted or are in the process of adopting this model [5]. Open banking offers numerous advantages, including enhanced customer experiences, the creation of new revenue streams, and the establishment of sustainable service models in markets with limited access to traditional banking services [5]. However, open banking also presents significant challenges, particularly concerning data security. The shared data encompass user identity information, financial transaction details, property, and other sensitive information. This extensive data sharing deepens the risk of data leakage and misuse [6].

To enhance the security of open banking data, we propose a sensitive data processing technique in a cloud-edge collaborative environment, as depicted in Figure 1. Firstly, financial institutions in the central cloud of a bank need to conduct a comprehensive assessment of their data assets to create a visual map of sensitive data before sharing with external parties. Secondly, the data application side (third-party organizations) deploys a regulatory outpost on the edge to ensure the security and compliance of open banking data. The Regulatory Sentinel is an independent software system designed to monitor every step of data operations performed by the application side, including storage, retrieval, and sharing. It also incorporates sensitive data identification, anonymization, watermarking, and records all user data operations for log auditing, leakage detection, data flow mapping, and situational awareness of data security [7].

From the description of the regulatory outpost, it is evident that it deeply integrates into the data processing workflow of the application side, leveraging the characteristics of edge-based data processing. To avoid compromising the overall data processing experience and incurring significant costs for the application side, the deployment of the Regulatory Sentinel should meet the following requirements:

Elastic scalability of resources: As data processing by the application side requires computational resources, which fluctuate with varying data volumes, the deployment should allow for elastic scalability of resources to minimize investment costs for the application side.
Low bandwidth utilization cost and reduced data processing latency: The data traffic accessed by the application side needs to pass through the regulatory outpost. It is crucial to ensure low bandwidth utilization costs and reduced data processing latency to minimize any impact on the application side’s user experience.
Ensuring data compliance: In the context of open banking, the application side tends to locally store open banking data, necessitating compliance checks on these data to prevent potential leaks. As shown in Figure 2, a way is given for the application side to perform operations such as data desensitization and watermarking locally to enhance data security, in which data classification and grading is the basis.

Hence, the automated classification and grading of sensitive data in the financial sector are garnering increasing attention. Firstly, financial institutions should conduct a comprehensive assessment of their own data landscape to achieve a visualized map of sensitive data assets before engaging in data sharing. Secondly, for third-party service providers collaborating with financial institutions, it is imperative that they enhance their data security management capabilities in accordance with government regulatory requirements and contractual agreements with financial institutions, which include encrypting sensitive data during storage or implementing data anonymization techniques, with the prerequisite being the prompt identification and classification of sensitive data transmitted during the collaboration process.

From the aforementioned scenario, the automated classification and categorization of sensitive data in the financial domain is a fundamental capability of the technology platform. Currently, the financial industry employs two primary methods for data classification and grading. One involves manual classification, which spans multiple departments, leading to a lengthy and inefficient process, and it lacks reusability, posing limitations on its scalability and adaptability. Another relies on automated classification and grading based on pattern matching, utilizing internally constructed data dictionaries. However, this approach suffers from low accuracy rates, especially when dealing with incomplete data dictionaries.

Building upon the aforementioned challenges, we present a data classification and grading framework to the financial industry which adheres to the relevant industry standards. Our framework encompasses both structured and unstructured data classification and grading. For structured data, we introduce a novel sensitive data classification and grading algorithm named UP-SDCG, leveraging self-enrichment and broadening techniques. Additionally, we enhance the financial data hierarchical classification by employing an augmentation model to expand keywords and lexicons, which significantly boosts the accuracy and recall of data classification and grading. Furthermore, we incorporate a synonym discrimination model to further expand the keywords and dictionaries in the industry data hierarchical classification library, resulting in improved accuracy and recall of data classification and grading. In our future work, we aim to further develop a scheme for classifying and grading unstructured sensitive data. This scheme will also support the coarse-grained classification of document data containing sensitive information. Additionally, we will propose a fine-grained classification approach to identify the types of sensitive data and their corresponding levels within the document.

Our research makes the following contributions:

We propose a financial data classification and grading framework and a self-enlarging structured sensitive data classification and grading algorithm named UP-SDCG, with a synonym discrimination model innovatively introduced to further expand keywords and lexicons.
Testing on real-world financial industry data, UP-SDCG outperforms existing public cloud algorithms in terms of accuracy and recall for sensitive data classification and grading.
We further propose unstructured sensitive data classification and grading design scheme and scenario analysis.

[ad_2]