Ongoing Projects

For a list of completed projects, please visit here.

RGC/GRF 15210023: “Local Tweaks for Privacy-Preserving Training in Machine Learning at Scale” (2024-2026, HK$1,228,619)

Abstract:

In 2021 machine-learning-as-a-service (MLaaS) had a global market of USD 2.26 billion, and is expected to reach USD 16.7 billion by 2027 at a CAGR of 39%.[1] MLaaS has empowered large machine learning models, especially those based on transformers, such as GPT-3, Gopher, Megatron-Turing NLG, and BEiT-3, for new applications across health, finance, manufacturing, logistics, public service, and education. As a side effect, training these large models requires huge amount of data from both public and private sources.

However, recent privacy infringements against pretrained models in MLaaS platforms have shown imminent threats to data owners. In particular, they suffer from membership inference and model inversion attacks[2][3][4], which compromise the privacy of training data and thus severely undermine public’s faith in contributing their private data for training. Although some privacy preserving techniques have been proposed, e.g., DP-SGD, they often work in the server end, who has good or even perfect knowledge about the training process, e.g., model architecture and hyperparameters. Unfortunately, data owners, especially individuals who contribute sensitive training data (e.g., medical and financial records), have little clue on whether or to what extent their data are vulnerable when pretrained models are released.

In this project, we will comprehensively study local protection strategies for data owners who contribute sensitive training data or participate in a federated learning process, both among the most popular machine learning paradigms. In this project, for the first time, data owners are put in the spotlight to address privacy threats arising from pretrained models. Upon completion of this project, we can answer two questions in privacy-preserving machine learning for data owners:

How can a data owner locally identify high-risk training examples in the training process?

What local strategies can a data owner adopt to protect her training examples from being inferred by MLaaS users who have access to a model trained by these examples?

With our extensive research experience in differential privacy and adversarial machine learning, we believe this project will contribute to the security foundation for any booming industry that is powered by machine learning. We also expect this project to spark off endeavor in related sectors such as computer vision, fintech, robotics, and AI health, where the privacy of training data is crucial. The outcomes will benefit the research community, the IT industry, and Hong Kong, “a world class smart city” in a blueprint drawn by the government.

[1] “Machine Learning as a Service (MLaaS) Market – Growth, Trends, COVID-19 Impact, and Forecasts (2022 – 2027)”, GlobeNewswire, June 13, 2022.

[2] N. Carlini et al. “Extracting Training Data from Large Language Models.” USENIX Security Symposium, 2021.

[3] M. Gomrokchi et al. “Where Did You Learn That From? Surprising Effectiveness of Membership Inference Attacks Against Temporally Correlated Data in Deep Reinforcement Learning.” arXiv:2109.03975, Oct 2021.

[4] J. Geiping et al. “Inverting Gradients – How easy is it to break privacy in federated learning?” NeurIPS, 2020.

RGC/GRF 15208923: “Small Leaks Sink Great Ships: Data Recovery Attacks and Defense in Local Differential Privacy” (2024-2026, HK$ 1,096,927)

With the thrill of big data and AI, user data are collected through web, mobile, IoT, and even wearable devices for automated operations and intelligent decisions. Since such collection often puts user privacy at stake, local differential privacy (LDP) has emerged to sanitize data for guaranteed repudiation. Despite its debatable accuracy, LDP has been widely adopted in many large-scale collection endeavors such as telemetric data in Apple iOS, Google Android and Microsoft Windows. More recently, US Census 2020 has adopted LDP to collect sensitive data from US citizens, geographic location, gender, ethnicity, and age.[1]

However, LDP is not a panacea for every collection use case. Besides its low utility for non-trivial or small-scale tasks[2], in this project we show the sanitized data are vulnerable to recovery by attackers with strong auxiliary knowledge. Although inference attacks and information leakage have been discovered in centralized differential privacy mechanisms[3] [4], no work has identified and addressed such vulnerabilities in the more pervasive local differential privacy mechanisms.

We will investigate data recovery attacks under two types of auxiliary knowledge: (1) probabilistic distribution, including domain distribution, user correlation and dimension dependence, and (2) pre-trained machine learning models. Defense techniques such as collaborative sampling, dependence-quantified and randomized-index-based LDP mechanisms will also be studied. Upon completion, this project will contribute to the privacy and security foundation for the booming industry of big data and AI. The outcomes will be fruitful and can benefit the research community, all industries driven by data, and citizens in Hong Kong.

[1] Disclosure Avoidance Modernization – Census Bureau. https://www.census.gov/programs-surveys/decennial-census/decade/2020/planning-management/process/disclosure-avoidance.html

[2] A Supreme Court Fight Over Census Data Privacy and Redistricting Is Likely Coming. Public Broadcasting Atlanta, Jun 29, 2021. https://www.wabe.org/a-supreme-court-fight-over-census-data-privacy-and-redistricting-is-likely-comin/

[3] C. Liu, S. Chakraborty, P. Mittal. Dependence Makes You Vulnerable: Differential Privacy Under Dependent Tuples. NDSS 2016, 16: 21-24.

[4] W. Huang, S. Zhou; Y. Liao. Unexpected Information Leakage of Differential Privacy Due to the Linear Property of Queries. IEEE Transactions on Information Forensics and Security, 16: 3123 – 3137, April 2021.

RGC/GRF 15209922: “Evasive Federated Learning Attacks through Differential Privacy: Mechanisms and Mitigations” (2023-2025, HK$941,434)

Abstract: Federated learning has brought benefits to both data privacy and communication/computation cost saving. However, recent attacks have shown the vulnerabilities of federated learning in the presence of malicious parties.[1] These attacks, including model/data poisoning and privacy leakage, severely undermine the public’s faith in federated learning and jeopardize innovative applications that build upon it. Although detection and mitigation of these attacks have been proposed, they assume perfect knowledge about the attacking strategy (i.e., white-box defense), leaving room for attackers to morph and evade them. In turn, such evasion can be detected and mitigated by new defense schemes — a non-ending battle in the arena of federated learning.

In this project, we escape from this loop by escalating such evasion attacks to a new level by exploiting the rigorous tool of (local) differential privacy (LDP). Differential privacy was proposed for distributed data contributors to report their sensitive information to an untrusted data collector for statistical analysis. We, on the other hand, treat attacks as data and conceal their true intention using differential privacy, which guarantees deniability and confuses the server. For example, a poison example or model update that targets on the global model is perturbed by differential privacy mechanisms to disguise its true intention (i.e., the targeted class label). This will be the first time differential privacy is exploited as a malicious tool for evasive attacks.

Upon completion of this project, we can answer two core security questions in federated learning:

Can an adversary create evasive poisons without knowing the detection strategy while still retaining poisoning effectiveness?

Is there any mitigation strategy that still works against such strong evasive attacks?

With our extensive research experience in both differential privacy and adversarial machine learning, we believe this project will contribute to the security foundation for the booming industry of federated learning and big data. We also expect this project to spark off endeavor in related sectors such as internet of things and edge computing, where the integrity of data and models is critical. The outcomes will benefit the research community, the IT industry, and Hong Kong, “a world class smart city” in a blueprint drawn by the HKSAR government.

[1] Hongyi Wang, Kartik Sreenivasan, Shashank Rajput, Harit Vishwakarma, Saurabh Agarwal, Jy-yong Sohn, Kangwook Lee, Dimitris S. Papailiopoulos. “Attack of the Tails: Yes, You Really Can Backdoor Federated Learning.” NeurIPS, 2020.

National Natural Science Foundation of China国家自然科学基金重大研究计划培育项目 92270123: “Mutual Security Analysis of Machine Learning Models on Untrusted Data Sources”数据源和机器学习模型双向安全策略研究 (2023-2025, CNY 800,000)

Abstract: Thanks to the advancement of big data techniques, computational power, deep neural network, and cloud computing, machine learning has jumped to the top of artificial intelligence. However, solutions that are based on machine learning heavily rely on the correctness and accuracy of data source for training. As such, in case of unreliable resources, malicious attacks, and corporate dishonesty, machine learning can become extremely vulnerable. On the other hand, publishing a machine learning model may also impose high risk on the model itself or the underlying training data, especially for those models that are currently operating in public cloud by some machine learning as a service (MLaaS).In this project, we will adopt a monolithic perspective on the security needs and threats of both source data and machine learning models. We aim to investigate on state-of-the-art adversarial machine learning attacks and defenses schemes that target on either data or model, and use the mutual security analysis to generalize, enhance, and defend against such attacks. We believe the outcome of this project can pave the way of a secure adoption of machine learning to essential business decisions and mission-critical AI activities where both model and data security are vulnerable but indispensable.

RGC/GRF 15226221: “Sword of Two Edges: Adversarial Machine Learning from Privacy-Aware Users” (2022-2024, HK$838,393)

Abstract: With the emergence of more stringent legislations on data privacy such as GDPR in EU (in effect since May 2018) and California Consumer Privacy Act (CCPA, in effect since Jan 2020), corporates and businesses with large amount of customer data are willingly seeking out privacy-preserving techniques, such as differential privacy and homomorphic encryption, to comply with these laws. This endeavor has been accompanied with another effort of protecting machine learning against adversarial attempts, such as adversarial samples, model extraction, and training data poisoning.[1]^,[2]

As of now, these two lines of research are “sword of two edges” — they are studied in parallel but seldomly considered in the same scope of a research. However, with the reasons above, we believe it will soon become imperative for academia and industry to cope with them simultaneously. While this will definitely raise more challenges to the research community, we see this as an opportunity to think out of the box and reexamine the way we design privacy-protection schemes and investigate adversarial attempts in real-life machine learning applications. In particular, a core question we would like to answer is “Will privacy-protected data make adversarial attacks on machine learning simpler or more difficult?”

With our extensive research experience in both data privacy protection and adversarial machine learning, we believe this project will contribute to the security foundation for the booming industry of machine learning and big data. We also expect this project to spark off endeavor in related sectors such as internet of things and edge computing, where both the privacy and security of data and models are critical. The outcomes will benefit the research community, the IT industry, and the city of Hong Kong, “one of the world’s smart cities”, a vision quoted from the Chief Executive’s Policy Address.

[1]. Will Douglas Heaven. “Facebook wants to make AI better by asking people to break it”, MIT Technology Review, Sept 24, 2020.

[2] David Danks. “How Adversarial Attacks Could Destabilize Military AI Systems”, IEEE Spectrum, 26 Feb 2020.

RGC/GRF 15225921: “Byzantine-Robust Data Collection under Local Differential Privacy Model” (2022-2024, HK$ 838,393)

With the proliferation of big data science, it becomes essential for business and governments to collect individual user data through web, mobile, wearable and IoT devices for various decision making and analytics tasks. For example, in recent COVID-19 pandemic, Google and Facebook track and publish users’ mobility report to help the public to understand the relationship between population movement and spread of the disease.[1] Since such data may contain personal information, local differential privacy (LDP) has been widely adopted by Apple, Google, Microsoft and even US Census 2020 to collect sensitive data for statistics analysis and estimation.[2]

However, one weakness of LDP is its vulnerability to malicious users. As pointed out by many researchers and data scientists like Jonathan Matus,[3] individuals may intentionally provide wrong data due to the following reasons: (1) dishonesty or unwillingness of data sharing, (2) erroneous LDP implementation or inconsistent configurations among users, and (3) compromised user devices. The last often occurs in mobile and IoT systems where devices are vulnerable to hacking activities. We unanimously call them “Byzantine users”, which gets its name from the “Byzantine generals problem” [4] where a group of mixed loyal generals and traitors need to reach consensus on whether to attack or retreat. Byzantine-robust statistics estimation and federated learning have been receiving much attention recently, but no work has ever addressed the more challenging problem of Byzantine-robust statistics estimation under LDP, where Byzantine users can disguise their malicious inputs by the perturbation noise of LDP.

In this project, we investigate this problem by first studying the impact of Byzantine user on the estimation accuracy of state-of-the-art LDP mechanisms, based on which we propose new LDP mechanisms that are Byzantine-robust for both snapshot and continual statistics estimation. Upon completion, this project will contribute to the privacy and security foundation for the booming industry of big data and AI. The outcomes will be fruitful and can benefit the research community, all industries driven by data, and the Hong Kong government’s Smart City Blueprint.

[1] E. Waltz. “How Facebook and Google Track Public’s Movement in Effort to Fight COVID-19.” IEEE Spectrum, 16 Apr 2020.

[2] J. Mervis. “Can a set of equations keep U.S. census data private?” Science, Jan 2019.

[3] J. Matus. “The Data Dilemma: Data Benefits The Entire Population But Depends On Trust.” Forbes, Aug 2020.

[4] L. Lamport, R. Shostak, and M. Pease, “The Byzantine generals problem,” ACM Trans. Program. Lang. Syst., vol. 4, no. 3, pp. 382–401, 1982.

RGC/GRF 15203120: “Securing Models and Data for Machine Learning at the Edge” (2021-2024, HK$845,055)

Abstract: Edge computing refers to the gathering, analyzing, and acting on data at the periphery of the network, and as close to data sources as possible. This is achieved by deploying edge computing devices (or simply, edge devices) which can triage and even process raw data wherever and whenever they are generated. Gartner Inc. estimated that by 2021, more than 25 billion edge devices will be connected to the Internet, and by 2022 more than 50% of enterprise data will be created and processed outside the cloud or data center, and most of them will come from edge devices.

As machine learning (ML) has been widely adopted in computer vision, human-computer interaction, and IoT, many edge devices are smart. Equipped with capable GPU/CPUs, they can complete ML tasks such as object detection in video surveillance, voice control in smart home, and intelligent sensing in industry IoT. Compared with cloud-based ML solutions, machine learning at the edge reduces network traffic, lowers the response time, and prevents a single point of failure.

However, while cloud servers are hosted in physically secured bunkers and protected by firewalls, edge devices are more vulnerable to the hostile environment due to limited hardware/software resources. In this project, we will investigate two common threats to ML models and data, namely, model extraction and sample poisoning. To resolve them under different security assumptions, we will propose a variety of protective schemes based on modern cryptographic tools such as differential privacy, homomorphic encryption, ORAM, and aggregate message authentication code.

National Natural Science Foundation of China国家自然科学基金面上项目, 62072390 “Integrity Assurance and Fraud Detection for Machine Learning as a Service机器学习即服务中的防欺诈和完整性验证研究” (2021-2024, CNY 570,000)

Abstract: Machine Learning as a Service (MLaaS), which provides a cost-effective solution to two scarce resources in machine learning — intensive CPU/GPU power and well-trained data scientists, has received tremendous attention in both research and industry. All leading public cloud vendors start to offer MLaaS, including Amazon ML, Microsoft Azure ML Studio, Google Prediction API, and IBM Watson ML. In MLaaS, users provide training samples and the MLaaS provider helps to decide a suitable machine learning model with the learning algorithm, performs the training, and returns inference results to clients for new samples. Successful stories have already been circulated in various business domains such as natural language processing (e.g., Google Cloud Translation API), computer vision (e.g., Microsoft Azure Face API), and speech recognition (e.g., Amazon Lex). Reuters has forecast that MLaaS industry will witness a 49 percent annual growth rate from 2017 to 2023 and reach the value of $7.6 billion at the end of 2023. However, MLaaS inherits trust and integrity issues from SaaS and DBaaS, as it is built upon the former two. Such issues can arise from resource exhaustion, service outages, media failure, hack attacks, or even corporate dishonesty. In the literature, adversarial machine learning has already demonstrated various attacks such as model extraction, training data poisoning, and adversarial example attacks. Unfortunately, traditional integrity assurance techniques such as digital signature and message authentication code cannot work in MLaaS due to two unique challenges. First, machine learning models are highly complex and their training and inference results are usually uncertain. Second, in MLaaS there is no guarantee of the integrity of training samples, which may come from distant data sources such as IP cams and smart speakers through insecure networks and untrusted delegates. In this project, we propose integrity assurance schemes throughout the entire MLaaS cycle. We will resolve the integrity issues of MLaaS providers (against training fraud), training samples (against sample poisoning), and clients who request MLaaS inference service (against model extraction). We believe this project will contribute to the security foundation for the booming MLaaS industry.