Viopolicy-Detector: An Automated Approach to Detecting GDPR Compliance Violations in Websites

Abstract

To provide users with personalized services, the website collects and tracks user’s activity data. At the same time, each website uses a privacy policy to ensure the legality of these actions. The purpose of the implementation of the General Data Protection Regulation (GDPR) is to protect the privacy of user data. Because GDPR is a programmatic regulation, there is no specific guidance on what a privacy policy should contain. Therefore, there may still be potential violations on the website, thus cause a risk of leak users’ private data. In this paper, we define a violating behavior that data collected by the website without a declaration in the privacy policy is illegal. To complete the violating behavior detection, we first interpret the GDPR and analyze 1000 website privacy policies to present a personal data classification including eight categories. Based on this, we propose a privacy policy annotation scheme including these eight categories and collect 145 related Web APIs. Then we propose an automated method to detect GDPR compliance violations in websites. On the one hand we use the multi-label text classification model to extract data collection stated in the privacy policy, with a precision of 0.9817. For another, We dynamically monitor the JavaScript calls of the website related to personal data collection during user visits. Finally, we compare the two results to determine whether violating behaviors appeared. We use this method to detect the European top 500 websites. A total of 159 websites appear in violation of the GDPR. We analyze the detection results from different perspectives, including statistics on the types of data declared in the privacy policy, statistics on data collected by the website, and which data collection is likely to cause violations. Then we classify the violating websites and find that websites in the Social category present the most violations. Finally, we count the rankings of the offending websites. Surprisingly, top-ranking sites are even more prone to breaches. There are even some globally well-known websites with violations, such as BBC, Nokia, Ebay, Google etc.

Publication
In The 25th International Symposium on Research in Attacks, Intrusions and Defenses
Wenbo Guo
Wenbo Guo

My research interests include Open Source Software Security and Software Supply Chain Security.