Cloak of Visibility: Detecting When Machines Browse A Different Web

Papers

Cloak of Visibility: Detecting When Machines Browse A Different Web

Introduction

Category	Engineering research
	Problem pattern	Less studied problem / Well studied problems
	Idea pattern	empirical research
Motivation	Background	What is the research problem? Why it is an important problem? Why these solutions cannot address the problem satisfactorily? The arms race nature of abuse has spawned a contentious battle in the realm of web cloaking. Here, miscreants seeking to short-circuit the challenge of acquiring user traffic turn to search engines and advertisement networks as a vehicle for delivering scams, unwanted software, and malware to browsing clients. Although crawlers attempt to vet content and expunge harmful URLs, there is a fundamental limitation to browsing the web: not every client observes the same content. While split views occur naturally due to personalization, geo optimization, and responsive web design, miscreants employ similar targeting techniques in a blackhat capacity to serve enticing and policy-abiding content exclusively to crawlers while simultaneously exposing victims to attacks.
	Literature Review	What are the existing solutions? What is their methodology? Indeed, earlier studies predicated their analysis on a limited set of known cloaking techniques. These include redirect cloaking in search engine results or search visitor profiling based on theUser-Agent and Referer of HTTP requests. An open question remains as to what companies and crawlers blackhat cloaking software targets, the capabilities necessary for security practitioners to bypass state of the art cloaking, and ultimately whether blackhat techniques generalize across traffic sources including search results and advertisements.
	Research Niche	What motivates this work? e.g. motivated by the limitations of existing work or the lack of study for a specific issue.
Work	Research Objectives	What does the authors want to achieve in this work? What are the scope of this work? What the questions that this paper want to address? The contentious battle between web services and miscreants involved in blackhat search engine optimization and malicious advertisements has driven the underground to develop increasingly sophisticated techniques that hide the true nature of malicious sites. These web cloaking techniques hinder the effectiveness of security crawlers and potentially expose Internet users to harmful content. In this work, we study the spectrum of blackhat cloaking techniques that target browser, network, or contextual cues to detect organic visitors.
	Challenge	Why achieving the research objectives is difficult? First, dynamic content remains a concern. If miscreants can limit the deviations introduced by cloaking to within typical norms (e.g., including only a small new button or URL), the system may fail to detect the attack. That said, this also constrains an attacker in a way that reduces click through from users. Additionally, there is a risk with news sites and other frequently updated pages that a crawler will serve incoming visitors a stale digest due to an outdated crawl, thus burdening users with alerts that are in fact false positives.
	Insight	What is the main inspiration that lead to the new solution?
	Research summary	What is the proposed approach/framework/technique(s)? In this work, we explored the cloaking arms race playing out between security crawlers and miscreants seeking to monetize search engines and ad networks via counterfeit storefronts and malicious advertisements. While a wealth of prior work exists in the area of understanding the prevalence of content hidden from prying eyes with specific cloaking techniques or the underlying monetization strategies, none marries both an underground and empirical perspective that arrives at precisely how cloaking operates in the wild today. We addressed this gap, developing an anti-cloaking system that covers a spec- trum of browser, network, and contextual blackhat targeting techniques that we used to determine the minimum crawling capabilities required to contend with cloaking today. We informed our system’s design by directly engaging with blackmarket specialists selling cloaking software and services to obtain ten of the most sophisticated offerings. The built- in capabilities of these packages included blacklisting clients based on their IP addresses, reverse DNS, User-Agent, HTTP headers, and the order of actions a client takes upon visiting a miscreant’s webpage. We overcame each of these techniques by fetching suspected cloaking URLs from multiple crawlers that each emulated increasingly sophisticated legitimate user behavior. We compared and classified the content returned for 94,946 labeled URLs, arriving at a system that accurately detected cloaking 95.5% of the time with a false positive rate of 0.9%.
Evaluation	Evaluation summary	How do the authors compare the new solution with existing ones? What are the comparison metrics? What is the scale of the experiment? What interesting observations do the author make from their experiment results? We present the overall accuracy of our system in Table VIII. We correctly detect 99.1% of Alexa URLs as non-cloaked with a false positive rate of 0.9%. To achieve this degree of accuracy, we overlook 18.0% of potentially cloaked counterfeit storefronts. If we examine the trade-off our system achieves between true positives and false positives, presented in Fig- ure 2, we find no inflection point to serve as a clear optimum. As such, operators of de-cloaking pipelines must determine an acceptable level of false positives. For the remainder of our study we rely on a false positive rate of 0.9%.
Evaluation	Implications	What are the implications of the evaluation results? What does it mean to the practitioners?
Novelty	Contributions	What are the contributions of the work? We frame our contributions as follows: We provide the first broad study of blackhat cloaking techniques and the companies affected. We build a distributed crawler and classifier that detects and bypasses mobile, search, and ads cloaking, with 95.5% accuracy and a false positive rate of 0.9%. We measure the most prominent search and ad cloaking techniques in the wild; we find 4.9% of ads and 11.7% of search results cloak against Google’s generic crawler. We determine the minimum set of capabilities required of security crawlers to contend with cloaking today.
Novelty	Limitations	Are there any unrealistic assumptions in the approach? Are there any case where the approach does not work? In the process, we exposed a gap between cur- rent blackhat practices and the broader set of fingerprinting techniques known within the research community which may yet be deployed. As such, we discussed future directions for breaking the cloaking arms race that included clients reporting browsing perspective to crawler operators, hindering the ability of miscreants to show benign content exclusively to search engines and ad networks.

Key concepts

What are the important concepts introduced in this work?

SEO

Search Engine Optimization，搜索引擎优化。是一种利用搜索引擎规则提高网站在有关搜索引擎内的自然排名。分为白帽技术和黑帽技术。

白帽技术

在搜索引擎优化行业中，使用正规符合搜索引擎网站质量规范的手段和方式，使网站在搜索引擎中关键词获得良好的自然排名称为白帽技术。白帽技术是较为流行的网络营销方式，主要目的是增加特定关键字的曝光率以增加网站的能见度，进而增加销售的机会。分为站外SEO和站内SEO两种。SEO的主要工作是通过了解各类搜索引擎如何抓取互联网页面、如何进行索引以及如何确定其对某一特定关键词的搜索结果排名等技术，来对网页进行相关的优化，使其提高搜索引擎排名，从而提高网站访问量，最终提升网站的销售能力或宣传能力的技术。

黑帽技术

黑帽技术是指在seo就是通过一些类似作弊的方法或技术手段，以不符合主流搜索引擎优化发行方针规定的手法来获得短时间内较好的搜索引擎优化的一种技术。黑帽SEO获利主要的特点就是短平快，为了短期内的利益而采用的作弊方法，同时随时会因为搜索引擎算法的改变而面临惩罚。

Technical content

What are the important concepts introduced in this work?

隐形技术的应用

<p>作者通过分析市场上主流的隐形软件，发现其主要针对三个方面的内容进行修改：Network Fingerprinting、Browser Fingerprinting、Contextual Fingerprinting</p><h1>系统结构</h1><h3>1. 构建候选URL集合</h3><p>（1）先前用于研究隐形网页的已经做好标记的数据集（TABLE V）</p><p>（2）普通场景下未标记的数据集（TABLE VI）</p><h3>2. 浏览器配置</h3><p>如TABLE VII所示，我们选择11中不同的浏览器和网络配置去爬取每个URL，并且对每个URL进行三次爬取操作，以减少动态页面和网络错误等情况引入的噪音</p><h3>3. 构建特征</h3><p>（1）成对的相似性特征</p><p>1）页面内容的相似性：首先使用hash值对爬取到的数据进行去重，然后使用汉明距离计算页面的相似程度，得到的距离越大表明页面的改动性越大</p><p>2）屏幕截图的相似性：将页面截图转化为像素点对比两个图片的不同像素点的个数，不同的数量越多表明页面的改动性越大</p><p>3）页面元素的相似性：动态网站可能会由于增加评论、广告动态调整使得两次爬虫爬取的结果产生比较大的差异，但网站的模板不会发生大的变化，所以我们使用Jaccard相似性来计算两个页面内容的差异</p><p>4）请求树的相似性：我们比较爬网时生成的网络请求，以检测不同的重定向，不匹配的网络错误和其他内容。使用Jaccard相似性计算来计算与任何定时信息无关的不同请求的数量，高分表示爬行触发了不同的网络行为</p><p>5）页面主题的相似性：在网页中提取文本信息比较语义的相似性</p><p>6）屏幕截图主题的相似性：使用屏幕截图作为输入的深度卷积神经网络检测到的文档主题相似性</p><p>（2）页面动态性特征</p><p>（3）用于特定域的特征</p><p>1）JavaScript、meta、flash重定向</p><p>2）爬虫配置错误</p><p>3）登陆域名：连续33次无法爬取到URL中的信息则将该URL标记为可疑的</p><h4>4. 分类器</h4><p>使用Extremely Randomized Trees</p><h1>分析误报与错报原因</h1><p>False Positives：三个来源：（1）网站在抓取之间修改内容，（2）连接问题，以及（3）嘈杂的标签</p><p>False Negatives：大多数错误是由于与隐形网站相关的稳定性问题而产生的</p><h1>Future work</h1><p>What are the further research topic that cannot be extended from this work? </p><p> </p><p> </p><p> </p><h1>Discussion</h1><p>What are the further research topic that cannot be extended from this work? </p><p> </p>