Domain Classifier: Compromised Machines Versus Malicious Registrations

Le Page, Sophie; Jourdan, Guy-Vincent; Bochmann, Gregor V.; Onut, Iosif-Viorel; Flood, Jason

doi:10.1007/978-3-030-19274-7_20

Sophie Le Page¹⁷,
Guy-Vincent Jourdan¹⁷,
Gregor V. Bochmann¹⁷,
Iosif-Viorel Onut¹⁸ &
…
Jason Flood¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11496))

Included in the following conference series:

International Conference on Web Engineering

1804 Accesses
10 Citations

Abstract

In “phishing attacks”, phishing websites disguised as trustworthy websites attempt to steal sensitive information. Remediation and mitigation options differ depending on whether the phishing website is hosted on a legitimate but compromised domain, in which case the domain owner is also a victim, or whether the domain itself is maliciously registered. We accordingly attempt to tackle here the important question of classifying known phishing sites as either compromised or maliciously registered. Following the recent adoption of GDPR standards now putting off-limits any personal data, few relevant literature criteria still satisfy those standards. We propose here a machine-learning based domain classifier, introducing nine novel features which exploit the internet presence and history of a domain, using only publicly available information. Evaluation of our domain classifier was performed with a corpus of phishing websites hosted on over 1,000 compromised domains and 10,000 malicious domains. In the randomized evaluation, our domain classifier achieved over 92% accuracy with under 8% false positive rate, with compromised cases as the positive class. We have also collected over 180,000 phishing website instances over the past 3 years. Using our classifier we show that 73% of the websites hosting attacks are compromised while the remaining 27% belong to the attackers.

You have full access to this open access chapter, Download conference paper PDF

Catching Classical and Hijack-Based Phishing Attacks

DeltaPhish: Detecting Phishing Webpages in Compromised Websites

A Review: How to Detect Malicious Domains

Keywords

1 Introduction

Phishing attacks have been relentless over the recent years, with over 280,000 unique attacks in the first quarter of 2016 [2], 144,000 in 2017 [1] and 260,000 in 2018 [3]. These worrying numbers occur despite an increasing awareness of the public, and widespread availability of tools used to combat these attacks. For example, browsers such as Google Chrome, FireFox, Opera and Safari all use Google Safe Browsing^{Footnote 1} to provide their users some level of built-in protection against phishing attacks. Microsoft Internet Explorer and Edge browsers also include a similar built-in defense mechanism, called SmartScreen^{Footnote 2}.

Most of the literature on phishing attacks focuses on detection, e.g. by using machine learning to train a detection model [12, 22], by using the reputation of the domains hosting the attacks [13], by performing visual comparisons between the phishing site and its target [7, 14], or by using similarity measures to known attacks [9]. In this work we attempt to better understand the large ecosystem that phishing is, and determine how often known phishing websites are hosted on legitimately owned (compromised) domains vs maliciously registered domains.

Classifying the type of domains that host phishing sites offers insight on how attackers commit their crimes and can present different remediation and mitigation options. For example, in the case of take down strategies, the owner of a compromised domain hosting the attack is also a victim and is presumably willing to cooperate with defenders. However, this is not true with a malicious domain. Take down strategies also differ based on who needs to be contacted, such as the sysadmin or the domain name registrar for compromised or malicious domains respectively [18]. Classifying phishing domains also offers further insight into which registrars effectively prevent fraudulent registrations. For example, a 2016 study by the Anti-Phishing Work Group (APWG) [5] found registrars such as GoDaddy had a ratio of malicious to compromised domains of 25%, whereas other registrars had a ratio well over 90%. Similarly, detecting an increase in compromised domains may offer insight into new indicators of compromise [6]. Lastly, classifying phishing domains can help advance research which specialize in studying either malicious or compromised phishing attacks. For example, for some such researchers, their feed source already distinguished between the two cases [13, 17], whereas for other researchers manual inspection was required [7, 16]. Therefore automatically classifying phishing domains in the latter case would help save time and remove human error.

In this work we propose our domain classifier, which exploits the history and internet presence of a domain with machine learning techniques to classify known phishing attacks as being hosted on either compromised domains or maliciously registered domains, using only publicly available information. This is especially relevant due to the recent adoption of General Data Protection Regulation (GDPR), which prevents certain registration information to be made publicly available. Note that our domain classifier can also be used to detect malicious domains from normal domains, where normal domains are those which are not hosting phishing attacks. However our classifier cannot be used to detect compromised domains from normal domains: the features we use to detect compromised domains are those that lend legitimacy to a domain such as domain history, which both compromised and normal domains share.

The remainder of this paper is organized as follows. Section 2 presents a literature review, followed by a description of our domain classifier architecture in Sect. 3. Section 4 describes our feature set and machine learning algorithms. The experiment setup is described in Sect. 5, and the full experimental results using randomized evaluation are reported in Sect. 6. In Sect. 7, we present our analysis of the proportion of phishing websites hosted on compromised and malicious domains over time. In Sect. 8 we discuss runtime performance and limitations of our proposed approach. We conclude in Sect. 9.

All resources for this work can be found at http://ssrg.site.uottawa.ca/icwe2019/.

2 Related Work

In this section, we discuss different types of hosting for phishing websites, as well as work related to identifying various types of hosting for phishing websites.

The types of hosting for phishing websites are identified by Moore et al. [18] as free web-hosting services, compromised machines, rock phish and fast-flux attacks. In [18] the authors analyze the different “notice and take-down” strategies for each case.

As an example, a typical URL for a website that has been set up at a free web-hosting provider would be http://www.brand.freehostsite.com/login, where the brand name is chosen to match or closely resemble the domain name of the brand being attacked. It is usually sufficient to compile a list of known free web-hosting domains, and then use this list to determine which websites are hosted on free space. In this case, to get the phishing website removed it is necessary to contact the webspace provider and draw their attention to the fraudulent site.

For compromised machines, attackers may have restricted permissions, and are limited on where files can be placed. They add their own web pages within an existing structure, leading to URLs for their websites that have the typical form http://www.example.com/user/www.brand.com/ where the brand name lends legitimacy. The attacker may also find that the existing DNS configuration permits URLs of the form www.brand.com.example.com. In this case, in order to get a website removed from a compromised machine it is generally necessary to get in touch with the sysadmin who looks after it.

To further hide suspicion, attackers sometimes go through the effort of registering their own domain name. The domain names are usually chosen to be a variation of brand.com such as “brand-usa.com”, or they will use the brand name as a subdomain of a misleading domain such as “brand.verysecuresite.com”. We refer to these domains as maliciously registered. If a domain name has been registered by an attacker, the defenders will ask the domain name registrar to suspend the offending domain.

In regards to rock phish and fast-flux attacks, these attacks require the attackers to purchase a number of cheap or free domains with meaningless names such as “vbe10.info”, with unique identifiers in order to evade spam filters. We also consider these domains to be maliciously registered domains.

In this way we distinguish three types of hosting for phishing websites; free web-hosting domains, compromised domains, and malicious domains. Since free web-hosting identification simply requires a list of known hosting websites, we are left with the problem of classifying between compromised and malicious domains.

In 2009, Moore et al. [17] worked on so-called “evil searching” of servers running known vulnerable software to compromise them and upload phishing sites. The authors mentioned that 75.8% of their database is made of compromised servers with no explanation about how they reached this number. More recently, in 2017 Corona et al. [7] proposed a method to detect phishing sites by evaluating the visual differences between the phishing page and the other pages hosted on the same domain. The authors suggest that 71% of the domains hosting phishing sites are compromised. However the authors used manual checking and did not provide any reusable method.

In 2016, Catakoglu et al. [6] use honey pots to lure attackers to compromise their server, and propose an automated technique to extract and validate indicators of compromise (IOCs) for web applications. Our work is orthogonal to this method and does not require server access to find out whether a domain has been compromised if a phishing attack is launched. In 2016, Hao et al. [13] detected malicious domains upon registration, for the purpose of phishing as well as spamming. Their strategy also uses machine learning in combination with designed features derived primarily from information known by registrars or registries, as well as lexical patterns of the domain name. Our approach includes most of the lexical pattern features from [13]. However most of the information known by registrars and registries is not publicly available. In 2017, Lin et al. [16] detected domain shadowing, which are compromised domains whose subdomains are malicious. The authors find that instead of generating subdomain names, several domain shadowing cases exploit the wildcard DNS records.

The work most closely related to ours is the 2016 [5] study done by The Anti-Phishing Work Group (APWG) which reports phishing trends on malicious and compromised domains. For phishing attacks launched only in 2016, APWG report almost 49% are malicious domains, while the rest are compromised. Their strategy is to identify malicious domains by checking (1) short timeframe from domain registration to phishing report, (2) brand name or misleading string in the domain names and (3) batch domain names registration. In our approach we include check (2) and a variation of check (1) but do not have the means to check (3) since it is not publicly available. APWG’s strategy focuses on properties that indicate malicious domain cases and otherwise simply considers that the domain is compromised. In contrast, our solution balances criteria suggesting malicious domains and criteria suggesting compromised ones. In addition our solution only relies upon publicly available information, and is thus widely usable.

3 System Architecture

Figure 1 shows the overall flow of our domain classifier. The feature extractor, shared by the training and testing phases, is the core of our system, in which the values of the 15 features described below in Sect. 4.1 are automatically extracted. Specifically, the goal of the training phase is to obtain the feature values for each instance of the training domains. Those features are then used by the machine learning engine to build classifiers. The goal of the testing phase is to label real phishing websites as either a compromised or a malicious domain.

In the testing phase, we first apply a free web-hosting detector using a list of known hosting websites. If the phishing website is not detected to be hosted on a free web-hosting site, we move on to extract the 15 features from the domain using domain-based features as well as other publicly available web resources. Finally we apply a pre-trained model to classify a phishing domain as compromised or malicious. In real-world scenarios, we can use a sliding window to include the most recent labeled phishing domains in the training data.

4 Feature-Based Machine Learning Framework

The present paper hinges on the set of high-level features detailed below in Sect. 4.1. Features 1 and 6 are inspired by work from APWG [5]. Features 2–5 are taken from [13]. Features 7–15 are the novel ones we propose here. Section 4.2 lists the machine learning algorithms we use.

4.1 Feature Set

Features are organized into two categories. The first category (features 1 through 6) deals with the domain name of the phishing website. We deal with domain names rather than full URLs because we want our system to be usable widely and early, at domain registration time or by looking at the DNS traffic. The second category (features 7 through 15) exploits the history and internet presence of a domain and involves crawling the web for public information.

In particular, we make use of The Wayback Machine, a digital archive of the Internet Archive^{Footnote 3} which allows retrieving the crawl history of a URL. Specifically, the Internet Archive uses the Alexa web crawler [11], and stores the HTML of a URL as a snapshot each time it is crawled. There may be multiple crawls ongoing at any one time, and a site might be included in more than one crawl list, therefore the frequency a site is crawled varies widely. In addition, site owners who fit Internet Archive’s exclusion policy can request that the site be excluded from the Wayback Machine. For example “quora.com” has opted out due to anonymization issues [20]. However we did not detect any phishing websites that opted out of the Wayback Machine, possibly because opting out would appear suspicious.

Domain-Based Features

1.
Freenom top level domain (TLD): This feature checks whether the TLD belongs to Freenom^{Footnote 4}. Freenom provides free domain names on “.cf”, “.gq”, “.ml”, “.tk” and “.ga” TLD. It is therefore not a surprise that a lot of these domains are abused [5].
2.
Ratio of the longest English word: This feature matches the longest English word that a domain name contains, and normalizes it by dividing by the length of the domain name. Attackers may generate pseudo-random names to avoid conflict with existing domains, or deliberately include readable words in the domain names to attract clicks from victims.
3.
Containing digits: This feature checks whether the domain name contains digits. Hao et al. [13] observe that spam and malicious phishing domains are more likely to use numerical characters than legitimate domains. Possible reasons may be that attackers add digits to generate several names from the same word or generate random names from a character set containing digits.
4.
Containing “-”: This feature checks whether the domain name contains any hyphens. Similarly, attackers can insert hyphens to break individual words or concatenate multiple words.
5.
Name length: This feature checks the domain name length (number of characters). Attackers may create domains using a specific template, such as random strings of a given length.
6.
Partial match of brand name: This feature matches the largest partial match of a brand name contained in a domain. We use a curated list of the top 50 brand names from Alexa^{Footnote 5}. Given a brand name, we find the most similar substring in a domain name, and divide the number of similar characters by the length of the brand name. The ratio ranges from 0 to 1, where 0 indicates no match to any brand name and 1 indicates an exact match. Attackers may create domains which include the brand name or a variation to lend legitimacy and trick victims.

Web-Based Features

7.
Archived domain: This feature checks whether the domain has been archived on The Wayback Machine, which is part of The Internet Archive. An archived domain is more likely to be legitimately owned.
8.
Timespan in years the domain has been archived: This feature checks the time in years the domain has been archived. The longer a domain has been archived, the more likely the domain is legitimately owned.
9.
Timespan in years of the domain’s last archive before 2019: This feature checks the time in years of the domain’s last archive relative to 2019. For example if the domain’s last archive was in 2017, then it has been 2 years since the last archive. A recently archived domain is more likely to be legitimately owned.
10.
Number of domain archive captures: This feature checks the number of archive captures for a domain. A frequently captured domain is an indication of legitimate ownership.
11.
Archive redirected: This feature checks whether the latest archive capture redirects to a different domain. This implies that the archive information is no longer relevant to the original domain. This constitutes a suspicious flag adding to other features.
12.
Reachable domain: This feature checks whether the domain returns a status code 200. Phishing sites usually last a short time: most pages do not last more than a few days because they are taken down by attackers to avoid tracking [4]. For this reason, legitimately owned domains are more likely to be reachable.
13.
Blocked domain: This feature checks whether access to the domain returns a status code 404, or whether the title or content of the page contains keywords such as “404”, “down” or “under maintenance”. Legitimately owned domains are less likely to be blocked. This follows a similar reasoning to the feature above.
14.
Alexa rank: This feature checks whether the domain appears in Alexa top 1 million, and if so records the rank. Domains in the Alexa ranking are more likely to be legitimately owned domains.
15.
Wildcard subdomain: This feature checks whether the domain is registered to accept all subdomains, known as a “wildcard” subdomain. A wildcard DNS record will match any request for an undefined domain name, making it easy for attackers to advertise a working URL with a subdomain of their choosing despite not controlling the DNS entries of the domain itself. This may influence determination of whether a domain is compromised or malicious.

4.2 Machine Learning Algorithms

We compare five learning algorithms in training the domain classifier with the primary goal of evaluating the effectiveness of our feature set: K Neighbors (KN), Support Vector Machines (SVM), Neural Networks (NN), Random Forest (RF), and Bayesian Network (BN). All the machine learning algorithm implementations were taken from the sklearn package^{Footnote 6}. RF turned out to be the top performing algorithm in our experiments. We accordingly report here only on the performance of RF in this paper.

5 Experimental Setup

5.1 Evaluation Metrics

In our experiment, we use true positive (TP) rate, false positive (FP) rate and accuracy (Acc.) as the main evaluation metrics. In tuning the machine learning models, we used Receiver Operating Characteristics (ROC) curves [10] as well as the area under the ROC curve (AUC) [8] metric. This is an approach to evaluate binary classification performance and portray the trade-off between TP and FP. Statistically, the AUC equals the probability that given a randomly generated positive instance and negative instance, a classifier will rank the positive one higher than the negative one, and thus is a good summary statistic for model comparison.

5.2 Phishing Domain Corpus

Our phishing domain collection is summarized in Table 1 and consists of malicious cases from PhishLabs [19], and compromised cases from both PhishLabs and DeltaPhish [7].

Table 1. Compromised and malicious phishing domain collection from 2 sources.

Full size table

For the domain instances from PhishLabs, we received a list of 9,475 malicious domains collected from May 2018 to October 2018. The analysts reviewing phishing attacks at PhishLabs manually classify the domain as malicious if they believe it was created by the attackers for the purpose of phishing. Since this list was manually checked by the analysts at PhishLabs, we consider this list to be accurate.

We also received a list of 17,427 confirmed phishing URLs from PhishLabs, collected from September 2018 to October 2018. As instructed by PhishLabs, intersecting the list of malicious domains and phishing URLs on “domain” gives the phishing URLs using malicious domains. The remaining phishing URLs are most likely those using compromised domains. Specifically, the 17,427 phishing URLs consist of 695 unique domains, 20 of which were in the list of malicious domains, resulting in a list of 675 likely compromised domains. However this list is less accurate, since the compromised domains were not manually inspected by the analysts at PhishLabs.

Indeed, quick examination of the list of 675 likely compromised domains exposed a number of clearly malicious domains: “my-apple-id.cf”, “doocs.gq”, “my-sharepointofficedrive.tk”, “outlook-livesl.cf”. There are two reasons why these domains are malicious: (1) The TLDs belong to Freenom, a registry that provides free domains, so a lot of these domains are abused; (2) The domain name contains suspicious keywords such as “apple”, “docs”, “officedrive”, and “outlook” mimicking legitimate brand names.

Since the list of compromised domains from PhishLabs contain some noise, we looked for other open source lists of compromised domains, and found a list shared by the authors of DeltaPhish [7]. The authors manually checked the list of 1,012 phishing websites hosted on 694 compromised domains that were retrieved from PhishTank from October 2015 to January 2016. Since this list was manually checked by the authors and used in their research, we consider this list to be accurate.

5.3 Evaluation Method

We use randomized evaluation to inspect the overall performance on all the available data. We also adopted the train, validation and test methodology from machine learning.

For the train/test splits, we used the following strategy. For our train set we used the compromised domains from PhishLabs collected from September 2018 to October 2018, and a random selection of an equivalent number of malicious domains in the same timespan, ensuring a balanced train set, mainly due to two reasons. First, a balanced train set is more realistic since in the real-world scenario, the volumes of malicious and compromised cases are similar. For example in 2016, APWG reports that almost 49% of domains are malicious, while the rest are compromised [5]. Second, machine learning classifiers such as Random Forest have difficulty coping with imbalanced train sets as they are sensitive to the proportions of the different classes. As a consequence, these algorithms tend to favor the majority class, creating misleading accuracy.

For our test set, we used the compromised domains from DeltaPhish collected from October 2015 to January 2016, and the remainder of the malicious domains from PhishLabs collected from May 2018 to October 2018.

We train on the compromised domains from PhishLabs because we knew that these domains contained noisy labels, as discussed in Sect. 5.2. This way we take advantage of the noise in the data in order to limit model over-fitting, ensuring better model generalization [15]. Conversely, we test on the compromised domains from DeltaPhish [7] because they were manually checked by the authors and are therefore likely to be a better indication of the accuracy of our classifier. Note that since the compromised domains come from two different sources, they also have different distributions. This also allows evaluation of classifier robustness towards a distribution of data it has not trained on, anticipating real world applications when the classifier will be deployed.

In optimizing the algorithm parameters, the training set was further divided via stratified sampling into a training portion and a validation portion. Stratification ensures that the class distribution is preserved between the training and validation parts. In performing the final tests with the optimal model parameters, the whole training sets were used to train the classifiers. We used the average statistics over 5 runs in all our experiments in order to reduce random variation and avoid lucky train/validation splits.

6 Experimental Results

6.1 Randomized Evaluation

The main goal of randomized evaluation is to inspect the overall performance of our domain classifier on all our data via stratification and multiple run averaging. In this section, we show the performance of our classifier trained under a balanced data set, using different machine learning algorithms.

For each machine learning algorithm, we found the optimal parameter with the best AUC via 5-run tuning. We show our final tuning results in a ROC curve in Fig. 2(a). We found through our experiments that Random Forest (RF) was the top performing algorithm. In the testing phase of randomized evaluation, we assigned RF (the top performing model) with the optimal parameters, and tested the model on a separate testing set. This way we reproduce a real deployment, where one could tune models offline and then employ the optimal setup for online scenarios. We also show the ROC curve on the test set in Fig. 2(b).

In Table 2, we show the performance of our domain classifier in this experiment. Since noisy compromised cases from PhishLabs were used in the train set and manually verified compromised cases from DeltaPhish in the test set (as discussed in Sect. 5.3), our model generalizes well to unseen data. As a result the test set performs better than the train set. As shown in Table 2, our domain classifier achieved a high TP of 91.29% with a reasonably low FP of 7.86%.

Table 2. Performance (5-run average) of our domain classifier using Random Forest (RF) network. The domain classifier was trained with a balanced data set, with compromised domains as the positive class.

Full size table

6.2 Learning with Individual and Grouped Features

The statistics in the previous sections were obtained by using the whole feature set (15 features in total). In this section, we evaluate the contribution of each individual feature, as well as grouped features, to the overall performance. We use the summary statistic AUC to measure the performance of each individual feature. We only report the result with Random Forest under a balanced train set in Fig. 3. Figure 3(a) shows that the archive features, which are novel features we proposed, stand out from the others with over 0.8 AUC, including Archived, Years active, Years inactive, Number of captures and Archive redirected. Features Reachable and Blocked also perform fairly well with over 0.75 AUC. The remaining features are ranked as inferior, with under 0.7 AUC. All features are over 0.5 AUC, although some are close to 0.5 AUC, almost amounting to random guessing. As an example, if we look at our Alexa rank feature, only a few domains in our data set have an Alexa rank in the top 1 million. This leaves the majority of domains with no rank, giving the classifier little information to distinguish between compromised and malicious domains given only Alexa rank as a feature. The usefulness of these lower ranked features however comes when combined with related higher ranked features, yielding better performance.

This can be seen in Fig. 3(b), where we compare top ranked features Archived, Reachable and Contain hyphen against related features grouped together to show better performance than individually. In this case All reachable refers to all web-based features which do not use the Internet Archive. In particular we find grouped Domain-based features improves the AUC from 0.66 to 0.75 compared to Contain hyphen. We can also compare grouped rankings, where All archive features perform best, followed by All reachable, and then by Domain-based features. Although the grouped Domain-based features rank lowest, the performance is still impressive considering the computations only involve string analysis of the domain name.

7 Analysis

In this section, using our domain classifier we present our analysis of the proportion of phishing websites hosted on compromised vs malicious domains over time. Specifically, we collected over 180,000 phishing website instances hosted on over 69,000 domains over the past 3 years. Our sources include PhishTank, IBM X-Force, and OpenPhish. Our analysis is shown for every quarter in Fig. 4. Note that our results from the first two quarters in 2016 may not be representative since the volume of phishing websites is much lower, due to the fact that our initial feed consisted of just PhishTank.

Over a 3 year period we find that 73% of the websites hosting attacks are compromised while the remaining 27% belong to the attackers. This ratio of compromised vs malicious is reasonably aligned with other findings [5, 7, 17]. In particular, with regards to APWG [5], our findings agree with registries such as GoDaddy that have only 25% malicious registrations. This may indicate that more registries are following GoDaddy and are actively defending against malicious registration.

However, overall APWG finds that 49% of domains are malicious. There may be a few reasons that explain this difference: first we classify hosting sites as compromised or malicious while AWPG detects domains as malicious and assumes the rest are compromised. Second, we believe that the criteria used in [5] may be too aggressive and will for example classify any site that is hacked almost immediately upon creation as malicious. Finally, APWG report that more than half of the servers flagged as malicious in their database are related to Chinese phishing attacks. In contrast, our data sources for phishing attacks are mostly made of North-American and European phishing attack reports.

That being said, we do see an increase in malicious registration in the year of 2018, in particular the 3rd quarter of 2018, as shown in Fig. 4, where the maximum malicious ratio reaches almost 40%. One reason the three latest quarters in 2018 have more malicious domains may be because we are performing this experiment “after the fact”. For example a domain which may have been malicious in 2016 may now be registered as a legitimate domain. Overall we find that the proportion of compromised domains has remained relatively consistent over 2016 and 2017, with a 5% decrease of compromised domains in 2018.

8 Discussion

8.1 Runtime Performance

Our framework is composed of a training phase and a testing phase (Fig. 1). The training phase can be done periodically offline. Users then experience no time delay in this stage. When deployed, the testing phase is conducted online as each phishing domain arrives.

All our experiments were conducted on a standard computer with a 2.10 GHz processor and 14.7 GB of available RAM. The free web-hosting detector caused no apparent delay in the work flow. The module with critical time issue is the feature extractor. For example fetching archive features take the longest time since we use a crawler and wait 10 s for the page to load. The average run time of the feature extractor module is 25 s per domain. Various measures exist for improving the time performance of the feature extraction module. For example we used 5 threads in the experiments in order to process 5 domains in parallel. This way we process 5 domains on average after 25 s (or 1 every 5 s). Another essential strategy we use is caching. For example caching the query and result web-based features improves runtime performance. Once the feature values have been extracted, applying the pre-trained machine learning model consumes a trivial amount of time.

8.2 Limitations

One limitation to our approach is that The Internet Archive is unintentionally internationally biased [21] towards North-American websites. For example, our approach may not work as well for Chinese phishing websites, since archive is less likely to have captured Chinese websites.

One way around this limitation would be to include search engines results as an indicator of domain presence and reputation. For example, in the Google search engine, one can search for “inurl:example.com” which returns indexed URLs with the string “example.com”. To address international bias, one could then use several international search engines. However this solution may not be scalable since search engines usually place a limit on the number of requests.

Another limitation to our approach would be if attackers intentionally register a domain with history. For example, an owner may have a domain for several years until dropping the domain, at which point an attacker may be ready to pick it up. However this also limits the attacker, since domains with history may be more expensive, and may take more time to identify and acquire. This also prevents attackers from choosing a domain name that is misleading, or from choosing a domain name resembling a brand name to lend legitimacy.

A previous way around this limitation would be to check registration history to see whether the domain has recently changed owners. However this approach has become increasingly difficult to automate, and the information may not be publicly available. Another option around this limitation would be to use the captures in archive and compare the HTML of a domain over time, to see whether there are any recent and sudden changes in the HTML. HTML that has suddenly changed and is not relevant to the previous content implies that the domain has probably passed on to a new owner.

9 Conclusion

In this paper, we presented a solution for classifying the domains of known phishing websites as either hosted on legitimately owned (compromised) domains or on maliciously registered domains. By exploiting the generalization power of machine learning techniques, our domain classifier achieves acceptable levels of TP and FP using a classification engine with a set of features of our design, which is the main originality of the present work. In particular, we achieved these acceptable levels based on the remark that the majority of compromised domains have more internet history and presence than malicious domains.

All resources for this work can be found at http://ssrg.site.uottawa.ca/icwe2019/.

Notes

References

APWG: Phishing Activity Trends Report 1st Half 2017. bit.ly/2KKTUzw
Google Scholar
APWG: Phishing Activity Trends Report 1st Quarter in 2016. bit.ly/1qNLrk5
Google Scholar
APWG: Phishing Activity Trends Report 1st Quarter in 2018. bit.ly/2HfK0Ik
Google Scholar
APWG: Phishing Activity Trends Report 3rd Quarter in 2018. bit.ly/2VTVYuh
Google Scholar
APWG: Trends and Domain Name Use in 2016. bit.ly/2TvHyE6
Google Scholar
Catakoglu, O., Balduzzi, M., Balzarotti, D.: Automatic extraction of indicators of compromise for web applications. In: Proceedings of the 25th International Conference on World Wide Web, pp. 333–343. International World Wide Web Conferences Steering Committee (2016)
Google Scholar
Corona, I., et al.: DeltaPhish: detecting phishing webpages in compromised websites. In: Foley, S.N., Gollmann, D., Snekkenes, E. (eds.) ESORICS 2017. LNCS, vol. 10492, pp. 370–388. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66402-6_22
Chapter Google Scholar
Cortes, C., Mohri, M.: AUC optimization vs. error rate minimization. In: Advances in Neural Information Processing Systems, pp. 313–320 (2004)
Google Scholar
Cui, Q., Jourdan, G.V., Bochmann, G.V., Couturier, R., Onut, I.V.: Tracking phishing attacks over time. In: Proceedings of the 26th International Conference on World Wide Web. pp. 667–676. International World Wide Web Conferences Steering Committee (2017)
Google Scholar
Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006)
Article MathSciNet Google Scholar
Forbes: The Internet Archive Behind the Scenes (2016). bit.ly/2CjomPa
Google Scholar
Gowtham, R., Krishnamurthi, I.: A comprehensive and efficacious architecture for detecting phishing webpages. Comput. Secur. 40, 23–37 (2014)
Article Google Scholar
Hao, S., Kantchelian, A., Miller, B., Paxson, V., Feamster, N.: Predator: proactive recognition and elimination of domain abuse at time-of-registration. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 1568–1579. ACM (2016)
Google Scholar
Jain, A.K., Gupta, B.B.: Phishing detection: analysis of visual similarity based approaches. Secur. Commun. Netw. 2017, 20 (2017)
Article Google Scholar
Krogh, A., Hertz, J.A.: Generalization in a linear perceptron in the presence of noise. J. Phys. A: Math. Gen. 25(5), 1135 (1992)
Article MathSciNet Google Scholar
Liu, D., Li, Z., Du, K., Wang, H., Liu, B., Duan, H.: Don’t let one rotten apple spoil the whole barrel: towards automated detection of shadowed domains. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 537–552. ACM (2017)
Google Scholar
Moore, T., Clayton, R.: Evil searching: compromise and recompromise of internet hosts for phishing. In: Dingledine, R., Golle, P. (eds.) FC 2009. LNCS, vol. 5628, pp. 256–272. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03549-4_16
Chapter Google Scholar
Moore, T., Clayton, R.: The impact of incentives on notice and take-down. In: Johnson, M.E., et al. (eds.) Managing Information Risk and the Economics of Security, pp. 199–223. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-09762-6_10
Chapter Google Scholar
PhishLabs: Threat intelligence & mitigation solutions (2019). https://www.phishlabs.com/
Quora: Why Does Quora Block the Wayback Machine from Accessing It (2016). bit.ly/2XSbeKa
Google Scholar
Thelwall, M., Vaughan, L.: A fair history of the web? Examining country balance in the internet archive. Libr. Inf. Sci. Res. 26(2), 162–176 (2004)
Article Google Scholar
Xiang, G., Hong, J., Rose, C.P., Cranor, L.: Cantina+: a feature-rich machine learning framework for detecting phishing web sites. ACM Trans. Inf. Syst. Secur. 14(2), 21:1–21:28 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Engineering, University of Ottawa, Ottawa, Canada
Sophie Le Page, Guy-Vincent Jourdan & Gregor V. Bochmann
IBM Centre for Advanced Studies, Ottawa, Canada
Iosif-Viorel Onut
IBM Security Data Matrices, Dublin, Ireland
Jason Flood

Authors

Sophie Le Page
View author publications
You can also search for this author in PubMed Google Scholar
Guy-Vincent Jourdan
View author publications
You can also search for this author in PubMed Google Scholar
Gregor V. Bochmann
View author publications
You can also search for this author in PubMed Google Scholar
Iosif-Viorel Onut
View author publications
You can also search for this author in PubMed Google Scholar
Jason Flood
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Guy-Vincent Jourdan , Gregor V. Bochmann , Iosif-Viorel Onut or Jason Flood .

Editor information

Editors and Affiliations

Novosibirsk State Technical University, Novosibirsk, Russia
Maxim Bakaev
Erasmus University Rotterdam, Rotterdam, The Netherlands
Flavius Frasincar
Korea Advanced Institute of Science and Technology, Daejeon, Korea (Republic of)
In-Young Ko

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Le Page, S., Jourdan, GV., Bochmann, G.V., Onut, IV., Flood, J. (2019). Domain Classifier: Compromised Machines Versus Malicious Registrations. In: Bakaev, M., Frasincar, F., Ko, IY. (eds) Web Engineering. ICWE 2019. Lecture Notes in Computer Science(), vol 11496. Springer, Cham. https://doi.org/10.1007/978-3-030-19274-7_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-19274-7_20
Published: 26 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-19273-0
Online ISBN: 978-3-030-19274-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics