US Patent for Method and system for detecting malicious and/or botnet-related domain names Patent (Patent # 10,027,688 issued July 17, 2018) (2024)

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is based on and derives the benefit of the filing date of U.S. Provisional Patent Application No. 61/087,873, filed Aug. 11, 2008. The entire content of this application is herein incorporated by reference in its entirety.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 illustrates a system 100 of detecting malicious and/or botnet-related domain names, according to one embodiment.

FIG. 2 illustrates a method of detecting botnet-related domain names, according to one embodiment.

FIG. 3 illustrates details of sampling DNS traffic, according to one embodiment, such as set forth in 205 of FIG. 2.

FIG. 4A illustrates details of filtering domain names for further processing in filter 130, as set forth in 210 of FIG. 2, according to one embodiment.

FIG. 4B illustrates details of a method of ranking domain names based on statistics, performed by ranker 145 and as set forth in 225 in FIG. 2, according to one embodiment.

FIG. 5 illustrates details related to searching for information about domain names by information searcher 150 as set forth in 230 of FIG. 2, according to one embodiment.

FIG. 6 illustrates details relating to classifying domain names by classifier 155 as set forth in 235 of FIG. 2, according to one embodiment.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

FIG. 1 illustrates a system 100 of detecting malicious and/or botnet-related domain names, according to one embodiment. The system 100 can comprise a DNS (Domain Name System) traffic monitor 110 that can be placed between a RDNS (Recursive Domain Name System) server 115, and a monitored network 105 with computers 101, 102, and 103. Note that one or more of these computers (e.g., 103) can be a bot. The DNS traffic monitor 110 can be connected to RDNS reconnaissance application 120. The RDNS reconnaissance application 120 can include a sampler 125, a filter 130, a statistics collector 135, a statistics database 140, a ranker 145, an information searcher 150, and a classifier 155.

The sampler 125 can sample DNS traffic between the monitored network 105 and the RDNS server 115 for further processing, according to one embodiment. The sampler 125 can sample DNS queries and their related responses according to a probability p. For example, assuming p=0.2, each DNS query and its response has a 20% chance to be included in the sample. The value of p can be varied in order to alter the desired sample-size.

The filter 130 can filter DNS traffic for further processing, according to one embodiment. For each domain name d that is in the sample of DNS traffic selected by the sampler 125, the filter 130 can determine whether to discard that domain name or accept it for further processing. According to one embodiment, the filter 130 can extract from domain d its top level domain (TLD(d)) and its second level domain (2LD(d)). Thus, for example, if d=domain.example.com, then TLD(d)=com, and 2LD(d)=example.com. The filter 130 can then check to see if TLD(d) or 2LD(d) are contained in certain lists of top level domains and second level domains. For example, 2LD(d) can be checked against a whitelist of known and legitimate second level domains. Additionally, TLD(d) can be checked against a list of suspicious top level domains. Furthermore, 2LD(d) can be checked against a list of dynamic DNS second level domain names, and then can be checked against another list of newly created second level domains. The filter 130 can then use the results of these queries to determine whether to discard domain name d or to accept it for further processing.

The statistics collector 135 can collect statistics about domain names that have been accepted for further processing by the filter 130, according to one embodiment. The statistics collector 135 can monitor these domain names over a period of time T. For example, if T=1 day, the statistics collector 135 can monitor each domain name for a period of one day. The statistics collector 135 can collect information about each domain name d over period T, such as, but not limited to: the number of queries to domain d observed during T, the number of distinct resolved IP addresses during T for the domain d, and the number of distinct source IP addresses that queried d during time T; the maximum number of queries for a certain domain issued by a single source IP address in any given subinterval Ti<T; the number of error messages received as a response to queries to a certain domain name; the number of NX domain (non-existent domain) responses; and the entire set of source IP address and resolved IP addresses extracted from the DNS queries and related responses.

The statistics database 140 can store the data that the statistics collector 135 gathered, according to one embodiment. For example, a relational database can be used to store this data. A relational database can be a structured collection of data that uses tables comprised of rows and columns to store the desired information.

The ranker 145 can rank domain names based on their suspiciousness and can accept some domain names for further processing, according to one embodiment. The ranker 145 can retrieve statistics from the statistics database 140. The ranker 145 can then calculate a suspiciousness score for each domain name d over a period of time T. For example, the ranker 145 can calculate a suspiciousness score as a ratio of the number of queries to domain d observed during T to the number of distinct source IP addresses that queried domain d during time T. The ranker 145 can rank each domain name based on its suspiciousness score. The ranker can then discard domain names with low suspiciousness scores and accept domain names with high suspiciousness scores for further processing. For example, the ranker 145 can compare each domain name's suspiciousness score to a provided threshold I in order to determine if it should be discarded or accepted for further processing.

The information searcher 150 can search for further information about the domain names that were accepted for further processing by the ranker 145, according to one embodiment. The information searcher 150 can use Internet search engines to search for a given domain name d, and the top n results of each Internet search can then be collected. For example, if n=10, the top 10 search results will be collected. If any of the top n results contains a link to a known malware analysis website, then the contents of that linked page can also be collected. The information searcher 150 can also conduct reverse DNS lookups for each resolved IP address for a given domain name d. The information searcher 150 can also perform a mapping between each resolved IP address for a given domain name d and the Autonomous System (AS) that it belongs to. An autonomous system can be a set of IP addresses under the control of one network operator or organization that has a clearly defined routing policy to the Internet. An AS is uniquely identified by an AS number and an AS name.

The classifier 155 can classify domain names into categories, such as, but not limited to: malicious, suspicious, or legitimate, according to one embodiment. The malicious category can include domain names that are clearly malware-related and likely to be botnet-related. The suspicious category can represent domain names that are likely to be malware-related, but for which further analysis is required. Finally, the legitimate category can represent domain names that are not related to any suspicious activity. The classifier can classify each domain name by examining: the domain name, a set of resolved IP addresses for that domain name, the statistics for that domain name collected by statistics collector 135 and stored by statistics database 140, the Internet search results gathered by the information searcher 150, a list of known malicious IP addresses and autonomous systems, a database of domain names from a malware analysis tool, a query volume threshold and a list of known malware analysis websites.

FIG. 2 illustrates a method of detecting botnet-related domain names, according to one embodiment. In 205, a sampler 125 can pick a sample of Domain Name System (DNS) traffic to review. This process is explained in more detail with respect to FIG. 3 below. In 210, a filter 130 can filter the sample of DNS traffic to get domain names to be further processed. This process is explained in more detail with respect to FIG. 4A below. Once the domain names have been filtered in 210, the domain names that have been accepted for further processing can be monitored and statistics can be collected by statistics collector 135 in 215. Statistics can be collected over a period of time T. For example, if T=1 day, then statistics for each domain name would be collected over a one day period. The statistics collected for each domain name d can include, but are not limited to: the number of queries to domain d observed during T (query_volume(d, T)), the number of distinct resolved internet protocol (IP) addresses during T for the domain d (resolved_IPs(d, T)), and the number of distinct source IP addresses that queried domain d during T (num_source_IPs(d, T)). In 220, the statistics can then be stored in a database 140. In one embodiment, the statistics can be stored in a relational database. A relational database can be a structured collection of data that uses tables comprised of rows and columns to store the desired information. In 225, domain names can be ranked in ranker 145 based on a suspiciousness score which is computed for each domain name d and each epoch T (s(d, T)) based upon the statistics stored in 220. This process is explained in more detail with respect to FIG. 4B below. In 230, information searches can be conducted by information searcher 150 for the domain names that were considered for further processing in 225. This process is explained in more detail with respect to FIG. 5 below. In 235, the domain names can be classified in classifier 155 as either malicious, suspicious, or legitimate, based upon examination of the retrieved information. This process is explained in more detail with respect to FIG. 6 below.

FIG. 3 illustrates details of sampling DNS traffic, according to one embodiment, such as set forth in 205 of FIG. 2. In 305, a DNS query q, its related response r, and a probability p are accepted as input parameters to sampler 125. For example, the method could accept the DNS query q=www.example.com, the related response r=123.123.123.123 which could represent the IP address that corresponds to the query q, and a probability p=0.20. The probability p represents the probability that a given query q and response r will be sampled for further processing. Thus, for example, when p=0.20, an estimated 20% of traffic will be sampled for further processing, and the other 80% of traffic will be discarded. In 310, a pseudorandom number N ranging between and including 0 and 1 can be generated using a uniform distribution. For example, a pseudorandom number N=0.6 can be generated. A person having ordinary skill in the art can recognize that there are many different algorithms available to generate a pseudorandom number. For example, the following algorithm can be used: For example, a Linear Congruential Generator (LCG) algorithm can be used, so that the next integer pseudo random number is computed as X(i+1)=(aX(i)+c)mod m, where m>0, 0<=a<m, 0<=c<m and X(0) is a “seed” number between 1 and m−1. We can then take the number N=X(i)/(m−1) as the result at each trial i. Note that this is just one example of an algorithm, and that those of ordinary skill in the art will see that many other algorithms may be used. In 315, the pseudorandom number N can be compared against the input parameter p. If N is greater than or equal top in 315, then the method will proceed to 320. On the other hand, if N is less than p in 315, then the method will proceed to 325. For example, if N=0.6 and p=0.2, then N is greater than or equal to p, and therefore the method will proceed to 320. In another example, if N=0.1 and p=0.2, then N is less than p, and therefore the method will proceed to 325. In 320, the DNS query q and its related response r can be discarded. In 325, the DNS query q and its related response r can be accepted for further processing as indicated in FIG. 2. In 330, the sampling method of FIG. 3 ends.

FIG. 4A illustrates details of filtering domain names for further processing in filter 130, as set forth in 210 of FIG. 2, according to one embodiment. In 405, a domain name d can be accepted as an input parameter. The domain name d can be part of the sample of domain names that was gathered in 205 of FIG. 2. The top level domain name, TLD(d), can be extracted from the domain name d. For example, if d=domain.example.com, then TLD(d)=com. Additionally, the second level domain name, 2LD(d), can be extracted from the domain name d. For example, if d=domain.example.com, then 2LD(d)=example.com.

Referring again to FIG. 4A, in 410, a set of second level domain names, which can be referred to as a 2LD Whitelist, can be checked to see if it contains 2LD(d). A 2LD Whitelist contains a list of second level domain names that are known to be legitimate (for example: ibm.com, google.com, yahoo.com, etc.). If 2LD(d) appears in the 2LD Whitelist, then d is discarded in 435 because it is considered a legitimate second level domain. If 2LD(d) does not appear in the 2LD Whitelist, then further filtration of domain d can continue.

In 415, a set of top level domains, which can be referred to as a Suspicious TLDs set, can be checked to see if it contains TLD(d). The Suspicious TLDs set can contain top level domains that are often associated with malicious and botnet-related domain names (for example: .biz, .info, etc.). If TLD(d) does not appear in the Suspicious TLDs set, then further filtration of d continues. If TLD(d) appears in the Suspicious TLDs set, then d is accepted for further processing in 430. Thus, the full domain name d, or any part of the domain name d can then be further investigated.

In 420, a set of second level domain names, referred to as the Dynamic DNS (DDNS) 2LDs set, can be checked to see if it contains 2LD(d). The DDNS 2LDs set can contain second level domain names owned by Dynamic DNS service providers that may be suspicious (for example: dyndns.org, no-ip.com, yi.org, etc.). If 2LD(d) does not appear in the DDNS 2LDs set, then further filtration of d continues. If 2LD(d) appears in the DDNS 2LDs set, then d is accepted for further processing in 430.

In 425, a set of second level domain names, which can be referred to as New 2LDs set, can be checked to see if it contains 2LD(d). The New 2LDs set can contain second level domains that have never been queried during a previous period of time. For example, if the second level domain “example.com” had not been queried in the previous week, it could be included in the New 2LDs set. If 2LD(d) does not appear in the New 2LDs set, then d is discarded in 435. If 2LD(d) appears in the New 2LDs set, then d is accepted for further processing in 430.

As discussed above, a domain d can be accepted for further processing in 430. In 435, a domain d that has been filtered out by one of the steps 410, 415, 420 or 425 can be discarded, and will not undergo further processing. In 440, the filtration method of FIG. 4A ends.

FIG. 4B illustrates details of a method of ranking domain names based on statistics, performed by ranker 145 and as set forth in 225 in FIG. 2, according to one embodiment. In 450, the statistics stored in 220 in database 140 can be retrieved from the database. In 455, a suspiciousness score s(d, T) can be calculated from those statistics. The suspiciousness score s(d, T) can be calculated as a ratio between the number of queries to domain name d observed during the epoch T and the number of distinct source IP addresses that queried domain d during T (i.e., s(d, T)=query_volume(d, T)/num_source_IPs(d, T)). For example, assuming that domain name d was queried 100 times during epoch T (i.e., query_volume(d, T)=100) and domain name d was queried by 50 distinct source IP addresses during epoch T (i.e., num_source_IPs(d, T)=50), then s(d, T) can be calculated by dividing 100 by 50. Accordingly, in this example, s(d, T)=2. In 460, the domain names can be ranked in order based upon their suspiciousness score s(d, T). For example, assume we have three domain names d1, d2, and d3. Furthermore, assume that s(d1, T)=2, s(d2, T)=5, and s(d3, T)=3. In this example, the domain names could be ranked in order based upon their suspiciousness score, therefore, since 2 is less than 3 which is less than 5, the domain names would be ranked in the following order: d1, d3, d2. In 465, the suspiciousness score s(d, T) can be compared to a threshold I. The value of the threshold I can be varied. For example, I can equal 1, or, in another example, I can equal 50. If s(d, T) is greater than 1, then d can be accepted for further processing in 475. For example, if the suspiciousness score s(d, T)=2 and the threshold I=1, then, since 2 is greater than 1, the domain name d is accepted for further processing 475. However, if s(d, T) is less than or equal to threshold I, then d is discarded in 470. For example, if the suspiciousness score s(d, T)=2 and the threshold I=3, then, since 2 is less than or equal to 3, the domain name d is discarded in 470. In 480, the ranking method ends.

FIG. 5 illustrates details related to searching for information about domain names by information searcher 150 as set forth in 230 of FIG. 2, according to one embodiment. In 505, an internet search engine can be used to query for a target domain name d. For example, the search engine google.com can be used to query for a given domain name “domain.example.com”. A person having ordinary skill in the art can recognize that there are many different internet search engines that can be used in this step, such as, but not limited to, google.com, yahoo.com, and ask.com. Once the query is complete, the top n search results can be collected. For example, the top 10 search results can be collected. The top n search results can then be compared against a list of known malware analysis websites m. The list of known malware analysis websites W could include, but is not limited to, avira.com, viruslist.com, and threatexpert.com. If any of the top n search results contain a link to a known malware analysis website listed in W, then the text of the linked webpage can also be collected. For example, if the top n search results included a link to avira.com, and if avira.com was a part of W, then the linked avira.com page would be collected.

In 510, a reverse DNS lookup can be performed for each IP address that resolved for domain name d. Previously in 220 in FIG. 2, statistics were stored for the resolved IP addresses for d over epoch T (resolved_IPs(d, T)). The set of resolved IP addresses can be represented by R. In 510, for each IP address r in R, a reverse DNS lookup (e.g., PTR DNS) can be performed to retrieve the domain name that points to that address. For example, a reverse DNS lookup can be performed by conducting a DNS query for a pointer record (PTR) by supplying an IP address. The result of the reverse DNS lookup can be the host name associated with the supplied IP address. This information may help identify whether a given IP address is a dynamic IP address or related to a DSL or dial-up connection (for example, “35-201-168-192.dialup.example.net”).

In 515, a mapping is performed between each resolved IP address r and the Autonomous System (AS) it belongs to. Given an IP address, the AS number and the AS name to which the IP address belongs can be retrieved using information publicly available on the Internet.

FIG. 6 illustrates details relating to classifying domain names by classifier 155 as set forth in 235 of FIG. 2, according to one embodiment. Domain names can be classified into one of three broad categories: malicious, suspicious, or legitimate. The malicious category can include, but is not limited to, domain names that are clearly malware-related and likely to be botnet-related domains. The suspicious category can include, but is not limited to, domains that are likely to be malware-related, but for which further analysis is required. The legitimate category can include, but is not limited to, domain names that are not related to any suspicious activity.

In 605, the input parameters can include, but are not limited to: domain name d, a set of resolved IP addresses R, domain statistics S (as compiled and stored in 215 and 220 on FIG. 2), Internet search results G (as compiled in 230 on FIG. 2), a list of known malicious IP addresses and autonomous systems A, a database of domain names from a malware analysis tool M, a query volume threshold t, and a list of known malware analysis websites W.

In 610, Internet search results G can be checked to see if they contain a link to a malware analysis website using W. If G contains a link to a malware analysis website listed in W, then d can be classified as malicious in 620. For example, if W contained known malware analysis website avira.com, and if Internet search results G contain a link to avira.com, then d can be classified as malicious. If G does not contain a link to any known malware analysis website in W, then further classification of d continues. Thus, for example, if the only known malware analysis website in W is avira.com, and Internet search results G do not contain a link to avira.com then d could be further classified in another step.

In 615, if the Internet search results G are determined to be empty, and if the database of domain names from malware analysis tool M contains the domain name d, then d can be classified as malicious in 620. Otherwise, further classification of d can continue. For example, if d=example.com, and if Internet search results G are empty, and if the database of domain names from malware analysis tools M contains “example.com,” then d can be classified as malicious. However, by way of another example, if G is not empty or if M does not contain “example.com” then further classification of d can continue.

In 625, it can be determined whether any of the resolved IP addresses R or their related Autonomous System (AS) numbers are in the list of known malicious IP addresses and autonomous systems A. If the resolved IP addresses or their related AS numbers are found to be in A, then d can be classified as suspicious in 645. Otherwise, further classification of d can continue. For example, if resolved IP addresses R contains IP address “123.123.123.123” and A also contains “123.123.123.123,” then d could be classified as suspicious. However, if A does not contain any of the IP addresses in R or their related AS numbers, then further classification of d could continue.

In 635, Internet search results G can be checked to see if the result is empty. For example, this criteria would be satisfied if the Internet search results G did not contain any data. On the other hand, by way of example, this criteria would not be satisfied if Internet search results G did contain some search results. Additionally, R can be checked to see if it contains at least one IP address that is a home-user address in 635. For example, this criteria could be satisfied if IP address “123.123.123.123” was known to be a home-user address and the set of resolved IP addresses R contained “123.123.123.123”. On the other hand, by way of example, this criteria would not be satisfied if R did not contain “123.123.123.123.” Additionally, the query volume for d can be checked to see if it is higher than a provided query volume threshold t in 635. For example, the query volume threshold t can be set to 1,000 queries. In this example, if the query volume for d was 2,000 queries, the query volume 2,000 is greater than the threshold 1,000, and accordingly this criteria would be satisfied. However, if the query volume for d was 500 queries, the query volume of 500 would be less than the threshold of 1,000, and accordingly the criteria would not be satisfied. If all of three criteria are satisfied, then the domain name d can be classified as suspicious in 645. Otherwise, d can be classified as legitimate in 640.

While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example, and not limitation. It will be apparent to persons skilled in the relevant art(s) that various changes in form and detail can be made therein without departing from the spirit and scope of the present invention. Thus, the present invention should not be limited by any of the above-described exemplary embodiments.

In addition, it should be understood that the figures described above, which highlight the functionality and advantages of the present invention, are presented for example purposes only. The architecture of the present invention is sufficiently flexible and configurable, such that it may be utilized in ways other than that shown in the figures.

Further, the purpose of the Abstract of the Disclosure is to enable the U.S. Patent and Trademark Office and the public generally, and especially the scientists, engineers and practitioners in the art who are not familiar with patent or legal terms or phraseology, to determine quickly from a cursory inspection the nature and essence of the technical disclosure of the application. The Abstract of the Disclosure is not intended to be limiting as to the scope of the present invention in any way.

US Patent for Method and system for detecting malicious and/or botnet-related domain names Patent (Patent #  10,027,688 issued July 17, 2018) (2024)

References

Top Articles
Latest Posts
Article information

Author: Stevie Stamm

Last Updated:

Views: 5625

Rating: 5 / 5 (80 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Stevie Stamm

Birthday: 1996-06-22

Address: Apt. 419 4200 Sipes Estate, East Delmerview, WY 05617

Phone: +342332224300

Job: Future Advertising Analyst

Hobby: Leather crafting, Puzzles, Leather crafting, scrapbook, Urban exploration, Cabaret, Skateboarding

Introduction: My name is Stevie Stamm, I am a colorful, sparkling, splendid, vast, open, hilarious, tender person who loves writing and wants to share my knowledge and understanding with you.