How many domains were registered yesterday? Understanding the dynamics of domain names registration and maintain an up-to-date WHOIS database A WhoisXML API, Inc. technical blog

Table of Contents

In what follows we address the important question of how to determine which Internet domains have been registered, dropped or modified on a given day. Apart from being interesting from the point of view of studying trends in registration dynamics, it is also a key ingredient of maintaining an up-to-date WHOIS database. With such a database one can then address a variety of questions concerning domain statistics, marketing research, IT security, etc.

It turns out that due to the design of the domain name system it is impossible to answer these questions accurately. So we describe some good practices to provide bona fide approximations, also implemented at WhoisXML API, Inc. for creating downloadable WHOIS data sets and setting up databases serving as a basis of our API products. We elucidate the ideas behind the collection of these data and outline how they can be used to maintain a WHOIS database.

1 The domain name system

The domain name system is one of the key ingredients of the operation of the Internet: it defines the domains, assigns them a name, and ensures that they can be owned by someone. We shall not go into its details here (consult our DNS primer whitepaper for its details). We just mention here a few aspects which are important for our goal.

It is crucial to keep in mind is that the Internet is not used in the way it was designed for. In its early days it used to be a computer network of a few universities and research institutes, used by a community in which everybody knows each other, almost everybody is honest and it is easy to find misbehaving users. The foundations of its operation were laid down that time. And albeit they have undergone several changes since, we still carry many issues which come from policies and protocols which still have features from the early days, but it is not possible to change them anymore. The difficulties mentioned in this document mostly stem from these.

For the technical operation of the domain it is necessary and sufficient for the domain to resolve: to be able to assign IP addresses to the hostnames in the domain and vice versa. A second level domain, such as e.g. "whoisxmlapi.com" has to appear in the zone file of its top level domain, "com" to achieve this. These files are the primary information sources the name and IP resolution is based on. This part of the domain name system, however, do not reveal any ownership information about the domain.

For keeping track of the ownership, contact data related to domains, and relevant dates (creation, modification, expiry) there is another protocol: WHOIS. WHOIS is based on a distributed database again, but a different one than the domain name system's zone files. So apart from adding a new domain to the zone file, it also has to be included into the WHOIS subsystem.

A domain is in operation technically as soon as it is in a zone file, and will not work if it is not there. However, to tell about its ownership, its creation, modification or expiry dates, or to find out whom to contact in case of any issue, the domain has to be there in another distributed database: the WHOIS system. And here our key problem comes. The domain can work and will be resolved in the DNS, regardless of whether it is there in the WHOIS system or not, and also regardless of whether its WHOIS data are accurate or not. A proper WHOIS record is not a technical requirement of a domain's operation, just a legal one. This is how the founding fathers have designed it.

Of course, in the 1970s there was a single person, named Mary K. Stahl, who was the hostmaster of the Internet maintaining a single file HOSTS.TXT. Meanwhile there was the ARPANET directory, the collection of the relevant contacts, which was the predecessor of WHOIS and existed also even in the form of a printed brochure till 1982. The domain system was introduced soon after. But still for a relatively small and coherent community whose members who prided themselves on their open collaboration and general ethics. Under such circumstances, nobody thought that there will be domains whose owners want to hide for malicious reasons. Not to speak about the need of hiding ownership or technical contact information of a domain for privacy reasons…

Anyway, while many protocols, practices, and policies have changed since, and a variety of issues of the old design under the new circumstances were overcome, the technical independence of WHOIS in the domain name system still persists. And even though the archaic WHOIS is just now being replaced by RDAP, of contemporary technology, and the access to WHOIS data is becoming organized with a significant respect of novel data protection regulations, still, technically a domain can operate without all these.

This immediately raises the question of when a domain is introduced and when is it dropped. At the time when it starts and ceases to resolve? Or at the dates included in its WHOIS record? So far we can see that these two are just loosely related because of technological reasons. But there are even more factors to increase this ambiguity.

2 Domain life cycle

Domains do not just appear or disappear; they have their life cycle. In case of domains in a generic top-level domain, such as .com, this is illustrated in the following figure: gtld-lifecycle-700x286.jpg (Figure source: https://www.icann.org/resources/pages/gtld-lifecycle-2012-02-25-en) Notice that in the auto-renew grace period, which can be 0-45 days, the domain may be in the zone file, so it may actively operate or not. Notice the ambiguity in this statement. And in the case of the country-code top level domains, managed by various registrars and authorities, the situation can be even more cumbersome…

In addition, the deadline of introducing or removing the WHOIS record can vary, there is no very strict regulation of this. So it easily happens that the domain already works but has no WHOIS record yet, or the other way around: the domain does not work, it is not in the zone file, but it already (or still) has a WHOIS record.

So far we can see that to find out how many domains were registered on a given day is rather hard even if we have an unlimited access to WHOIS data: the registration, the technical operation and the appearance of a WHOIS record can occur at different times. And the access to WHOIS data, at least via the WHOIS protocol, is typically very limited.

3 Detecting changes and maintaining a WHOIS database

So to find out which, or just how many domains were registered on a given day, say, today, one can consider various strategies.

Naively one would just query all WHOIS records in the world which have a date of registration of today. There is a bit of a problem though: the WHOIS protocol does not support such a query (and nor does RDAP). So then let's collect all WHOIS data and create a local database. Just using the WHOIS protocol it is a very hard task. Believe us, at WhoisXML API we are doing it. But we are doing it with the goal of maintaining such a database and providing it to our clients. We have quarterly releases providing a snapshot of all WHOIS records in the world (or at least those which can be found in any legal way) four times a year. Here you can read about how to download and set up such a database. (See also the manual of quarterly releases of generic top-level domains as well as the one for country-code top-level domains for a detailed specification of our WHOIS data downloads.)

But the WHOIS records of all domains, this is a tremendous amount of data, so there is no way to refresh all records everyday. (Especially since most WHOIS servers pose strong limitations on the amount of queries.) A possible idea is (and this is what we do at WhoisXML API, in case of most domains) to identify those domains which just have started to resolve or just have disappeared. And indeed, this is how we collect the daily data. It sounds easy but the devil is in the detail: there is much resources needed, and a lot of expertise is needed to do it properly. Just the parsing of WHOIS data in itself a very involved matter, it is a subject of ongoing scientific research (see e.g. the paper of S. Liu et al. (in Proceedings of the ACM Internet Measurement Conference, Tokyo, Japan, October 2015). Nevertheless, our data are available for downloading for our subscribers from our daily data feeds, too. See our manuals for daily generic TLD data feeds and that for country-code TLD data feeds for details.

In case of numerous country-code TLDs, there is no systematic way based on the DNS to identify the changes in the list of domains, which poses further difficulties. In case of them, we are left with finding as many domain names as possible, via a variety of methods, including web scraping, DNS sensors, etc. We are doing this, too: some daily feeds contain these discovered domains. They are not necessarily newly registered, however, in the lack of more accurate data, at least there will be information about them.

Having our data at hand, you can set up your own local WHOIS database for the domains you are interested in. This can cover almost all domains of the Internet if you have enough computer resources, recall that it quickly turns into a big data task.

The recipe to maintain your own WHOIS database is to

  • download a quarterly release,
  • merge in the daily data everyday, starting from an early enough date.
  • when a new quarterly comes, merge its data, too.

As of merging in daily feeds, the "early enough date" needs to be determined. The quarterly data collection primarily aims at collecting fresh data on all known domains at a point of the 3-month collection period. Since the v32 gTLD and v18 ccTLD quarterly database releases, however, the quarterly release contains the data of the daily feeds, too. This merging also takes processing time, hence the newest daily-collected data in a quarterly release are from the beginning of the last month of the collection period. Hence, the "early enough date" is the beginning of the last month of the collection period, that is, one month before the quarterly relaese's date. (Prior to v32 gTLD and v18 ccTLD releases this merging was not performed and the recommended early date used to be 6 month before the release date.)

Whenever a new quarterly release appears, it is recommended to also merge in its contents into the database, even if all dailies had been merged in on a daily basis. In this document you can see further details on how we ourselves do it actually to have the database behind our API products. This describes a recommended way of merging the data: how to decide when to insert or update WHOIS record in the database when it appears in a daily or new quarterly feed.

Having a database like this, you can answer many questions including, e.g., how many domains were registered by a registrar in a top-level domain, or even the one in the title of this document.

4 How accurate it will be?

In the light of the above considerations it is clear that the domains appearing in the daily data feeds on a given day e.g. as new domains will not exactly be the ones which will be found with this registration date in the WHOIS database when everything has settled down. In the daily feed there are the domains which start to technically function on that day, maybe not the first time even. (It might also happen that some property of the domain in the domain name system changes just because some error.) The date in the WHOIS record is, on the other hand the date when the domain was officially registered, but it is not forced to coincide with the time when it started to function. The number of the record in the daily data feed, however will show the same or similar main trends even if it will not coincide with the number of domains found with the given date in the WHOIS database, which can be found out later, via querying a complete database. But the maintenance of a complete database is definitely more resource expensive than counting the number of lines of some files.

Meanwhile, when talking about times, one should never forget about time zones. A date in the WHOIS record having a date for yesterday can be today in your own time zone.

Another systematic feature of our methodology is that the WHOIS records for domains with status codes indication that in they are in the redemption grace period or pending delete period are not all captured. It is for the reason that if we detect that a domain stops to resolve, it can have two meanings: it is somewhere in its auto-renew grace period or it has just started its redemption grace period. The uncertainty is because in the auto-renew grace period, "the domain may be in the zone file". And it is very likely that it is just the status which changes when it disappears from the domain name system, so we will probably not have too much new information from the rest of these records.

Altogether one should not forget that the WHOIS system is an archaic and largely undocumented part of the Domain Name System undergoing many changes nowadays. These include both technological ones (using the RDAP protocol instead of the original WHOIS), and contentual ones (implementing data protection regulations). Yet WHOIS data are still the only ones linking Internet resources to physical individuals, and thus their relevance cannot be underestimated. Getting complete WHOIS data for entire top-level-domains is impossible by the simple direct use of WHOIS clients, which also does not support advanced queries. The here described approaches result in the (most probably) best possible WHOIS database.

Author: (c) WhoisXML API, Inc. 2000. All rights reserved.

Created: 2020-09-02 sze 10:19

Emacs 25.2.2 (Org mode 8.2.10)

Validate