|Survey Data Mining: Home | FAQ | Archive | Glossary|
|Survey Methodology - an Overview|
Our surveys collect only information that is meant to be publicly available. For example, we collect web pages, and record the responses the server sends us, comparable to any request a user might make of your server. The same goes for DNS queries.
All of our reports are comprised of data that anyone has access to on the web, in a completely legitimate manner. So what makes our reports unique? Simply put, the breadth and duration of our surveying activities. We have been collecting information since May of 1998, and we collect information from MANY web sites each month. In some cases, our crawlers visit a subset of the web. In others, we visit all servers we know of, depending on what information we are seeking to collect.
By aggregating the information from the various collection methodologies, we have the ability to produce a wide variety of reports, some of which we publish for free on our site every month.
|What is the "HEAD" request?|
Each month, we visit all known web servers, and issue a single HTTP request that looks like this:
HEAD / HTTP/1.0 User-Agent: Mozilla/4.0 (compatible; SecuritySpace WebSurvey; http://www.securityspace.com ) Accept: text/plain,text/htmlIn short, this request is a request for the home page of the web site in question. But, to optimize bandwidth, we only request the HTTP header component of the web page. Hence the usage of "HEAD", where normal users will issue a "GET". This saves both our bandwidth and yours. In some cases, the web server will treat a HEAD the same as a GET request, and return the entire web page to us as well. In these cases, we discard the web page returned, keeping only the HTTP header.
The User-Agent string for most users identifies the type of browser they are using. In our case, we identify our crawler with this string.
|What sites do you visit?|
We visit what we consider well-known sites. In our case, we define a well-known site as a site that had a link to it from at least one other site that we consider well-known. So, if we are visiting you, it means we know about you through a link from another site.
If a site stops responding to our request for 3 consecutive months, we automatically remove it from the survey. In this fashion, our list of known servers remains up to date.
Because of this technique, we find that we actually only visit about 10%
of the web sites out on the web. This is because approximately 90% of all
web sites are "fringe" sites, such as domain squatters, personal web sites,
etc., that are considered unimportant by the rest of the web community
(because no-one considers them important enough to link to.)
We run a general crawler that behaves much as any crawler behaves, visiting web pages on a subset of the web each month. This crawler downloads complete web pages, makes notes on various attributes of these pages (such as the location of imbedded objects such as images, frames, etc.).
The purpose of this crawler is to discover links to new sites that we don't yet know about, as well as to make note of various attributes in use on web sites that allow us to publish reports such as our Web Bug report, Technology Penetration report, and more.
We visit sites in a random fashion. We take our complete list of known sites, randomize the order, and then start visiting the sites, one after another. We structure our crawlers to go through this list once a year, meaning that on average we never spider your site more than once a year.
When we generate our summary report and pie charts, we do so by grouping together all the different types of servers from a particular vendor. The following shows you the different servers that we grouped together under a single heading:
|Netscape||Netscape is the sum of the following different servers: Netscape-Enterprise, Netscape-Commerce, Netscape-Communications, Netscape-FastTrack, Netscape-Catalog, and Netscape-Administrator.|
|Microsoft||Microsoft is the sum of the following different servers: Microsoft-IIS, Microsoft-PWS, Microsoft-PWS-95, Microsoft-ELG, and Microsoft-Internet-Information-Server.|
|WebSite||WebSite (by O'Reilly) is the sum of WebSite and WebSitePro.|
We generate a number of DNS queries each month as a result of our survey. These include:
The above activity allows us to produce a number of reports, including our freely available ISP market share report and DNS load balancing report that we publish each month.
We publish our surveys regularly each month, releasing reports on the 1st
of the month. For example,
on July 1st, 2001, we published the results of the survey done during the
month of June.
Please Contact us! We'd love to hear from you!