The ability to predict surfing patterns of the common Internet user has the potential to be instrumental in solving many problems facing producers and consumers of web page content. Web site designs can be evaluated and optimized by predicting how users will surf through their structures. Web client and server applications can also reduce user perceived network latency by pre-fetching content predicted to be on the surfing path of individual users or groups of users with similar surfing patterns. Systems and user interfaces can be enhanced by the ability to recommend content of interest to users, or by displaying information in a way that best matches users' interests. Proper analysis of a web site's activity is therefore an important process that supports an enhanced and intelligent design of a web site.
An important component of any ecommerce initiative is to track the effectiveness of the marketing effort. Through careful analysis of a web site's statistics much information can be gleaned that can be further used to fine tune the advertising, web site content, and customer relationship management strategies and policies. These are all important elements of Internet Marketing plans and strategies that can ultimately dictate the success or failure of any ecommerce initiative.
Surfing the World Wide Web involves traversing the connections among hyperlinked documents. It is one of the most common ways of accessing web pages. Theories and models are beginning to explain how observed patterns of surfing behavior emerge from fundamental human information search processes. Therefore, the ability to predict surfing patterns has the potential to be instrumental in solving many problems facing producers and consumers of web page content. For instance, web site designs can be evaluated and optimized by predicting how users will surf through their structures. Web client and server applications can also reduce user perceived network latency by pre-fetching content predicted to be on the surfing path of individual users or groups of users with similar surfing patterns. Systems and user interfaces can be enhanced by the ability to recommend content of interest to users, or by displaying information in a way that best matches users' interests. Proper analysis of a web site's activity is therefore an important process that supports an enhanced and intelligent design of a web site.
A common and popular source of tracking data and statistics for any website is the log file on the web server. Most web servers have a system for recording all requests for web site objects to a log file. The data in the log file indicates which objects were requested, when, and information about whom or what requested them. Therefore, with the appropriate software that is used to process this data, company managers and executives can measure the success of their websites and develop appropriate strategies to address weaknesses and enhance their prospects for future success by assessing their site's visibility (the ease with which customers can locate your site), navigability (the paths that customers use to navigate through your site), and the usability (how easy is it for customers to use your site).
However, complete reliance on data collected in log files has its pitfalls, some of which will be discussed in this article. Other tools such as tracking counters help overcome some of the problems encountered with log file analysis. Therefore, an intelligent selection of site statistics software requires the ability to recognize the strengths and weaknesses of each tool in order to effectively strike a balance that realizes the missions and goals of your organization. Understanding the statistics provided by web site analysis software is critical in order to properly interpret, evaluate, and design subsequent marketing strategies.
Log file data
While web servers have the ability to record vast amounts of information, relatively few fields are typically recorded. Several formats have evolved from the Common Logfile Format (CLF), including the Extended Logfile Format (ECLF) as well as a variety of customized formats. For the most part, the following fields are recorded by web servers:
- the time of the request in seconds,
- the machine making the request is recorded as either the domain name or IP address,
- the name of the requested URL as specified by the client,
- the size of the transferred URL, and
- various HTTP related information like version number, method, and return status.
Various web servers also enable other fields to be recorded, the most common of which are:
- the URL of the previously viewed page (the referrer field),
- the identity of the software used to make the request (the user agent field), and
- a unique identifier issued by the server to each client (typically a cookie).
Understanding how all of this data is interpreted and displayed in a user readable format for subsequent decision analysis is an important component of any statistical analysis. It is therefore crucial that users be aware that there are different ways that the statistical analysis software can present the data to you. Subsequent sections of this article address some of the important decisions that the statistical analysis software must make when creating reports on your web site activity.
URLs and Referrer Fields
While these fields are useful to analyze and provide reasonable characterizations, several limitations make analysis difficult when attempting path reconstruction efforts. The URL recorded is the URL as requested by the user, not the location of the file returned by the server. This behavior can cause false tabulation for pages when the requested page contains relative hyperlinks, symbolic links, and/or hard coded expansion/translation rules, e.g., directories do not always translate to index.html. It also can lead to two paths being considered different when in actuality they contain the same content. While both pieces of information are useful, the canonical file system-based URL returned by the server would arguably be more useful as it removes the ambiguity of what resource was returned to the user.
In addition, the content of the information contained in the referrer field can be quite varied. Various browsers and proxies do not send this information to the server for privacy and other reasons. In addition, the value of the referrer field is undefined for cases in which the user requests a page by typing in the URL, selects a page from their Favorites/Bookmarks list, or uses other interface navigational aids like the history list. Furthermore, several browsers provide conflicting values for the referrer field. To illustrate, suppose a user selects a listing for the Dell Corporation on Yahoo. In requesting the Dell splash page, the URL for the page on Yahoo is provided as the value for the referrer field. Now suppose the user clicks on the Products page, returns to the Dell splash page, and reloads the splash page. In several popular browsers, the referrer field for Yahoo is included in the second request for the Dell splash page although the last page viewed on the user's surfing path was the Product page in the Dell site. If one chooses to reconstruct paths by relying upon the referrer field, the paths of two users may be identified instead of only one. Given these limitations, strong reliance upon the information in the referrer field may be more problematic than one would initially expect.
User Agent Fields
The user agent field also suffers from imprecise semantics, different implementations, and missing data. This can partially be attributed to the use of the field by browser vendors to perform content negotiation. Given that the rendering of HTML differs from browser to browser, servers have the ability of altering the HTML based upon which browser is on the other end. Consequently, the user agent field may contain the name of multiple browsers. Some proxies also append information to this field. In addition, the value of the user agent field can vary for requests made by the same user using the same Web browser. Adding to the confusion, there is no standardized manner to determine if requests are made by autonomous agents (e.g., robots), semi-autonomous agents acting on behalf of users (e.g., copying a set of pages for off-line reading), or humans following hyperlinks in real time. Clearly, it is important to be able to understand these classes of requests when attempting to model surfing behaviors.
Even when cookies are used, several scenarios are possible when a previously encountered cookie is processed. If the request is coming from the same host regardless of the user agent, the request is treated as being issued by the same user. This is because a unique cookie is issued to only one browser. If the user agent field remains the same but the host changes, it is still the same user and some form of IP/domain name change is occurring. This often occurs with users behind firewalls and ISPs that load-balance proxies. However, if we have the same cookie with a different user agent, then an error has most likely occurred as cookies are not shared across browsers. If no cookies are present, then the site statistic software can resort to using IP addresses. If the request comes from a known host, then we could have a new user or the same user, otherwise the request is from a different user. It is important to point out that these latter two cases could also be issued from non-cookie compliant crawling software.
An interesting set of scenarios occur when a new cookie is encountered. If the request is from a host that has already been processed and the previous value of the cookie was null and the user agent is the same, it is fair to conclude that the request is from a new user that just received their first cookie from the server in the previous request. If the client is not using cookie obfuscation software, one would expect the following requests from this user to all contain the same cookie. However, suppose the previous value from the same host and agent was a different cookie, it could be the same user obfuscating cookie requests, or a new user from the same ISP using the same browser version and platform as the user from the previous request. Barring any other piece of supporting evidence like the referrer field or consulting the site's topology, it is difficult to determine which the correct scenario is. If the user agent is different from the previous request, but accompanies a new cookie from the same host, it is fair to assume that a new user has entered the site. Of course, a new cookie from a new host regardless of the agent is a new user.
IP and Domain Name Counting
You can also learn something about visitors by studying their domain names. Though the log file may record IP addresses, your log analysis program can determine from many of these IP numbers the associated domain or ISP. This might tell you if your most important client -- or competitor -- has been looking at your web pages.
The most simplistic assumption to make about users is that each IP address or domain name represents a unique user. Using this method, all the requests made by the same host are treated as through from a single user. When a new host is detected, a new user profile is created and the corresponding requests are associated to the new user. Several methods that use additional information recorded in the access logs or other heuristics are also possible. One refinement is to use the user agent field. Using this method, new users are identified as above as well as when requests coming from the same machine have different user agents. Another refinement is to place session timeouts on requests made from the same machine. The intuition is that if a certain amount of time has elapsed, then the old user has left the site and a new user has entered.
When using these methods for identifying users, the following situations occur when sequentially processing access logs:
- a new IP address is encountered (assume this is a new user),
- an already processed IP address is encountered
- the user agent matches prior requests (assume this is the same user),
- the user agent filed does not match any prior requests form the same IP (assume this is a new user)
- when a session is terminated due to a timeout, assume a new user has entered the site.
Therefore, if a substantial part of your statistics imply that many of the new hosts and timeouts were from hosts in the same domain/IP address space, you can infer that a large number of web site users either connect to the Web via ISPs with load balancing proxies, or that a large number of different users access the site from within the same domain as would occur with a large company, or that some combination of both cases exist.
Regardless, a significant number of page requests can result in ambiguous cases, where it is not possible to determine the existence of new users with certainty. While the incidence rate can vary considerably from Web site to Web site, the results can be inaccurate since these IP-based methods and other IP-based derivatives are used in cases where unique identifiers like cookies are not present.
Another major problem that dilutes the quality of the data is caching. There are two major types of caching. First, browsers automatically cache files when they are downloaded. When this is done, it is not necessary to subsequently download the entire page again. Depending on the browser settings, it can determine if the page has changed: in which case, you do know about it, and a page request is recorded. However, if the browser is not set to verify if a page has changed, then the user can read the page without any entry being recorded in the web log.
In addition, almost all ISPs now have their own cache. This means that when a web page request is made to the same page that anyone else from the ISP has made recently, the cache will have saved it, and will release it without any request being made to the original site. Therefore many people could request a site's pages from the same cache without the original web site (or its logs) even knowing about it.
For example, AOL uses caching extensively, and a single user with an AOL account may be reflected in your server logs by several different IP numbers as AOL uses its caching to grab the files for its user. If this happens, the logs will fail to identify a repeat customer. In addition, the logs will not be able to record if a visitor typed a URL into their browser after seeing a particular advertisement. If already cached when called, no page requests at all might show up in the logs.
While there is no ideal solution for getting precise site visitor statistics, one can seek a solution that is congruent with your organization's business plan. Two of the more popular web site analysis software products are Webtrends and Site Statistics. Site Statistics is sold by NetPromoter, which has a suite of Search Engine Optimization products that are designed to optimize commercial websites according to the missions and strategies of an organization (see Figure 1). The corporate mission is realized through execution of its strategies, which are influenced by the data and statistics that are collected from the log statistics and tracking counter modules. This information in turn is used to adjust the web site's use of keywords, phrases, and navigation paths. The Site Statistics' Top-ten Analyzer and Site Analyzer modules are designed to provide feedback to support any successful design efforts. This in turn, is an iterative process that is fed back to re-evaluate the mission and strategies of the overall ecommerce initiative.
Figure 1 How Site Statistics Supports a Successful Web Site
The log analyzer module parses and analyzes the log files and presents the results in a useful format. It displays information on visitor IP addresses, referring pages, requested pages, user paths through the site, and more. In Figure 2 a sample report displays the number of hits (unique visitors), the number of visitors and pages viewed, and the bandwidth used by browser type.
Figure 2 A Sample report by the Site Statistics Log Analyzer
Cookie-based module (Script Generator)
The statistics does not require the installation of any additional software on either the web server or the client. However, potential drawbacks include site visits by users with disable cookies or instances when the counter is not allowed to function correctly when the page is not allowed to load completely.
Top Ten Analyzer
The Top 10 Analyzer module queries search engines by keywords that are important to your site, and retrieve the sites that occupy the top ranked positions by these search terms. Figure 3 is an example of a report that displays the top 10 search results for every selected keyword and search engine. This module can subsequently analyze these sites and discover the reasons for their top ranking, which include: the density and prominence of keywords on these pages, embedded tags, and the referring sites.
Figure 3 An example Top Ten Report
The Site Analyzer module retrieves information from all pages on your site, and generates the site structure, site map, and analyzes your site by the referrer and keywords information that is exported from the Log Statistics module. In addition it
- Analyzes the site tree structure
- Views the site map in the form of an interactive chart, which allows studying all page correlations
- Conducts a site analysis by the actual keywords and key phrases (see Figure 4), and
- Conducts a site analysis by referring pages.
Figure 4 Keyword Distribution Report
A comprehensive strategy
The suggested strategy of combining log analysis with a tracking counter helps overcome limitations of each data gathering tool and leads to a more comprehensive and accurate way to evaluate the success of a commercial web site. This article also suggests several other tools that are provided by NetPromoter that help to evaluate the success of your web site. The Site Analyzer tool supports the comparison of the effectiveness of your strategy with that of your competitors through an in depth analysis of the keyword distributions and other important parameters. These tools provide important information that can be iteratively used to fine tune your corporate strategies and ultimately help you to achieve your goals.