Log File Lowdown
by Michael Calore
Hey, you! What are you doing? Where are you going? More importantly, what are you clicking on?
Ah, if only it were that easy. But no, most users like to travel the Web incognito. They come to your site, poke around a few files, download a PDF or two, and then -- poof -- disappear, leaving nothing but questions in their wake: Where did they come from? Which browsers are they using? Are they experiencing any errors?
Short of asking outright, there are ways to find the answers to these questions. The most celebrated method of tracking users is by planting cookies, which some folks consider rude or invasive, and, oh yeah, you need to know how to program them. Not to worry -- there is an option that requires very little technical know-how, comes at no (or nominal) cost, and may already be a part of your site's backend. I'm talking about logs! No, not those fancy browser-based publishing automation systems with a cult-like following. I speak of log files.
Almost every Web server worth its salt has some sort of system that stores information about which pages, images, and files are requested, who requests them, and how many bytes are transferred. All of this information is dumped into a log file that is stored in a specific location on your server.
These log files are yours to explore. You can simply open the log file in an ordinary text editor and read the raw data. Or, for a more user-friendly view of the info, suck the log file into a nifty stand-alone software package or browser-based viewer, which parse the data and spit it out as a charts or graphs or tables that clearly illustrate your users' activities.
Not sure why this information is valuable? Well, if you've invested time and money in a website, one of your biggest points of interest is indubitably traffic -- whether people are exposed to advertisements or your products, traffic is directly proportional to revenue. But there's more to traffic than just eyes on pages. Sure, the numbers you get from your log files will tell you how many people visited your site in any given space of time, but traffic data can also be studied to give you a clear, precise idea of what kinds of viewing practices your users exhibit.
Let's say a user comes to your site and views a few pages. In server-speak, user actions are counted in requests. Any time the user is served an image, an HTML file, or an ad, it counts as a request. If 17 HTTP requests are served in one session, how many of those 17 requests are images? How many are ads? How many turned up as (eek!) 404 "address failed" errors? These are the types of questions that can be answered by picking over your log files and generating in-depth reports.
The trick is to learn as much as possible about what is being served to your users. Vitals like location, browser version, and time spent on your site allow you to tailor your content and presentation design specifically to please the users that you're doing business with.
So what, exactly, do you look for? Let's take a closer look.
The Prizes Inside
There are a number of areas where the data housed in your logfiles can help you understand and cater to your users:
What's your traffic like to any given page? Are there certain pages that stand out as high-traffic areas? Pages that corral more viewers are hot in terms of real estate — ad space on those high-traffic pages should cost more, right? And what is the overall volume like on your site? Do you see traffic jump when you publish exciting, new content, or does it stay relatively flat throughout your publishing schedule? Do you get twice as much traffic on Fridays than Mondays? Thorough traffic reporting will present the answers to these questions if you take the time to seek them out.
Who's visiting your site? Are most of your users from the Unites States or Japan? You can look at the domain names or even IP addresses of your visitors and determine where they are geographically. You can also find out where your visitors are coming from demographically. Are you being visited by AOL users, university students, or workers at defense contracting firms? A site in Cancun, Mexico that sees heavy traffic from American university students should be certain that its English translation service is doing its job — the site is also especially ripe for ads pushing college spring break travel packages.
Are your users primarily Macintosh users? Linux users? Since your site probably varies in presentation between OS X and GNU, you can use the reports about platform specifics to round out your site testing and quality assurance practices. And as any savvy developer knows, the differences between how a page looks in IE 5.5 and Netscape 6 or Konqueror can be astounding. Are you using gobs of IE-specific CSS positioning on pages that are primarily being viewed by BeOS users? For your sake and theirs, I hope not.
Browser plug-ins are fun only when they work, so if you have any content that's "plug-in required," you should be sure that the majority of your users are running a platform for which the needed plugins are available.
What kinds of errors are your log files reporting? Are any links on your site handing out those pesky 404s? Better check those links, then. Are your redirects working or are they pointing your users out into the ether? Are any of your scripts loading incorrectly? Even if everything runs ship-shape on your workstation, a report that shows faulty scripts might lead you to test them on different browsers or from behind a firewall. Are users ditching an image before it fully loads up? There's a cause for concern — look into it. The image may have an error, or may simply need to be optimized.
A referer indicates where a user was refered from, whether it be an advertisement, a link somewhere else on your site, or a link on some one else's site. You can use your referer data to see what kind of traffic you're getting out of a plug on a message board, an ad, or even a mention on Slashdot.
Getting at the Info
So, how do you get your paws on all this valuable data? If you're hip to Unix, you can use grep and sort commands to extract data from raw log files. Or just FTP down a logfile and open it up in your favorite text editor.
A Sample Log File
Next time you fire up your FTP client or log in to your Web server, take a moment and dig around for your log files. On most Web servers, you will find a directory — usually in your root directory, or the parent directory just above it — named "logs" or "stats". Inside, you will most likely see a file with a .log, .web, or .clf extension. Since Web logs are essentially text files, some will even have a .txt file extension. Download the log file, save it to a local drive, and have a look.
Most servers generate CLF (Common Log Format) files, but they also come in other flavors, like ELF (Extended Log Format) and DLF (Combined Log Format). Some servers produce files with different extensions in different formats, but most of the log file types out there are formatted much like CLF files. For this reason, we'll use the structure of a CLF file for our example.
In Common Log Format files, each line represents one request. So if a user comes to your site and is served a page with three images, it shows up as four lines of text in your CLF file — one request each for the three images and one request for the HTML file itself.
CLF files are standardized, so they almost always look the same. A normal CLF file logs the data in this format:user's computer ident userID [date and time] "requested file" status filesize
The fields are separated by spaces. Some fields, such as the date and request information, are defined with punctuation. If any of the fields are non-existent during the session logged, the server puts a hyphen in the place of the non-active field. Let's look at these fields one by one.
- The remote host information shows the IP address and, in some cases, the domain name of the client computer requesting the file.
- The ident information is logged if your server is running IdentityCheck, an antiquated directive that was once used for thorough server logging. It was phased out of general use because it required the identification process to run every time a file is served. Because this process can sometimes take 5 or 10 seconds, most sites turn IdentityCheck off so that their pages load more quickly.
- If your site requires a password upon login, the userID that the user entered is logged in this field. If you don't have any user login features on your site, this field is no big deal.
- The date field is straightforward — the date and time of the request is logged here.
- The request field logs the type of request made by the user, as well as the path and name of the requested file.
- The status field contains a three-digit code that tells you if the file was transferred successfully or not. These codes are standard HTTP codes.
- The filesize field is also straightforward — it lists the number of bytes transfered when the requested file was served.
For the following example, I've extracted one line from a log file that records the activity on my own personal website, snackfight.com. My hosting company serves my site using Apache, and they've tweaked a few options to provide me with more comprehensive data. (Apache's mod_log_config module allows you to customize the string that's fed into the logs.) I've divided this logged request into its separate parts for clarity — normally, all of this data would be dumped onto one single line in the log file.adsl-63-183-164.ilm.bellsouth.net - - [09/May/2001:13:42:07 -0700] "GET /about.htm HTTP/1.1" 200 3741 "http://www.e-angelica.com" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98)"
The first part of the request shows the user's local domain. I can see that this is a DSL subscriber on the BellSouth network. The two hyphens that follow are where the IndentityCheck and UserID information would normally show up, but since my site does not utilize either of these processes, I get nothing but hyphens. Next, in brackets, is the date, then the time (in 24-hour format), followed by the user's time zone code.
The request field, displayed within quotes, shows that the user asked the server to GET a page. Other request types are POST, DELETE, and HEAD, though you don't see those nearly as often. Following the request type is the path and name of the file. In this case, the user was requesting the "about.htm" file in the root directory of snackfight.com. Also, you can see that the protocol used here was the good old Hypertext Transfer Protocol, version 1.1.
The status field shows a status code of 200, meaning that everything went through just peachy. A status code of 404, as you may know, means that the file was not found on the server. Immediately following the status code is the file size of "about.htm". It's 3,741 bytes. Hey, not bad! I'll bet it loaded nice and quick.
The next two fields are especially interesting. These are custom fields that my hosting company has added to its logging so that I can get a better idea of who's visiting my site. The first field, in quotes, is the referer field. This is where my user clicked on a link in order to arrive at the page he was just served. I can see that this particular user is a fan of the e-angelica site, because that's where he came from to arrive at my site. In some cases, referers are logged in their own log file. These referer logs usually use the same format and can also be viewed or run through an analyzer. For the full skinny on referer logs, check out Jeff's article.
The last field, also in quotes, shows some information about the user's browser and platform, in this case, Internet Explorer 5.0 on a Windows 98 machine. Oh, how original!
And that's about it! It's a lot of information, I know, and your log file may store even more goodies (an in-depth explanation of the syntax used in log files can be found in the massive spec for HTTP 1.1, which is also useful as a reference when looking up header fields and server status codes.)
All this data is little overwhelming, no? Especially in its raw state. If you're not exactly thrilled about the idea of picking through thousands of lines of text and status codes to determine whether or not your users are being served in the most efficient manner, there are several software packages on the market that you can use to generate reports without getting your hands dirty (and without opening your text editor). But which one is right for you? Well, that all depends on what you're looking for.
Different Ways of Looking at It
Picking out a log file analyzer package is a bit like stocking up for a party. If you're only inviting a few close friends, maybe you can get away with a case of soda and a few bags of chips. But what happens if everyone you know unexpectedly invites seven people? All of a sudden, that case of soda isn't good enough anymore. There's a lesson here — be a good scout and Be Prepared!
For solid, intelligent reporting, it's crucial to pick out a software application that covers everything under the sun — even for a small site with moderate traffic. Sooner or later, you'll want to generate a strange or unique report, and having the tools at hand to do so is essential. The extra functionality that comes with the larger, robust applications allows you to generate just about any kind of log file report possible.
If your server produces any of the usual file types (again, CLF, DLF, ELF), you're in the green. However, there are a great deal of proprietary file formats out there, in which case you will need to check and see if your desired software package can understand the flavor of log file you'll be analyzing.
Almost every log file analyzer runs locally on your computer. Some of the more forward-thinking software companies, however, now offer hosted log file report generators. These options are lightweight yet powerful and they can be accessed from any computer with Web access, whether or not it's directly connected to your server. Though their functionality is limited and security issues are always a concern, a hosted solution is often less costly than a large software application.
All About the Washingtons
Speaking of price, there are a plethora of log file analyzer applications available on the Web as freeware or shareware. The open source software movement has made many of these indispensable backend utilities available for free, though finding technical support for some of them can be challenging. Most have excellent online documentation, but lack a telephone- or email-based support structure. If you don't feel comfortable walking the tightrope without a safety net but you're on a budget, there are several software companies that provide tiered pricing on their products — you only buy the level of functionality that you require.
Back to our party analogy: If you're planning a really small get together, say three or four people, you may not even need to buy anything extra. You can get by on what you've got in the fridge. The same thing goes for log files.
If all you're interested in is hits, you can grab a handy free counter like Site Meter. If you're running a Microsoft Internet Information Server (IIS), Microsoft's Site Server application has extensive logging and analyzing capabilities. Also, if your site is hosted, the good Web hosting providers will offer browser-based log file reports as part of (or as an add-on to) their basic service. Graphs, charts, numbers in a row — all a few clicks away.
Once you've managed to clearly define your needs and limitations, you're ready to go to market. There are many log file analyzers to choose from, and you may have to do some research on your own to select the one that's Cinderella-slipper-perfect for your needs. But to get the ball rolling, let's take a closer look at the most popular solutions: WebTrends and Sawmill (for pay) and Analog, Webalizer, and http-analyze (for free).
Pay Your Way: WebTrends and Sawmill
If you're looking for an application that can generate fast, effective reports of any kind and display them in a format that's easy to read and understand, you have several options. The applications we'll be looking first are commercial products (i.e., they cost money), but they also come with technical support and they're built with the newbie user in mind.
First up is some software from WebTrends, which is the industry leader for log file reporting applications on the enterprise and small-to-medium business level. Even the single-site analysis application, WebTrends Log Analyzer, is widely used by webmasters of small sites who want to keep a close eye on their users' habits.
To a new user, the most appealing aspect of the WebTrends Log Analyzer is the user-friendly interface. When you launch the program, you are presented with a wizard that helps you build a profile for your site and locate your log files (either locally or on your Web server). As you generate reports, you can use your mouse to select or deselect different stats from a list.
Click picture for full size screen shot
This type of selective reporting is useful if you'd like to isolate certain information, such as peak usage times or advertising click-through numbers. The program crunches through your server logs and generates a local folder full of navigable HTML files that you can sift through with your browser.
The advantages of the WebTrends analyzer are its ease-of-use and ability to generate just about any report you could possibly want. The amount of information that WebTrends spits out is staggering — it took me a few tries to pare down my desired stats to something even vaguely readable. Also, WebTrends software can interface with all of the big Microsoft Office products, which means you can dump reports into MS Excel or MS Word.
There are two disadvantages to WebTrends Log Analyzer. The first is that it only runs on Windows. The second sticky point is the price. Even though WebTrends offers tiered pricing — separate packages for e-commerce, enterprise, and even a full-featured hosted solution — the low-end starts at a high price. And then you have to pay extra for technical support. The WebTrends Log Analyzer that I used is $699 US by itself, and $838 US if you purchase it along with a year of telephone support. View some sample reports, and download a trial version.
Another commercial log file analyzer with a significant amount of gusto is Sawmill. Sawmill is not as feature-rich as WebTrends, and it may not look as pretty, but it certainly gets the job done.
Sawmill's interface is entirely browser-based. Plus, they've included a quick start option for the anxious webmaster — all you need to do is tell it where your server logs are, give the report a name, and click Submit. The reports that I generated were for the entire six months that my site has been active, and the detail was surprising. The program's navigable calendar makes it simple to zoom in on a particular month, week, or day and view only the stats for a specific span of time. Every report you need is just a click away — referers, browser type, operating system, domains of the visitors, and others.
Click picture for full size screen shot
With all that functionality, Sawmill is a bargain at $99 US for a single-user license. Telephone and email support are free. It also runs on a wide variety of operating systems: not only Windows and Linux, but also Mac OS, BSD, and even BeOS.
The only disadvantage is that the interface can be obtuse. The user experience is not as professional and thorough as WebTrends. Even so, Sawmill gets the Monkey stamp of approval. Find out for yourself by perusing the fully browsable sample reports, and downloading a free trial.
There are literally hundreds of commercial log analysis packages in the marketplace. Download a few and take them for a spin.
If you are running Windows, try Surfstats, NetGenesis 5 from NetGenesis Corp., or FastStats Analyzer from Mach 5 Enterprises. For the cross-platform crew, including Mac and Unix, try FlashStats from Maximized Software, ThinWEB Technologies' WebCrumbs, or Laurent Domisse's W3Perl.
Now let's have a peek at the shareware and freeware options for analyzing those log files.
All Your Freeware Are Belong to Tux
First up in the free software world is Analog, the program that claims to be the "most popular log file analyzer in the world." That may be true, but how does it stand up against the competition? Very well, it turns out.
Analog is fully configurable, which means you can tweak it to produce referer reports, error reports, or anything that's reflected in your raw log file data. The only thing is, you have to know how to tweak it. A complete user manual is available on the Web, plus there's a user email list that you can turn to if you get stuck. This is helpful as the learning curve tips a little steep for the inexperienced user. It runs on Windows, Macintosh, Linux, Unix, and BSD. It understands almost every log file format on the Web, and it can generate reports in over 30 languages. On top of all that, it's free!
The reports generated by Analog are pumped into one large HTML file that you can drag into your browser and scroll through (here, see a sample). All sorts of highly detailed data is here, but the reports are not as easy to read as the commercial applications, which are more concerned with look and feel. Of course if you'd like to see more stylish-looking reports, you can download a free add-on called Report Magic, which gives Analog the pretty user interface you'd expect from a program that costs a whole lot more.
In the Linux crowd, many would argue that nothing beats Webalizer. The free application generates highly detailed, easy-to-read reports in HTML (check out the graphing capabilities in these sample reports). It also runs on a host of operating systems and speaks multiple languages.
Webalizer, being born of Unix-kind, is more difficult to use and customize than other applications. If you know your way around Bash or Perl, then you'll have no problem configuring reports from the command-line. If that's all Greek to you, then your reports will remain on the stale side. The default settings produce such beautiful-looking logs, however, I can't help but recommend it.
HTTP-analyze does have a few bits of bad karma. The most prominent is that it only effectively parses CLF and DLF/ELF log files. If your server uses even the slightest variation from standard format for the access logs, HTTP-analyze will generate an incomplete report. Also, HTTP-analyze is a Unix-based program, meaning that its operating system support is more limited than the others.
A few pieces of helpful documentation, such as the user manual and FAQ, are't even available online yet, but they should be finished soon. But if you're running Windows NT or a flavor of Unix, download it and take advantage of the unique interface.
At this juncture, I encourage you to go explore your log files. You've got enough under your belt to start analyzing your site activity. But before you go dive into your logs, I have a few hard-won tips and hints to pass along.
One point that I can't stress enough is the importance of long-term logging. If your Web server is configured to erase your old log files every month, either change the server's configuration or save a copies of your log files locally. It's very insightful to see the differences in site traffic over a four-year period. For example, by looking at the user agent information over time, you can see how quickly and how often your users upgrade their browsers or operating systems to the latest versions.
When you're looking at your log files, either in raw form or in an analyzer, you'll probably notice a file called "robots.txt" in your root directory that's getting a whole bunch of hits. Don't worry, that's not a mistake — it only means that a search engine robot was crawling your site. Search engines send out their robots, also called spiders or crawlers, every now and then to crawl the Web and see what's out there. If you include a robots.txt file in your root directory, you can give specific instructions to a robot: Tell it to go away, or point it to the information that you would like to make searchable. For more information on how the robot.txt file works, visit the Web Robots pages.
And here's a handy trick to take with you. Did you do your good Monkey deed and create a favorites icon for your site? If so, you can find out how many people are actually seeing your icon simply by running a report that counts hits on your "favicon.ico" file.
That's it for Web logs! But really, it's only the beginning of the user tracking game. For in-depth information about tracking and utilizing data, read Bill's tutorial about these very topics. To understand the importance of knowing your audience, check out Lesson 2 of Josh and Oliver's Market Research on the Web tutorial. Or just give your eyes a rest and go protest the production of the wooden kind.
Michael Calore is Webmonkey's senior technical editor. He's the ringleader and publisher of Snackfight, and a part-time musician. His favorite movie is Dude, Where's My Car?
Feedback | Help | About Us | Jobs | Advertise | Privacy Statement | Terms of Service
Copyright © 1994-2004 Wired Digital Inc., a Lycos Network site. All rights reserved.