Google

SuperservicePrev Chapter 6. Schemas Reference Next

Schemas for the www Superservice

DLF Schema for WWW service

Schema ID: www

Timestamp Field: time

In this DLF schema, each record represents a request to the web server. It has the equivalent information than the common log format supported by most web servers.

Fields in the Schema

client_host

Type: hostname

Defaults: -

The hostname (or ip address) of the clients that made the request.

who

Type: string

Defaults: -

If the request was authenticated, this field should contains the name of the authenticated user. Not that there is no indication of which authentication method was used (RFC1531, WWW authentication, etc.).

http_result

Type: string

Defaults: -

The numeric result code of the request. That's is 200, 301, etc.

requested_page_size

Type: bytes

Defaults: -

The number of bytes sent to the client during the request.

http_action

Type: string

Defaults: -

The method used by the client for the request. That is usually one of GET, HEAD, POST, etc.

requested_page

Type: url

Defaults: -

The URL that was requested by the client.

http_protocol

Type: string

Defaults: -

The protocol used by the client. It should usually be one of HTTP/1.0 or HTTP/1.1.

time

Type: timestamp

Defaults: 0

The time of the request.

referer

Type: string

Defaults: -

The content of the Referer header that was sent along the request. That usually represents the referring URL, that's the URL which the user was browsing when this URL was requested.

useragent

Type: string

Defaults: -

The content of the User-Agent header that was sent along the request. That usually contains information the web browser used by the client.

gzip_result

Type: string

Defaults: -

When automatic compression is used, this should contains the result code from the compression submodule.

compression

Type: int

Defaults: 0

When automatic compression of the results is used, this field should contains the compression ratio achieved.

Extended Schemas for the www Superservice

Attack Extended Schema for WWW service

Schema ID: www-attack

Base Schema: www

Module: Lire::Extensions::WWW::AttackSchema

Required Fields: requested_page

This is an extended schema for the WWW service which tries to find common web attack based on the requested URL.

Fields in the Schema

attack

Type: string

Defaults: Unknown/No Attack

The type of attack that this request represents.

Domain Extended Schema for WWW service

Schema ID: www-domain

Base Schema: www

Module: Lire::Extensions::WWW::DomainSchema

Required Fields: client_host

This is an extended schema for the WWW service which adds a country and client_domain fields based on the client host.

Fields in the Schema

client_domain

Type: hostname

Defaults: -

The domain of the client host.

country

Type: string

Defaults: Unknown

The country of the client host as determined by the top-level domain.

Robot Extended Schema for WWW service

Schema ID: www-robot

Base Schema: www

Module: Lire::Extensions::WWW::RobotSchema

Required Fields: None

This is an extended schema for the WWW service which adds a robot field based on information from the domain name or the user_agent string.

Fields in the Schema

robot

Type: string

Defaults: Unknown/No Robot

The name of the robot that made the request.

Search Engine Extended Schema for WWW service

Schema ID: www-search

Base Schema: www

Module: Lire::Extensions::WWW::SearchSchema

Required Fields: referer

This is an extended schema for the WWW service which analyze the referrals. It extract the referring sites and it also determines if it was a request from a search engine.

Fields in the Schema

referring_site

Type: string

Defaults: -

The site which reffered that request. This is usually an hostname, but it can also be bookmarks for when the user used a bookmark.

search_engine

Type: string

Defaults: -

The name of the search engine, when the request was referred through a search engine.

keywords

Type: string

Defaults: -

The search phrase used when the request was referred through a search engine.

URL Extended Schema for WWW service

Schema ID: www-url

Base Schema: www

Module: Lire::Extensions::WWW::URLSchema

Required Fields: requested_page

This is an extended schema for the WWW service which parses the requested URL and adds several fields based on this information.

Fields in the Schema

requested_file

Type: filename

Defaults: -

The portion of the requested URL that represents a filename. That is everything that comes before the ? which starts the QUERY_STRING.

requested_page_ext

Type: string

Defaults: -

The extension of the requested file.

directory

Type: filename

Defaults: -

The directory portion of the URL.

User Agent Extended Schema for WWW service

Schema ID: www-user_agent

Base Schema: www

Module: Lire::Extensions::WWW::UserAgentSchema

Required Fields: useragent

This is an extended schema for the WWW service which adds fields to access information from the user_agent field.

Fields in the Schema

browser

Type: string

Defaults: Unknown

The browser that was probably used to make the request as guessed from the user_agent field.

os

Type: string

Defaults: Unknown

The client's operating system as guessed from the user_agent field.

lang

Type: string

Defaults: Unknown

The client's language as guessed from the locale's information contained in the user_agent field.

Derived Schemas for the www Superservice

User Session Derived Schema for WWW service

Schema ID: www-user_session

Base Schema: www

Module: Lire::Extensions::WWW::UserSessionSchema

Required Fields: time, client_host

Timestamp Field: session_start

This is a derived schema for the WWW service which represents user session. User sessions tracks the traversal of users through the web site. Users are tracked using their IP address and their user agent information. This is not a full proof method. For one thing, it clearly fails in the case of users having an homogeneous environment and browsing from behing a proxy server.

Possible enhancements would be to use tracking information from a cookie.

The session represent all the consequential requests made by a user. The session will end after 30 minutes where no requests was made by the user.

Fields in the Schema

session_id

Type: string

Defaults: -

This field contains an arbitrary session identifier.

session_start

Type: timestamp

Defaults: 0

The time at which the session started.

session_end

Type: timestamp

Defaults: 0

The time of the last request in the session.

session_length

Type: duration

Defaults: 0

The length elapsed between the first and last requests.

page_count

Type: int

Defaults: 0

The number of pages requested by the user in this session. (This excludes requests ending in .png, .jpg, .jpeg, .gif and .css.)

req_count

Type: int

Defaults: 0

This gives the number of requests by the user

first_page

Type: filename

Defaults: -

The first page requested by the user. (See page_count for exlusion.)

page_2

Type: filename

Defaults: -

The 2nd page requested by the user.

page_3

Type: filename

Defaults: -

The 3rd page requested by the user.

page_4

Type: filename

Defaults: -

The 4th page requested by the user.

page_5

Type: filename

Defaults: -

The 5th page requested by the user.

last_page

Type: filename

Defaults: -

The last page requested by the user.

completed

Type: bool

Defaults: -

Was this session completed? A completed session is one that we know for sure that if the user made another request, it would have been in a new sesssion. Concretely, all requests made in the last 30 minutes of the period covered by the log file will be part of uncompleted sessions.

visit_number

Type: int

Defaults: 0

This starts at 1 for the first session of a user in the log file and will be incremented for each new session started by that user in the same log file.