Database Format --------------- Records ------- Records are formatted like RFC 822 messages. Unless specified, values may not contain HTML, or empty lines, but may contain 8-bit values. Where a value contains "one or more" tokens, they are to be separated by a comma followed by a space. Fields can be repeated and grouped by appending number 2 and up, for example: robot-owner-name1: Mr A. RobotAuthor robot-owner-url1: http://webrobot.com/~a/a.html robot-owner-name2: Mr B. RobotCoAuthor robot-owner-name2: http://webrobot.com/~b/b.html Fields Schema ------ robot-id: Short name for the robot, used internally as a unique reference. Should use [a-z-_]+ Example: webcrawler robot-name: Full name of the robot, for presentation purposes. Example: WebCrawler robot-details-url: URL of the robot home page, containing further technical details on the robot, background information etc. Example: http://webcrawler.com/WebCrawler/Facts/HowItWorks.html robot-cover-url: URL of the robot product, containing marketing details about either the robot, or the service to which the robot is related. Example: http://webcrawler.com/ robot-owner-name: Name of the owner. For service robots this is the person running the robot, who can be contacted in case of specific problems. In the case of robot products this is the person maintaining the product, who can be contacted if the robot has bugs. Example: Brian Pinkerton robot-owner-url: Home page of the robot-owner-name Example: http://info.webcrawler.com/bp/bp.html robot-owner-email: Email address of owner Example: np@webcrawler.com robot-status: Deployment status of the robot. One of: - development: robot under development - active: robot actively in use - retired: robot no longer used robot-purpose: Purpose of the robot. One or more of: - indexing: gather content for an indexing service - maintenance: link validation, html validation etc. - statistics: used to gather statistics Further details can be given in the description robot-type: Type of robot software. One or more of: - standalone: a separate program - browser: built into a browser - plugin: a plugin for a browser robot-platform: Platform robot runs on. One or more of: - unix - windows, windows95, windowsNT - os2 - mac etc. robot-availability: Availability of robot to general public. One or more of: - source: source code available - binary: binary form available - data: bulk data gathered by robot available - none Details on robot-url or robot-cover-url. robot-exclusion: Standard for Robots Exclusion supported. yes or no robot-exclusion-useragent: Substring to use in /robots.txt Example: webcrawler robot-noindex: directive supported: yes or no robot-host: Host the robot is run from. Can be a pattern of DNS and/or IP. If the robot is available to the general public, add '*' Example: spidey.webcrawler.com, *.webcrawler.com, 192.216.46.* robot-from: The HTTP From field as defined in RFC 1945 can be set. yes or no robot-useragent: The HTTP User-Agent field as defined in RFC 1945 Example: WebCrawler/1.0 libwww/4.0 robot-language: Languages the robot is written in. One or more of: c,c++,perl,perl4,perl5,java,tcl,python, etc. robot-description: Text description of the robot's functions. More details should go on robot-url. Example: The WebCrawler robot is used to build the database for the WebCrawler search service operated by GNN (part of AOL). The robot runs weekly, and visits sites in a random order. robot-history: Text description of the origins of the robot. Example: This robot finds its roots in a research project at the University of Washington in 1994. robot-environment: The environment the robot operates in. One or more of: - service: builds a commercial service - commercial: is a commercial product - research: used for research - hobby: written as a hobby modified-date: The date this record was last modified. Format as in HTTP Example: Fri, 21 Jun 1996 17:28:52 GMT