File Source
The Vector file
source
collects logs from files.
Configuration
- Common
- Advanced
- vector.toml
- vector.yaml
- vector.json
[sources.my_source_id]type = "file" # requiredignore_older = 600 # optional, no default, secondsinclude = ["/var/log/**/*.log"] # requiredread_from = "beginning" # optional, default
- optionalstring
data_dir
The directory used to persist file checkpoint positions. By default, the global
data_dir
option is used. Please make sure the Vector project has write permissions to this dir. See Checkpointing for more info.This field accepts a valid file system path.
- Syntax:
file_system_path
- View examples
- Syntax:
- optionaltable
encoding
Configures the encoding specific source behavior.
- optionalstring
charset
Encoding of the source messages. Takes one of the encoding label strings defined as part of the Encoding Standard. When set, the messages are transcoded from the specified encoding to UTF-8, which is the encoding vector assumes internally for string-like data. Enable this transcoding operation if you need your data to be in UTF-8 for further processing. At the time of transcoding, any malformed sequences (that can't be mapped to UTF-8) will be replaced with replacement character and warnings will be logged.
- Syntax:
literal
- View examples
- Syntax:
- optional[string]
exclude
Array of file patterns to exclude. Globbing is supported.Takes precedence over the
include
option.- View examples
- optionalstring
file_key
The key name added to each event with the full path of the file.
- Syntax:
literal
- Default:
"file"
- Syntax:
- optionaltable
fingerprint
Configuration for how the file source should identify files.
- optionaluint
ignored_header_bytes
The number of bytes to skip ahead (or ignore) when generating a unique fingerprint. This is helpful if all files share a common header. See fingerprint for more info.
- Only relevant when: strategy = "checksum"
- Default:
0
(bytes)
- enumoptionalstring
strategy
The strategy used to uniquely identify files. This is important for checkpointing when file rotation is used.
- Syntax:
literal
- Default:
"checksum"
- Enum, must be one of:
"checksum"
"device_and_inode"
- View examples
- Syntax:
- optionaluint
glob_minimum_cooldown
Delay between file discovery calls. This controls the interval at which Vector searches for files. See Autodiscovery and Globbing for more info.
- Default:
1000
(milliseconds)
- Default:
- optionalstring
host_key
The key name added to each event representing the current host. This can also be globally set via the global [
host_key
](#host_key) option.- Syntax:
literal
- Default:
"host"
- Syntax:
- optionalbool
ignore_checkpoints
This causes Vector to ignore existing checkpoints when determining where to start reading a file. Checkpoints are still written normally. See Read Position for more info.
- Default:
false
- View examples
- Default:
- optionalbool
ignore_not_found
Ignore missing files when fingerprinting. This may be useful when used with source directories containing dangling symlinks.
- Default:
false
- View examples
- Default:
- commonoptionaluint
ignore_older
Ignore files with a data modification date that does not exceed this age.
- View examples
- commonrequired[string]
include
Array of file patterns to include. Globbing is supported. See File Read Order and File Rotation for more info.
- View examples
- optionalstring
line_delimiter
String sequence used to separate one file line from another See Line Delimiters for more info.
- Syntax:
literal
- Default:
""" """
- View examples
- Syntax:
- optionaluint
max_line_bytes
The maximum number of a bytes a line can contain before being discarded. This protects against malformed lines or tailing incorrect files.
- Default:
102400
(bytes)
- Default:
- optionaluint
max_read_bytes
An approximate limit on the amount of data read from a single file at a given time.
- View examples
- optionaltable
multiline
Multiline parsing configuration. If not specified, multiline parsing is disabled. See Multiline Messages for more info.
- commonrequiredstring
condition_pattern
Condition regex pattern to look for. Exact behavior is configured via
mode
. See Multiline Messages for more info.This field accepts a valid [Rust regular expression]urls.rustregex_syntax. Wrapping
/
characters are _not required or permitted.- Syntax:
regex
- View examples
- Syntax:
- enumcommonrequiredstring
mode
Mode of operation, specifies how the
condition_pattern
is interpreted. See Multiline Messages for more info.- Syntax:
literal
- Enum, must be one of:
"continue_through"
"continue_past"
"halt_before"
"halt_with"
- View examples
- Syntax:
- commonrequiredstring
start_pattern
Start regex pattern to look for as a beginning of the message. See Multiline Messages for more info.
This field accepts a valid [Rust regular expression]urls.rustregex_syntax. Wrapping
/
characters are _not required or permitted.- Syntax:
regex
- View examples
- Syntax:
- commonrequireduint
timeout_ms
The maximum time to wait for the continuation. Once this timeout is reached, the buffered message is guaranteed to be flushed, even if incomplete.
- View examples
- optionalbool
oldest_first
Instead of balancing read capacity fairly across all watched files, prioritize draining the oldest files before moving on to read data from younger files. See File Read Order for more info.
- Default:
false
- View examples
- Default:
- enumcommonoptionalstring
read_from
In the absence of a checkpoint, this setting tells Vector where to start reading files that are present at startup. See Read Position for more info.
- Syntax:
literal
- Default:
"beginning"
- Enum, must be one of:
"beginning"
"end"
- View examples
- Syntax:
- optionaluint
remove_after
Timeout from reaching
eof
after which file will be removed from filesystem, unless new data is written in the meantime. If not specified, files will not be removed.- WARNING: Vector's process must have permission to delete files.
- View examples
Output
This component outputs log events with the following fields:
{"file" : "/var/log/apache/access.log","host" : "my-host.local","message" : "53.126.150.246 - - [01/Oct/2020:11:25:58 -0400] \"GET /disintermediate HTTP/2.0\" 401 20308","timestamp" : "2020-10-10T17:07:36+00:00"}
- commonrequiredstring
file
The absolute path of originating file. See Context for more info.
- Syntax:
literal
- View examples
- Syntax:
- commonrequiredstring
host
The local hostname, equivalent to the
gethostname
command.- Syntax:
literal
- View examples
- Syntax:
- commonrequiredstring
message
The raw line from the file.
- Syntax:
literal
- View examples
- Syntax:
- commonrequiredtimestamp
timestamp
The exact time the event was ingested into Vector.
- View examples
Telemetry
This component provides the following metrics that can be retrieved through
the internal_metrics
source. See the
metrics section in the
monitoring page for more info.
- counter
checkpoint_write_errors_total
The total number of errors writing checkpoints. This metric includes the following tags:
instance
- The Vector instance identified by host and port.job
- The name of the job producing Vector metrics.
- counter
checkpoints_total
The total number of files checkpointed. This metric includes the following tags:
instance
- The Vector instance identified by host and port.job
- The name of the job producing Vector metrics.
- counter
checksum_errors_total
The total number of errors identifying files via checksum. This metric includes the following tags:
file
- The file that produced the errorinstance
- The Vector instance identified by host and port.job
- The name of the job producing Vector metrics.
- counter
file_delete_errors_total
The total number of failures to delete a file. This metric includes the following tags:
file
- The file that produced the errorinstance
- The Vector instance identified by host and port.job
- The name of the job producing Vector metrics.
- counter
file_watch_errors_total
The total number of errors encountered when watching files. This metric includes the following tags:
file
- The file that produced the errorinstance
- The Vector instance identified by host and port.job
- The name of the job producing Vector metrics.
- counter
files_added_total
The total number of files Vector has found to watch. This metric includes the following tags:
file
- The file that produced the errorinstance
- The Vector instance identified by host and port.job
- The name of the job producing Vector metrics.
- counter
files_deleted_total
The total number of files deleted. This metric includes the following tags:
file
- The file that produced the errorinstance
- The Vector instance identified by host and port.job
- The name of the job producing Vector metrics.
- counter
files_resumed_total
The total number of times Vector has resumed watching a file. This metric includes the following tags:
file
- The file that produced the errorinstance
- The Vector instance identified by host and port.job
- The name of the job producing Vector metrics.
- counter
files_unwatched_total
The total number of times Vector has stopped watching a file. This metric includes the following tags:
file
- The file that produced the errorinstance
- The Vector instance identified by host and port.job
- The name of the job producing Vector metrics.
- counter
fingerprint_read_errors_total
The total number of times Vector failed to read a file for fingerprinting. This metric includes the following tags:
file
- The file that produced the errorinstance
- The Vector instance identified by host and port.job
- The name of the job producing Vector metrics.
- counter
events_out_total
The total number of events emitted by this component. This metric includes the following tags:
component_kind
- The Vector component kind.component_name
- The Vector component ID.component_type
- The Vector component type.instance
- The Vector instance identified by host and port.job
- The name of the job producing Vector metrics.
- counter
glob_errors_total
The total number of errors encountered when globbing paths. This metric includes the following tags:
instance
- The Vector instance identified by host and port.job
- The name of the job producing Vector metrics.path
- The path that produced the error.
Examples
Given the following input:
53.126.150.246 - - [01/Oct/2020:11:25:58 -0400] "GET /disintermediate HTTP/2.0" 401 20308
And the following configuration:
[sources.file]type = "file"include = ["/var/log/**/*.log"]
The following Vector log event will be output:
{"file": "/var/log/apache/access.log","host": "my-host.local","message": "53.126.150.246 - - [01/Oct/2020:11:25:58 -0400] \"GET /disintermediate HTTP/2.0\" 401 20308","timestamp": "2020-10-10T17:07:36.452332Z"}
How It Works
Autodiscovery
Vector will continually look for new files matching any of your
include patterns. The frequency is controlled via the
glob_minimum_cooldown
option. If a new file is added that matches
any of the supplied patterns, Vector will begin tailing it. Vector
maintains a unique list of files and will not tail a file more than
once, even if it matches multiple patterns. You can read more about
how we identify files in the Identification section.
Checkpointing
Vector checkpoints the current read position after each
successful read. This ensures that Vector resumes where it left
off if restarted, preventing data from being read twice. The
checkpoint positions are stored in the data directory which is
specified via the global data_dir
option, but can be overridden
via the data_dir
option in the file source directly.
Compressed Files
Vector will transparently detect files which have been compressed using Gzip and decompress them for reading. This detection process looks for the unique sequence of bytes in the Gzip header and does not rely on the compressed files adhering to any kind of naming convention.
One caveat with reading compressed files is that Vector is not able to efficiently seek into them. Rather than implement a potentially-expensive full scan as a seek mechanism, Vector currently will not attempt to make further reads from a file for which it has already stored a checkpoint in a previous run. For this reason, users should take care to allow Vector to fully process anycompressed files before shutting the process down or moving the files to another location on disk.
Context
By default, the file
source will augment events with helpful
context keys as shown in the "Output" section.
File Deletion
When a watched file is deleted, Vector will maintain its open file
handle and continue reading until it reaches EOF
. When a file is
no longer findable in the includes
option and the reader has
reached EOF
, that file's reader is discarded.
File Read Order
By default, Vector attempts to allocate its read bandwidth fairly across all of the files it's currently watching. This prevents a single very busy file from starving other independent files from being read. In certain situations, however, this can lead to interleaved reads from files that should be read one after the other.
For example, consider a service that logs to timestamped file, creating a new one at an interval and leaving the old one as-is. Under normal operation, Vector would follow writes as they happen to each file and there would be no interleaving. In an overload situation, however, Vector may pick up and begin tailing newer files before catching up to the latest writes from older files. This would cause writes from a single logical log stream to be interleaved in time and potentially slow down ingestion as a whole, since the fixed total read bandwidth is allocated across an increasing number of files.
To address this type of situation, Vector provides the
oldest_first
option. When set, Vector will not read from any file
younger than the oldest file that it hasn't yet caught up to. In
other words, Vector will continue reading from older files as long
as there is more data to read. Only once it hits the end will it
then move on to read from younger files.
Whether or not to use the oldest_first flag depends on the
organization of the logs you're configuring Vector to tail. If your
include
option contains multiple independent logical log streams
(e.g. Nginx's access.log and error.log, or logs from multiple
services), you are likely better off with the default behavior. If
you're dealing with a single logical log stream or if you value
per-stream ordering over fairness across streams, consider setting
the oldest_first
option to true.
File Rotation
Vector supports tailing across a number of file rotation strategies.
The default behavior of logrotate
is simply to move the old log
file and create a new one. This requires no special configuration of
Vector, as it will maintain its open file handle to the rotated log
until it has finished reading and it will find the newly created
file normally.
A popular alternative strategy is copytruncate
, in which
logrotate
will copy the old log file to a new location before
truncating the original. Vector will also handle this well out of
the box, but there are a couple configuration options that will help
reduce the very small chance of missed data in some edge cases. We
recommend a combination of delaycompress (if applicable) on the
logrotate
side and including the first rotated file in Vector's
include
option. This allows Vector to find the file after rotation,
read it uncompressed to identify it, and then ensure it has all of
the data, including any written in a gap between Vector's last read
and the actual rotation event.
Globbing
Globbing is supported in all provided file paths,
files will be autodiscovered continually at a rate defined by the
glob_minimum_cooldown
option.
Line Delimiters
Each line is read until a new line delimiter (by default, i.e.
the
0xA
byte) or EOF
is found. If needed, the default line
delimiter can be overriden via the line_delimiter
option.
Multiline Messages
Sometimes a single log event will appear as multiple log lines. To
handle this, Vector provides a set of multiline
options. These
options were carefully thought through and will allow you to solve the
simplest and most complex cases. Let's look at a few examples:
Example 1: Ruy Exceptions
Ruby exceptions, when logged, consist of multiple lines:
foobar.rb:6:in `/': divided by 0 (ZeroDivisionError)from foobar.rb:6:in `bar'from foobar.rb:2:in `foo'from foobar.rb:9:in `<main>'
To consume these lines as a single event, use the following Vector configuration:
[sources.my_file_source]type = "file"# ...[sources.my_file_source.multiline]start_pattern = '^[^\s]'mode = "continue_through"condition_pattern = '^[\s]+from'timeout_ms = 1000
start_pattern
, set to^[^\s]
, tells Vector that new multi-line events should not start with white-space.mode
, set tocontinue_through
, tells Vector continue aggregating lines until thecondition_pattern
is no longer valid (excluding the invalid line).condition_pattern
, set to^[\s]+from
, tells Vector to continue aggregating lines if they start with white-space followed byfrom
.
Example 2: Line Continuations
Some programming languages use the backslash (\
) character to
signal that a line will continue on the next line:
First line\second line\third line
To consume these lines as a single event, use the following Vector configuration:
[sources.my_file_source]type = "file"# ...[sources.my_file_source.multiline]start_pattern = '\\$'mode = "continue_past"condition_pattern = '\\$'timeout_ms = 1000
start_pattern
, set to\\$
, tells Vector that new multi-line events start with lines that end in\
.mode
, set tocontinue_past
, tells Vector continue aggregating lines, plus one additional line, untilcondition_pattern
is false.condition_pattern
, set to\\$
, tells Vector to continue aggregating lines if they end with a\
character.
Example 3: Line Continuations
Activity logs from services such as Elasticsearch typically begin with a timestamp, followed by information on the specific activity, as in this example:
[2015-08-24 11:49:14,389][ INFO ][env ] [Letha] using [1] data paths, mounts [[/(/dev/disk1)]], net usable_space [34.5gb], net total_space [118.9gb], types [hfs]
To consume these lines as a single event, use the following Vector configuration:
[sources.my_file_source]type = "file"# ...[sources.my_file_source.multiline]start_pattern = '^\[[0-9]{4}-[0-9]{2}-[0-9]{2}'mode = "halt_before"condition_pattern = '^\[[0-9]{4}-[0-9]{2}-[0-9]{2}'timeout_ms = 1000
start_pattern
, set to^\[[0-9]{4}-[0-9]{2}-[0-9]{2}
, tells Vector that new multi-line events start with a timestamp sequence.mode
, set tohalt_before
, tells Vector to continue aggregating lines as long as thecondition_pattern
does not match.condition_pattern
, set to^\[[0-9]{4}-[0-9]{2}-[0-9]{2}
, tells Vector to continue aggregating up until a line starts with a timestamp sequence.
Read Position
By default, Vector will read from the beginning of newly discovered
files. You can change this behavior by setting the read_from
option to
"end"
.
Previously discovered files will be checkpointed, and
the read position will resume from the last checkpoint. To disable this
behavior, you can set the ignore_checkpoints
option to true
. This
will cause Vector to disregard existing checkpoints when determining the
starting read position of a file.
State
This component is stateless, meaning its behavior is consistent across each input.
fingerprint
By default, Vector identifies files by creating a
cyclic redundancy check (CRC) on the first 256 bytes of
the file. This serves as a fingerprint to uniquely identify the file.
The amount of bytes read can be controlled via the fingerprint_bytes
and ignored_header_bytes
options.
This strategy avoids the common pitfalls of using device and inode names since inode names can be reused across files. This enables Vector to properly tail files across various rotation strategies.