Collectord

Setting up comprehensive centralized logging with AWS Services for Kubernetes

March 13, 2019

We are happy to announce our new integration with Amazon Web Services, that allows you to implement a comprehensive centralized logging solution with AWS Services.

Why you might be interested to read this blog post?

  1. You use kubectl logs ... command to get access to the logs, and you already found out that it doesn't scale. You are looking for a comprehensive log management solution.
  2. You like your existing log management solution, but you are not satisfied with existing retention policies. Most service providers provide only 30 days of retention, and on-premises solutions require additional hardware resources to support more extended retention policy for logs.
  3. You are not satisfied with the price of your current log management solution. Below you can find a calculator to estimate the cost with AWS Services.
  4. You want to have a backup log management solution.

In this blog post we will use EKS cluster on AWS as an example and guide you through the setup of centralized logging solution with AWS Services using Collectord, a container-native software for log forwarding built by Outcold Solutions. We will use AWS Athena with S3, Glue and QuickSight as log management and analyzing tools with long retention period, and AWS Cloud Watch Logs as a log aggregation tool for building real-time alerts.

Let's get started.

Architecture overview

We will forward all the logs and events to S3 by default to build analytics and discover any log event from any pod in the long term. The logs will be forwarded to S3 in a compressed format, in chunks of 10 minutes (or maximum of 100Mb), partitioned by namespace, workload, container and date.

In case of AWS Cloud Watch Logs, we disable forwarding of logs by default (opt-out behavior by default) and choose to forward only logs for crucial services. We will use Cloud Watch Logs as a real-time log management system which will allow us to setup alerts. We are also going to sample logs, and forward only a small portion of them to AWS Cloud Watch.

collectord

Pre-requirements

We already have access to the EKS cluster running on AWS in region us-east-1. You can use this guide for any Kubernetes clusters, running in AWS cloud, other cloud providers or on premises, for self-provisioned clusters and managed clusters.

Configuring AWS

We are going to work in the AWS region us-east-1.

First, we need to choose the bucket where we want to store the logs. We will create a new bucket collectord.example.logs.

aws s3api create-bucket --bucket collectord.example.logs --region us-east-1

We also will enable encryption by default to keep the logs encrypted at rest.

aws s3api put-bucket-encryption --bucket collectord.example.logs --server-side-encryption-configuration '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm": "AES256"}}]}'

After that we will create a user account with programmatic access, that will be able to PUT objects to S3, create databases, tables and partitions in Glue Catalog and put logs in CloudWatch Logs. We will use this user later for Collectord deployments. Example of the policy is below (make sure to change the bucket name in line 7)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::collectord.example.logs/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:CreateTable",
                "glue:CreateDatabase",
                "glue:CreatePartition"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents",
                "logs:PutRetentionPolicy"
            ],
            "Resource": [
                "arn:aws:logs:*:*:log-group:/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "*"
        }
    ]
}

You can create the policy and the user with awscli tool. First create policy.

aws iam create-policy --policy-name collectord --policy-document '{"Version": "2012-10-17", "Statement": [{"Effect": "Allow", "Action": "s3:PutObject", "Resource": "arn:aws:s3:::collectord.example.logs/*"}, {"Effect": "Allow", "Action": ["glue:CreateTable", "glue:CreateDatabase", "glue:CreatePartition"], "Resource": "*"}, {"Effect": "Allow", "Action": ["logs:CreateLogStream", "logs:PutLogEvents", "logs:PutRetentionPolicy"], "Resource": ["arn:aws:logs:*:*:log-group:/*"] }, {"Effect": "Allow", "Action": "logs:CreateLogGroup", "Resource": "*"} ] }'

Create user

aws iam create-user --user collectord

Attach created policy to the user (make sure to change the policy ARN to the policy created in two steps above)

1
2
aws iam attach-user-policy --user collectord \
  --policy-arn arn:aws:iam::999999999999:policy/collectord

Create access key and secret key with

aws iam create-access-key --user collectord

Record the AccessKeyId and SecretAccessKey.

Install Collectord

Create a file, that will be used as a secret with AWS credentials

Save the file 100-general.yaml with content, where <AccessKeyId> and <SecretAccessKey> are from the output of aws iam create-access-key --user collectord command.

1
2
3
[aws]
accessKeyID = <AccessKeyId>
secretAccessKey = <SecretAccessKey>

We will use this file later to create secrets.

Collectord-s3

To start we will install Collectord deployment, that will forward data to S3 and create glue database and tables by following installations instructions (there is no need for the pre-requirement steps, as we already created everything we need, and EKS cluster in our case have docker logging driver set to json-file and have max-file and max-size set).

Just save the collectord-s3.yaml from the installations instructions and modify the lines under ConfigMap. You need to review and accept license agreement, specify the license key (you can request a trial license key with our automated form or receive the license after subscribing to Collectord from AWS Marketplace), also we will specify the cluster name devel, the AWS region us-east-1 and the target bucket collectord.example.logs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
[general]
# Review SLA at https://www.outcoldsolutions.com/docs/license-agreement/ and accept the license
acceptLicense = true
# Request the trial license with automated form https://www.outcoldsolutions.com/trial/request/
license = Qkc1MTgzUTQ0SUUyTTowOjoz....
# If you are planning to setup log aggregation for multiple cluster, name the cluster
fields.cluster = devel

[aws]
# Specify AWS Region
region = us-east-1

[output.s3]
# Specify Bucket Name
bucket = collectord.example.logs

Apply the file to Kubernetes

kubectl apply -f collectord-s3.yaml

Create the secret with the AWS credentials from the file 100-general.yaml

kubectl create secret generic collectord-s3 --from-file=./100-general.conf --namespace collectord-s3

Verify that the pods are running

kubectl get pods -n collectord-s3

If you'll see an issue, follow our troubleshooting steps

Collectord-cloudwatch

Now we will deploy Collectord with CloudWatch output. Similarly to S3 output we will follow installations instructions from our website. And again, no need to run any pre-requirement steps, as we have everything ready.

Save the collectord-cloudwatch.yaml from the installations instructions. Similarly to collectord-s3 deployment we need to review and accept license agreement, specify the license key (use the same license key, used for collectord-s3), similarly we will specify the cluster name devel and AWS region us-east-1. After that slightly modify the configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
apiVersion: v1
kind: ConfigMap
metadata:
  name: collectord-cloudwatch
  namespace: collectord-cloudwatch
  labels:
    app: collectord-cloudwatch
data:
  101-general.conf: |
    [general]
    # Review SLA at https://www.outcoldsolutions.com/docs/license-agreement/ and accept the license
    acceptLicense = true
    # Request the trial license with automated form https://www.outcoldsolutions.com/trial/request/
    license = Qkc1MTgzUTQ0SUUyTTowOjoz.asO2Um5SbttfK8hBFrSUjR/ErvojY8NtuIFKfw.PzhWbxRG6icd/jdwn1Y++ZBlt3S1Qyidp9ZH0A
    # If you are planning to setup log aggregation for multiple cluster, name the cluster
    fields.cluster = devel

    [aws]
    # Specify AWS Region
    region = us-east-1

    [output.cloudwatch.logs]
    retentionInDays = 7

  102-daemonset.conf: |
    [input.files::logs]
    disabled = true
    [input.files::syslog]
    disabled = true
    [input.journald]
    disabled = true

    [input.files]
    output = devnull
    samplingPercent = 5

Apply the configuration

kubectl apply -f collectord-cloudwatch.yaml

Again, create a secret now for the collectord-cloudwatch from the same file

kubectl create secret generic collectord-cloudwatch --from-file=./100-general.conf --namespace collectord-cloudwatch

And verify that pods are running

kubectl get pods -n collectord-cloudwatch

If you see an issue, follow our troubleshooting steps

Create a "production-like" workload

We will use a nginx deployment from our documentation how to work with QuickSight. with a small change to make it work with CloudWatch (reminder: we disabled collection of the container logs by default).

We just need to add one annotation to tell collectord-cloudwatch to start forwarding logs from this deployment and only from stdout, as stderr has some error information about not being able to read the file from file system, which we can avoid in this example (reminder: by default we sample only 5% of the logs). You can read more about annotations for CloudWatch to learn what you can control with them (hiding sensitive information, discovering of application logs, escaping terminal colors, specifying multi-line log patterns and more). Similarly Collectord with S3 also supports all the same annotations.

In the following deployment we applied annotation cloudwatch.collectord.io/stdout-logs-output: 'cloudwatch' which tells collectord-cloudwatch deployment to forward logs from stdout of all containers created with this Workload nginx-with-clients.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-with-clients
  labels:
    app: nginx-with-clients
  annotations:
    cloudwatch.collectord.io/stdout-logs-output: 'cloudwatch'
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx-with-clients
  template:
    metadata:
      labels:
        app: nginx-with-clients
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
---
kind: Service
apiVersion: v1
metadata:
  name: nginx-with-clients
spec:
  selector:
    app: nginx-with-clients
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-client-get-200
  annotations:
    s3.collectord.io/logs-output: devnull
  labels:
    app: nginx-client-get-200
spec:
  replicas: 20
  selector:
    matchLabels:
      app: nginx-client-get-200
  template:
    metadata:
      labels:
        app: nginx-client-get-200
    spec:
      containers:
      - name: busybox
        image: busybox
        args: [/bin/sh, -c,
               'while true; do wget -qO- http://nginx-with-clients.default.svc:80; sleep 5; done']
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-client-post
  annotations:
    s3.collectord.io/logs-output: devnull
  labels:
    app: nginx-client-post
spec:
  replicas: 8
  selector:
    matchLabels:
      app: nginx-client-post
  template:
    metadata:
      labels:
        app: nginx-client-post
    spec:
      containers:
      - name: busybox
        image: busybox
        args: [/bin/sh, -c,
               'while true; do wget -qO- --post-data=foo=x http://nginx-with-clients.default.svc:80; sleep 8; done']
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-client-get-404
  annotations:
    s3.collectord.io/logs-output: devnull
  labels:
    app: nginx-client-get-404
spec:
  replicas: 10
  selector:
    matchLabels:
      app: nginx-client-get-404
  template:
    metadata:
      labels:
        app: nginx-client-get-404
    spec:
      containers:
      - name: busybox
        image: busybox
        args: [/bin/sh, -c,
               'while true; do wget -qO- http://nginx-with-clients.default.svc:80/404; sleep 10; done']

This deployment has a server nginx container, and clients that generate activity on this server.

Save this file as nginx-example.yaml and apply it

kubectl apply -f nginx-example.yaml

Verify that pods are running, and give it a few moments.

Review the logs and events with CloudWatch Logs

You can now navigate to the CloudWatch Logs and find events and container logs from the pods for which we enabled output cloudwatch with a default retention period of 7 days.

CloudWatch

Create metrics from logs

We can create metrics filter from the logs to define how many 200 and not 200 responses nginx server returns (considering that we are sampling only 5%).

The filter pattern for status codes 2xx should be

[ip, user, username, timestamp, request, status_code = 2*, bytes_sent, http_referer, http_user_agent, gzip_ratio]

Metric filter pattern for status codes not-2xx should be

[ip, user, username, timestamp, request, status_code != 2*, bytes_sent, http_referer, http_user_agent, gzip_ratio]

CloudWatch Metric

We saved both metrics in the metric namespace LogMetrics as Status200 and StatusNot200. Now we can build a dashboard comparing status codes 200 vs not 200.

CloudWatch Dashboard

The JSON code of this dashboard is below.

{
    "widgets": [
        {
            "type": "metric",
            "x": 0,
            "y": 0,
            "width": 24,
            "height": 6,
            "properties": {
                "metrics": [
                    [ "LogMetrics", "Status200", { "period": 300, "stat": "Sum" } ],
                    [ ".", "StatusNot200", { "period": 300, "stat": "Sum" } ]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "us-east-1",
                "title": "Status = 200 vs Status != 200 (sampled 5%)",
                "period": 300
            }
        }
    ]
}

Just as an example we can also create an alarm, that will trigger if we will have more not-2xx requests than 4xx.

CloudWatch Alarm

As for the Kubernetes events, they are forwarded in JSON format, so you can use Logs Insights to extract the fields

CloudWatch Log Insights

And similarly to how we created alert for logs, you can create one for events in a specific namespace.

For example for the /kubernetes/devel/events/default/ we can create a metric from warning events in the namespace that have raised more than 3 times.

{ ($.object.type = "Warning") && ($.object.count >= 3) }

CloudWatch Alarm for Events

Based on this metric we can generate an alert later.

Analyze logs with Athena and QuickSight

Querying data with Athena

If you switch to Athena, you can find that in the list of databases, there is already a database kubernetes with three tables container_logs, host_logs and events. You can read more about best practices and query examples from documentation Querying data with Athena, including how to enable default encryption for resulting bucket, how to scan and pay less with Athena.

With Athena you can query the logs, and using reach Presto SQL syntax extract the fields and perform various transformations to logs. For example, we can extract the fields from the nginx logs using the regexp_extract function.

With the following SQL we will query all logs from the nginx-with-client workload (which is Deployment in our case) for the last 7 days and only from stdout

Athena

Similarly we also can query events by extracting fields from JSON objects

Athena Events

Analyzing data with QuickSight

Looking just at SQL results is not fun. So we will use QuickSight to build dashboards.

If you have not used QuickSight before, it will ask you to setup an account with AWS. Just make sure to give it access to the Athena and S3 buckets (with the data and results)

QuickSight Setup

After that you can click Manage Data and choose New Data Set. Choose Athena and give it a unique name for example athena-nginx-container-logs, on the next step choose Use custom SQL. Input the SQL statement

select 
  to_unixtime(from_iso8601_timestamp(timestamp)) as timestamp, 
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 1) as nginx_remote_addr,
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 2) as nginx_remote_user,
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 4) as nginx_request_method,
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 5) as nginx_request_path,
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 6) as nginx_request_http_version,
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 7) as nginx_response_status,
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 8) as nginx_bytes_sent,
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 9) as nginx_referer,
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 10) as nginx_user_agent
from kubernetes.container_logs 
where workload = 'nginx-with-clients' and stream='stdout' and dt>=date_format(date_add('day', -7, now()), '%Y%m%d');

Click Edit/Preview Data. On this step you can run the query and modify some types, like timestamp should be a Date and nginx_bytes_sent can be a Decimal.

QuickSight Preview

You can choose to keep the data in SPICE or always execute the Query. You can read more about SPICE at Importing Data into SPICE. We will choose SPICE in our example. And after that choose to Refresh the SPICE every day at the 00:01 UTC, so we can have a comprehensive report for the last 7 days.

QuickSight SPICE refresh

With these data we can build a dashboard.

QuickSight Dashboard

And with the power of the QuickSight we can schedule an every day delivery of this report.

QuickSight Report Delivery

An example of this email

QuickSight Dashboard Email

Cost estimate

Now the interesting part. You can play with the calculator below to estimate the cost of managing a centralized logging solution with AWS Services and Collectord. There are a lot of moving parts.

  1. The most important one is the daily volume of logs produced by your workloads. The good part is that Collectord compresses the logs before forwarding them to S3. Logs are very well compressed, and in most cases we see 50x times compression. In this example we just show 10x compression (compression ratio = 0.10).
  2. Number of Kubernetes worker nodes. We license our Collectord by the number of nodes. Changing the number of nodes changes Collectord license price.
  3. With S3 we forward logs by default every 10 minutes. That helps us to reduce the number of files, which means that Collectord needs to do less PUT requests, and Athena needs to do less GET requests from S3. Forwarding data more often can generate more files, but at the same time provide access to the logs with less delay.
  4. Retention on S3 defines how much storage is going to be used in the long-term.
  5. The number of unique containers daily defines how many files will be stored in S3, which in turn defines the number of requests (PUT for Collectord, GET and LIST for Athena) and also the number of objects and requests to Glue Catalog. If you have around 100 unique containers running on Kubernetes every day, that means that you will have 100 new partitions in Glue daily.
  6. The number of the performed searches with Athena defines how much data is going to be scanned, and also the amount of GET and LIST requests to S3.
  7. If you decide to use QuickSight for reporting and dashboards, there are two different roles in QuickSight Enterprise. One is the author, that creates the Dashboards, and second is the reader. In the organization usually you will have a limited number of authors, who will have write access to dashboards, and a much higher number of readers.
  8. In case of CloudWatch Logs, you can decide to sample the logs (we used 5% sampling by default in this example), and you can define the retention period for the logs, stored in CloudWatch. Considering that you have long-term storage with S3, you don't need very long retention here.

A lot can be done with the configurations to reduce the number of files, objects in Glue Catalog and reduce the amount of scanned data with Athena. If you have any questions, please ask. There are also limits for Glue Catalog that need to be taken in account, the most important being Number of partitions per table = 1,000,000, Number of partitions per account = 10,000,000. You can request the limit increase per region.

Estimate monthly costs calculator

Estimates are based on data provided by AWS pricing guide for each service.
Monthly charges will be based on your actual usage of AWS services, and may vary from the estimates the calculator has provided.

Daily log volume Gb
Number of docker hosts and kubernetes nodes
Collectord license $
Centralized logging with AWS S3, Athena, Glue and QuickSight
Compression ratio
Frequency of ingesting logs minutes
Retention month(s)
Number of unique containers daily container(s)
Number of searches per day searching 1 day back search(es)
Number of searches per day searching 30 days back search(es)
Number of QuickSight authors user(s)
Number of QuickSight unique readers reader(s)
S3 standard storage (Storage pricing) Gb $
S3 standard storage (PUT, COPY, POST, or LIST Requests) $
S3 standard storage (GET, SELECT and all other Requests) $
Glue Storage objects $
Glue Requests requests $
Athena TB scanned $
QuickSight $
Centralized logging with AWS Cloud Watch Logs
Sampling %
Retention days
Data Ingestion GB $
Store GB $
Total ~$

Do you have any questions? Feel free to leave the comment or send us an email with any feedback or questions at contact@outcoldsolutions.com.

collectord, kubernetes, eks, aws, s3, glue, athena, quicksight, cloudwatch logs

About Outcold Solutions

Outcold Solutions provides solutions for building centralized logging infrastructure and monitoring Kubernetes, OpenShift and Docker clusters. We provide easy to setup centralized logging infrastructure with AWS services. We offer Splunk applications, which give you insights across all containers environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and deploy solutions for Linux and Windows containers.