Collectord

Geo-IP lookups with Athena

March 18, 2019

Last week we shared with you how to setup a log aggregation with AWS Services for your Kubernetes clusters where one of the destinations is S3 with Glue Catalog, which allows you to run analytic queries with Athena and build reports with QuickSight. One of the advatages of running SQL queries and especially using relation schemas is that you can join data from the kubernetes logs with some reference tables, that you can define. A good examples is Geo IP table and ability to perform Geo IP lookups with this table.

Geo IP table

Load GeoLite2 tables

MaxMind provides GeoIP database available under Creative Commons Attribution-ShareAlike 4.0 International License. You can download it from their website GeoLite2 Free Downloadable Databases. Download GeoLite2 City database in CSV format and unzip it

curl -O https://geolite.maxmind.com/download/geoip/database/GeoLite2-City-CSV.zip
unzip GeoLite2-City-CSV.zip

We will use only two files GeoLite2-City-Blocks-IPv4.csv, that includes block of IP addresses, and GeoLite2-City-Locations-en.csv, that includes references from geo location id to the position on the map.

Gzip these files

gzip */GeoLite2-City-Locations-en.csv */GeoLite2-City-Blocks-IPv4.csv

And after that upload it to one of the S3 buckets that you have access to, for example

aws s3 cp */GeoLite2-City-Locations-en.csv.gz s3://example.com.logs/tmp/geolite2-citi-locations/data.csv.gz
aws s3 cp */GeoLite2-City-Blocks-IPv4.csv.gz s3://example.com.logs/tmp/geolite2-citi-blocks-ipv4/data.csv.gz

Go to Athena and create temporary tables, so we can look at these tables, and be able to work with them. Right now we give them prefix tmp_ as we will need to change the format of the fields a little, and we will use Athena to do that.

We will do all the operations under database kubernetes that we created with Collectord in the blog post from last week.

We will define the table for the IPv4 blocks with LazySimpleSerDe, as it does not have any quoted fields.

CREATE EXTERNAL TABLE IF NOT EXISTS kubernetes.tmp_geolite_blocks_ipv4 (
  network STRING,
  geoname_id INT,
  registered_country_geoname_id INT,
  represented_country_geoname_id INT,
  is_anonymous_proxy INT,
  is_satellite_provider INT,
  postal_code STRING,
  latitude DOUBLE,
  longitude DOUBLE,
  accuracy_radius INT
)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
LOCATION 's3://example.com.logs/tmp/geolite2-citi-blocks-ipv4/'
TBLPROPERTIES ('skip.header.line.count'='1');

And table for locations with OpenCSVSerDe as it does have quoted fields.

CREATE EXTERNAL TABLE IF NOT EXISTS kubernetes.tmp_geolite_locations (
  geoname_id INT,
  locale_code STRING,
  continent_code STRING,
  continent_name STRING,
  country_iso_code STRING,
  country_name STRING,
  subdivision_1_iso_code STRING,
  subdivision_1_name STRING,
  subdivision_2_iso_code STRING,
  subdivision_2_name STRING,
  city_name STRING,
  metro_code STRING,
  time_zone STRING,
  is_in_european_union INT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar' = '\"',
   'escapeChar' = '\\'
)
LOCATION 's3://example.com.logs/tmp/geolite2-citi-locations/'
TBLPROPERTIES ('skip.header.line.count'='1');

Run a simple query to verify that tables are created and you have access to them

select * 
from  kubernetes.tmp_geolite_blocks_ipv4 t1
  inner join kubernetes.tmp_geolite_locations t2 on t1.geoname_id = t2.geoname_id
limit 10

Athen verify tables

Converting tables for CIDR lookups

As you can see, the table tmp_geolite_blocks_ipv4 defines blocks using CIDR notation, like 1.0.0.0/24, which means that this block includes all IP addresses from 1.0.0.0 to 1.0.0.255. Unfortunately currently Presto does not support CIDR lookups, which means Athena does not support them either. We will have to convert these CIDR representations to something else. One way, is to use Integer representation of the IP addresses to be able to run queries like ip_start <= ip_address <= ip_end. To convert the IP address to integer is easy, we just need to run ipv4[1]*256*256*256 + ipv4[2]*256*256 + ipv4[3]*256 + ipv4[4]. And we will need to convert the bits (the part /24) to last IPv4 address of the range.

With the following query, we will parse network field and convert it to the fields ip_start and ip_end and create a new table from the result kubernetes.geolite_blocks_ipv4 in Parquet format.

CREATE TABLE kubernetes.geolite_blocks_ipv4
WITH (
  format='PARQUET',
  external_location='s3://example.com.logs/geolite2-citi-blocks-ipv4/'
) AS
select 
  cast(ip[1] as BIGINT)*256*256*256 + cast(ip[2] as BIGINT)*256*256 + cast(ip[3] as BIGINT)*256 + cast(ip[4] as BIGINT) as ip_start,
  (
    bitwise_or(cast(ip[1] as BIGINT), bitwise_and(255, cast(power(2, greatest(8 - range, 0)) as BIGINT)-1))
  )*256*256*256 +
  (
    bitwise_or(cast(ip[2] as BIGINT), bitwise_and(255, cast(power(2, greatest(16 - range, 0)) as BIGINT)-1))
  )*256*256 +
  (
    bitwise_or(cast(ip[3] as BIGINT), bitwise_and(255, cast(power(2, greatest(24 - range, 0)) as BIGINT)-1))
  )*256+
  (
    bitwise_or(cast(ip[4] as BIGINT), bitwise_and(255, cast(power(2, greatest(32 - range, 0)) as BIGINT)-1))
  ) as ip_end,
  network,
  geoname_id,
  registered_country_geoname_id,
  represented_country_geoname_id,
  cast(is_anonymous_proxy as BOOLEAN) as is_anonymous_proxy,
  cast(is_satellite_provider as BOOLEAN) as is_satellite_provider,
  postal_code,
  latitude,
  longitude,
  accuracy_radius
from 
(
  select 
    network,
    geoname_id,
    registered_country_geoname_id,
    represented_country_geoname_id,
    is_anonymous_proxy,
    is_satellite_provider,
    postal_code,
    latitude,
    longitude,
    accuracy_radius,
    split(network_array[1], '.') as ip,
    cast(network_array[2] as BIGINT) as range
  from
  (
    select 
      network,
      geoname_id,
      registered_country_geoname_id,
      represented_country_geoname_id,
      is_anonymous_proxy,
      is_satellite_provider,
      postal_code,
      latitude,
      longitude,
      accuracy_radius,
      split(network, '/') as network_array
    from kubernetes.tmp_geolite_blocks_ipv4
  )
)

Also, let's convert kubernetes.tmp_geolite_locations into new table with Parquet format as well

CREATE TABLE kubernetes.geolite_locations
WITH (
  format='PARQUET',
  external_location='s3://example.com.logs/geolite2-citi-locations/'
) AS
select 
 geoname_id,
 locale_code,
 continent_code,
 continent_name,
 country_iso_code,
 country_name,
 subdivision_1_iso_code,
 subdivision_1_name,
 subdivision_2_iso_code,
 subdivision_2_name,
 city_name,
 metro_code,
 time_zone,
 cast(is_in_european_union as BOOLEAN) as is_in_european_union
from tmp_geolite_locations

After that you can drop the tmp_ tables

drop table kubernetes.tmp_geolite_blocks_ipv4
drop table tmp_geolite_locations

And remove the CSV files from S3.

Verify tables with GeoIP lookup

You can now test the tables as following, let's use the IP address 8.8.8.8 and convert it to the integer as 8*256*256*256+8*256*256+8*256+8, which is 134744072.

Lets find the location of this IP address

select t2.continent_name, t2.country_name, t2.city_name
from  kubernetes.geolite_blocks_ipv4 t1
  inner join kubernetes.geolite_locations t2 on t1.geoname_id = t2.geoname_id
where t1.ip_start <= 134744072 and 134744072 <= t1.ip_end

Athen test IP lookup

Pre-loading the data

In our last week's blog post Setting up comprehensive centralized logging with AWS Services for Kubernetes we used nginx deployment as an example. We also had several deployments that generated traffic on this nginx deployment. To be able to test IP addresses from a public networks, we will need to generate activity from outside. For that we will change the deployment and expose it via NodePort

We also exposed port 30036 with the SecurityGroups

apiVersion: v1
kind: Service
metadata:  
  name: nginx
spec:
  externalTrafficPolicy: Local
  selector:    
    app: nginx
  type: NodePort
  ports:  
  - name: http
    port: 80
    targetPort: 80
    nodePort: 30036
    protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80

After that we will use AWS Route53 HealthCheck service to generate traffic from various locations.

Route 53 HealthChecks

Analyzing the data

After running it for a while, we can check the data and create a View from the nginx logs very similar to the View we created last week at Setting up comprehensive centralized logging with AWS Services for Kubernetes, just with a different where statement to reflect different workload. And similarly we extracted all the nginx_ fields from the message

CREATE OR REPLACE VIEW "container_logs_nginx_public_7d" AS
select 
  to_unixtime(from_iso8601_timestamp(timestamp)) as timestamp, 
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 1) as nginx_remote_addr,
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 2) as nginx_remote_user,
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 4) as nginx_request_method,
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 5) as nginx_request_path,
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 6) as nginx_request_http_version,
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 7) as nginx_response_status,
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 8) as nginx_bytes_sent,
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 9) as nginx_referer,
  regexp_extract(message, '^([^\s]+) - ([^\s]+) \[(.+)\] "([^\s]+) ([^\s]+) ([^\s]+)" ([^\s]+) ([^\s]+) "(.+)" "(.+)" "(.+)"$', 10) as nginx_user_agent
from kubernetes.container_logs 
where cluster='eks-devel' and workload='nginx' and stream='stdout' and dt>=date_format(date_add('day', -7, now()), '%Y%m%d')

After that we can go to QuickSight and build a new Data Set, where we will use just created view, join it with the GeoLite tables, convert IP address to integer and look for the ip address ranges in the block, and join it with locations database, so we can get the City and Country names.

select
  sum(ip_activities.count) as count,
  kubernetes.geolite_locations.city_name,
  kubernetes.geolite_locations.country_name,
  kubernetes.geolite_blocks_ipv4.latitude,
  kubernetes.geolite_blocks_ipv4.longitude,
  kubernetes.geolite_blocks_ipv4.postal_code
from (
  select 
    cast(ip[1] as BIGINT)*256*256*256 + 
    cast(ip[2] as BIGINT)*256*256 + 
    cast(ip[3] as BIGINT)*256 + 
    cast(ip[4] as BIGINT) as ip,
    count as count
  from (
    select split(nginx_remote_addr, '.') as ip, count(*) as count
    from kubernetes.container_logs_nginx_public_7d
    group by 1
  )
) as ip_activities
inner join kubernetes.geolite_blocks_ipv4 on 
  ip_activities.ip >= kubernetes.geolite_blocks_ipv4.ip_start and ip_activities.ip <= geolite_blocks_ipv4.ip_end
inner join kubernetes.geolite_locations on
  kubernetes.geolite_blocks_ipv4.geoname_id = kubernetes.geolite_locations.geoname_id
group by 2,3,4,5,6

QuickSight automatically recognized all the column types

QuickSight data source

Let's build a visualization with a Map and a Table with the number of received connections.

QuickSight data source

It matches the list from the Route53 HealthCheck endpoints!

HealthChecks List

Summary

As you can see, having relation tables is a very powerful tool in analysis. You can easily upload your own reference tables and start building the analytics by using them.

collectord, athena, geoip, logs, quicksight

About Outcold Solutions

Outcold Solutions provides solutions for building centralized logging infrastructure and monitoring Kubernetes, OpenShift and Docker clusters. We provide easy to setup centralized logging infrastructure with AWS services. We offer Splunk applications, which give you insights across all containers environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and deploy solutions for Linux and Windows containers.