I lately spent a while configuring and analysing the logs for Amazon Redshift warehouse. I am summarizing the experience here so others can achieve the same faster.
Amazon Redshift is the analytical data warehouse platform on the AWS cloud, with rapidly growing user base. It is optimized to work with S3 storage service. Redshift is focused on performance (columnar architecture, Massively Parallel Processing (MPP) using Compute Nodes and slices), more here. It is based on PostgreSQL 8.0.2 and speaks similar SQL variant, and it can connect to any BI tool that is PostgreSQL compliant. The typical reason to set up Amazon Redshift is to address data warehousing use case scenarios (analytics, various types of aggregation and reporting) pertaining to the data derived from Amazon S3.
Redshift compute time sells for $0.25per hour and is available for free trial. It is technically not part of Amazon Free Tier (the basic level no-cost offering from AWS), however Amazon currently offers a certain amount of free compute hours for those users who start their Redshift adventure. It is currently 750 hours over 2 months, under certain limitations: DC2.Large node with 160GB of compressed SSD storage.
When using Redshift (or any database) in production, the database logging feature is always useful and should be turned on early on. User activities and system events should be logged for security, audit and debugging purposes. Redshift offers three different ways of logging that data:
- Event log files in S3 buckets
- Event logs in STL tables
- AWS Cloud Trail
Typically, you do not need all three feature but just one of them. This briefing will help you pick the logging method for Amazon AWS Redshift, enable and use it. I will start with setting up Redshift from scratch.
setting up Amazon AWS Redshift cluster
I will cover this briefly since there is a lot of good material elsewhere. This document Getting Started with Amazon Redshift explains how to set up the cluster…. except that it has a glitch.
The document says you need to set up the cluster security group. Correct: you won’t be able to access the cluster from SQL Workbench/J (or any external tool) unless the cluster security group has been set. However, in the Cluster Properties section of Redshift console, the field Security Group described in the document (highlighted in yellow in screenshot below) is actually missing until you create one. But you still need to set it up. Don’t look at Cluster Properties, it is not there. Instead click Cluster/Modify Cluster (yellow highlight below), there you will find the security group box. Pick one. To modify those groups or create new ones, go to EC2 Console and click Security to the left. Other than this, the instruction is fine.
Option 1: log files in S3 buckets
Amazon Redshift can store the logging information in flat text files located in Amazon S3 bucket of your choice. Log files will be split in three categories:
- Connection log — logs authentication attempts, and connections and disconnections.
- User log — logs information about changes to database user definitions.
- User activity log — logs each query before it is run on the database.
This logging option is best if you are used to work with text logfiles and youa re comfortable with Amazon/Big Data ecosystem of tools. For instance, you may set up processes (using EC2, ssh, Kinesis, Kafka etc) that copy, synchronize or process your S3 logs. The disadvantage is of this scenario is the processing delay at Amazon side. As I observed, the log files will appear in the bucket in hourly batches, and sometimes even later. Typical delay is 30-120 minutes. In other words, the S3 logging is useful for post-mortem analytics, but not for real-time pipelines with SLA-imposed reaction time requirments.
setting up the logging
In the AWS Redshift console, go to Clusters -> your cluster -> click Database -> Configure Audit Logging. The feature is disabled. Enable it.
Decide where you want the log – optimally, a new, separate S3 bucket. Go to S3 console and create a new bucket if necessary.
This is not enough. You still need to enable the user activity logging, in the Parameter Groups. Find Parameter Groups to the left of the Redshift dashboard. However you cannot edit the default parameter group. Create separate parameter group and set enable_user_activity_logging to true. Then make sure the new, nondefault parameter group is associated with your cluster.
That’s it, at this point you have enabled the logging. In the cluster console you should see the note:
Whether logs are being successfully delivered to the S3 bucket: first delivery pending
How S3-based Redshift logging works
Database Audit logging provides Connection log, User log and User activity log. The first one is about logging attempts, the last one is about all user activity such as SELECT * FROM. The logs are stored in compressed files, (a separate for each hourly inteval of logging) in hierarchical file structure in the S3 bucket which you have provided for this purpose. Here are the details on what columns are in the log files.
The log files are dumped at irregular intervals. As I observed, this is between 30 and 120 minutes. If you are interested in the recent events, you might sometimes need to wait for up to two hours until the related log becomes accessible.
Accessing the Redshift audit logs
The logs contain a lot of information. Especially the user log records all the queries that automatic processes issue every few minutes. Hence, eventhough you can in principle access the logs manually with S3 console, it is not what you want to do. Instead, use AWS CLI, a commandline tool that allows you to copy (or synchronize) your entire logging directory from S3 to your local system. Here and here is how to install it, while below is how I did it (this is the bash shell inside Windows 10 WSL; I use pip).
$ python3 -m pip install wheel
$ python3 -m pip install awscli --upgrade --user
# this did not work for me. I had errors.
# some online advisory tells you to sudo pip if this happens. Don't do that!
python3 -m pip uninstall aws
$ python3 -m pip install awscli
$ aws --version
aws-cli/1.16.135 Python/3.5.2 Linux/4.4.0-17134-Microsoft botocore/1.12.125
Now, you need to set up the Administrator user in the AWS console, as described here. Then you need to configure AWS CLI using that user’s credentials:
$ aws configure
AWS Access Key ID [None]: ****
AWS Secret Access Key [None]: *****
Default region name [None]: us-east-1
Default output format [None]: text
Note 1: you migth configure aws for various user profiles, using –profile option, which I don’t do in the example above. We will be using the default user for simplicity. Note 2: AWS CLI is good at the testing phase; for production pipelines you’re better of pushing your logs to Kinesis, Kafka and the like.
Bingo, at this point you should be able to access the remote s3 logs:
$ aws s3 ls[…]
2019-03-31 11:03:28 altanova-redshiftlog
Here is how the log directory looks:
$ aws s3 ls altanova-redshiftlog/redlog/AWSLogs/632551500903/redshift/us-east-2/2019/03/30/2019-03-31 11:55:58 309 632551500903_redshift_us-east-2_redshift-cluster-1_connectionlog_2019-03-30T00:51.gz2019-03-31 11:55:59 379 632551500903_redshift_us-east-2_redshift-cluster-1_connectionlog_2019-03-30T01:51.gz2019-03-31 11:56:00 504 632551500903_redshift_us-east-2_redshift-cluster-1_connectionlog_2019-03-30T02:51.gz
As you can see the logs are separated hourly. For instance, this is how I access the most recent one:
$ aws s3 cp s3://altanova-redshiftlog/redlog/AWSLogs/632551500903/redshift/us-east-2/2019/04/01/632551500903_redshift_us-east-2_redshift-cluster-1_connectionlog_2019-04-01T11:30.gz .download: s3://altanova-redshiftlog/redlog/AWSLogs/632551500903/redshift/us-east-2/2019/04/01/632551500903_redshift_us-east-2_redshift-cluster-1_connectionlog_2019-04-01T11:30.gz to ./632551500903_redshift_us-east-2_redshift-cluster-1_connectionlog_2019-04-01T11:30.gz(2019-04-awscli)
Testing the Redshift logging
Now let’s create a user event and verify if the logs correctly take account of it. Connect to your Redshift cluster with SQL Workbench/J and executed some SQL querries, such as SELECT, INSERT or CREATE TABLE. Wait for a few minutes or hours, download the log from S3 bucket and see if it contains the event.
Indeed, here is the snippet from my Connection log where the test login event has been recorded (for the description of fields, refer to the documentation here).
authenticated |Mon, 1 Apr 2019 13:35:30:370|::ffff:220.127.116.11 |25269 |13371|dev |awsuser |password |0|TLSv1.2 |ECDHE-RSA-AES256-SHA384 |0| | | |
initiating session |Mon, 1 Apr 2019 13:35:30:370|::ffff:18.104.22.168 |25269 |13371|dev |awsuser |password |0|TLSv1.2 |ECDHE-RSA-AES256-SHA384 |0| | | |
set application_name |Mon, 1 Apr 2019 13:35:30:898|::ffff:22.214.171.124 |25269 |13371|dev |awsuser |password |1752455|TLSv1.2 |ECDHE-RSA-AES256-SHA384 |0| | | |JDBC-126.96.36.1993
disconnecting session |Mon, 1 Apr 2019 13:49:37:917|::ffff:188.8.131.52 |25269 |13371|dev |awsuser |password |848771598|TLSv1.2 |ECDHE-RSA-AES256-SHA384 |0| | | |JDBC-184.108.40.2063
In addition, here is the section from the User Activity Log which recorded all the user actions. As you can see we’ve used SHOES table provided as the dummy table example in the Redshift tutorial.
'2019-04-01T13:45:12Z UTC [ db=dev user=awsuser pid=14185 userid=100 xid=86255 ]' LOG: insert into shoes values ('sandals', 'blue')
'2019-04-01T13:45:27Z UTC [ db=dev user=awsuser pid=14185 userid=100 xid=86264 ]' LOG: select * from shoes
'2019-04-01T13:45:27Z UTC [ db=dev user=awsuser pid=14185 userid=100 xid=86264 ]' LOG: SELECT public.shoes.shoetype AS shoetype, public.shoes.color AS color FROM public.shoes;
That concludes Option1: log files in S3 buckets. As you can see, if you can live with the delay, it can be quite useful.
Option 2: Logging with STL tables
Redshift stores quite a lot of log information by default, in STL system tables. Some of them are only accessible by superuser (which can be changed by GRANT SELECT privilege). The logged data is not meant for permanent storage. The STL log tables are periodically emptied. Here is the reference page containing names of all tables that can be queried.
As an example, to find the recent authentication failure events, we can execute the following query from SQL Workbench/J:
select event, recordtime, dbname, username, authmethod, pid, remotehost
where event = 'authentication failure'
order by recordtime
Here is the requested output:
Use this feature if you want the most up-to-date information. The STL tables contain the most immediate data, they also contain more detail than the S3 logfiles. Also, the data has already been structured which can be of an advantage. This feature is also better if you feel more at home with SQL than Python text processing.
The disadvantage here is that the STL tables do not store data indefinitely. For audit purposes, you would need to constantly copy the newest data over to your location outside AWS to make sure it is protected from writing over. This concludes Option 2: logging with STL tables.
Option 3: the AWS CloudTrail
AWS has one more way of logging data: the CloudTrail, containing all the AWS user events. CloudTrail is not specific to Redshift. It will include events and calls to other AWS APIs too. Here is a document about creating a Trail for AWS Account.
CloudTrail is useful for if we need uniform logging method covering not just Redshift, but the entire AWS service. If you want to log Redshift in particular, it won’t be optimal as the trail contains less Redshift-specific data and won’t be as up-to-date as STL tables.
Amazon AWS Redshift provides three ways to log user and system events for audit, security and debugging purposes:
- logging with S3 buckets is best for archive, analytical, text-based processing
- logging with STL tables is best for real-time processing
- logging with CloudTrail allows to extend the logging to AWS beyond Redshift
I hope it makes sense now!