Data Warehousing Essentials: Loading Data From Amazon S3 Using Amazon Redshift

Pronay Ghosh

Published in

Accredian

6 min readApr 6, 2022

by Pronay Ghosh and Hiren Rupchandani

In the previous article, we learned about how to set up Amazon Redshift.
In this article, we will know how how to load data from Amazon S3 to Amazon Redshift.
We will first download the CSV data files.
Then we will create an Amazon S3 bucket and then upload the data files to the S3 bucket.
Launch the Create database tables in an Amazon Redshift cluster.
To load the tables from the data files on Amazon S3, use COPY commands.
Troubleshoot load issues and make changes to your COPY commands to fix them.
You may either use an INSERT command or a COPY command to add data to your Amazon Redshift tables.
The COPY command is several times quicker and more efficient than INSERT instructions at the scale and speed of an Amazon Redshift data warehouse.
The COPY command reads and loads data from many data sources in parallel using the Amazon Redshift massively parallel processing (MPP) architecture.
You may load data from Amazon S3, Amazon EMR, or any remote server that can be reached through Secure Shell (SSH).
You may also use an Amazon DynamoDB table to load data.

Steps for Loading Data from Amazon S3 into Amazon Redshift:

Step 1: Creating a cluster

Log in to the AWS Management Console and go here to access the Amazon Redshift console.
Choose the AWS Region in which you wish to establish the cluster in the upper right corner.
Choose CLUSTERS from the navigation menu, then Create cluster.
The page to create a cluster displays.
Fill in the parameters for your cluster on the Create cluster page.
Except for the following settings, you can set your own values for the parameters: For the node type, choose dc2.large.
The number of nodes should be set to four.
Select an IAM role from the Available IAM roles list in the Cluster permissions section.
This job should be one that you’ve already created and has Amazon S3 access.
Then choose the Associate IAM role to add it to the cluster’s list of Associated IAM roles.
Click on Create a cluster.

Step 2: Creating a cluster

Here, is the bundled file to download.
The files should be extracted to a folder on your PC.
Always cross-check the size of the zip with the downloaded zip.

Step 3: Creating a cluster

Create an Amazon S3 bucket.
Log in to the AWS Management Console and go to access the Amazon S3 console.
Create a bucket by clicking the Create Bucket button.
Type a bucket name in the Create a Bucket dialogue box’s Bucket Name box.
Your bucket name must be distinct from all other bucket names in Amazon S3.
Prefixing your bucket names with your organization’s name is one approach to assist assure uniqueness.

Note: Bucket names must follow a set of guidelines. One can go to the Amazon Simple Storage Service User Guide’s Bucket limits and limitations for additional details.

Choose a region.
In the same Region as your cluster, create a bucket.
Choose the region you want for your cluster. For e.g., US West (Oregon) Region (us-west-2).
Select Create.
The console displays your empty bucket in the Buckets panel after Amazon S3 successfully generates your bucket.

Step 4: Creating a sample table

Suppose we have to make the above sample Star Schema Benchmark (SSB) Table.
It’s possible that the SSB tables already exist in the present database.
If this is the case, drop the tables to remove them from the database before continuing with the CREATE TABLE procedures.

The tables used in this tutorial may or may not have the same properties as other tables.

In your SQL client, do the following CREATE TABLE statements.
In order to create the part table, one can use the following set of commands.

In order to create the supplier table, one can use the following set of commands.

In order to create the customer table, one can use the following set of commands.

In order to create the dwdate table, one can use the following set of commands.

In order to create the lineorder table, one can use the following set of commands.

Step 5: Running the COPY command

To load each of the tables in the SSB schema, you use COPY commands.
We can use the following COPY command syntax for the same.

COPY table_name [ column_list ] FROM data_source CREDENTIALS access_credentials [options]

Step 6: Running the COPY command

You should use a VACUUM command followed by an ANALYZE command whenever you add, remove, or edit a large number of rows.
A vacuum restores the sort order and recovers the space from removed rows.
The ANALYZE command modifies the statistics information, allowing the query optimizer to create more precise query plans.
See Vacuuming Tables for further details.
In order to run VACUM and ANALYZE follow the two commands.

vacum;
analyze;

Step 7: Clean Up the Resources

Run the following statements on your SQL client to remove the SSB tables.

Conclusion:

So far in this article, we covered a high-level overview of how to load data into Amazon Redshift from an S3 bucket.
In the next article, we will learn about how to Unload the data from Amazon Redshift.

Final Thoughts and Closing Comments

There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this story, I recommend you to go with the Global Certificate in Data Science & AI because this one will cover your foundations, machine learning algorithms, and deep neural networks (basic to advance).

Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.