Thorn Technologies helps a leading global telecommunications solutions provider deploy and optimize Amazon Redshift to exceed its data warehousing needs.
A global telecommunications firm that provides voice, mobile data, IoT, and other services to mobile network operators and digital companies was in need of a new data warehousing solution.
The company used the IBM Netezza data warehouse product to analyze their over 438 terabytes (TB) of data. The product was reaching its end of life and IBM would no longer support it. Thus, the company was in search of an alternative data warehouse product.
The company wanted to see if Amazon Redshift could provide the performance they needed for their applications. The target metric they wanted to achieve was that all queries needed to run in under three (3) seconds.
Thorn Technologies helped perform an in-depth proof of concept to help the client determine whether Redshift would meet their data warehouse performance needs.
Here are the steps we took to do so.
1) Set up AWS Environment and Redshift Instance
The first thing we did was set up the overall AWS architecture to deploy the Redshift instance. We designed this architecture with security in mind from the beginning. The environment included:
- A dedicated virtual private cloud (VPC)
- Private subnets for database cluster nodes
- Public subnets for bastion hosts and publicly accessible services
- A CloudFormation template for repeatable deployments
- A standardized change control process to facilitate Infrastructure as Code
We then determined which Redshift instance would best achieve the client’s needs for the project.
We compared ds2.8xlarge and dc2.8xlarge instances.
The ds2 instances use magnetic disks, which are cheaper and can hold more data. But when we ran queries with test data, these instances did not meet the client’s goal of queries executing in under 3 seconds.
The dc2 instances use SSD, which is much more performant than ds2’s magnetic disk technology. While these are more expensive, the dc2 instances were able to exceed the client’s requirements, so they were the best option.
2) Adapt and Execute the Data Definition Language (DDL)
After we set up the AWS environment, our next step was to modify the syntax of the Netezza DDL exports and translate it to syntax usable by Redshift.
This entailed modifying the distribution keys, Datetime default formatting, and other syntax.
After these changes were made, we ran the DDL on Redshift to build the data warehouse tables.
3) Export Netezza Data and Transfer it to AWS S3
The next step was to export sample data from the Netezza data warehouse and move it to S3.
We received over 6,500 files that amounted to approximately 1.5 terabytes (TB) of compressed data. This data represented six (6) months of GPRS roaming exchange data.
4) Load Data from S3 to Redshift
The data in S3 now could be loaded into Redshift.
We leveraged the Redshift COPY command to load the data in parallel, and temporarily scaled up the number of nodes to reduce the data load time.
Here is a snapshot of the amount of data we worked with:
5) Run Queries to Obtain Baseline Benchmarks
Once the data was loaded into Redshift, we ran initial queries to determine baseline performance of the data warehouse before any optimizations were made. (See the Results section for these benchmark times.)
6) Edit Tables to Optimize Query Response Time
After determining the query benchmarks, we optimized the table parameters to make the queries run faster.
We analyzed the queries for ideal sort keys, distribution keys, and compression.
Some of the tactics we used included:
Changed the Distribution Style to ALL for the Dimension tables.
The Dimension tables were small enough to justify putting duplicate copies of each table on every single Redshift node. This improved performance by eliminating the network overhead of a single node needing to fetch dimension data from another node during a query.
For the Fact Tables:
- Changed Sort Key to match query conditions:
- Change the Distribution Style to EVEN
- Ran ANALYZE COMPRESSION to get compression estimates
We then recreated the optimized tables, copied the data from the original tables to the optimized tables, and reran the queries to compare the performance.
7) Inflate Data from 6 Months to 36 Months
In order to better model the client’s real-world conditions, we inflated the 6-month daily data set, fact_xdr_daily, to represent approximately 36 months of data.
The data set was expanded from 1.6 TB to 13.2 TB, and we reran the queries to check performance.
The results of the queries were as follows:
We achieved significant improvement on all but one of the queries.
In the case of the 36-month data set, the time to run those two queries (xdr_daily_1 and xdr_daily_2) was faster than that of the original 6-month data set. This is because when the proper sort keys are identified, Redshift is able to immediately zero in on the right subset of data needed by the query regardless of how much total data is in the table.
In the case of the xdr_hourly_2 query where an improvement was not achieved, we surmise that this was a very complex query that just didn’t respond to our optimizations. Regardless, the query execution time met the 3-second goal that the client set.
We helped the client determine that Redshift is certainly capable of running queries that meet or exceed the speed of their old Netezza data warehouse queries.
As a result, the client is planning on moving ahead with a full migration to Redshift for all of their data warehousing needs.
Need help with your cloud big data initiatives? Reach out to us at email@example.com for a free assessment.