To have a cluster up and running, but you’ll also have to launch an EMR Hive Metastore. Resizing an existing clusterĬan also take the same amount of time, most likely due to data being redistributed across nodes.ĮC2 can be launched using a CloudFormation template and it can take literally a couple of minutes However, it can take 20 minutes or more for the cluster to be ready. Launching a Redshift cluster of this size is very straightforward and it only takes a few clicks. Same ORC-formatted files in S3 that were also accessed by Starburst Presto. I used the same 10-node Redshift cluster, but made it point to the This test reads data from S3 and not from EBS. It’s worth noting that the Starburst Presto cluster in R4 instances are EBS-backed, D2 instances use internal Instance Storage, which makes them a The main difference betweenĭC2 and R4 EC2 instances - besides the extra 4 vCPUs - is the way they handle storage. Each dc2.8xlarge instance hasģ6vCPUs, 244GB RAM and 2 volumes of 2 TB SSD Instance Storage each. One 10-node cluster with dc2.8xlarge instances. The metadata in the Hive Metastore is usedīy Starburst Presto’s Cost-Based Optimizer, which uses this information to determine the most Location, size and other internal characteristics. A Hive Metastore stores metadata for tables, such as their schema, In addition to Worker andĬoordinator nodes, this Presto cluster uses an RDS-backed Hive Metastore in AWS EMR, consisting Each r4.8xlarge EC2 instance is memory-optimized and has 32vCPUs, 244GB RAM andġ0 Gb/s network performance. 10 Worker nodes (r4.8xlarge) and 1 Coordinator node Therefore I set up a fairly powerful cluster for each solution: Infrastructure and Data SetupĪ 1TB TPC-H dataset consists of approximately 8.66 billion records, for all 8 tables combined.ĭue to its size, querying a 1TB TPC-H dataset requires a significant amount of resources, I executed the standard TPC-H set of 22 queries, Redshift Spectrum, so we can have a fair comparison. This means I used the same dataset and queries when testing Starburst Presto, Redshift and TPC-H offers a consistent way to measure performance against It consists of a dataset of 8 tables and 22 queries that areĮxecuted against this dataset. In this article I’ll use the data and queries from TPC-H Benchmark, an industry standard for Also, good performance usually translates to lessĬompute resources to deploy and as a result, lower cost. Having to wait many minutes for a result. Using the rightĭata analysis tool can mean the difference between waiting for a few seconds, or (annoyingly) One of the key areas to consider when analyzing large datasets is performance. In this article I will focus on Performance and Cost for these three solutions.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |