Obtenez par e-mail toute l'actualité Hortonworks

Une fois par mois, recevez les dernières idées, tendances, informations d’analyse et découvertes sur le Big Data.


Sign up for the Developers Newsletter

Une fois par mois, recevez les dernières idées, tendances, informations d’analyse et découvertes sur le Big Data.




Prêt à débuter ?

Télécharger Sandbox

Que pouvons-nous faire pour vous ?

* Je comprends que je peux me désabonner à tout moment. J'ai également compris les informations supplémentaires fournies dans la Politique de confidentialité de Hortonworks.
fermerBouton Fermer
April 05, 2017
diapositive précédenteDiapositive suivante

Apache Hive: Moving Beyond Analytics Offload with SQL MERGE

auteur :
Carter Shanklin

HDP 2.6 takes a huge step forward toward true data management by introducing SQL-standard ACID Merge to Apache Hive.

As scalable as Apache Hadoop is, many workloads don’t work well in the Hadoop environment because they need frequent or unpredictable updates. Updates using hand-written Apache Hive or Apache Spark jobs are extremely complex.  Not only are developers responsible for the update logic, they must also implement all rollback logic, detect and resolve write conflicts and find some way to isolate downstream consumers from in-progress updates. Hadoop has limited facilities for solving these problems and people who attempted it usually ended up limiting updates to a single writer and disabling all readers while updates are in progress.  

This approach is too complicated and can’t meet reasonable SLAs for most applications. For many, Hadoop became just a place for analytics offload — a place to copy data and run complex analytics where they can’t interfere with the “real” work happening in the EDW.

Hive’s ACID Merge handles all the hard work of Hadoop data maintenance. Because Merge handles inserts, updates and deletes in a single pass, developers don’t need to implement multi-stage update pipelines or develop complex rollback/retry logic. In addition, Hive’s ACID system protects you against write conflicts and isolates readers from in-progress updates, meaning you can update and query data at the same time.

 Figure 1: Complexity of the most common data maintenance tasks with and without MERGE

As Figure 1 shows, Merge lets you do the most common data maintenance tasks like upserts, SCD updates and data restatements in just one SQL query, far simpler than the complex and non-scalable approaches required with traditional Hive or Spark pipelines. Merge is 100% compatible with Hive LLAP which means fast updates and queries are possible on the same engine.

Manage Billions of Rows Per Day with ACID Merge

To get a sense of Hive Merge’s scalability, we ran a benchmark based on the standard TPC-H benchmark. TPC-H includes data maintenance via a process they call Refresh Functions, described in the TPC-H specification document. There are two Refresh processes. New Sales Refresh, which inserts new records and Old Sales Refresh to remove old data. The TPC-H schema has  a table called orders to track all sales orders and a table called lineitem which contains  up to 7 entries per order. The New Sales Refresh involves inserting net-new orders and line items.  The Old Sales Refresh deletes items based on their primary key.

TPC-H supplies a data generator that can generate batches of these inserts and deletes. Our data maintenance approach is to generate a large batch of inserts and deletes, stage them into a staging table, and use Hive Merge to perform a Full Table Upsert to combine the updates and deletes into a single pass. In the real world, it’s common that an upstream operational system will give you periodic database dumps every hour or every day. This pattern lines up with real-world scenarios very well.

Figure 2: Merge rate for Full Table Upserts using 25 and 50 GB staging tables.

Note that in this benchmark we only test Merge rate, resulting in no simultaneous analytics running while the data is being kept up-to-date. Of course, Merge and queries can be run simultaneously to support updates is just a matter of additional capacity above the resources needed to support analytics.

Figure 3 below shows the complete details of the cluster used for the test. As you can see, even though the cluster size is quite modest, we can still upsert more than 20 billion records per day.

Figure 3: Cluster details used for the benchmark.


Ian says:

Can you share the DDL for these tables? I’m curious to see the approach for clustering and bucketing in these scenarios. Also, it appears the staging tables are represented as ~500, 262 million. What is the size of the destination table?

Alisa says:

Is the MERGE really available on HDP 2.6? In release notes on 2.6.1 there is still Hive 2.1, but MERGE feature is available since Hive 2.2.

Carter Shanklin says:

HDP ships 2 versions of Hive, both based on Apache but including features and stabilization backports. One of these is based on Hive 1.2.1 (more stable) and one is based on Hive 2.1 (more cutting edge).

MERGE is available in both the 1.2.1 based version and the 2.1 based version.

Gopala says:

Is this version enterprise ready?

zzeng says:

I think you can find sample DDL / DML here.

Laisser une réponse

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués par une *