Hive or Impala ?
Hive and Impala both support SQL operation, but the performance of Impala is far superior than that of Hive. Although now with Spark SQL engine and use of HiveContext the performance of hive queries is also significantly fast, impala still has a better performance. The reason that impala has better performance is that it already has daemons running on the worker nodes and thus it avoids the overhead that is incurred during the creation of map and reduce jobs.
The query that I will mention later ran almost 10X faster on impala than on Hive (61 seconds vs around 600 seconds): Impala is known to give even better performance.
Schema on read vs Schema on write
Schema on read differs from schema on write as data is not validated till it is read. Although schema on read offers flexibility of defining multiple schemas for the same data, it can cause nasty runtime errors. As an example Hive and Impala are very particular about the timestamp format that they recognize and support, one workaround to avoid such bad records is to use a trick where rather than specifying the data type as timestamp, you specify the datatype as String and then use the cast operator to transform the records to timestamp format, this way bad records are skipped and the query does not error out.
cast(field_name as timestamp)
Window Functions, Top-N Queries, PL/SQL
Hive and Impala do not support update queries, but they do support
select * from insert into operation. Hive and impala also support window functions. The latter makes life easier because both Impala and Hive do not support PL/SQL procedures.
In the example below, I am using the dataset of NYC Yellow Taxi from the month of January 2015. The query below filters out invalid timestamp records and selects first 500 records per hour for 1st january 2015.
1 2 3 4 5
/*Top-N Subquery selects first 500 records per hour for a day*/ insert into nyc_taxi_data_limited select VendorID, tpep_pickup_datetime , tpep_dropoff_datetime , passenger_count ,trip_distance ,pickup_longitude ,pickup_latitude,RateCodeID ,store_and_fwd_flag ,dropoff_longitude ,dropoff_latitude ,payment_type ,fare_amount ,extra,mta_tax ,tip_amount,tolls_amount,improvement_surcharge,total_amount from ( select *, row_number() over (partition by trunc(cast(tpep_pickup_datetime as timestamp), 'HH') order by trunc(cast(tpep_pickup_datetime as timestamp), 'HH') desc) as rownumb from nyc_taxi_data where cast(tpep_pickup_datetime as timestamp) between cast('2015-01-01 00:00:00' as timestamp) and cast('2015-01-01 23:59:59' as timestamp) ) as q where rownumb<=500;
Note the use of window function row_number and ordering by truncated timestamp, and cast operator to avoid invalid records.
What's the catch ?
Given the benefits of Impala why would one ever use Hive ? The answer lies in the fact that impala queries are not fault tolerant.
Although, Impala and Hive do not offer entire repertoire of functionality supported by traditional RDBMS's, they are closest wrt to functionality offered by traditional RDBMS's in the world of distributed systems and offer scalable and large scale data analysis capability.