Deep dive into enterprise data lake through Impala

download

of 41

Embed Size (px)

description

A Hadoop near real-time solution Impala

transcript

<ul><li> Deep dive into enterprise data lake through Impala Evans Ye 2014.7.28 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 1 </li> <li> Agenda Introduction to Impala Impala Architecture Query Execution Getting Started Parquet File Format ACL via Sentry 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Introduction to Impala 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 3 </li> <li> Impala General-purpose SQL query engine Support queries takes from milliseconds to hours (near real-time) Support most of the Hadoop file formats Text, SequenseFile, Avro, RCFile, Parquet Suitable for data scientists or business analysts 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Why do we need it? 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 5 </li> <li> 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Current adhoc-query solution - Pig Do hourly count on akamai log A = load /test_data/2014/07/20/00' using PigStorage(); B = foreach (group A all) COUNT_STAR(A); dump B; 0% complete 100% complete (194202349) 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Using Impala No memory cache &gt; select count(*) from test_data where day=20140720 and hour=0 194202349 with OS cache Do a further query: select count(*) from test_data where day=20140720 and hour=00 and c='US'; 41118019 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Status quo Developed by Cloudera Open source under Apache License 1.0 available in 05/2013 current version is 1.4 connect via ODBC/JDBC/hue/impala-shell authenticate via Kerberos or LDAP fine-grained ACL via Apache Sentry 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Benefits High Performance C++ direct access to data (no Mapreduce) in-memory query execution Flexibility Query across existing data(no duplication) support multiple Hadoop file format Scalable scale out by adding nodes 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Impala Architecture 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 12 </li> <li> Impala Architecture Datanode Tasktracker Regionserver impala daemon 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 NN, JT, HM Active NN, JT, HM Standby Datanode Tasktracker Regionserver impala daemon Datanode Tasktracker Regionserver impala daemon Datanode Tasktracker Regionserver impala daemon State store Catalog Hive Metastore </li> <li> Components Impala daemon collocate with datanodes handle all impala internal requests related to query execution User can submit request to impala daemon running on any node and that node serve as coordinator node State store daemon communicates with impala daemons to confirm which node is healthy and can accept new work Catalog daemon broadcast metadata changes from impala SQL statements to all the impala daemons 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Fault tolerance No fault tolerance for impala daemons A node failed, the query failed state-store offline query execution still function normally can not update metadata(create, alter) if another impala daemon goes down, then entire cluster can not execute any query catalog offline can not update metadata 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Query Execution 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 16 </li> <li> 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Getting Started 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 21 </li> <li> Impala-shell (sercure cluster) $ yum install impala-shell $ kinit kt evans_ye.keytab evans_ye $ impala-shell --kerberos --impalad IMPALA_HOST 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Create and insert &gt; create table t1 (col1 string, col2 int); &gt; insert into t1 values (foo, 10); only supports writing to TEXT and PARQUET tables every insert creates 1 tiny hdfs file by default, the file will be stored under /user/hive/warehouse/t1/ use it for setting up small dimension table, experiment purpose, or with HBase table 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Create external table to read existing files &gt; create external table t2 (col1 string, col2 int) row format delimited fields terminated by t location /user/evans_ye/test_data; location must be a directory (for example, pig output directory) files to read: V /user/evans_ye/test_data/part-r-00000 X /user/evans_ye/test_data/_logs/history X /user/evans_ye/test_data/20140701/00/part-r-00000 no recursive 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> No recursive? Then how to add external data with folder structure like this: /user/evans_ye/test_data/20140701/00 /user/evans_ye/test_data/20140701/01 /user/evans_ye/test_data/20140701/02 /user/evans_ye/test_data/20140702/00 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Partitioning 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 26 </li> <li> Create the table with partitions &gt; create external table t3 (col1 string, col2 int) partitioned by (`date` int, hour tinyint) row format delimited fields terminated by t; No need to specify the location on create 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Add partitions into the table &gt; alter table t3 add partition (`date`=20130330, hour=0) location /user/evans_ye/test_data/20130330/00; 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Partition number thousands of partitions per table OK tens of thousands partitions per table Not OK 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Compute table statistics &gt; compute stats t3; &gt; show table stats t3; Help impala to optimize join query: broadcast join, partitioned join 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Parquet File Format 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 31 </li> <li> Parquet apache incubator project column-oriented file format compression is better since all the value would be the same type encoding is better since value could often be the same and repeated SQL projection can avoid unnecessary read and decoding on columns Supported by Pig, Impala, Hive, MR and Cascading impala by default use snappy with parquet impala + parquet = google dremel dremel doesnt support join impala doesnt support nested data structure(yet) 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Transform text files table into parquet format &gt; create table t4 like t3 stored as parquet; &gt; insert overwrite t4 partition (`date`, hour) select * from t3 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Using parquet in Pig $ yum install parquet $ pig &gt; A = load /user/hive/warehouse/t4 using parquet.pig.ParquetLoader as (x: chararray, y: int); &gt; store A into /user/evans_ye/parquet_out using parquet.pig.ParquetStorer; 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> ACL via 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 35 </li> <li> Sentry apache incubator project provide fine-grained role based authorization currently integrates with Hive and Impala require strong authentication such as kerberos 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Enable Sentry for Impala turns on Sentry authorization for Impala add two lines into impala daemons configuration file (/etc/default/impala) auth-policy.ini Sentry policy file server1 a symbolic name used in policy file all impala daemons must specify same server name 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Sentry policy file example roles: on server1, spn_user_role has permission to read(SELECT) all tables in spn database groups evans_ye group has role spn_user_role 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Sentry policy file example roles: evans_data has permission to access /user/evans_ye allows you to add data under /user/evans_ye as partitions foo_db_role can do anything in foo database create, alter, insert, drop 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Impala 2014 Roadmap 1.4 (now available) order by without limit 2.0 nested data types (structs, arrays, maps) disk-based joins and aggregations 8/2/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 </li> <li> Q&amp;A 8/2/2014 41Confidential | Copyright 2013 TrendMicro Inc. </li> </ul>