荣耀彩票代理

IT技术互动交流平台

Apache Sqoop

作者:Overview——Sqoop 概述 - 林子系  来源:IT165收集  发布日期:2016-12-14 20:32:52

Apache Sqoop - Overview

Apache Sqoop 概述 

荣耀彩票代理SHIYONGHadoopLAIFENXIHECHULISHUJUXUYAOJIANGSHUJUJIAZAIDAOJIQUNZHONGBINGQIEJIANGTAHEQIYESHENGCHANSHUJUKUZHONGDEQITASHUJUJINXINGJIEHECHULI。CONGSHENGCHANXITONGJIAZAIDAKUAISHUJUDAOHadoopZHONGHUOZHECONGDAXINGJIQUNDEmap reduceYINGYONGZHONGHUODESHUJUSHIGETIAOZHAN。YONGHUBIXUYISHIDAOQUEBAOSHUJUYIZHIXING,XIAOHAOSHENGCHANXITONGZIYUAN,GONGYINGXIAYOUGUANDAODESHUJUYUCHULIZHEIXIEXIJIE。YONGJIAOBENLAIZHUANHUASHUJUSHIDIXIAOHEHAOSHIDEFANGSHI。SHIYONGmap reduceYINGYONGZHIJIEQUHUOQUWAIBUXITONGDESHUJUSHIDEYINGYONGBIANDEFUZAHEZENGJIALESHENGCHANXITONGLAIZIJIQUNJIEDIANGUODUFUZAIDEFENGXIAN。

这就是Apache Sqoop能够做到的。Aapche Sqoop 目前是Apache软件会的孵化项目。更多关于这个项目的信息可以在http://incubator.apache.org/sqoop查看

Sqoop能够使得像关系型数据库、企业数据仓库和NoSQL系统那样简单地从结构化数据仓库中导入导出数据。你可以使用Sqoop将数据从外部系统加载到HDFS,存储在Hive和HBase表格中。Sqoop配合Ooozie能够帮助你调度和自动运行导入导出任务。Sqoop使用基于支持插件来提供新的外部链接的连接器。

荣耀彩票代理 当你运行Sqoop的时候看起来是非常简单的,但是表象底层下面发生了什么呢?数据集将被切片分到不同的partitions和运行一个只有map的作业来负责数据集的某个切片。因为Sqoop使用数据库的元数据来推断数据类型所以每条数据都以一种类型安全的方式来处理。

荣耀彩票代理ZAIZHEIPIANWENZHANGQIYUBUFENZHONGWOMENJIANGTONGGUOYIGELIZILAIZHANSHISqoopDEGEZHONGSHIYONGFANGSHI。ZHEIPIANWENZHANGDEMUBIAOSHITIGONGSqoopCAOZUODEYIGEGAISHUERBUSHISHENRUGAOJIGONGNENGDEXIJIE。

导入数据

下面的命令用于将一个MySQL数据库中名为ORDERS的表中所有数据导入到集群中
---
$ sqoop import --connect jdbc:mysql://localhost/acmedb
  --table ORDERS --username test --password ****
---

荣耀彩票代理ZAIZHEITIAOMINGLINGZHONGDEGEZHONGXUANXIANGJIESHIRUXIA:

  • import: 指示Sqoop开始导入 --connect <connect string>, --username <user name>, --password <password>: 这些都是连接数据库时需要的参数这跟你通过JDBC连接数据库时所使用的参数没有区别 --table <table name>: 指定要导入哪个表

    DAORUCAOZUOTONGGUOXIAMIANFigure1SUOMIAOHUIDENEILIANGBULAIWANCHENG。DIYIBU,SqoopCONGSHUJUKUZHONGHUOQUYAODAORUDESHUJUDEYUANSHUJU。DIERBU,SqoopTIJIAOmap-onlyZUOYEDAOHadoopJIQUNZHONG。DIERBUTONGGUOZAIQIANYIBUZHONGHUOQUDEYUANSHUJUZUOSHIJIDESHUJUCHUANSHUGONGZUO。

    Figure 1: Sqoop Import Overview

    DAORUDESHUJUCUNCHUZAIHDFSMULUXIA。ZHENGRUSqoopDADUOSHUCAOZUOYIYANG,YONGHUKEYIZHIDINGRENHETIHUANLUJINGLAICUNCHUDAORUDESHUJU。

    默认情况下这些文档包含用逗号分隔的字段,用新行来分隔不同的记录。你可以明确地指定字段分隔符和记录结束符容易地实现文件复制过程中的格式覆盖。

    Sqoop也支持不同数据格式的数据导入。例如,你可以通过指定 --as-avrodatafile 选项的命令行来简单地实现导入Avro 格式的数据。

    荣耀彩票代理There are many other options that Sqoop provides which can be used to further tune the import operation to suit your specific requirements.

    荣耀彩票代理SqoopTIGONGXUDUOXUANXIANGKEYIYONGLAIMANZUZHIDINGXUQIUDEDAORUCAOZUO。

    导入数据到 Hive

    在许多情况下,导入数据到Hive就跟运行一个导入任务然后使用Hive创建和加载一个确定的表和partition。手动执行这个操作需要你要知道正确的数据类型映射和其他细节像序列化格式和分隔符。Sqoop负责将合适的表格元数据填充到Hive 元数据仓库和调用必要的指令来加载table和partition。这些操作都可以通过简单地在命令行中指定--hive-import 来实现。
    ----
    $ sqoop import --connect jdbc:mysql://localhost/acmedb
      --table ORDERS --username test --password **** --hive-import
    ----

    DANGNIYUNXINGYIGEHive importSHI,SqoopJIANGHUIJIANGSHUJUDELEIXINGCONGWAIBUSHUJUCANGKUDEYUANSHENGSHUJULEIXINGZHUANHUANCHENGHiveZHONGDUIYINGDELEIXING,SqoopZIDONGDIXUANZEHiveSHIYONGDEBENDIFENGEFU。RUGUOBEIDAORUDESHUJUZHONGYOUXINXINGHUOZHEYOUQITAHiveFENGEFU,SqoopYUNXUNIYICHUZHEIXIEZIFUBINGQIEHUOQUDAORUDAOHiveDEZHENGQUESHUJU。

    荣耀彩票代理YIDANDAORUCAOZUOWANCHENG,NIJIUXIANGHiveQITABIAOGEYIYANGQUCHAKANHECAOZUO。

    导入数据到 HBase

    你可以使用Sqoop将数据插入到HBase表格中特定列族。跟Hive导入操作很像,可以通过指定一个额外的选项来指定要插入的HBase表格和列族。所有导入到HBase的数据将转换成字符串并以UTF-8字节数组的格式插入到HBase中

    ----
    $ sqoop import --connect jdbc:mysql://localhost/acmedb
     --table ORDERS --username test --password ****
    --hbase-create-table --hbase-table ORDERS --column-family mysql
    ----
     

    XIAMIANSHIMINGLINGXINGZHONGGEZHONGXUANXIANGDEJIESHI:

  • --hbase-create-table: 这个选项指示Sqoop创建HBase表. --hbase-table: 这个选项指定HBase表格的名字. --column-family: T这个选项指定列族的名字.

    荣耀彩票代理SHENGXIADEXUANXIANGGENPUTONGDEDAORUCAOZUOYIYANG。

    导出数据

    ZAIYIXIEQINGKUANGZHONG,TONGGUOHadoop pipelinesLAICHULISHUJUKENENGXUYAOZAISHENGCHANXITONGZHONGYUNXINGEWAIDEGUANJIANYEWUHANSHULAITIGONGBANGZHU。SqoopKEYIZAIBIYAODESHIHOUYONGLAIDAOCHUZHEIXIEDESHUJUDAOWAIBUSHUJUCANGKU。HAISHISHIYONGSHANGMIANDELIZI,RUGUOHadoop pieplinesCHANSHENGDESHUJUDUIYINGSHUJUKUOREDERSBIAOGEZHONGDEMOUXIEDIFANG,NIKEYISHIYONGXIAMIANDEMINGLINGXING:


    ----
    $ sqoop export --connect jdbc:mysql://localhost/acmedb
     --table ORDERS --username test --password ****
    --export-dir /user/arvind/ORDERS
    ----
     

    荣耀彩票代理XIAMIANSHIGEZHONGXUANXIANGDEJIESHI:

  • export: 指示Sqoop开始导出 --connect <connect string>, --username <user name>, --password <password>:这些都是连接数据库时需要的参数。这跟你通过JDBC连接数据库时所使用的参数没有区别 --table <table name>: 指定要被填充的表格 --export-dir <directory path>: 导出路径.

    荣耀彩票代理DAORUCAOZUOTONGGUOXIAMIANFigure2SUOMIAOHUIDENEILIANGBULAIWANCHENG。DIYIBU,CONGSHUJUKUZHONGHUOQUYAODAORUDESHUJUDEYUANSHUJU,DIERBUZESHISHUJUDECHUANSHU。SqoopJIANGSHURUSHUJUJIFENGECHENGPIANRANHOUYONGmapRENWUJIANGPIANCHARUDAOSHUJUKUZHONG。WEILEQUEBAOZUIJIADETUNTULIANGHEZUIXIAODEZIYUANSHIYONGLV,MEIGEmapRENWUTONGGUODUOGESHIWULAIZHIXINGZHEIGESHUJUCHUANSHU。

     

    Figure 2: Sqoop Export Overview

    荣耀彩票代理YIXIELIANJIEQIZHICHILINSHIBIAOGELAIBANGZHUGELINEIXIERENHEYUANYINDAOZHIDEZUOYESHIBAIERCHANSHENGDESHENGCHANBIAOGE。YIDANSUOYOUDESHUJUDOUCHUANSHUWANCHENG,LINSHIBIAOGEZHONGDESHUJUSHOUXIANBEITIANCHONGDAOmapRENWUHEHEBINGDAOMUBIAOBIAOGE。

    Sqoop 连接器

    荣耀彩票代理SHIYONGZHUANMENLIANJIEQI,SqoopKEYILIANJIENEIXIEYONGYOUYOUHUADAORUDAOCHUJICHUSHESHIDEWAIBUXITONG,HUOZHEBUZHICHIBENDIJDBC。LIANJIEQISHICHAJIANHUAZUJIANJIYUSqoopDEKEKUOZHANKUANGJIAHEKEYITIANJIADAORENHEDANGQIANCUNZAIDESqoop。YIDANLIANJIEQIANZHUANGHAO,SqoopKEYISHIYONGTAZAIHadoopHELIANJIEQIZHICHIDEWAIBUCANGKUZHIJIANJINXINGGAOXIAODECHUANSHUSHUJU。

    默认情况下,Sqoop包含支持各种常用数据库例如MySQL荣耀彩票代理,PostgreSQL,Oracle,SQLServer和DB2的连接器。它也包含支持MySQL和PostgreSQL数据库的快速路径连接器。快速路径连接器是专门的连接器用来实现批次传输数据的高吞吐量。Sqoop也包含一般的JDBC连接器用于连接通过JDBC连接的数据库

    GENNEIZHIDELIANJIEBUTONGDESHI,XUDUOGONGSIHUIKAIFATAMENZIJIDELIANJIEQICHARUDAOSqoopZHONG,CONGZHUANMENDEQIYECANGKULIANJIEQIDAONoSQLSHUJUKU。

    总结

    在这篇文档中可以看到大数据集在Hadoop和外部数据仓库例如关系型数据库的传输是多么的简单。除此之外,Sqoop提供许多高级提醒如不同数据格式、压缩、处理查询等等。我们建议你多尝试Sqoop并给我们提供反馈。
     

    更多关于Sqoop的信息可以在下面路径找到:
     

    Project Website: http://incubator.apache.org/sqoop

    Wiki: http://cwiki.apache.org/confluence/display/SQOOP

    Project Status:  http://incubator.apache.org/projects/sqoop.html

    Mailing Lists: http://cwiki.apache.org/confluence/display/SQOOP/Mailing+Lists

    XIAMIANSHIYUANWEN


    Apache Sqoop - Overview 

    Using Hadoop for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large clusters can be a challenging task. Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data residing on external systems from within the map reduce applications complicates applications and exposes the production system to the risk of excessive load originating from cluster nodes.


    This is where Apache Sqoop fits in. Apache Sqoop is currently undergoing incubation at Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.

    Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. Using Sqoop, you can provision the data from external system on to HDFS, and populate tables in Hive and HBase. Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks. Sqoop uses a connector based architecture which supports plugins that provide connectivity to new external systems.

    What happens underneath the covers when you run Sqoop is very straightforward. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe manner since Sqoop uses the database metadata to infer the data types.

    In the rest of this post we will walk through an example that shows the various ways you can use Sqoop. The goal of this post is to give an overview of Sqoop operation without going into much detail or advanced functionality.

    Importing Data

    荣耀彩票代理The following command is used to import all data from a table called ORDERS from a MySQL database:


    ---
    $ sqoop import --connect jdbc:mysql://localhost/acmedb
      --table ORDERS --username test --password ****
    ---

    荣耀彩票代理 In this command the various options specified are as follows:

  • import: This is the sub-command that instructs Sqoop to initiate an import. --connect <connect string>, --username <user name>, --password <password>: These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection. --table <table name>: This parameter specifies the table which will be imported.


    荣耀彩票代理 The import is done in two steps as depicted in Figure 1 below. In the first Step Sqoop introspects the database to gather the necessary metadata for the data being imported. The second step is a map-only Hadoop job that Sqoop submits to the cluster. It is this job that does the actual data transfer using the metadata captured in the previous step.

     

    Figure 1: Sqoop Import Overview

    The imported data is saved in a directory on HDFS based on the table being imported. As is the case with most aspects of Sqoop operation, the user can specify any alternative directory where the files should be populated.

    By default these files contain comma delimited fields, with new lines separating different records. You can easily override the format in which data is copied over by explicitly specifying the field separator and record terminator characters.

    Sqoop also supports different data formats for importing data. For example, you can easily import data in Avro data format by simply specifying the option --as-avrodatafile with the import command.
     

    There are many other options that Sqoop provides which can be used to further tune the import operation to suit your specific requirements.

    Importing Data into Hive

    In most cases, importing data into Hive is the same as running the import task and then using Hive to create and load a certain table or partition. Doing this manually requires that you know the correct type mapping between the data and other details like the serialization format and delimiters. Sqoop takes care of populating the Hive metastore with the appropriate metadata for the table and also invokes the necessary commands to load the table or partition as the case may be. All of this is done by simply specifying the option --hive-import with the import command.

    ----
    $ sqoop import --connect jdbc:mysql://localhost/acmedb
      --table ORDERS --username test --password **** --hive-import
    ----

    When you run a Hive import, Sqoop converts the data from the native datatypes within the external datastore into the corresponding types within Hive. Sqoop automatically chooses the native delimiter set used by Hive. If the data being imported has new line or other Hive delimiter characters in it, Sqoop allows you to remove such characters and get the data correctly populated for consumption in Hive.
     

    Once the import is complete, you can see and operate on the table just like any other table in Hive.

    Importing Data into HBase

    You can use Sqoop to populate data in a particular column family within the HBase table. Much like the Hive import, this can be done by specifying the additional options that relate to the HBase table and column family being populated. All data imported into HBase is converted to their string representation and inserted as UTF-8 bytes.

    ----
    $ sqoop import --connect jdbc:mysql://localhost/acmedb
     --table ORDERS --username test --password ****
    --hbase-create-table --hbase-table ORDERS --column-family mysql
    ----

    In this command the various options specified are as follows:

  • --hbase-create-table: This option instructs Sqoop to create the HBase table. --hbase-table: This option specifies the table name to use. --column-family: This option specifies the column family name to use.

    荣耀彩票代理The rest of the options are the same as that for regular import operation.

    Exporting Data

    In some cases data processed by Hadoop pipelines may be needed in production systems to help run additional critical business functions. Sqoop can be used to export such data into external datastores as necessary. Continuing our example from above - if data generated by the pipeline on Hadoop corresponded to the ORDERS table in a database somewhere, you could populate it using the following command:

    ----
    $ sqoop export --connect jdbc:mysql://localhost/acmedb
     --table ORDERS --username test --password ****
    --export-dir /user/arvind/ORDERS
    ----

    In this command the various options specified are as follows:

  • export: This is the sub-command that instructs Sqoop to initiate an export. --connect <connect string>, --username <user name>, --password <password>: These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection. --table <table name>: This parameter specifies the table which will be populated. --export-dir <directory path>: This is the directory from which data will be exported.


    Export is done in two steps as depicted in Figure 2. The first step is to introspect the database for metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into splits and then uses individual map tasks to push the splits to the database. Each map task performs this transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.

    Figure 2: Sqoop Export Overview

    荣耀彩票代理Some connectors support staging tables that help isolate production tables from possible corruption in case of job failures due to any reason. Staging tables are first populated by the map tasks and then merged into the target table once all of the data has been delivered it.

    Sqoop Connectors

    Using specialized connectors, Sqoop can connect with external systems that have optimized import and export facilities, or do not support native JDBC. Connectors are plugin components based on Sqoop’s extension framework and can be added to any existing Sqoop installation. Once a connector is installed, Sqoop can use it to efficiently transfer data between Hadoop and the external store supported by the connector.

    By default Sqoop includes connectors for various popular databases such as MySQL, PostgreSQL, Oracle, SQL Server and DB2. It also includes fast-path connectors for MySQL and PostgreSQL databases. Fast-path connectors are specialized connectors that use database specific batch tools to transfer data with high throughput. Sqoop also includes a generic JDBC connector that can be used to connect to any database that is accessible via JDBC.

    Apart from the built-in connectors, many companies have developed their own connectors that can be plugged into Sqoop. These range from specialized connectors for enterprise data warehouse systems to NoSQL datastores.

    Wrapping Up

    In this post you saw how easy it is to transfer large datasets between Hadoop and external datastores such as relational databases. Beyond this, Sqoop offers many advance features such as different data formats, compression, working with queries instead of tables etc. We encourage you to try out Sqoop and give us your feedback.

    More information regarding Sqoop can be found at:
     

    荣耀彩票代理Project Website: http://incubator.apache.org/sqoop

    Wiki: http://cwiki.apache.org/confluence/display/SQOOP

    Project Status:  http://incubator.apache.org/projects/sqoop.html

    Mailing Lists: http://cwiki.apache.org/confluence/display/SQOOP/Mailing+Lists

Tag标签:      
  • 专题推荐

About IT165 - 广告服务 - 隐私声明 - 版权申明 - 免责条款 - 网站地图 - 网友投稿 - 联系方式
本站内容来自于互联网,仅供用于网络技术学习,学习中请遵循相关法律法规