荣耀彩票代理

IT技术互动交流平台

Nutch+Hbase

来源:IT165收集  发布日期:2016-11-17 20:01:02

BENWENZHUYAOJIANGJIENEIRONGBAOKUO:antJIivyDEDAJIAN、Nutch + HbaseDAJIAN

1、antJIivyDEDAJIAN

1-1)antXIAZAIDIZHIhttp://ant.apache.org/bindownload.cgi

1-2)HUANJINGBIANLIANGPEIZHI,XIUGAIlinux /etc/profileWENJIANNEIRONG,TIANJIARUXIA:

 

export ANT_HOME=/usr/ant
export PATH=$ANT_HOME/bin:$PATH

1-3)下载ivy build.xml http://ant.apache.org/ivy/history/latest-milestone/samples/build.xml

 

1-4)ZAIXIAZAIDELUJINGXIAZHIXING ant MINGLING,CHENGGONGHOUZAIantDEANZHUANGLUJINGXIAXINZENGivyWENJIANJIA,BINGJIANGivyXIADEivy.jarKAOBEIDAOANT_HOME/libMULUXIA

2、Nutch + HbaseDAJIAN

荣耀彩票代理2-1)NutchXIAZAILUJINGhttp://nutch.apache.org/downloads.html,XUANZEDUIYINGDEBANBEN,BENWENXUANYONGapache-nutch-2.3.1-src.tar.gz

荣耀彩票代理2-2)XIUGAIconf/nutch-site.xml,NEIRONGRUXIA:

 

<configuration>
    <property>
        <name>http.agent.name</name>
        <value>hbase_nutch</value>
    </property>
    <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.hbase.store.HBaseStore</value>
        <description>Default class for storing data</description>
    </property>
    <property>
        <name>plugin.includes</name>
        <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    </property>
</configuration>

荣耀彩票代理2-3)conf/regex-urlfilter.txt,YONGLAIGUOLVZHUAQUWANGZHANDEURLGUIZE,DUZHEKEYIGENJUGERENXUQIUJINXINGDINGZHI。

荣耀彩票代理2-4)XIUGAI ivy/ivy.xml,ZHUYAOYONGLAISHEZHISUOYILAIDEBANBEN,GAIDONGYOURUXIA:

TIANJIAHbaseZHICHI,ZHEILIXUYAOZHUYI,YOUYUBANBENJIANRONGWENTI,ZHEILISHIYONG0.98.13BANBEN,BIZHECESHIHbase1.2BANBEN,CHUXIANCUOWU。

 

    <dependency org="org.apache.hbase" name="hbase-client" rev="0.98.13-hadoop2" conf="*->default"/>
    <dependency org="org.apache.hbase" name="hbase-common" rev="0.98.13-hadoop2" conf="*->default"/>
    <dependency org="org.apache.hbase" name="hbase-protocol" rev="0.98.13-hadoop2" conf="*->default"/>
    <dependency org="org.apache.gora" name="gora-hbase" rev="0.6.1" conf="*->default" />
其他的jar文件读者可以根据需要进行删除或者更改。

 

2-5)KAOBEIhbaseJIQUNPEIZHIWENJIAN,cp $HBASE_HOME/conf/hbase-site.xml $NUTCH_HOME/conf/

2-6)XIUGAIconf/gora.properties,TIANJIARUXIAPEIZHI

 

     gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

2-7)配置抓取链接,在conf目录下创建urls目录,用来保存抓取链接,然后初始化种子文件seed.txt,内容添加如下:

 

 

     http://www.csdn.net/
2-8)进行编译,创建抓取程序 ,nutch根目录下执行ant runtime,结果如下:

 

荣耀彩票代理DIYICISHIJIANBIJIAOZHANG,XUYAOXIAZAIjarBAODENGDENG。

荣耀彩票代理2-9)ZHUAQUNEIRONG,ZAIruntime/local/binXIAZHIXINGRUXIAMINGLING:

 

    ./crawl /usr/apache-nutch-2.3.1/conf/urls/ numberOfRounds 10
crawl 命令的参数解释如下:
    Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
    <seedDir>:放置种子文件的目录
    <crawlID> :抓取任务的ID
    <solrURL>:用于索引及搜索的solr地址
    <numberOfRounds>:迭代次数,即抓取深度

 

2-10)CHAKANhbaseJIANKONGYEMIANWANGZHIWEI:http://lenovo1:16010/master-status,HUOQUDAO BIAOMINGCHENGWEInumberOfRounds_webpage,TONGGUOSparkDAIMADUQURUXIA:

 

    // please ensure HBASE_CONF_DIR is on classpath of spark driver
    // e.g: set it through spark.driver.extraClassPath property
    // in spark-defaults.conf or through --driver-class-path
    // command line option of spark-submit
    val conf = HBaseConfiguration.create()

    val args = Array[String]("numberOfRounds_webpage")
    // Other options for configuring scan behavior are available. More information available at
    // http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html
    conf.set(TableInputFormat.INPUT_TABLE, args(0))

    // Initialize hBase table if necessary
    val admin = new HBaseAdmin(conf)
    if (!admin.isTableAvailable(args(0))) {
      println("不存在该表")
      return
      //sc.stop()
    }
    val pool = new HTablePool(conf, 1000)
    val table = pool.getTable(args(0))

    try {
      val rs: ResultScanner = table.getScanner(new Scan())
      var r = rs.next()
      while (r != null) {
        System.out.println("获得到rowkey:" + new String(r.getRow))
        for (keyValue <- r.raw()) {
          System.out.println("(" + new String(keyValue.getFamily()) + "," + new String(keyValue.getQualifier()) + "):" + new String(keyValue.getValue()));
        }
        r = rs.next()
      }
    } catch {
      case e => e.printStackTrace()
    }
    //sc.stop()
    admin.close()
展示结果如下:

 

荣耀彩票代理ZHICIYIGEJIANDANDESHILIWANCHENGLE,DUZHEKEYIZAICIJICHUSHANGTIANJIAFUZAYEWULUOJI。

延伸阅读:

Tag标签:   
  • 专题推荐

About IT165 - 广告服务 - 隐私声明 - 版权申明 - 免责条款 - 网站地图 - 网友投稿 - 联系方式
本站内容来自于互联网,仅供用于网络技术学习,学习中请遵循相关法律法规