Saturday, July 22, 2017

How to let Apache Nutch index the results via ElasticSearch over REST API.

In my work, we used Apache Nutch to fetch and parse some data from various websites. The data is pretty huge, so we need to index this on ElasticSearch. In my initial prototypes, I discovered that we can't use AWS ElasticSearch because AWS doesn't expose the native transport protocol; we can only talk to it via its RESTful API. In the official Nutch releases (v2.3.1 and v1.13, as of July 22, 2017), the plugin for ES doesn't support the RESTful API. When I checked the source code though, I was quite surprised that there's another plugin for this. Unfortunately, there's zero documentation on this, so I wish these notes would help others. Note: the URL regex config and HTTP options need to be set too, the IP addresses need to be whitelisted, etc... just leave a comment in case I miss something.

OS used for this experiment: Ubuntu 16.04.2 LTS

Install Java

add-apt-repository ppa:webupd8team/java
apt-get update
apt-get install oracle-java8-installer

Install ant

apt-get install ant

Increase the ulimit settings

(TODO: these settings should survive a reboot; just edit /etc/sysctl.conf or something)
ulimit  -f unlimited \
        -t unlimited \
        -v unlimited \
        -n 64000 \
        -m unlimited \
        -u 64000

Install Apache Nutch

Download the source code

git clone https://github.com/apache/nutch.git

Build

ant clean runtime

Put this inside runtime/local/config/nutch-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>Spiderman</value>
  </property>

  <property>
    <name>plugin.includes</name>
    <value>protocol-(http|httpclient)|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-elastic|indexer-elastic-rest|nutch-extensionpoints|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
  </property>

  <property>
    <name>parser.character.encoding.default</name>
    <value>utf-8</value>
  </property>

  <property>
    <name>elastic.rest.host</name>
    <value>aws-elastic-search-endpoint.example.org</value>
  </property>

  <property>
    <name>elastic.rest.port</name>
    <value>443</value>
  </property>

  <property>
    <name>elastic.rest.index</name>
    <value>nutch</value>
  </property>

  <property>
    <name>elastic.rest.type</name>
    <value>doc</value>
  </property>

  <property>
    <name>elastic.rest.https</name>
    <value>true</value>
  </property>

  <property>
    <name>elastic.rest.trustallhostnames</name>
    <value>false</value>
  </property>

</configuration>

Make sure to whitelist your IP address first in AWS ElasticSearch’s policy settings. Yikes it's 7:11AM now... need to sleep. Hahaha... :p

No comments: