Saturday, July 22, 2017

How to let Apache Nutch index the results via ElasticSearch over REST API.

In my work, we used Apache Nutch to fetch and parse some data from various websites. The data is pretty huge, so we need to index this on ElasticSearch. In my initial prototypes, I discovered that we can't use AWS ElasticSearch because AWS doesn't expose the native transport protocol; we can only talk to it via its RESTful API. In the official Nutch releases (v2.3.1 and v1.13, as of July 22, 2017), the plugin for ES doesn't support the RESTful API. When I checked the source code though, I was quite surprised that there's another plugin for this. Unfortunately, there's zero documentation on this, so I wish these notes would help others. Note: the URL regex config and HTTP options need to be set too, the IP addresses need to be whitelisted, etc... just leave a comment in case I miss something.

OS used for this experiment: Ubuntu 16.04.2 LTS

Install Java

add-apt-repository ppa:webupd8team/java
apt-get update
apt-get install oracle-java8-installer

Install ant

apt-get install ant

Increase the ulimit settings

(TODO: these settings should survive a reboot; just edit /etc/sysctl.conf or something)
ulimit  -f unlimited \
        -t unlimited \
        -v unlimited \
        -n 64000 \
        -m unlimited \
        -u 64000

Install Apache Nutch

Download the source code

git clone https://github.com/apache/nutch.git

Build

ant clean runtime

Put this inside runtime/local/config/nutch-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>Spiderman</value>
  </property>

  <property>
    <name>plugin.includes</name>
    <value>protocol-(http|httpclient)|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-elastic|indexer-elastic-rest|nutch-extensionpoints|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
  </property>

  <property>
    <name>parser.character.encoding.default</name>
    <value>utf-8</value>
  </property>

  <property>
    <name>elastic.rest.host</name>
    <value>aws-elastic-search-endpoint.example.org</value>
  </property>

  <property>
    <name>elastic.rest.port</name>
    <value>443</value>
  </property>

  <property>
    <name>elastic.rest.index</name>
    <value>nutch</value>
  </property>

  <property>
    <name>elastic.rest.type</name>
    <value>doc</value>
  </property>

  <property>
    <name>elastic.rest.https</name>
    <value>true</value>
  </property>

  <property>
    <name>elastic.rest.trustallhostnames</name>
    <value>false</value>
  </property>

</configuration>

Make sure to whitelist your IP address first in AWS ElasticSearch’s policy settings. Yikes it's 7:11AM now... need to sleep. Hahaha... :p

Thursday, July 13, 2017

Setting up Apache Nutch v2.3.1 on Ubuntu with MongoDB and ElasticSearch

These are my notes for installing Apache Nutch v2.3.1 on Ubuntu in my work. We'll probably not use Nutch 2.3 because AWS ElasticSearch doesn't support the native transport protocol (it only supports the REST API, and it exposes it in a non-standard manner on port 80); thus we'll have a hard time indexing the data unless we maintain the servers ourselves. We can probably use AWS CloudSearch, but Nutch 2.3 doesn't have a plugin for it, unlike Nutch 1.13.

Anyway, there are lots of similar HOWTOs out there, but this one points to specific versions and URLs so you can just copy and paste these to get your system up and running in a few minutes.

OS used for this experiment: Ubuntu 16.04.2 LTS

Make sure that Ubuntu is the latest version

apt-get update
apt-get dist-upgrade
reboot #if necessary

Updating to the latest version is needed to fix the glibc bug that affects MongoDB. It’s probably not required anymore, but it’s good to be safe.

Install Java

add-apt-repository ppa:webupd8team/java
apt-get update
apt-get install oracle-java8-installer

Install ant

apt-get install ant

Increase the ulimit settings

(TODO: these settings should survive a reboot)
ulimit  -f unlimited \
        -t unlimited \
        -v unlimited \
        -n 64000 \
        -m unlimited \
        -u 64000

Install MongoDB

apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 0C49F3730359A14518585931BC711F9BA15703C6

echo "deb [ arch=amd64,arm64 ] http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/3.4 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.4.list

apt-get update
apt-get install mongodb-org

service mongod start

echo "use nutch" | mongo

Install Apache Nutch

Download and extract

wget http://apache.cs.utah.edu/nutch/2.3.1/apache-nutch-2.3.1-src.tar.gz
tar xvf apache-nutch-2.3.1-src.tar.gz

Edit the configs

Uncomment this line at apache-nutch-2.3.1/ivy.xml:
<dependency org="org.apache.gora" name="gora-mongodb" rev="0.6.1" conf="*->default" />

Uncomment these lines at apache-nutch-2.3.1/runtime/local/conf/gora.properties:
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=nutch

Build Apache Nutch

ant runtime #needs to be run in the top directory

Install ElasticSearch

Setup ElasticSearch on AWS
Create a new AWS ElasticSearch domain and whitelist the server(s) where Apache Nutch is installed. AWS ElasticSearch doesn’t support the native transport protocol (port 9300). We may be able to use AWS if Nutch supports the HTTP REST protocol. Needs more research.

Manual Installation
# It's possible to use the latest version, ES5, but you need to update the indexer.

wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb http://packages.elastic.co/elasticsearch/1.7/debian stable main" | sudo tee -a /etc/apt/sources.list.d/elasticsearch-1.7.list
apt-get update && sudo apt-get install elasticsearch

Edit /etc/elasticsearch/elasticsearch.yaml and put this:
network.host: 127.0.0.1
cluster.name: nutch
node.name: nutch1

Finish up
update-rc.d elasticsearch defaults
/etc/init.d/elasticsearch restart

Install Kibana (optional, didn't install this since curl works fine)

echo "deb http://packages.elastic.co/kibana/4.4/debian stable main" | sudo tee -a /etc/apt/sources.list.d/kibana-4.4.x.list

apt-get update && sudo apt-get install kibana

Put this inside apache-nutch-2.3.1/runtime/local/conf/nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.mongodb.store.MongoStore</value>
    <description>Default class for storing data</description>
  </property>

  <property>
    <name>http.agent.name</name>
    <value>Spiderman</value>
  </property>

  <property>
    <name>plugin.includes</name>
    <value>protocol-(http|httpclient)|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-elastic|nutch-extensionpoints|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
  </property>
  <property>
    <name>elastic.host</name>
    <value>127.0.0.1</value>
  </property>

  <property>
    <name>elastic.port</name>
    <value>9300</value>
  </property>

  <property>
    <name>elastic.cluster</name>
    <value>nutch</value>
  </property>

  <property>
    <name>elastic.index</name>
    <value>nutch</value>
  </property>

  <property>
    <name>parser.character.encoding.default</name>
    <value>utf-8</value>
  </property>

  <property>
    <name>http.content.limit</name>
    <value>6553600</value>
  </property>

  <property>
    <name>elastic.max.bulk.docs</name>
    <value>2000</value>
  </property>

  <property>
    <name>elastic.max.bulk.size</name>
    <value>2500500</value>
  </property>

</configuration>

Wednesday, July 05, 2017

Automatically scaling and deploying Server Density agent on AWS Elastic Beanstalk

We're using AWS ElasticBeanstalk in one of our projects at my work. EB is an abstraction on top of Elastic Load Balancing and EC2 (which is yet another abstraction on top of Xen, but I digress hehe...), and I really appreciate how it simplifies a lot of routine sysad tasks. EB allows us to scale the website up and down, depending on the traffic.

One problem that we encountered during development was a memory leak from PHP.  To prevent another downtime (or at least help predict when the servers are about to go offline), I decided to use Server Density to implement an early warning system that would give us a heads up when the server's memory is about to run out. Server Density is a pretty great service. We've been using it for quite some time now. Server Density works by letting you install an agent inside of your servers, and the agent will then continuously push metrics to their server. Server Density then collects all these logs and displays them into nice graphs. If a metric crosses the threshold in one of your conditions (i.e., if the CPU load is >= 90% or disk space is <= 5GB), Server Density then sends you notification via e-mail, SMS, and Slack.

The Server Density agent can be installed during each EB deployment by executing this inside your deployment script at .ebextensions/:

curl https://archive.serverdensity.com/agent-install.sh | bash -s -- -a ACCOUNTNAME -t PUTYOURKEYHERE -g GroupNameForYourEB -p amazon -i

Determining the amount of free memory in Linux is somewhat tricky. I can't use the memory usage metric to trigger a notification because Linux always shows that nearly all of the memory is being used, even though in reality they're just being used as cache. To get a more reliable metric, I used swap space instead. The theory is that the instances are supposed to have enough RAM for the tasks; when memory runs out, Linux uses the swap space as a last resort. Thus, we can check if swap space is >= 1MB to trigger a notification.

Next problem: since EB regularly deploys and terminates the instances, Server Density ends up monitoring servers that no longer exist. We needed a way to automatically stop Server Density from monitoring instances that have been terminated.

Solution: I made a CloudWatch rule that triggers whenever instances are stopped or terminated. The events are then pushed to a Lambda function which calls Server Density's API to remove the monitoring.

Here's the architecture that I came up with:
I think CloudWatch has a way to monitor the swap space, but the last time I checked, AWS SNS (a separate AWS service that sends notifications) can't send SMS messages to Philippine numbers so I can't wake up (not joking, unfortunately haha) whenever there are server problems.

Update: Turns out that Linux' default swappiness value is 60, which means that it will use swap ahead of time even though around half (or 80%? the docs have conflicting calculations) of the RAM is still available. To avoid this situation, set the swappiness to 1. You can even set it to 0 if you want:
sysctl vm.swappiness=1

Thursday, April 27, 2017

How to run AWS Lambda on VPC

I'm trying to run an AWS Lambda function inside a VPC because I need it to access ElastiCache. Problem: if you put a Lambda function inside a VPC, it loses Internet access. There are some few documentations online, but they are too complicated and they involve unnecessary steps. Here is the simplest way to run AWS Lambda inside VPC.

1. Create a simple NodeJS function that connects to an external site via HTTP:
'use strict';
const http = require('http');

exports.handler = (event, context, callback) => {
  http.get('http://www.google.com', res => {
    callback(null, `Success, with: ${res.statusCode}`);
  }).on('error', err => callback(`Error with: ${err.message}`));
};

2. Run the above function without a VPC to verify that it's working correctly (i.e., it returns an HTTP 200).

3. In the AWS Console, go to the VPC page and click "Elastic IPs". Then, click the "Allocate new address" button and select the "VPC" scope.


4. Next, go to the VPC Dashboard and click the "Start VPC Wizard" button.


5. Select VPC with Public and Private Subnets option.


6. In the next page, enter your "VPC name", and in the "Elastic IP Allocation ID" field, enter the Elastic IP that you created in Step 3. Click the "Create VPC" button.


7. Finally, go back to the Lambda page and configure your function. Click the "Configuration" tab and go to the "Advanced settings section". Select the VPC that you created in Step 5, and Select the private subnet that you created. This is important; otherwise outgoing Internet connections won't work.


8. Click the "Save and test" button to test your setup.  That's it! For a proper setup, use at least 2 subnets in different availability zones to run your function in high-availability mode.

Some SEO keywords to help other people: aws lambda run vpc nat gateway howto tutorial

Thursday, March 23, 2017

How to run arbitrary binary executables on AWS Lambda 6.10+

After posting the hack yesterday, AWS coincidentally released the Node.js 6.10 platform. The old method from 0.10 now works:

1. Create a bin/ directory inside your project directory.
2. Statically compile your binary as much as possible. Put it inside your bin/ directory. Put any shared libraries that it needs inside the bin/ directory too.
3. chmod +x bin/hello-world-binary
4. Do something like:

'use strict';

const exec = require('child_process').exec;

process.env['PATH'] = process.env['PATH'] + ':' + process.env['LAMBDA_TASK_ROOT'] + '/bin';
process.env['LD_LIBRARY_PATH'] = process.env['LAMBDA_TASK_ROOT'] + '/bin';

exports.handler = (event, context, callback) => {
    const cmd = 'hello-world-binary';
    const child = exec(cmd, (error) => {
        callback(error, 'Process complete!');
    });

    child.stdout.on('data', console.log);
    child.stderr.on('data', console.error);
};

5. ZIP your project directory and upload to AWS Lambda. Enjoy!

Wednesday, March 22, 2017

How to run arbitrary binary executables on AWS Lambda 4.3+

I've been enjoying AWS Lambda in my work lately. I appreciate how we can create serverless architecture (or nearly serverless, anyway) on the AWS Platform. Yeah, it's fun to set up Linux and/or FreeBSD servers myself, but oftentimes, this "cloud" thingy makes perfect sense in many scenarios.

Anyway, AWS Lambda has various constraints, such as not being able to install packages yourself. These constraints, according to their documentation, allows them to "perform operational and administrative activities on [our] behalf, including provisioning capacity, monitoring fleet health, applying security patches, deploying [our] code, and monitoring and logging [our] Lambda functions."

OK, fine; but how do you deal with code that depends on an external binary? In previous versions of the Lambda platform (Node.js v0.10), you can do this by statically compiling your binary and by putting it inside the bin/ folder inside your project and, by doing something like:

process.env['PATH'] = process.env['PATH'] + ':' + process.env['LAMBDA_TASK_ROOT'] + '/bin';
process.env['LD_LIBRARY_PATH'] = process.env['LAMBDA_TASK_ROOT'] + '/bin';

exec('hello-world-binary', function () {});

With the recent update, however, the above code no longer works. On AWS Lambda platform that uses Node.js v4.3+, we'd need to copy all the binaries to the /tmp folder, chmod +x it, and run it from there.

I couldn't find a cleaner solution, so I just wrapped these in a Node.js package:

https://www.npmjs.com/package/lambda_exec

Enjoy!