Computer Science and Technology: Solr 4.2 on EC2 (Part 1)

Saturday, 13 April 2013

Solr 4.2 on EC2 (Part 1)

The Solr distribution comes with a couple of sample applications. I will be focusing on 2 of those - one is found in the example directory (solr-4.2.0/example) and the other is found in the example-DIH directory (solr-4.2.0/example/example-DIH). The Data Import Handler (DIH) is used to index database contents.

Deploying both the Solr examples with Jetty (source)

Download and untar Solr :

cd /opt
wget http://apache.mirror.vexxhost.com/lucene/solr/4.2.0/solr-4.2.0.tgz
tar -xvf solr-4.2.0.tgz

Change security groups :

open port 8983 on your EC2 security group
you can access this from your online EC2 console

Start the Solr sample example :

cd /opt/solr-4.2.0/example
java –jar start.jar

If you want to run the DIH Solr example instead :

java -Dsolr.solr.home="/opt/solr-4.2.0/example/example-DIH/solr/" -jar start.jar

You should be able to see the Solr manager at :

foo.com:8983/solr/admin/ or (your-ec2-ip):8983/solr/admin/

Note that your Solr home is /opt/solr-4.2.0/example/solr/ for the example and /opt/solr-4.2.0/example/example-DIH/solr/ for the DIH example.

Deploying both the Solr examples with Tomcat 7

Setup Tomcat 7 on your EC2 instance :

Check out my blog post on this (here)

Change server.xml :

sudo vim /opt/apache-tomcat-7.0.34/conf/server.xml
add the following

<Connector port="8983" protocol="HTTP/1.1" connectionTimeout="20000" 
     redirectPort="8443" URIEncoding="UTF-8" />

Create solr.xml :

sudo vim /opt/apache-tomcat-7.0.34/conf/Catalina/localhost/solr.xml
add the following

<?xml version="1.0" encoding="utf-8"?>
<Context path="/solr" docBase="/opt/apache-tomcat-7.0.34/webapps/solr.war" debug="0" crossContext="true">
        <Environment name="solr/home" type="java.lang.String" value="/opt/solr-4.2.0/example/solr" override="true"/>
</Context>

The next thing is the Solr deployment. To do that we need the /opt/solr-4.2.0/dist/solr-4.2.0.war file that contains the necessary files and libraries to run Solr that is to be copied to the Tomcat webapps directory and renamed solr.war.
Start the Solr sample example :

sudo service tomcat restart

If you want to run the Data Import Handler (DIH) Solr example instead :

in step 3, replace

<Environment name="solr/home" type="java.lang.String" value="/opt/solr-4.2.0/example/solr" override="true"/>

with

<Environment name="solr/home" type="java.lang.String" value="/opt/solr-4.2.0/example/example-DIH/solr" override="true"/>

You should be able to see the Solr manager at :

foo.com:8983/solr/admin/ or (your-ec2-ip):8983/solr/admin/

Note that your Solr home is /opt/solr-4.2.0/example/solr/ for the example and /opt/solr-4.2.0/example/example-DIH/solr/ for the DIH example.

Deploying a custom DIH application (backed by PostgreSQL) with Tomcat 7

A bit of background - I'm going to assume a simple database table whose contents we would like to index. My Solr home here is /opt/solr-4.2.0/example/solr/.

Lets assume a simple table (if you want to setup PostgreSQL on your EC2 instance, refer to my blog post on this here) :

create table users(
    uid serial primary key,
    firstname varchar(255) not null
);

Start off by copying the db directory which the Solr DIH example uses to the Solr example directory (I would rather run a Solr cluster on the example, as opposed to example-DIH) :

cd /opt/solr-4.2.0/example/solr
cp -r /opt/solr-4.2.0/example/example-DIH/solr/db .

Modify solr.xml :

sudo vim /opt/solr-4.2.0/example/solr/solr.xml
Add <core name="db" instanceDir="db" />

Modify solrconfig.xml :

sudo vim /opt/solr-4.2.0/example/solr/db/conf/solrconfig.xml
change <lib dir="../../../../dist/" regex="solr-dataimporthandler-.*\.jar" /> to <lib dir="../../../dist/" regex="solr-dataimporthandler-.*\.jar" />

Modify db-data-config.xml :

sudo vim /opt/solr-4.2.0/example/solr/db/conf/db-data-config.xml
remove everything and add your own DB configuration e.g.,

<dataConfig>
  <dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://foo.com:5432/fooDB" user="power_user" password="pass" />
  <document>
    <entity name="user" query="SELECT uid,firstname from users">
    <field column="uid" name="id" />
      <field column="firstname" name="name" />
    </entity>
  </document>
</dataConfig>

The /opt/solr-4.2.0/example/solr/db/conf/schema.xml file doesn't need to be changed for my example, but you will likely have to change it if you're going to index other stuff in your DB.
You will also need to include your JDBC driver :

cd /opt/solr-4.2.0/example/solr/db/lib
paste your postgresql-9.1-902.jdbc4.jar in that folder

Restart Tomcat :

sudo service tomcat restart

When you begin indexing things, you might encounter permission problems creating the db/data/index folder. In this case, I just change all the permissions of the db folder and its files :

sudo chmod -R 777 /opt/solr-4.2.0/example/solr/db
this might not be a great idea security wise

Zookeeper

In order to run SolrCloud—the distributed Solr installation—you need to have Apache ZooKeeper installed. Zookeeper is a centralized service for maintaining configurations, naming, and provisioning service synchronization. SolrCloud uses ZooKeeper to synchronize configuration and cluster states (such as elected shard leaders), and that's why it is crucial to have a highly available and fault tolerant ZooKeeper installation. If you have a single ZooKeeper instance and it fails then your SolrCloud cluster will crash too.

Download and untar Zookeeper :

wget http://apache.mirror.vexxhost.com/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz
sudo tar -xvf zookeeper-3.4.5.tar.gz

Create and modify your zoo.cfg file :

cd /opt/zookeeper-3.4.5/conf/
sudo cp zoo_sample.cfg zoo.cfg
sudo vim /opt/zookeeper-3.4.5/conf/zoo.cfg
Add/change the following

dataDir=/opt/zookeeper-3.4.5/data
server.1=<ec2-instance-ip>:2888:3888 (e.g., server.1=192.168.1.1:2888:3888)
server.2=<ec2-instance-ip>:2888:3888 (e.g., server.1=192.168.1.2:2888:3888)
server.3=<ec2-instance-ip>:2888:3888 (e.g., server.1=192.168.1.3:2888:3888)

Regarding the second line that you added, the first part is the IP address of the ZooKeeper server, and the second and third parts are the ports used by ZooKeeper instances to communicate with each other.

Open port 2181 from ec2 console.
Start Zookeeper :

sudo /opt/zookeeper-3.4.5/bin/zkServer.sh start

If successful, the following message will be displayed :

JMX enabled by default
Using config: /opt/zookeeper-3.4.5/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

Notes

On Solr Caches :

Caches play a major role in a Solr deployment. There are three Solr caches :

Filter cache : This is used for storing filter (query parameter fq ) results and mainly enum type facets
Document cache : This is used for storing Lucene documents which hold stored fields
Query result cache : This is used for storing results of queries

There is a fourth cache - Lucene's internal cache - which is a field cache, but you can't control its behaviour. It is managed by Lucene and created when it is first used by the Searcher object.
With the help of these caches we can tune the behaviour of the Solr searcher instance. Solr cache sizes should be tuned to the number of documents in the index, the queries, and the number of results you usually get from Solr.

On Solr Directory Implementation :

One of the most crucial properties of Apache Lucene, and thus Solr, is the Lucene directory implementation.
The directory interface provides an abstraction layer for Lucene on all the I/O operations. This can affect the performance of your Solr setup in a drastic way.
If you want Solr to make the decision for you, you should use solr.StandardDirectoryFactory. This is a filesystem-based directory factory that tries to choose the best implementation based on your current operating system and Java virtual machine used.
If you are implementing a small application, which won't use many threads, you can use solr.SimpleFSDirectoryFactory which stores the index file on your local filesystem, but it doesn't scale well with a high number of threads.
solr.NIOFSDirectoryFactory scales well with many threads, but it doesn't work well on Microsoft Windows platforms (it's much slower), because of the JVM bug, so you should remember that.
solr.MMapDirectoryFactory was the default directory factory for Solr for the 64-bit Linux systems from Solr 3.1 till 4.0. This directory implementation uses virtual memory and a kernel feature called mmap to access index files stored on disk. This allows Lucene (and thus Solr) to directly access the I/O cache. This is desirable and you should stick to that directory if near real-time searching is not needed.
If you need near real-time indexing and searching, you should use solr.NRTCachingDirectoryFactory. It is designed to store some parts of the index in memory (small chunks) and thus speed up some near real-time operations greatly.
solr.RAMDirectoryFactory, is the only one that is not persistent. The whole index is stored in the RAM memory and thus you'll lose your index after restart or server crash. Also you should remember that replication won't work when using solr.RAMDirectoryFactory. One would ask, why should I use that factory? Imagine a volatile index for an autocomplete functionality or for unit tests of your queries' relevancy. Just anything you can think of, when you don't need to have persistent and replicated data. However, please remember that this directory is not designed to hold large amounts of data.