Saturday, 13 April 2013

Solr 4.2 on EC2 (Part 1)

The Solr distribution comes with a couple of sample applications. I will be focusing on 2 of those - one is found in the example directory (solr-4.2.0/example) and the other is found in the example-DIH directory (solr-4.2.0/example/example-DIH). The Data Import Handler (DIH) is used to index database contents.

Deploying both the Solr examples with Jetty (source)
  1. Download and untar Solr :
    • cd /opt
    • wget http://apache.mirror.vexxhost.com/lucene/solr/4.2.0/solr-4.2.0.tgz
    • tar -xvf solr-4.2.0.tgz
  2. Change security groups :
    • open port 8983 on your EC2 security group
    • you can access this from your online EC2 console
  3. Start the Solr sample example :
    • cd /opt/solr-4.2.0/example
    • java –jar start.jar
  4. If you want to run the DIH Solr example instead :
    • java -Dsolr.solr.home="/opt/solr-4.2.0/example/example-DIH/solr/" -jar start.jar
  5. You should be able to see the Solr manager at :
    • foo.com:8983/solr/admin/ or (your-ec2-ip):8983/solr/admin/
  6. Note that your Solr home is /opt/solr-4.2.0/example/solr/ for the example and /opt/solr-4.2.0/example/example-DIH/solr/ for the DIH example.
Deploying both the Solr examples with Tomcat 7
  1. Setup Tomcat 7 on your EC2 instance :
    • Check out my blog post on this (here)
  2. Change server.xml :
    • sudo vim /opt/apache-tomcat-7.0.34/conf/server.xml
    • add the following
    • <Connector port="8983" protocol="HTTP/1.1" connectionTimeout="20000" 
           redirectPort="8443" URIEncoding="UTF-8" />
  3. Create solr.xml :
    • sudo vim /opt/apache-tomcat-7.0.34/conf/Catalina/localhost/solr.xml
    • add the following
    • <?xml version="1.0" encoding="utf-8"?>
      <Context path="/solr" docBase="/opt/apache-tomcat-7.0.34/webapps/solr.war" debug="0" crossContext="true">
              <Environment name="solr/home" type="java.lang.String" value="/opt/solr-4.2.0/example/solr" override="true"/>
      </Context>
  4. The next thing is the Solr deployment. To do that we need the /opt/solr-4.2.0/dist/solr-4.2.0.war file that contains the necessary files and libraries to run Solr that is to be copied to the Tomcat webapps directory and renamed solr.war.
  5. Start the Solr sample example : 
    • sudo service tomcat restart
  6. If you want to run the Data Import Handler (DIH) Solr example instead :
    • in step 3, replace
    • <Environment name="solr/home" type="java.lang.String" value="/opt/solr-4.2.0/example/solr" override="true"/>
    • with
    • <Environment name="solr/home" type="java.lang.String" value="/opt/solr-4.2.0/example/example-DIH/solr" override="true"/>
  7. You should be able to see the Solr manager at :
    • foo.com:8983/solr/admin/ or (your-ec2-ip):8983/solr/admin/
  8. Note that your Solr home is /opt/solr-4.2.0/example/solr/ for the example and /opt/solr-4.2.0/example/example-DIH/solr/ for the DIH example.

Deploying a custom DIH application (backed by PostgreSQL) with Tomcat 7

A bit of background - I'm going to assume a simple database table whose contents we would like to index. My Solr home here is /opt/solr-4.2.0/example/solr/.
  1. Lets assume a simple table (if you want to setup PostgreSQL on your EC2 instance, refer to my blog post on this here) :
    • create table users(
          uid serial primary key,
          firstname varchar(255) not null
      );
  2. Start off by copying the db directory which the Solr DIH example uses to the Solr example directory (I would rather run a Solr cluster on the example, as opposed to example-DIH) :
    • cd /opt/solr-4.2.0/example/solr
    • cp -r /opt/solr-4.2.0/example/example-DIH/solr/db .
  3. Modify solr.xml :
    • sudo vim /opt/solr-4.2.0/example/solr/solr.xml
    • Add <core name="db" instanceDir="db" />
  4. Modify solrconfig.xml : 
    • sudo vim /opt/solr-4.2.0/example/solr/db/conf/solrconfig.xml 
    • change <lib dir="../../../../dist/" regex="solr-dataimporthandler-.*\.jar" /> to <lib dir="../../../dist/" regex="solr-dataimporthandler-.*\.jar" />
  5. Modify db-data-config.xml  : 
    • sudo vim /opt/solr-4.2.0/example/solr/db/conf/db-data-config.xml 
    • remove everything and add your own DB configuration e.g.,
    • <dataConfig>
        <dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://foo.com:5432/fooDB" user="power_user" password="pass" />
        <document>
          <entity name="user" query="SELECT uid,firstname from users">
          <field column="uid" name="id" />
            <field column="firstname" name="name" />
          </entity>
        </document>
      </dataConfig>
  6. The /opt/solr-4.2.0/example/solr/db/conf/schema.xml file doesn't need to be changed for my example, but you will likely have to change it if you're going to index other stuff in your DB.
  7. You will also need to include your JDBC driver :
    • cd /opt/solr-4.2.0/example/solr/db/lib
    • paste your postgresql-9.1-902.jdbc4.jar in that folder
  8. Restart Tomcat : 
    • sudo service tomcat restart
  9. When you begin indexing things, you might encounter permission problems creating the db/data/index folder. In this case, I just change all the permissions of the db folder and its files :
    • sudo chmod -R 777 /opt/solr-4.2.0/example/solr/db
    • this might not be a great idea security wise
Zookeeper

In order to run SolrCloud—the distributed Solr installation—you need to have Apache ZooKeeper installed. Zookeeper is a centralized service for maintaining configurations, naming, and provisioning service synchronization. SolrCloud uses ZooKeeper to synchronize configuration and cluster states (such as elected shard leaders), and that's why it is crucial to have a highly available and fault tolerant ZooKeeper installation. If you have a single ZooKeeper instance and it fails then your SolrCloud cluster will crash too.
  1. Download and untar Zookeeper :
    • wget http://apache.mirror.vexxhost.com/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz
    • sudo tar -xvf zookeeper-3.4.5.tar.gz
  2. Create and modify your zoo.cfg file :
    • cd /opt/zookeeper-3.4.5/conf/
    • sudo cp zoo_sample.cfg zoo.cfg
    • sudo vim /opt/zookeeper-3.4.5/conf/zoo.cfg 
    • Add/change the following
    • dataDir=/opt/zookeeper-3.4.5/data
      server.1=<ec2-instance-ip>:2888:3888 (e.g., server.1=192.168.1.1:2888:3888)
      server.2=<ec2-instance-ip>:2888:3888 (e.g., server.1=192.168.1.2:2888:3888)
      server.3=<ec2-instance-ip>:2888:3888 (e.g., server.1=192.168.1.3:2888:3888)
    • Regarding the second line that you added, the first part is the IP address of the ZooKeeper server, and the second and third parts are the ports used by ZooKeeper instances to communicate with each other.
  3. Open port 2181 from ec2 console.
  4. Start Zookeeper :
    • sudo /opt/zookeeper-3.4.5/bin/zkServer.sh start
  5. If successful, the following message will be displayed : 
    1. JMX enabled by default
      Using config: /opt/zookeeper-3.4.5/bin/../conf/zoo.cfg
      Starting zookeeper ... STARTED
Notes

On Solr Caches :
  • Caches play a major role in a Solr deployment. There are three Solr caches :
    • Filter cache : This is used for storing filter (query parameter fq ) results and mainly enum  type facets
    • Document cache : This is used for storing Lucene documents which hold stored fields
    • Query result cache : This is used for storing results of queries
  • There is a fourth cache - Lucene's internal cache - which is a field cache, but you can't control its behaviour. It is managed by Lucene and created when it is first used by the Searcher object.
  • With the help of these caches we can tune the behaviour of the Solr searcher instance. Solr cache sizes should be tuned to the number of documents in the index, the queries, and the number of results you usually get from Solr.
On Solr Directory Implementation :
  • One of the most crucial properties of Apache Lucene, and thus Solr, is the Lucene directory implementation. 
  • The directory interface provides an abstraction layer for Lucene on all the I/O operations. This can affect the performance of your Solr setup in a drastic way. 
  • If you want Solr to make the decision for you, you should use solr.StandardDirectoryFactory. This is a filesystem-based directory factory that tries to choose the best implementation based on your current operating system and Java virtual machine used. 
  • If you are implementing a small application, which won't use many threads, you can use solr.SimpleFSDirectoryFactory which stores the index file on your local filesystem, but it doesn't scale well with a high number of threads. 
  • solr.NIOFSDirectoryFactory scales well with many threads, but it doesn't work well on Microsoft Windows platforms (it's much slower), because of the JVM bug, so you should remember that.
  • solr.MMapDirectoryFactory was the default directory factory for Solr for the 64-bit Linux systems from Solr 3.1 till 4.0. This directory implementation uses virtual memory and a kernel feature called mmap to access index files stored on disk. This allows Lucene (and thus Solr) to directly access the I/O cache. This is desirable and you should stick to that directory if near real-time searching is not needed.
  • If you need near real-time indexing and searching, you should use solr.NRTCachingDirectoryFactory. It is designed to store some parts of the index in memory (small chunks) and thus speed up some near real-time operations greatly.
  • solr.RAMDirectoryFactory, is the only one that is not persistent. The whole index is stored in the RAM memory and thus you'll lose your index after restart or server crash. Also you should remember that replication won't work when using solr.RAMDirectoryFactory. One would ask, why should I use that factory? Imagine a volatile index for an autocomplete functionality or for unit tests of your queries' relevancy. Just anything you can think of, when you don't need to have persistent and replicated data. However, please remember that this directory is not designed to hold large amounts of data.
(This post is based on the Apache Solr 4 Cookbook by Rafal Kuc)

14 comments:

  1. Cloud is one of the tremendous technology that any company in this world would rely on(Salesforce Certification). Using this technology many tough tasks can be accomplished easily in no time. Your content are also explaining the same(Salesforce crm training in chennai). Thanks for sharing this in here. You are running a great blog, keep up this good work.

    ReplyDelete
  2. Well Said, you have provided the right info that will be beneficial to somebody at all time. Thanks for sharing your valuable Ideas to our vision.


    Hadoop Training in Marathallai



    Hadoop Training in BtmLayout

    ReplyDelete
  3. I simply wanted to thank you so much again. I am not sure the things that I might have gone through without the type of hints revealed by you regarding that situation.

    Best Java Training Institute Chennai

    ReplyDelete
  4. Artificial intelligence Training in noida
    Artificial intelligence Training in noida-Artificial Intelligence Training in Noida, Artificial Intelligence Training classes in Noida, Artificial Intelligence Training classes in Noida, Artificial Intelligence Training

    by Real time ARTIFICIAL INTELLIGENCE Experts, Big-Data and ARTIFICIAL INTELLIGENCE Certification Training in Noida



    WEBTRACKKER TECHNOLOGY (P) LTD.
    C - 67, sector- 63, Noida, India.
    F -1 Sector 3 (Near Sector 16 metro station) Noida, India.

    +91 - 8802820025
    0120-433-0760
    0120-4204716
    EMAIL: info@webtrackker.com
    Website: www.webtrackker.com



    Our Other Courses:


    artificial intelligence Training in noida

    SAS Training Institute in Delhi

    SAS Training in Delhi

    SAS Training center in Delhi

    Sap Training Institute in delhi

    Sap Training in delhi

    Best Sap Training center in delhi

    Best Software Testing Training Institute in delhi

    Software Testing Training in delhi

    Software Testing Training center in delhi

    Best Salesforce Training Institute in delhi

    Salesforce Training in delhi

    Salesforce Training center in delhi

    Best Python Training Institute in delhi



    Python Training in delhi


    Best Android Training Institute In delhi


    Best Python Training center in delhi


    Android Training In delhi


    best Android Training center In delhi

    ReplyDelete

  5. Best Solidworks training institute in noida

    SolidWorks is a solid modeling computer-aided design (CAD) and computer-aided engineering (CAE) computer program that runs on Microsoft Windows. SolidWorks is published by Dassault Systems. Solid Works: well, it is purely a product to design machines. But, of course, there are other applications, like aerospace, automobile, consumer products, etc. Much user friendly than the former one, in terms of modeling, editing designs, creating mechanisms, etc.
    Solid Works is a Middle level, Main stream software with focus on Product development & this software is aimed at Small scale & Middle level Companies whose interest is to have a reasonably priced CAD system which can support their product development needs and at the same time helps them get their product market faster.

    Company Address:
    WEBTRACKKER TECHNOLOGY (P) LTD.
    C-67,Sector-63,Noida,India.
    E-mail: info@webtracker.com
    Phone No: 0120-4330760 ,+91-880-282-0025

    webtrackker.com/solidworks-training-Course-institute-in-noida-delhi

    ReplyDelete
  6. 3D Animation Training in Noida

    Best institute for 3d Animation and Multimedia

    Best institute for 3d Animation Course training Classes in Noida- webtrackker Is providing the 3d Animation and Multimedia training in noida with 100% placement supports. for more call - 8802820025.

    3D Animation Training in Noida

    Company Address:

    Webtrackker Technology

    C- 67, Sector- 63, Noida

    Phone: 01204330760, 8802820025

    Email: info@webtrackker.com

    Website: http://webtrackker.com/Best-institute-3dAnimation-Multimedia-Course-training-Classes-in-Noida.php

    ReplyDelete
  7. Graphics designing training institute in Noida
    Best Graphics training institute in Noida, Graphic Designing Course, classes in Noida- webtrackker is providing the graphics training in Noida with 100% placement supports. If you are looking for the Best Graphics designing training institute in Noida For more call - 8802820025.

    Graphics designing training institute in Noida, Graphics designing training in Noida, Graphics designing course in Noida, Graphics designing training center in Noida

    Company address:
    Webtrackker Technology
    C- 67, Sector- 63, Noida
    Phone: 01204330760, 8802820025
    Email: info@webtrackker.com
    Website: http://webtrackker.com/Best-institute-for-Graphic-Designing-training-course-in-noida.php

    ReplyDelete