Saturday 13 April 2013

Solr 4.2 on EC2 (Part 1)

The Solr distribution comes with a couple of sample applications. I will be focusing on 2 of those - one is found in the example directory (solr-4.2.0/example) and the other is found in the example-DIH directory (solr-4.2.0/example/example-DIH). The Data Import Handler (DIH) is used to index database contents.

Deploying both the Solr examples with Jetty (source)
  1. Download and untar Solr :
    • cd /opt
    • wget
    • tar -xvf solr-4.2.0.tgz
  2. Change security groups :
    • open port 8983 on your EC2 security group
    • you can access this from your online EC2 console
  3. Start the Solr sample example :
    • cd /opt/solr-4.2.0/example
    • java –jar start.jar
  4. If you want to run the DIH Solr example instead :
    • java -Dsolr.solr.home="/opt/solr-4.2.0/example/example-DIH/solr/" -jar start.jar
  5. You should be able to see the Solr manager at :
    • or (your-ec2-ip):8983/solr/admin/
  6. Note that your Solr home is /opt/solr-4.2.0/example/solr/ for the example and /opt/solr-4.2.0/example/example-DIH/solr/ for the DIH example.
Deploying both the Solr examples with Tomcat 7
  1. Setup Tomcat 7 on your EC2 instance :
    • Check out my blog post on this (here)
  2. Change server.xml :
    • sudo vim /opt/apache-tomcat-7.0.34/conf/server.xml
    • add the following
    • <Connector port="8983" protocol="HTTP/1.1" connectionTimeout="20000" 
           redirectPort="8443" URIEncoding="UTF-8" />
  3. Create solr.xml :
    • sudo vim /opt/apache-tomcat-7.0.34/conf/Catalina/localhost/solr.xml
    • add the following
    • <?xml version="1.0" encoding="utf-8"?>
      <Context path="/solr" docBase="/opt/apache-tomcat-7.0.34/webapps/solr.war" debug="0" crossContext="true">
              <Environment name="solr/home" type="java.lang.String" value="/opt/solr-4.2.0/example/solr" override="true"/>
  4. The next thing is the Solr deployment. To do that we need the /opt/solr-4.2.0/dist/solr-4.2.0.war file that contains the necessary files and libraries to run Solr that is to be copied to the Tomcat webapps directory and renamed solr.war.
  5. Start the Solr sample example : 
    • sudo service tomcat restart
  6. If you want to run the Data Import Handler (DIH) Solr example instead :
    • in step 3, replace
    • <Environment name="solr/home" type="java.lang.String" value="/opt/solr-4.2.0/example/solr" override="true"/>
    • with
    • <Environment name="solr/home" type="java.lang.String" value="/opt/solr-4.2.0/example/example-DIH/solr" override="true"/>
  7. You should be able to see the Solr manager at :
    • or (your-ec2-ip):8983/solr/admin/
  8. Note that your Solr home is /opt/solr-4.2.0/example/solr/ for the example and /opt/solr-4.2.0/example/example-DIH/solr/ for the DIH example.

Deploying a custom DIH application (backed by PostgreSQL) with Tomcat 7

A bit of background - I'm going to assume a simple database table whose contents we would like to index. My Solr home here is /opt/solr-4.2.0/example/solr/.
  1. Lets assume a simple table (if you want to setup PostgreSQL on your EC2 instance, refer to my blog post on this here) :
    • create table users(
          uid serial primary key,
          firstname varchar(255) not null
  2. Start off by copying the db directory which the Solr DIH example uses to the Solr example directory (I would rather run a Solr cluster on the example, as opposed to example-DIH) :
    • cd /opt/solr-4.2.0/example/solr
    • cp -r /opt/solr-4.2.0/example/example-DIH/solr/db .
  3. Modify solr.xml :
    • sudo vim /opt/solr-4.2.0/example/solr/solr.xml
    • Add <core name="db" instanceDir="db" />
  4. Modify solrconfig.xml : 
    • sudo vim /opt/solr-4.2.0/example/solr/db/conf/solrconfig.xml 
    • change <lib dir="../../../../dist/" regex="solr-dataimporthandler-.*\.jar" /> to <lib dir="../../../dist/" regex="solr-dataimporthandler-.*\.jar" />
  5. Modify db-data-config.xml  : 
    • sudo vim /opt/solr-4.2.0/example/solr/db/conf/db-data-config.xml 
    • remove everything and add your own DB configuration e.g.,
    • <dataConfig>
        <dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://" user="power_user" password="pass" />
          <entity name="user" query="SELECT uid,firstname from users">
          <field column="uid" name="id" />
            <field column="firstname" name="name" />
  6. The /opt/solr-4.2.0/example/solr/db/conf/schema.xml file doesn't need to be changed for my example, but you will likely have to change it if you're going to index other stuff in your DB.
  7. You will also need to include your JDBC driver :
    • cd /opt/solr-4.2.0/example/solr/db/lib
    • paste your postgresql-9.1-902.jdbc4.jar in that folder
  8. Restart Tomcat : 
    • sudo service tomcat restart
  9. When you begin indexing things, you might encounter permission problems creating the db/data/index folder. In this case, I just change all the permissions of the db folder and its files :
    • sudo chmod -R 777 /opt/solr-4.2.0/example/solr/db
    • this might not be a great idea security wise

In order to run SolrCloud—the distributed Solr installation—you need to have Apache ZooKeeper installed. Zookeeper is a centralized service for maintaining configurations, naming, and provisioning service synchronization. SolrCloud uses ZooKeeper to synchronize configuration and cluster states (such as elected shard leaders), and that's why it is crucial to have a highly available and fault tolerant ZooKeeper installation. If you have a single ZooKeeper instance and it fails then your SolrCloud cluster will crash too.
  1. Download and untar Zookeeper :
    • wget
    • sudo tar -xvf zookeeper-3.4.5.tar.gz
  2. Create and modify your zoo.cfg file :
    • cd /opt/zookeeper-3.4.5/conf/
    • sudo cp zoo_sample.cfg zoo.cfg
    • sudo vim /opt/zookeeper-3.4.5/conf/zoo.cfg 
    • Add/change the following
    • dataDir=/opt/zookeeper-3.4.5/data
      server.1=<ec2-instance-ip>:2888:3888 (e.g., server.1=
      server.2=<ec2-instance-ip>:2888:3888 (e.g., server.1=
      server.3=<ec2-instance-ip>:2888:3888 (e.g., server.1=
    • Regarding the second line that you added, the first part is the IP address of the ZooKeeper server, and the second and third parts are the ports used by ZooKeeper instances to communicate with each other.
  3. Open port 2181 from ec2 console.
  4. Start Zookeeper :
    • sudo /opt/zookeeper-3.4.5/bin/ start
  5. If successful, the following message will be displayed : 
    1. JMX enabled by default
      Using config: /opt/zookeeper-3.4.5/bin/../conf/zoo.cfg
      Starting zookeeper ... STARTED

On Solr Caches :
  • Caches play a major role in a Solr deployment. There are three Solr caches :
    • Filter cache : This is used for storing filter (query parameter fq ) results and mainly enum  type facets
    • Document cache : This is used for storing Lucene documents which hold stored fields
    • Query result cache : This is used for storing results of queries
  • There is a fourth cache - Lucene's internal cache - which is a field cache, but you can't control its behaviour. It is managed by Lucene and created when it is first used by the Searcher object.
  • With the help of these caches we can tune the behaviour of the Solr searcher instance. Solr cache sizes should be tuned to the number of documents in the index, the queries, and the number of results you usually get from Solr.
On Solr Directory Implementation :
  • One of the most crucial properties of Apache Lucene, and thus Solr, is the Lucene directory implementation. 
  • The directory interface provides an abstraction layer for Lucene on all the I/O operations. This can affect the performance of your Solr setup in a drastic way. 
  • If you want Solr to make the decision for you, you should use solr.StandardDirectoryFactory. This is a filesystem-based directory factory that tries to choose the best implementation based on your current operating system and Java virtual machine used. 
  • If you are implementing a small application, which won't use many threads, you can use solr.SimpleFSDirectoryFactory which stores the index file on your local filesystem, but it doesn't scale well with a high number of threads. 
  • solr.NIOFSDirectoryFactory scales well with many threads, but it doesn't work well on Microsoft Windows platforms (it's much slower), because of the JVM bug, so you should remember that.
  • solr.MMapDirectoryFactory was the default directory factory for Solr for the 64-bit Linux systems from Solr 3.1 till 4.0. This directory implementation uses virtual memory and a kernel feature called mmap to access index files stored on disk. This allows Lucene (and thus Solr) to directly access the I/O cache. This is desirable and you should stick to that directory if near real-time searching is not needed.
  • If you need near real-time indexing and searching, you should use solr.NRTCachingDirectoryFactory. It is designed to store some parts of the index in memory (small chunks) and thus speed up some near real-time operations greatly.
  • solr.RAMDirectoryFactory, is the only one that is not persistent. The whole index is stored in the RAM memory and thus you'll lose your index after restart or server crash. Also you should remember that replication won't work when using solr.RAMDirectoryFactory. One would ask, why should I use that factory? Imagine a volatile index for an autocomplete functionality or for unit tests of your queries' relevancy. Just anything you can think of, when you don't need to have persistent and replicated data. However, please remember that this directory is not designed to hold large amounts of data.
(This post is based on the Apache Solr 4 Cookbook by Rafal Kuc)


  1. Cloud is one of the tremendous technology that any company in this world would rely on(Salesforce Certification). Using this technology many tough tasks can be accomplished easily in no time. Your content are also explaining the same(Salesforce crm training in chennai). Thanks for sharing this in here. You are running a great blog, keep up this good work.

  2. Artificial intelligence Training in noida
    Artificial intelligence Training in noida-Artificial Intelligence Training in Noida, Artificial Intelligence Training classes in Noida, Artificial Intelligence Training classes in Noida, Artificial Intelligence Training

    by Real time ARTIFICIAL INTELLIGENCE Experts, Big-Data and ARTIFICIAL INTELLIGENCE Certification Training in Noida

    C - 67, sector- 63, Noida, India.
    F -1 Sector 3 (Near Sector 16 metro station) Noida, India.

    +91 - 8802820025

    Our Other Courses:

    artificial intelligence Training in noida

    SAS Training Institute in Delhi

    SAS Training in Delhi

    SAS Training center in Delhi

    Sap Training Institute in delhi

    Sap Training in delhi

    Best Sap Training center in delhi

    Best Software Testing Training Institute in delhi

    Software Testing Training in delhi

    Software Testing Training center in delhi

    Best Salesforce Training Institute in delhi

    Salesforce Training in delhi

    Salesforce Training center in delhi

    Best Python Training Institute in delhi

    Python Training in delhi

    Best Android Training Institute In delhi

    Best Python Training center in delhi

    Android Training In delhi

    best Android Training center In delhi


  3. Best Solidworks training institute in noida

    SolidWorks is a solid modeling computer-aided design (CAD) and computer-aided engineering (CAE) computer program that runs on Microsoft Windows. SolidWorks is published by Dassault Systems. Solid Works: well, it is purely a product to design machines. But, of course, there are other applications, like aerospace, automobile, consumer products, etc. Much user friendly than the former one, in terms of modeling, editing designs, creating mechanisms, etc.
    Solid Works is a Middle level, Main stream software with focus on Product development & this software is aimed at Small scale & Middle level Companies whose interest is to have a reasonably priced CAD system which can support their product development needs and at the same time helps them get their product market faster.

    Company Address:
    Phone No: 0120-4330760 ,+91-880-282-0025

  4. 3D Animation Training in Noida

    Best institute for 3d Animation and Multimedia

    Best institute for 3d Animation Course training Classes in Noida- webtrackker Is providing the 3d Animation and Multimedia training in noida with 100% placement supports. for more call - 8802820025.

    3D Animation Training in Noida

    Company Address:

    Webtrackker Technology

    C- 67, Sector- 63, Noida

    Phone: 01204330760, 8802820025



  5. Wow cool site I have already withdrawn money from here. Actually my girlfriend advised me and I decided to risk the risk and you elegant real casino I wish you more victories

  6. Your info is really amazing with impressive content..Excellent blog with informative concept. Really I feel happy to see this useful blog, Thanks for sharing such a nice blog..
    If you are looking for any python Related information please visit our website Python classes in pune page!

  7. Hi,
    Best article, very useful and well explanation. Your post is extremely incredible.Good job & thank you very much for the new information, i learned something new. Very well written. It was sooo good to read and usefull to improve knowledge. Who want to learn this information most helpful. One who wanted to learn this technology IT employees will always suggest you take Data science course in Pimple Saudagar

  8. Really i appreciate the effort you made to share the knowledge. The topic here i found was really effective...

    Looking for Training Institute in Bangalore , India. Softgen Infotech is the best one to offers 85+ computer training courses including IT software course in Bangalore, India. Also it provides placement assistance service in Bangalore for IT.
    Best Software Training Institute in Bangalore

  9. wonderful thanks for sharing an amazing idea. keep it...

    Start your journey with Training Institute in Bangaloreand get hands-on Experience with 100% Placement assistance from Expert Trainers with 8+ Years of experience @eTechno Soft Solutions Located in BTM Layout Bangalore.
    SAP Training in Bangalore


  10. Thanks for sharing,excellent information.It is very useful for me to learn and understand easily.Tableau is a powerful and fastest growing data visualization tool used in the Business Intelligence Industry. Business Intelligence Industry suggest to take tableau course to enhance their skills
    tableau training institute in bangalore

  11. Great post i must say and thanks for the information. Education is definitely a sticky subject. However, is still among the leading topics of our time. I appreciate your post and look forward to more.
    Digital Marketing Training Course in Chennai | Digital Marketing Training Course in Anna Nagar | Digital Marketing Training Course in OMR | Digital Marketing Training Course in Porur | Digital Marketing Training Course in Tambaram | Digital Marketing Training Course in Velachery

  12. Good job in presenting the correct content with the clear explanation. The content looks real with valid information. Good Work

    Dot Net Training in Chennai | Dot Net Training in anna nagar | Dot Net Training in omr | Dot Net Training in porur | Dot Net Training in tambaram | Dot Net Training in velachery

  13. Thanks for sharing information to our knowledge, it helps me plenty keep sharing…

    Big Data Training Institute In Bangalore
    Big Data Training In Bangalore

  14. To become successful and good entrepreneurs, they first have to identify the real needs and problems of people and solve them. Thus, enrolling in Entrepreneur Training Courses is the best idea. To know more visit here

  15. Excellent and very cool idea and great content of different kinds of the valuable information's.

    Data Science Training in Bangalore
    Data Science Training Institute in Bangalore

  16. หาคุณกำลังหาเกมส์ออนไลน์ที่สามารถสร้างรายได้ให้กับคุณ เรามีเกมส์แนะนำ เกมยิงปลา รูปแบบใหม่เล่นง่ายบนมือถือ คาสิโนออนไลน์ บนคอม เล่นได้ทุกอุปกรณ์รองรับทุกเครื่องมือ มีให้เลือกเล่นหลายเกมส์ เล่นได้ทั่วโลกเพราะนี้คือเกมส์ออนไลน์แบบใหม่ เกมยิงปลา