In this article by Rafał Kuć, author of the book Solr Cookbook - Third Edition, covers the cloud side of Solr—SolrCloud, setting up collections, replicas configuration, distributed indexing and searching, as well as aliasing and shard manipulation. We will also learn how to create a cluster.

(For more resources related to this topic, see here.)

Creating a new SolrCloud cluster

Imagine a situation where one day you have to set up a distributed cluster with the use of Solr. The amount of data is just too much for a single server to handle. Of course, you can just set up a second server or go for another master server with another set of data. But before Solr 4.0, you would have to take care of the data distribution yourself. In addition to this, you would also have to take care of setting up replication, data duplication, and so on. With SolrCloud you don't have to do this—you can just set up a new cluster and this article will show you how to do that.

Getting ready

It shows you how to set up a Zookeeper cluster in order to be ready for production use.

How to do it...

Let's assume that we want to create a cluster that will have four Solr servers. We also would like to have our data divided between the four Solr servers in such a way that we have the original data on two machines and in addition to this, we would also have a copy of each shard available in case something happens with one of the Solr instances. I also assume that we already have our Zookeeper cluster set up, ready, and available at the address 192.168.1.10 on the 9983 port. For this article, we will set up four SolrCloud nodes on the same physical machine:

We will start by running an empty Solr server (without any configuration) on port 8983. We do this by running the following command (for Solr 4.x):
```
java -DzkHost=192.168.1.10:9983 -jar start.jar
```
For Solr 5, we will run the following command:
```
bin/solr -c -z 192.168.1.10:9983
```
Now we start another three nodes, each on a different port (note that different Solr instances can run on the same port, but they should be installed on different machines). We do this by running one command for each installed Solr server (for Solr 4.x):
```
java -Djetty.port=6983 -DzkHost=192.168.1.10:9983 -jar start.jar
java -Djetty.port=4983 -DzkHost=192.168.1.10:9983 -jar start.jar
java -Djetty.port=2983 -DzkHost=192.168.1.10:9983 -jar start.jar
```

For Solr 5, the commands will be as follows:

bin/solr -c -p 6983 -z 192.168.1.10:9983
bin/solr -c -p 4983 -z 192.168.1.10:9983
bin/solr -c -p 2983 -z 192.168.1.10:9983

Now we need to upload our collection configuration to ZooKeeper. Assuming that we have our configuration in /home/conf/solrconfiguration/conf, we will run the following command from the home directory of the Solr server that runs first (the zkcli.sh script can be found in the Solr deployment example in the scripts/cloud-scripts directory):
```
./zkcli.sh -cmd upconfig -zkhost 192.168.1.10:9983 -confdir /home/
conf/solrconfiguration/conf/ -confname collection1
```

Now we can create our collection using the following command:

curl 'localhost:8983/solr/admin/collections?action=CREATE&nam
e=firstCollection&numShards=2&replicationFactor=2&collection.
configName=collection1'

If we now go to http://localhost:8983/solr/#/~cloud, we will see the following cluster view:

As we can see, Solr has created a new collection with a proper deployment. Let's now see how it works.

How it works...

We assume that we already have ZooKeeper installed—it is empty and doesn't have information about any collection, because we didn't create them.

For Solr 4.x, we started by running Solr and telling it that we want it to run in SolrCloud mode. We did that by specifying the -DzkHost property and setting its value to the IP address of our ZooKeeper instance. Of course, in the production environment, you would point Solr to a cluster of ZooKeeper nodes—this is done using the same property, but the IP addresses are separated using the comma character.

For Solr 5, we used the solr script provided in the bin directory. By adding the -c switch, we told Solr that we want it to run in the SolrCloud mode. The -z switch works exactly the same as the -DzkHost property for Solr 4.x—it allows you to specify the ZooKeeper host that should be used.

Of course, the other three Solr nodes run exactly in the same manner. For Solr 4.x, we add the -DzkHost property that points Solr to our ZooKeeper. Because we are running all the four nodes on the same physical machine, we needed to specify the -Djetty.port property, because we can run only a single Solr server on a single port. For Solr 5, we use the -z property of the bin/solr script and we use the -p property to specify the port on which Solr should start.

The next step is to upload the collection configuration to ZooKeeper. We do this because Solr will fetch this configuration from ZooKeeper when you will request the collection creation. To upload the configuration, we use the zkcli.sh script provided with the Solr distribution. We use the upconfig command (the -cmd switch), which means that we want to upload the configuration. We specify the ZooKeeper host using the -zkHost property. After that, we can say which directory our configuration is stored (the -confdir switch). The directory should contain all the needed configuration files such as schema.xml, solrconfig.xml, and so on. Finally, we specify the name under which we want to store our configuration using the -confname switch.

After we have our configuration in ZooKeeper, we can create the collection. We do this by running a command to the Collections API that is available at the /admin/collections endpoint. First, we tell Solr that we want to create the collection (action=CREATE) and that we want our collection to be named firstCollection (name=firstCollection). Remember that the collection names are case sensitive, so firstCollection and firstcollection are two different collections. We specify that we want our collection to be built of two primary shards (numShards=2) and we want each shard to be present in two copies (replicationFactor=2). This means that we will have a primary shard and a single replica. Finally, we specify which configuration should be used to create the collection by specifying the collection.configName property.

As we can see in the cloud, a view of our cluster has been created and spread across all the nodes.

There's more...

There are a few things that I would like to mention—the possibility of running a Zookeeper server embedded into Apache Solr and specifying the Solr server name.

Starting an embedded ZooKeeper server

You can also start an embedded Zookeeper server shipped with Solr for your test environment. In order to do this, you should pass the -DzkRun parameter instead of -DzkHost=192.168.0.10:9983, but only in the command that sends our configuration to the Zookeeper cluster. So the final command for Solr 4.x should look similar to this:

java -DzkRun -jar start.jar

In Solr 5.0, the same command will be as follows:

bin/solr start -c

By default, ZooKeeper will start on the port higher by 1,000 to the one Solr is started at. So if you are running your Solr instance on 8983, ZooKeeper will be available at 9983.

The thing to remember is that the embedded ZooKeeper should only be used for development purposes and only one node should start it.

Specifying the Solr server name

Solr needs each instance of SolrCloud to have a name. By default, that name is set using the IP address or the hostname, appended with the port the Solr instance is running on, and the _solr postfix. For example, if our node is running on 192.168.56.1 and port 8983, it will be called 192.168.56.1:8983_solr. Of course, Solr allows you to change that behavior by specifying the hostname. To do this, start using the -Dhost property or add the host property to solr.xml.

For example, if we would like one of our nodes to have the name of server1, we can run the following command to start Solr:

java -DzkHost=192.168.1.10:9983 -Dhost=server1 -jar start.jar

In Solr 5.0, the same command would be:

bin/solr start -c -h server1

Setting up multiple collections on a single cluster

Having a single collection inside the cluster is nice, but there are multiple use cases when we want to have more than a single collection running on the same cluster. For example, we might want users and books in different collections or logs from each day to be only stored inside a single collection. This article will show you how to create multiple collections on the same cluster.

Getting ready

This article will show you how to create a new SolrCloud cluster. We also assume that ZooKeeper is running on 192.168.1.10 and is listening on the 2181 port and that we already have four SolrCloud nodes running as a cluster.

How to do it...

As we already have all the prerequisites, such as ZooKeeper and Solr up and running, we need to upload our configuration files to ZooKeeper to be able to create collections:

Assuming that we have our configurations in /home/conf/firstcollection/conf and /home/conf/secondcollection/conf, we will run the following commands from the home directory of the first run Solr server to upload the configuration to ZooKeeper (the zkcli.sh script can be found in the Solr deployment example in the scripts/cloud-scripts directory):
```
./zkcli.sh -cmd upconfig -zkhost localhost:2181 -confdir /home/
conf/firstcollection/conf/ -confname firstcollection

./zkcli.sh -cmd upconfig -zkhost localhost:2181 -confdir /home/
conf/secondcollection/conf/ -confname secondcollection
```

We have pushed our configurations into Zookeeper, so now we can create the collections we want. In order to do this, we use the following commands:

curl 'localhost:8983/solr/admin/collections?action=CREATE&nam
e=firstCollection&numShards=2&replicationFactor=2&collection.
configName=firstcollection'

curl 'localhost:8983/solr/admin/collections?action=CREATE&name
=secondcollection&numShards=4&replicationFactor=1&collection.
configName=secondcollection'

Now, just to test whether everything went well, we will go to http://localhost:8983/solr/#/~cloud. As the result, we will see the following cluster topology:

As we can see, both the collections were created the way we wanted. Now let's see how that happened.

How it works...

We assume that we already have ZooKeeper installed—it is empty and doesn't have information about any collections, because we didn't create them. We also assumed that we have our SolrCloud cluster configured and started.

We start by uploading two configurations to ZooKeeper, one called firstcollection and the other called secondcollection. After that we are ready to create our collections.

We start by creating the collection named firstCollection that is built of two primary shards and one replica. The second collection, called secondcollection is built of four primary shards and it doesn't have any replicas. We can see that easily in the cloud view of the deployment. The firstCollection collection has two shards—shard1 and shard2. Each of the shard has two physical copies—one green (which means active) and one with a black dot, which is the primary shard. The secondcollection collection is built of four physical shards—each shard has a black dot near its name, which means that they are primary shards.

Splitting shards

Imagine a situation where you reach a limit of your current deployment—the number of shards is just not enough. For example, the indexing throughput is lower and lower, because the disks are not able to keep up. Of course, one of the possible solutions is to spread the index across more shards; however, you already have a collection and you want to keep the data and reindexing is not an option, because you don't have the original data. Solr can help you with such situations by allowing splitting shards of already created collections. This article will show you how to do it.

Getting ready

This article will show you how to create a new SolrCloud cluster. We also assume that ZooKeeper is running on 192.168.1.10 and is listening on port 2181 and that we already have four SolrCloud nodes running as a cluster.

How to do it...

Let's assume that we already have a SolrCloud cluster up and running and it has one collection called books. So our cloud view (which is available at http://localhost:8983/solr/#/~cloud) looks as follows:

cloud-img-2

We have four nodes and we don't utilize them fully. We can say that these two nodes in which we have our shards are almost fully utilized. What we can do is create a new collection and reindex the data or we can split shards of the already created collection. Let's go with the second option:

We start by splitting the first shard. It is as easy as running the following command:

curl 'http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=books&shard=shard1'

After this, we can split the second shard by running a similar command to the one we just used:

curl 'http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=books&shard=shard2'

Let's take a look at the cluster cloud view now (which is available at http://localhost:8983/solr/#/~cloud):

As we can see, both shards were split—shard1 was divided into shard1_0 and shard1_1 and shard2 was divided into shard2_0 and shard2_1. Of course, the data was copied as well, so everything is ready.

However, the last step should be to delete the original shards. Solr doesn't delete them, because sometimes applications use shard names to connect to a given shard. However, in our case, we can delete them by running the following commands:

curl 'http://localhost:8983/solr/admin/collections?action=DELETESHARD&collection=books&shard=shard1'

curl 'http://localhost:8983/solr/admin/collections?action=DELETESHARD&collection=books&shard=shard2'

Now if we would again look at the cloud view of the cluster, we will see the following:

cloud-img-4

How it works...

We start with a simple collection called books that is built of two primary shards and no replicas. This is the collection which shards we will try to divide it without stopping Solr.

Splitting shards is very easy. We just need to run a simple command in the Collections API (the /admin/collections endpoint) and specify that we want to split a shard (action=SPLITSHARD). We also need to provide additional information such as which collection we are interested in (the collection parameter) and which shard we want to split (the shard parameter). You can see the name of the shard by looking at the cloud view or by reading the cluster state from ZooKeeper. After sending the command, Solr might force us to wait for a substantial amount of time—shard splitting takes time, especially on large collections. Of course, we can run the same command for the second shard as well.

Finally, we end up with six shards—four new and two old ones. The original shard will still contain data, but it will start to re-route requests to newly created shards. The data was split evenly between the new shards. The old shards were left although they are marked as inactive and they won't have any more data indexed to them. Because we don't need them, we can just delete them using the action=DELETESHARD command sent to the same Collections API. Similar to the split shard command, we need to specify the collection name, which shard we want to delete, and the name of the shard. After we delete the initial shards, we now see that our cluster view shows only four shards, which is what we were aiming at.

We can now spread the shards across the cluster.

Summary

In this article, we learned how to set up multiple collections. This article thought us how to increase the number of collections in a cluster. We also worked on a way used to split shards.