Skip to content

How rails sharding connection handling works

Henrique Gubert edited this page Jun 16, 2017 · 13 revisions

First of all, to understand what the rails-sharding gem does, it is necessary to understand how ActiveRecord handles its connection to the database, and how we can use its existing interface to connect to shards.

ActiveRecord v5 defines the following class hierarchy:

ConnectionHandler -[has many]-> ConnectionPool -[has many]-> Connection (Adapter-specific classes)

Let's understand what each one of these classes do.

This is a high-level interface to database connection management. A common running Rails app has only one instance ConnectionHandler, that can be accessed via ActiveRecord::Base.connection_handler. The actual connection_handler getter is defined in ActiveRecord::Core#L134, which is included into ActiveRecord::Base.

The ConnectionHandler can manage several ConnectionPool, usually one per database. In a vanilla Rails app the ConnectionHandler has a single ConnectionPool. Besides holding several ConnectionPool, the ConnectionHandler also has the feature of automatically dealing with forked processes. If a Rails app is forked, the child (new) process cannot use the same ConnectionPool of the parent process. The ConnectionHandler detects the fork automatically and instanciates a new ConnectionPool for the child process, with the same specification as the parent's. This is completely transparent to the ConnectionHandler user.

To create a new ConnectionPool you need to call ConnectionHandler#establish_connection, which is a poorly named method for me, as it doesn't actually establish a connection to the database. It simply creates an empty ConnectionPool associated with a spec_name (just a plain string, which is "primary" by default).

Once the pool is created, you can retrieve it by spec_name using ConnectionHandler#retrieve_connection_pool, or you can even retrieve a DB connection directly just passing the spec_name using ConnectionHandler#retrieve_connection

Of course, Rails users don't usually know about the ConnectionHandler, and they also don't know things such as the spec_name of the database they're using. If you needed to get a connection or connection_pool through the ConnectionHandler, you'd call:

ActiveRecord::Base.connection_handler.retrieve_connection_pool("primary")
ActiveRecord::Base.connection_handler.retrieve_connection("primary")

However, Rails sugar-coats this process by including the ConnectionHandling module to ActiveRecord::Base, which makes accessing DB connections much simpler:

ActiveRecord::Base.connection_pool
ActiveRecord::Base.connection
ActiveRecord::Base.with_connection { |connection| ... }

Also, when you have different models being saved in different databases, the ConnectionHandling already does the trick for you of returning the connection or the connection pool to the right database, depending on which model you're calling the methods:

ModelA::Base.connection_pool  #=> connection pool of the DB where ModelA is
ModelB::Base.connection_pool  #=> connection pool of the DB where ModelB is

The case above is one of those cases where a single ConnectionHandler will have multiple ConnectionPool.

The ConnectionPool is responsible for managing and allocating DB connections to threads. When you define your connection in the config/database.yml you specify the pool size of for each database using the pool option (if not, the default is 5). This is the maximum number of database connections a ConnectionPool supports.

The ConnectionPool is lazy. It starts with 0 available connections, and as connections are checked-out, they are created and removed from the available pool. When the pool runs out of connections it will put any thread that is requesting another connection to wait for a given timeout. If it timeouts, it will then raise a ConnectionTimeoutError exception here. You can control the timeout time with the checkout_timeout option in your database specification in config/database.yml

If you're into this kind of stuff you can play with the ConnectionPool directly using the the checkout and checkin methods:

connection_pool = ActiveRecord::Base.connection_pool
connection_1 = connection_pool.checkout
connection_2 = connection_pool.checkout
connection_pool.checkin connection_1
connection_pool.checkin connection_2

But be warned! If you forget to check-in the connection it will stay reserved and unavailable for checkout, and you run the risk of running out of connections in your pool.

This is one of the reasons you should probably not use the checkout and checkin methods directly, but rather the higher-level methods connection and release_connection (these are the methods used by Rails to retrieve a connection). The main difference is that connection and release_connection will associate a checked-out connection to the current thread, and will avoid reserving more than one connection per thread. This means that if you forget to release_connection after you have finished using the connection, at least, the next time you request for a connection you will get the same one as before, instead of checking-out a new one:

connection_pool = ActiveRecord::Base.connection_pool
connection_1 = connection_pool.connection
connection_2 = connection_pool.connection
connection_1 == connection_2 #=> true

In fact, that is how ActiveRecord::Base uses the ConnectionPool. When you load some record like User.first ActiveRecord will request a connection to the ConnectionPool, and it will not call release_connection later. This is actually a sensible thing to do when you have a single database, but might be a problem if you're accessing thousands of different databases randomly.

In terms of using the ConnectionPool directly, it is more memory-efficient to use the with_connection method instead of the pair connection, release_connection. This ensures the connection is checked-in to the poll after the block runs, even if the block raised some exception:

connection_pool = ActiveRecord::Base.connection_pool
connection_pool.with_connection do |connection|
  # do your work here
end
# connection has been checked-in again when we get here

Just remember that a checked-in connection is not a closed connection, it just means the connection has returned to the pool of available connections to be checked-out by another thread. This leads to better memory utilization but does not free memory directly.

Another interesting feature of the ConnectionPool is the Reaper. It runs periodically (with frequency configurable through the reaping_frequency config) and it identifies connections checked-out by threads that are no longer alive, and checks them in again. The Reaper does not run by default, only if you specify the reaping_frequency option.

As of now, the ConnectionPool doesn't have any methods to actually disconnect and clear stale connections sitting in the pool for too long. This is relevant as if you are accessing a large number of databases, this means at least one connection will be kept open for each database ever queried. This results in a larger memory footprint of your Rails process and of your database process.

Clone this wiki locally