Data clumps refer to groups of variables that are frequently found together in different parts of the codebase. These variables often represent related pieces of information or attributes of an entity. When data clumps occur, it suggests that there might be an underlying concept or abstraction that is not yet explicitly represented in the code.

Problems

Duplications-If the same group of variables is passed around in multiple places, any changes to the structure or meaning of those variables must be made in several locations, which increases the likelihood of errors and makes maintenance more difficult.
Code Readability-When a group of variables appears frequently throughout the codebase, it can make the code harder to read and understand, as developers need to track the relationships between these variables.
Maintainability-As the codebase grows, managing multiple instances of the same group of variables becomes more complex and error-prone.

Solution

Identify Related Variables-Look for groups of variables that tend to appear together in various parts of your code.
Encapsulate into a Class or Data Structure-Create a new class or data structure that represents the related variables as a single entity. This class can provide methods for accessing and manipulating the data as needed.
Update References-Replace instances of the data clump with instances of the new class or data structure throughout your codebase.

Real World Example

A good example of data clumps would be references to a database connection.

require "mysql2"

client = Mysql2::Client.new(
  host: "127.0.0.1",
  username: "root",
  password: "12345",
  port: 3306,
  database: "development"
)

client.query("SELECT * FROM users;").

As we can see in the example above, we need to provide appropriate data to connect to the database. The problem with data clumps occurs when such code appears in many places.

Of course, no one remembers all the data, so in this case, every time we want to create a database client, we copy the data from the previous place and paste it where we want to use it.

What if someone changes the port that this database is on? In this case, we have to change the data everywhere.

To solve this problem, we can move all the data to a new module/class that will be responsible for storing it.

module Database
  CREDENTIALS = {
    host: "127.0.0.1",
    username: "root",
    password: "12345",
    port: 3306,
    database: "development"
  }
end

Now, instead of looking for credentials every time and copying them from another place in the code, we can use our newly created module

require "mysql2"

client = Mysql2::Client.new(Database::CREDENTIALS)

client.query("SELECT * FROM users;").

Pros & Cons

Improved Code Readability-Encapsulating related variables into a single class or data structure makes the code more self-explanatory. It provides a clear abstraction for the group of variables, making it easier for other developers to understand their purpose and relationships.
Reduced Duplication-By centralizing the group of variables into a single entity, you eliminate the need to duplicate their declarations and usage throughout the codebase. This reduces redundancy and makes the code more maintainable.
Enhanced Maintainability-Changes to the structure or behavior of the related variables can be made in one place, rather than scattered across multiple parts of the code. This simplifies maintenance and reduces the likelihood of introducing bugs due to inconsistent updates.
Encourages Modular Design-Encapsulating related variables promotes a more modular design, where each component of the codebase is responsible for a specific concern. This makes the codebase easier to manage, test, and evolve over time.

Cons

Increased Complexity-Introducing a new class or data structure adds an additional layer of abstraction to the codebase, which can sometimes increase complexity. Developers need to understand the relationships between the encapsulated variables and how they interact with other parts of the system.
Over-Engineering-In some cases, addressing a data clump by introducing a new class or data structure may be unnecessary and can lead to over-engineering. If the group of variables is simple and unlikely to change, it may be more appropriate to leave them as they are.
Potential Performance Overhead-Depending on the implementation, encapsulating related variables into a new class or data structure may introduce a slight performance overhead compared to directly accessing primitive variables. However, in most cases, the performance impact is negligible and outweighed by the benefits of improved maintainability and readability.

Introduction

Large Class

Overview Problems Solution Real World Example Pros & Cons