Jaa


7 - Implementing a Graph Database

patterns & practices Developer Center

On this page: Download:
What is a Graph? | Designing a Graph Database | Modeling Data as a Graph | Retrieving Data from a Graph Database - Neo4j: Cypher Query Examples, Indexing Nodes and Relationships | Common Use Cases for a Graph Database | How Adventure Works Used a Graph Database to Store Product Recommendations | Designing the Nodes and Relationships for the Product Recommendations Database | Retrieving Product Recommendations from the Database | Implementing a Graph Database to Maximize Scalability, Availability, and Consistency | Maximizing Scalability and Availability | Ensuring Consistency | How Adventure Works Implemented Scalability and Availability for the Product Recommendations Database | Summary | More Information

Download code samples

Download PDF

Order Paperback

The driving force behind most NoSQL databases is to enable you to store information in a large scale, highly available, and optimized database, and then retrieve this information again extremely quickly. The focus of these databases is on putting data in and getting it out again in as efficient a manner as possible. Graph databases operate in a different functional space; their main aim is to enable you to perform queries based on the relationships between data items write application that can analyze these relationships.

If you store information in a relational database, you can use SQL to perform complex and intricate data analyses, but the code for these SQL statements can sometimes be difficult to develop and test, cryptic to maintain, and require considerable resources to run. For this reason, many relational database vendors also offer data warehousing solutions that enable a system to take the data held in a relational database and restructure this data into a form optimized for performing data analysis. However, many such solutions can be expensive, and they often require that you run them on powerful hardware that is tuned to support the demands of the data warehouse software.

A graph database enables you to take a different approach. Rather than defining your data as a series of entities or records and implementing costly run-time processes to deduce the relationships between them, you explicitly design the database around the relationships that your application requires. This strategy enables you to optimize the database for your application. Starting at a given point, your application can easily traverse the network of connections between entities without requiring that you write complex code. Additionally, most graph databases do not require lavish hardware, often running on clusters of commodity computers.

This chapter describes how to design a graph database to support the analytical processing performed by an application, and discusses how Adventure Works used a graph database to implement product recommendations for customers buying goods using the Shopping application.

What is a Graph?

In the world of graph databases, a graph is a connected network of nodes. Each node contains information about a specific entity, and each connection specifies a relationship between entities.

A graph can contain many different types of nodes, and a node can contain properties that store information about that node. For example, in a database that models the departmental structure of an organization, you might create nodes for departments and nodes for employees that work in these departments. The properties of a department node could include the department name and location, while the properties of an employee node could include the employee's name, payroll number, and other relevant details.

In theory, you could implement a graph database using almost any technology, including tables in a relational database. In a relational database you could create tables to hold the information for the different types of entities and use SQL to perform queries that join rows retrieved from multiple tables. This approach is very flexible from a query perspective but it can consume considerable resources at runtime (joining tables is a costly operation) and tends not to scale well. Additionally, if business requirements change and you need to restructure the information that you store about entities, or even add new types of entities, the schema-oriented nature of a relational database can make it difficult to accommodate these changes.

NoSQL graph databases are specifically designed to store information about the relationships between nodes, and enable applications to use this information to traverse a graph extremely efficiently. In a graph database, each connection is directional; it has a start point (a node) and an end point (another, or possibly the same node), and the direction of the connection determines how an application can traverse from the start point to the end point. Additionally, connections can have properties that provide additional information about the relationship between the nodes at the start and end points. These are an important feature because they enable your application to quickly filter out paths that are unnecessary or irrelevant for a specific query. Furthermore, like most NoSQL systems, graph databases do not enforce any form of rigid schema, enabling you to easily incorporate additional types of entities into your solutions without requiring that you restructure the database.

Dn313282.note(en-us,PandP.10).gifPoe says:
Poe When you query a graph database, you specify a node that acts as the starting point and then traverse the relationships from that node. The performance of any given query does not depend on the overall size of the database, only the number of nodes that the query finds. Queries can run in near-linear time even if the volume of data in the database increases exponentially.

Figure 1 shows an example graph database implementing a refinement of the departmental structure for the fictitious organization first presented in Chapter 1, "Data Storage for Modern High-Performance Business Applications." This example separates out the Works For and Manages connections because these are actually separate explicit relationships in a graph database (in a relational database, these connections would be a single bidirectional relationship that could be inferred by using foreign and primary key fields.)

Additionally, this example includes the Employs relationship that indicates the departments in which each employee works. This relationship enables the database to support queries such as "Who works in Marketing?" without having to fetch each employee and examine the Works In relationship.

Finally, this example also defines the property Worked In Since for the Works In relationship, which specifies the starting date of an employee within a department (if you were using a relational database, you would probably need to store this information in a separate table, and any queries would need to join with data from this table). This property belongs to the relationship rather than an Employee node, and this structure enables the database to model the situation where a single employee works for more than one department (2 days a week in Sales, and 3 days a week in Marketing, for example).

Figure 1 - A departmental organization chart, structured as a graph database

Figure 1 - A departmental organization chart, structured as a graph database

Designing a Graph Database

Clearly there are two primary aspects that you need to consider when you design a graph database:

  • The nodes that your application needs to retrieve and process, together with the relevant properties for these entities.
  • The relationships between these entities and the properties of these relationships.

You also need to give careful thought to the mechanics of how the queries that your application performs will use your graph database. All queries must specify a starting point, or query root, and your graph database must enable an application to quickly find the node that acts as the query root for any given query.

Finally, while there are some scenarios that are eminently suited to graph databases, there are other situations that are best handled by using another form of database. You should not attempt to force the data for an application into a graph structure simply because you have a graph database available.

The following sections discuss these concerns in more detail.

Modeling Data as a Graph

When you design a graph database, try to avoid thinking in terms of tables and rows, and instead concentrate on the queries and operations that your application needs to perform. Examine the business use cases and try to identify the questions that these uses cases ask of your database, such as "In which department does <employee name> work?," or "Who does <manager name> manage?" This process can help to identify the most common relationships ("Works In," "Manages") that the database needs to support. You can then analyze these relationships to determine the objects on which they should operate (Employees, Departments, Managers). You should also be aware that some of these objects might actually be instances of the same type of data (Employees and Managers, for example), and that it is their role in a relationship that determines how the application uses them. As part of this process you will probably discover other relationships that you should model, such as "Works For" and "Employs."

Note

Some relationships might be bidirectional, while others might only be unidirectional. For example, Manages is clearly a unidirectional relational. A manager manages an employee, but the converse is not true. An example of a bidirectional relationship could be "Colleague Of" for two employees that work in the same department. If employee A is a colleague of employee B, then employee B will also be a colleague of employee A.

The queries that your application needs to perform will also help to define the properties that the nodes and relationships should include. In the organization example in Figure 1, the application that uses this data displays employee names and identifies employees by using their unique payroll number, so these attributes are included as properties. You should note that most graph databases do not provide the richness for defining the structure of data in a node that other types of NoSQL databases supply. For example, you might be limited to using simple types such as strings, numbers, and dates, and you might not be able to create complex denormalized structures such as nested documents or handle large binary data values. If you need to store and retrieve highly-structured information you can build a polyglot system; save the basic, essential information about entities as nodes in a graph database, and store the remainder in a more appropriate NoSQL database such as a document database. The information that you store in the graph database should include a key that the application can use to quickly lookup the complete document in the document database. Chapter 8, "Building a Polyglot Solution,"**describes this scenario in more detail.

Dn313282.note(en-us,PandP.10).gifPoe says:
Poe Don't try and store highly structured documents in a graph database.

You should pay special attention to the properties that you add to relationships because they can have a significant impact on the performance of the queries performed against the database. When you perform a query, a graph database allows you to qualify the relationships that the query should navigate across by using these properties, enabling the database server to quickly avoid traversing paths that are not relevant to the query and effectively pruning the tree of nodes that it needs to search.

Dn313282.note(en-us,PandP.10).gifPoe says:
Poe In general, you should design the graph database to optimize the queries performed by the application. Focus on the relationships as these will be the most important items that influence the performance of the database and your application.

Retrieving Data from a Graph Database

All graph databases provide a means to enable an application to walk through a set of connected nodes based on the relationships between these nodes. The programmatic interfaces that graph databases provide vary from vendor to vendor, ranging from the simple imperative approach through to more declarative mechanisms. In the simple imperative approach you select a node as a starting point, examine the relationships that this node has with other nodes, and then traverse each relevant relationship to find related nodes, and then repeat the process for each related node until you have found the data that your require or there is no more data to search. In the more declarative approach you select a starting point, specify criteria that filters the relationships to traverse and nodes to match, and then let the database server implement its own graph-traversal algorithm to return a collection of matching nodes. If possible, adopt the declarative approach because this mechanism can help to avoid tying the structure of your code too closely to the structure of the database.

Neo4j: Cypher Query Examples

As an illustration of the declarative approach, this section contains some examples based on the Cypher query language implemented by Neo4j, a popular NoSQL graph database.

This section is not intended to explain how the Cypher query language works. If you require more information, review the Neo4j documentation available online at The Neo4j Manual.

The examples are based on the organization graph shown in Figure 1. Neo4j assigns a unique identity number to each node and relationship when you create them. You can specify the identity number of a node (or nodes) as the starting point for a Cypher query. The examples that follow assume that the employee nodes have the following identity numbers:

1: Sarah
2: Walter
3: Mark
4: Lisa
5: Louise
6: Susan
7: John

The following Cypher query retrieves the nodes for all the employees that report directly to Sarah. The query starts at node 1 (Sarah), and follows all instances of the Manages relationship from that node to find the employees connected directly to node 1:

START manager=node(1)
MATCH manager-[:Manages]->employee
RETURN employee

This query returns a list containing nodes 2 (Walter), 3 (Mark), and 6 (Susan).

The next example shows how to find all employees that report directly or indirectly to Sarah. In this case, the Cypher query uses the "*" modifier to specify that it should traverse all instances of the Manages relationship not just for Sarah, but for every node that it finds connected to Sarah, and for each of the nodes connected to these nodes, and so on:

START manager=node(1)
MATCH manager-[:Manages*]->employee
RETURN employee

This query returns a list containing every employee node except for employee 1 (Sarah does not manage herself).

You can specify multiple starting points for a query by providing a comma-separated list of nodes in the START clause. You can also specify the wildcard character "*" as a shorthand for all nodes (this is not recommended if you have a large database). The following example shows how to find the department in which each employee works by using the Works_In relationship:

START employee=node(*)
MATCH employee-[:Works_In]->department
RETURN department

You can also filter data by using the properties of nodes and relationships. If you wish to limit the previous query to employees that work in departments located in New York, you can perform a query such as this:

START employee=node(*)
MATCH employee-[:Works_In]->department
WHERE department.Location="New York"
RETURN employee

This query returns node 1 (Sarah), and node 3 (Mark). These employees work in different departments (Head Office and Sales), but both departments are based in New York.

If you need to find out which employees work in a specific department, such as Manufacturing (assume that this department has the identity number 11 in the graph database shown in Figure 1), you can make use of the Employs relationship:

START department=node(11)
MATCH department-[:Employs]->employee
RETURN employee

This query returns a list containing nodes 5 (Louise), 6 (Susan), and 7 (John).

Indexing Nodes and Relationships

All queries against a graph database require a node (or a collection of nodes) as a starting point. Rather than referencing nodes through their identity numbers, you typically select nodes by specifying a value for one or more properties. For example, in the organization database, if you want to find all the employees that report directly to a particular manager, you would provide a property value (such as the payroll number) that identifies that manager as the starting point.

In most graph databases, you can define indexes over properties to enable the database server to quickly locate a node based on the values of these properties. In particular, you should create indexes over each of the properties that you use to specify the starting criteria for the searches that your application performs. In the organization example, if you wish to retrieve the employees that work for the manager with the Payroll_Number property of 1004, you should create an index over this property. If you are using Neo4j and the Cypher query language, you can then use the following code to retrieve the employees (this example assumes that the index is called Payroll_Number_Index):

START manager=node:Payroll_Number_Index(Payroll_Number="1004")
MATCH employee-[:Works_For]->manager
RETURN employee

Note

The indexing capabilities of a graph database vary from vendor to vendor. In some cases, the database might maintain indexes automatically, while in others it may be necessary to manually add references to nodes directly to an index as new nodes are inserted into the database. In the latter case, it is important that your application code keeps the indexes up to date otherwise the queries that the application performs might return inaccurate data, or they could omit data that it should have returned.

You may also find indexes beneficial over properties used to filter data in queries (you can think of an index as a sorted key/value store where the value references the node, and an index can quickly help to identify a set of nodes with a property value that falls into a given range). However, you should avoid creating indexes over other properties because they will be unlikely to improve the speed of your queries, and may even be detrimental to the performance of the database (these indexes have to be maintained as nodes are inserted, deleted, or modified, and the more indexes you have the greater the maintenance overhead).

You may also be able to define indexes over the properties of relationships. These indexes can prove extremely useful for queries that filter traversal paths based on these properties, especially if a node has a large number of relationships with other nodes.

Note

Some graph databases support full-text indexing. These indexes are useful if your database stores large textual values and you need to quickly find information based on patterns and regular expressions in these values.

Common Use Cases for a Graph Database

Remember that the purpose of a graph database is to store and maintain information about relationships between nodes, and enable applications to traverse these relationships quickly. Common scenarios that match the model implemented by most graph databases include:

  • Social Networking. A typical social networking database contains a significant number of contacts together with the connections that these contacts have with each other, either directly or indirectly. The database may also reference other types of information such as shared documents or photographs (these items might be physically stored outside of the graph database, following a polyglot architecture as described in Chapter 8). The result is a typically a complex network of friends (connections between contacts) and likes (connections between contacts and documents/photographs).
  • Calculating Routes. You can use a graph database to help solve complex routing problems that would require considerable resources to resolve by using an algorithmic approach. For example, you can easily determine the shortest path between two nodes in a highly connected graph.
  • Managing Networks. A graph database is an ideal repository for modeling complex telecommunications or data networks. Using a graph database can help to spot weak points in the network, and perform failure analysis by examining what might happen if one or more nodes becomes unavailable.
  • Generating Recommendations. You can use a graph database as a recommendations engine. For example, in a retail system, you can store information about which products are frequently purchased together. You can use the resulting graph to generate "other customers that bought xyz also bought abc" recommendations when the customer views the details for a product.
  • Mining Data. You can traverse the relationships in a graph database to investigate the direct and indirect interactions between nodes. These interactions can help you to spot trends in your data and enable you to act on these trends.
  • Determining Security and Access Rights. Many distributed systems need to be able to query and track the access rights that have been granted to a large number of users over an equally large volume of resources. When an application requests access to a resource on behalf of a user, it is important to be able to resolve this request quickly and accurately. A simple and efficient solution is to create graph database that stores information about users and resources, and implements access rights as the relationships between these entities.
Dn313282.note(en-us,PandP.10).gifPoe says:
Poe Some graph database servers incorporate common graph traversal algorithms such as "find the shortest path between nodes," "find all paths between nodes," and "find the cheapest path between nodes" (implementing the Dijkstra algorithm that adds weighting and costs to the paths between neighboring nodes, so the cheapest path from one node to another is not necessarily the shortest). You specify the node to use as the start and end points, and any parameters that the algorithm requires, and the database server returns a list of paths.

In a typical graph database, neither relationships nor nodes are necessarily static. Graph databases excel at queries that traverse connections between nodes, but they are not designed to support high levels of online transaction processing. So, ideally, the volume updates to nodes should be small compared to the number of queries being run. Performing an update that modifies a large number of unrelated nodes can be expensive because the graph database needs to locate each node; nodes are not stored in the well-defined aggregations common to other types of NoSQL databases. In the organization example shown earlier in this chapter, if the graph database stores employee details as nodes and includes the salary as part of this information, performing an update operation that raises the salary for a large proportion of the employees requires that you find each employee individually (this is a different process from finding nodes that have a relationship with other nodes). As described previously, defining indexes can help to speed up the process of finding nodes that match a specific pattern, but the database will also need to maintain these indexes as nodes are added, modified, or deleted.

Modifications to relationships tend to be less expensive than making changes to nodes, usually due to the nature of the business logic that drives these modifications. In the organization example shown earlier, if a manager moves to a new department, it is relatively easy to traverse the Manages relationships for that manager to find all the employees that reported to that manager and assign them to a new manager instead.

Dn313282.note(en-us,PandP.10).gifPoe says:
Poe If you need to combine the ability to frequently modify the data in nodes with queries that involve complex, dynamic relationships, then use two different NoSQL databases. Store the data as entities in a document or column-family database, and reference these entities from the nodes in the graph database. Chapter 8, "Building a Polyglot Solution,"provides an example.

It is also important to realize that a graph database is typically structured to optimize specific queries, and you define the relationships between the nodes in the database that support these queries. This means that a graph database might not be the best type of repository to support truly ad hoc searches; such queries tend to be based on relationships that are discovered once the data has been loaded into the database. Similarly, a graph database may not suitable for performing calculations that involve aggregating data unless these aggregations have been predefined by using map/reduce functionality (if the graph database supports it). For example, in the organization database, if you need to calculate the average salary of every employee in the company, or even in a particular department, then you may be able to implement map/reduce functions and store the results directly in the database. However, if you need to calculate results from ad hoc aggregations, such as determining the average salary for every employee who is based in Boston and has started working for the company since 01/01/2013, then storing the information in a relational database or a column-family database might be more efficient.

Note

You can define indexes over properties in nodes and relationships to help speed up direct access to these items. Using these indexes might improve the performance of some ad hoc aggregate calculations by enabling you to iterate through data in a specific sequence rather than traversing a graph, but you should avoid defining too many indexes because the maintenance overhead that they incur may slow your database down.
Some graph databases provide built-in query facilities for calculating simple aggregate values such as Count, Sum, Average, Max, and Min, without writing your own map/reduce functions. However, it can still be expensive to use these features in an application.

How Adventure Works Used a Graph Database to Store Product Recommendations

In the Shopping application, when a customer views the details of a product, the application displays a list of items that other customers who bought this product also purchased. Figure 2 shows an example.

Figure 2 - The Shopping application showing the details of a product and a list of related recommended products

Figure 2 - The Shopping application showing the details of a product and a list of related recommended products

To keep the list of product recommendations manageable and relevant, the list only contains the five most frequently purchased related products.

It is important to ensure that the application can find recommended products for any given product quickly and easily, so the developers at Adventure Works decided to store product recommendations in a graph database.

Note

The developers selected Neo4j to implement the graph database. However, this choice was simply because this is the graph database with which they are most familiar, and they could have selected one of a number of graph databases from other manufacturers.

Designing the Nodes and Relationships for the Product Recommendations Database

Initially, the developers at Adventure Works implemented the database as a graph containing Order and Product nodes connected by Contains and IncludedIn relationships. The database was prepopulated with Product nodes containing the details of each product. When a customer placed an order, an Order node was added and a Contains relationship was established from order to each product in the order.

Note

An Order node is denormalized structure that contains a subset of the order header information together with some details of each item in the order. Remember, that the full details of each order are recorded in the document database; the graph database only contains the information necessary to generate the product recommendations.

Similarly, an IncludedIn relationship was created from the Product node back to the order, as shown in Figure 3. When a customer viewed a product, the Shopping application used an index created over the Product ID property to quickly find the Product node in the graph database, and then followed the IncludedIn relationships to find all orders for that product. The Shopping application then used the Contains relationship to find any other products that these orders contain.

Figure 3 - The initial version of the product recommendations database

Figure 3 - The initial version of the product recommendations database

After evaluating this approach, the developers realized that although it is easy to find products that were purchased together, it was not easy to determine the frequency of these purchases. This made it difficult to ascertain which products the Shopping application should display as its top five recommendations for any given product. To perform this calculation, the Shopping application has to visit each order and count the number of times each product is referenced.

In an attempt to optimize this approach, the developers added a ReferencedBy property to each Product node. This property is initialized to zero when the database is populated initially, and incremented each time an order is placed for that product. However, this approach still meant that the Shopping application had to visit each product to find the number of times it has been ordered, and the overhead of maintaining the ReferencedBy property degraded the performance and caused contention.

Note

Graph databases typically require you to wrap updates in transactions, which lock the data being updated while the transaction is being performed. For more information, see the section "Ensuring Consistency" later in this chapter.

The developers realized that they did not actually need to store information about orders in the database, and could simply hold the product recommendations as a relatively simple network of RecommendedProduct nodes containing the product ID and the name of the product. The nodes are connected by a ProductRecommendation relationship. For example, if product B is purchased as part of the same order as product A, the RecommendedProduct node for product A has a ProductRecommendation relationship with product B. The developers also realized that they could store information in the ProductRecommendation relationship about the frequency with which product B is purchased as part of an order with product A compared to other products. This frequency is calculated as a percentage. For example, 20% (0.2) of the orders for product B might also occur in orders for product A, whereas only 10% (0.1) of the orders for another product (product C) might occur in orders for product A. The result is a network similar to that shown in Figure 4. Storing this information in the ProductRecommendation relationships enables the Shopping application to quickly determine the top five related products for any given product.

Figure 4 - A network of recommended products in the graph database

Figure 4 - A network of recommended products in the graph database

Note

The ProductRecommendation relationship is unidirectional because of the Percentage property. 20% of the orders for product B might occur with product A, but the converse is not necessarily true and only 5% of the orders for product A might occur with product B.

When a customer browses a product, the Shopping application uses the product ID to find the node for this product in the graph database and then it retrieves the list of recommended products by traversing the ProductRecommendation relationships from this node. To ensure that the application can locate the node for any given product quickly, the developers at Adventure Works created an index called product_id_index over the ProductId property in the RecommendedProduct nodes in the database.

Note

Due to the volume of orders being continually placed by customers, the Shopping application does not attempt to update the product recommendations database in real time. For more information, see the section "How Adventure Works Implemented Scalability and Availability for the Product Recommendations Database" later in this chapter.

Retrieving Product Recommendations from the Database

As described in Appendix A, "How the MvcWebApi Web Service Works," the developers followed the Repository pattern to connect to the graph databases and isolate the database-specific code from the rest of the application. The ProductsController class in the web service creates a ProductRecommendationRepository object, and the GetRecommendations method in the ProductsController class uses this object to retrieve product recommendations. The ProductRecommendationRepository class implements the IProductRecommendationRepository interface shown in the following code example:

public interface IProductRecommendationRepository
{
    ICollection<RecommendedProduct> GetProductRecommendations(int productId);
}

The GetProductRecommendations method takes the ID of a product and returns a list of RecommendedProduct domain entity objects associated with this product. The RecommendedProduct class contains the product ID and product name from the corresponding RecommendedProduct node in the graph database, but it also contains the value of the Percentage property from the ProductRecommendation relationship that links the product specified in the productId parameter with the product identified by the RecommendedProduct object. The RecommendedProduct class looks like this:

public class RecommendedProduct
{
    public int ProductId { get; set; }
    public string Name { get; set; }
    public decimal Percentage { get; set; }
}

The ProductRecommendationRepository class uses a series of APIs provided by Neo4j to query the database. Specifically, the code uses a programmatic interface to Cypher provided by the Neo4jClient library for the .NET Framework. The URL of the database to use is specified by using the constructor.

You can find more information about the Neo4jClient library for the .NET Framework online at the Neo4jClient wiki.

The GetProductRecommendations method utilizes the product_id_index in the database to find the node that matches the value in the productId parameter, and then traverses the ProductRecommendation relationship from this node to find all product recommendations for this product. Each product recommendation is returned from Neo4j as a ProductGraphEntity object. This class is specific to the implementation in the Neo4j database (Neo4j saves the data as JSON formatted objects). The GetProductRecommendations method reformats the data and converts it into a collection of RecommendedProduct domain objects. The following code sample shows the details of the GetProductRecommendations method:

Note

In the database, the physical name of the logical ProductRecommendation relationship is PRODUCT_RECOMMENDATION.

public class ProductRecommendationRepository : IProductRecommendationRepository
{
    ...
    private static GraphClient client;

    ...

    public ICollection<RecommendedProduct> GetProductRecommendations(
        int productId)
    {
    ...

        var recommendedProducts = new List<RecommendedProduct>();

        var graphResults = ProductRecommendationRepository.client
            .Cypher
            .Start(new { product = 
                Node.ByIndexLookup("product_id_index", "productId", productId) })
            .Match("product-[r:PRODUCT_RECOMMENDATION]->recommended")
            .Return<ProductGraphEntity>("recommended").Results;

        foreach (var graphProduct in graphResults)
        {
            recommendedProducts.Add(new RecommendedProduct()
            {
                Name = graphProduct.Name,
                Percentage = graphProduct.Percentage,
                ProductId = graphProduct.ProductId
            });
        }

        return recommendedProducts;
    ...

    }
}

Note that the graph database contains only the top five recommendations for each product. This approach simplifies the query that the GetProductRecommendations method has to perform. The section, "How Adventure Works Implemented Scalability and Availability for the Product Recommendations Database" later in this chapter describes how the graph database is populated.

Implementing a Graph Database to Maximize Scalability, Availability, and Consistency

In common with other forms of NoSQL database, an important concern when using a graph database is utilizing the features that it provides to ensure good performance. Graph databases are frequently employed in large-scale environments such as web search engines and social networking applications, and they have to handle queries that can span vast amounts of data quickly and efficiently.

Maximizing Scalability and Availability

The usual approach to maximize scalability in a NoSQL database is to implement sharding*.* With a graph database, sharding is less useful. In theory, any node in the database can be related to any other node, and splitting the data across servers may slow down the performance of queries rather than speeding them up. Furthermore, many graph databases do not permit nodes connected by a relationship to be distributed across different servers. If you have domain-specific knowledge about nodes, the relationships between them, and the way in which applications query these relationships, you might be able to partition the nodes into isolated groups and store each group on a separate server. For example, in the organization graph described earlier in this chapter, if the organization is a multi-national corporation, you could store the employee and department nodes for departments for each territory in separate databases hosted by servers located in those territories. Strictly speaking, this is not sharding because each server holds a stand-alone database, and the application that queries and maintains the information in the database has to be able to direct operations to the most appropriate database. Figure 5 illustrates this architecture.

Figure 5 - Partitioning organization information across territories

Figure 5 - Partitioning organization information across territories

In this example, the Manufacturing department is located in Western Europe (London) while the Marketing and Sales departments and Head Office are based in the eastern United States (Boston and New York). Users view the data through an application that runs in the cloud using Windows Azure. The application has been deployed to Windows Azure datacenters in the Western Europe and Eastern United States regions. The databases and database servers are also deployed to the cloud as Windows Azure Virtual Machines. To reduce network latency, a user connects to an instance of the application that is the same region as the user. Most of the operations performed by the application send requests to the database server in the same datacenter, but if the application requires data that is not held locally it can connect to the database server in the other, remote datacenter.

Note

If a single manager manages employees in different departments across different regions, then the database in each region may need to contain a node for that manager, depending on the queries that the application needs to run. In the example in Figure 5 a node for Sarah has been created in both databases.

You can use replication to improve read scaling. Most commercial graph databases provide support for primary/secondary replication. Write operations are directed to a single primary database, and the replicas provide support for concurrent read operations, although there may be some latency while the data is replicated to the secondary databases. Of course, replication is also a useful technology for ensuring the availability of a graph database. If the computer hosting the primary database fails, one of the secondary servers can be promoted to the primary role, and the original primary server can be enlisted as a secondary server once it has been recovered.

Ensuring Consistency

Most graph databases implement a limited form of internal consistency between nodes and relationships. The reason for this concerns the need to maintain consistency between nodes and relationships. Most graph databases enforce referential integrity checks between nodes and relationships. For example, you cannot create a relationship between nonexistent nodes, and you cannot remove a node that is involved in a relationship. Therefore, unlike other types of NoSQL databases, when you add or modify information in a graph database, you may need to do so in the context of an ACID transaction. If any of the operations in the transaction fail then they are all undone.

Note

In common with many other NoSQL databases, to maximize throughput, a large number of graph databases prefer to cache data in memory and only write to disk when it is absolutely necessary. To prevent data loss, graph databases that follow this approach implement append-only logging, whereby the details of each transaction are added to the end of a transaction log (on disk) when the transaction completes. The transaction is not marked as successful until the transaction log record has been saved. The database file on disk may eventually be updated when records are flushed from memory when the system starts to run short of memory or when the database server is shut down. Appending data to a log file is significantly faster than seeking a location within a database file, making the necessary change, and moving other data around if the change has caused a node to grow. The cost is the increased recovery time in the event of database failure when the database server has to roll the database forward and apply all changes recorded in the log file that had not been written to the database.

To ensure consistency during transactions, many graph databases lock the data being modified, inserted, or deleted. Therefore, it is possible for concurrent transactions to cause deadlock. You should minimize the chances of this happening by designing your applications carefully; always lock resources in a prescribed sequence, and keep transactions as short as possible. Short transactions can also help to alleviate lengthy response times caused by blocking. If two applications attempt to modify the same data, the second application will be blocked by the first, and will only be able to continue once the first application has completed its transaction and released its locks.

If your graph database is subject to a high volume of insert and delete operations, you may be able to improve scalability by configuring a primary/secondary replication topology with writable secondary servers. In this case, write operations can be directed towards the primary server or any writable secondary server, but the transaction surrounding this operation will usually be marked as complete only when the secondary server has successfully synchronized the data with the primary server. Be warned that using writable secondary servers can lead to conflict if two concurrent transactions attempt to modify the same data; the transaction that finishes last may fail if the primary server detects that the data the transaction wishes to modify has already been changed by another transaction using a different secondary server.

How Adventure Works Implemented Scalability and Availability for the Product Recommendations Database

In the Shopping application, product recommendations are displayed when the user browses to a product. This function is performed very frequently, so it is important that product recommendations are retrieved quickly to avoid degrading the customer's experience. The previous chapters have all described how the Shopping application is deployed across multiple Windows Azure datacenters. The system implements read scaling and high availability by using Neo4j Enterprise to replicate the product recommendations database across a collection of virtual machines hosted at each datacenter. Neo4j implements the primary/secondary replication model, and an instance of the Shopping application running in a given datacenter will be directed towards one of the replicas in the same datacenter when it needs to retrieve and display product recommendations.

The developers at Adventure Works decided not to update product recommendations in real time when customers place orders. Instead, they implemented a separate offline batch process to perform this task. This process runs monthly and is outside the scope of the Shopping application. It trawls the list of orders placed that month, and uses this information to generate a list of new product recommendations based on sales recorded in the SQL Server database used by the warehousing system inside Adventure Works. In addition, the batch process determines the top five product recommendations for any given product and only stores these recommendations. This strategy simplifies the queries that the Shopping application performs as it does not have to do any additional filtering other than selecting the recommendations for a product.

The new recommendations replace the existing recommendations in the graph database managed by the primary server in the replication cluster, so the data reflects only the most recent trends in product purchases (as a future extension, the developers might archive the existing recommendations in a separate database before overwriting them, enabling Adventure Works to perform an analysis of historical trends in which products customers purchase). The results are subsequently propagated to the secondary servers. The secondary servers can continue to support read requests (possibly with old data) while the product recommendations are being updated, so there is no down time required.

Summary

This chapter has described how to use a graph database to store information about entities and their relationships.

A graph database is an ideal repository for applications that need to retrieve information about collections of related objects very quickly. You design a graph database to support the specific queries that your applications perform. If you need to perform more generalized, ad hoc queries then an alternative such as a relational database or column-family database may be more appropriate.

You should avoid trying to use a graph database as a store for large, structured entities. If necessary, use a graph database as part of a polyglot solution in conjunction with a document database or key/value store; record information about relationships in the graph database, but save the detailed information in the document database or key/value store.

This chapter also described how Adventure Works used a graph database to hold product recommendations for the Shopping application. It discussed the information that is stored for each product recommendation, and how the product recommendations entities are connected to enable the Shopping application to fetch recommendations for any valid product quickly and easily.

More Information

All links in this book are accessible from the book's online bibliography on MSDN at: https://msdn.microsoft.com/en-us/library/dn320459.aspx.

Next Topic | Previous Topic | Home | Community