Analysis of commonly used together ingredients

Analysis of commonly used together ingredients (Preparations)


The database you start with must contain all of the data you loaded in the setup for this course.

This is what you will see when you click the database icon database icon.

DatabaseBeforeIngredientAnalysis

If you do not see this in your Neo4j Browser, you will need to perform the setup, Load Data, and Node Similarity steps again.

Analysis of commonly used together ingredients (Overview)


Now that you have familiarized yourself with graph algorithms and how to execute them in the Graph Data Science library, you will do a quick graph analysis of the Recipes Dataset. In this exercise, you will find communities of commonly used together ingredients and the representatives of those communities:

  • Part 1: Project the ingredients graph to Graph Catalog with native projection.

  • Part 2: Run the Node Similarity algorithm to evaluate the distribution of the Jaccard index similarity score.

  • Part 3: Run the Node Similarity algorithm to mutate a relationship between a pair of commonly used together ingredients.

  • Part 4: Run the Louvain Modularity algorithm to add a community property to each node.

  • Part 5: Write the relationships stored in a named in-memory graph back to Neo4j.

  • Part 6: Investigate the community detection results.

  • Part 7: Inspect the top five largest communities of ingredients.

  • Part 8: Find representatives of the top five communities with the PageRank algorithm.

Go to the next page to start this exercise.

Part 1: Project the ingredients graph to Graph Catalog with native projection. (Instructions)


This time, you will compare how often the ingredients are used in the same recipes instead of how similar the recipes are. The direction of relationships in the projected graph determines which set of nodes will be compared. In the previous section, you used the default NATURAL orientation of the CONTAINS_INGREDIENT relationships to compare how similar are the recipes. In this section, you will use the REVERSE orientation of the CONTAINS_INGREDIENT relationships to compare how often are the ingredients used in a common recipe.

Write Cypher code to project the Ingredient graph to Graph Catalog using these guidelines:

  • Name the graph ingredient.

  • Project all nodes labeled Ingredient and Recipe.

  • Project all relationships with a type CONTAINS_INGREDIENT.

  • Relationships must use the REVERSE orientation.

Hint: You will call gds.graph.create.

Part 1: Project the ingredients graph to Graph Catalog with native projection. (Solution)


Write Cypher code to project the Ingredient graph to Graph Catalog using these guidelines:

  • Name the graph ingredient.

  • Project all nodes labeled Ingredient and Recipe.

  • Project all relationships with a type CONTAINS_INGREDIENT.

  • Relationships must use the REVERSE orientation.

Hint: You will call gds.graph.create.

Here is the solution code:

CALL gds.graph.create('ingredient',
  ['Ingredient','Recipe'],
  {CONTAINS_INGREDIENT:{
      type:'CONTAINS_INGREDIENT',
      orientation:'REVERSE'}})

You will see these results:

EXIA.2

Part 2: Run the Node Similarity algorithm to evaluate the distribution of the Jaccard index similarity score. (Instructions)


You will evaluate the Jaccard similarity score distribution to help you fine-tune the topK and similarityCutoff parameters of the Node Similarity algorithm in the next step.

Write Cypher code to examine the similarity score distribution between ingredients following these guidelines:

  • The algorithm must use the projected graph ingredient, which is stored in the graph catalog.

  • Use the stats mode of the Node Similarity algorithm.

  • YIELD the following result: nodesCompared, similarityDistribution.

Hint: You will call gds.nodeSimilarity.stats.

Part 2: Run the Node Similarity algorithm to evaluate the distribution of the Jaccard index similarity score. (Solution)


You will evaluate the Jaccard similarity score distribution to better fine-tune the topK and similarityCutoff parameters of the Node Similarity algorithm in the next step.

  • The algorithm must use the projected graph recipes, which is stored in the graph catalog.

  • Use the stats mode of the Node Similarity algorithm.

  • YIELD the following result: nodesCompared, similarityDistribution.

Hint: You will call gds.nodeSimilarity.stats.

Here is the solution code:

CALL gds.nodeSimilarity.stats('ingredient')
YIELD nodesCompared, similarityDistribution

The results returned will look like this:

EXIA.2

 

The average Jaccard similarity score is relatively low at 0.15 value. Due to the low average similarity score, we will have to select a low similarity cutoff value in the next step. Otherwise, we might infer too sparse of a similarity network, which will not yield relevant results.

Part 3: Run the Node Similarity algorithm to mutate a relationship between a pair of commonly used together ingredients. (Instructions)


You will use the mutate mode of the Node Similarity algorithm to store the results back to the named in-memory graph. The COMMONLY_USED_TOGETHER relationships will be stored back to the projected in-memory named graph instead of the Neo4j stored graph. This way, you can use the results of the Node Similarity algorithm as an input to the Community detection algorithm in the next step without having to recreate another in-memory named graph.

Write Cypher code to execute the Node Similarity algorithm on the Ingredients graph using these guidelines:

  • The algorithm must use the projected graph ingredient, which is stored in the graph catalog.

  • Use the mutate mode of the Node Similarity algorithm.

  • The algorithm will mutate a relationship with a type COMMONLY_USED_TOGETHER between a pair of ingredients.

  • The algorithm will mutate a property named score to each relationship with the computed value.

  • Specify a similarity cutoff threshold of 0.3.

  • Specify the topK parameter of 10.

  • YIELD the following result: nodesCompared, relationshipsWritten.

Hint: You will call gds.nodeSimilarity.mutate.

Part 3: Run the Node Similarity algorithm to mutate a relationship between a pair of commonly used together ingredients. (Solution)


Write Cypher code to execute the Node Similarity algorithm on the Ingredients graph using these guidelines:

  • The algorithm must use the projected graph ingredient, which is stored in the graph catalog.

  • Use the mutate mode of the Node Similarity algorithm.

  • The algorithm will mutate a relationship with a type COMMONLY_USED_TOGETHER between a pair of ingredients.

  • The algorithm will mutate a property named score to each relationship with the computed value.

  • Specify a similarity cutoff threshold of 0.3.

  • Specify the topK parameter of 10.

  • YIELD the following result: nodesCompared, relationshipsWritten.

Hint: You will call gds.nodeSimilarity.mutate.

Here is the solution code:

CALL gds.nodeSimilarity.mutate('ingredient',{
   mutateProperty:'score',
   mutateRelationshipType:'COMMONLY_USED_TOGETHER',
   similarityCutoff:0.30,
   topK:10})
YIELD nodesCompared, relationshipsWritten

The results returned will look like this:

EXIA.3

 

The algorithm has written 1260 similarity relationships between 1384 nodes. Even with the low similarity cutoff value, the inferred network is still relatively sparse. You could lower the similarity cutoff value even more, but then some relationships between not so commonly used together ingredients will be inferred.

Part 4: Run the Louvain Modularity algorithm to add a community property to each node. (Instructions)


You will use the Louvain Modularity algorithm to inspect the community structure of the inferred similarity network between Ingredient nodes. Because you used the mutate mode of the Node Similarity algorithm, the COMMONLY_USED_TOGETHER relationship is already available in the ingredients named graph.

Write Cypher code to perform the weighted Louvain Modularity algorithm on the Ingredient graph using these guidelines:

  • The algorithm must use the projected graph ingredient, which is stored in the graph catalog.

  • Use the write mode of the Louvain Modularity algorithm.

  • The algorithm will consider only relationships with a type COMMONLY_USED_TOGETHER.

  • The algorithm will consider only nodes with a label Ingredient.

  • The algorithm will write a property named ingredient_community to each node with the computed value.

  • The relationship weight property name is score.

  • YIELD the following results: modularity, ranLevels, communityCount.

Hint: You will call gds.louvain.write.

Part 4: Run the Louvain Modularity algorithm to add a community property to each node. (Solution)


Write Cypher code to perform the weighted Louvain Modularity algorithm on the Ingredient graph using these guidelines:

  • The algorithm must use the projected graph ingredient, which is stored in the graph catalog.

  • Use the write mode of the Louvain Modularity algorithm.

  • The algorithm will consider only relationships with a type COMMONLY_USED_TOGETHER.

  • The algorithm will consider only nodes with a label Ingredient.

  • The algorithm will write a property named ingredient_community to each node with the computed value.

  • The relationship weight property name is score.

  • YIELD the following results: modularity, ranLevels, communityCount.

Hint: You will call gds.louvain.write.

Here is the solution code:

CALL gds.louvain.write('ingredient',
  {nodeLabels:['Ingredient'],
   relationshipTypes:['COMMONLY_USED_TOGETHER'],
   writeProperty:'ingredient_community',
   relationshipWeightProperty:'score'})
YIELD modularity, ranLevels, communityCount

The results returned will look like this:

EXIA.4

 

The algorithm found three hierarchical levels of communities with a total of 952 communities on the last level. You already knew that the inferred similarity network is sparse, so a high number of communities is not surprising. Probably a lot of communities consist of only a single node. Next, you will investigate single node communities to test our hypothesis.

Part 5: Write the relationships stored in a named in-memory graph back to Neo4j. (Instructions)


To investigate the inferred similarity network, you have to write the mutated relationships stored in named graph back to Neo4j.

Write Cypher code to write the mutated COMMONLY_USED_TOGETHER relationship back to Neo4j using these guidelines:

  • The procedure must use the projected graph ingredient, which is stored in the graph catalog.

  • The procedure will write relationships with a type COMMONLY_USED_TOGETHER back to Neo4j.

Hint: You will call gds.graph.writeRelationship.

Part 5: Write the relationships stored in a named in-memory graph back to Neo4j. (Solution)


Write Cypher code to write the mutated COMMONLY_USED_TOGETHER relationship back to Neo4j using these guidelines:

  • The procedure must use the projected graph ingredient, which is stored in the graph catalog.

  • The procedure will write relationships with a type COMMONLY_USED_TOGETHER back to Neo4j.

Hint: You will call gds.graph.writeRelationship.

Here is the solution code:

CALL gds.graph.writeRelationship('ingredient', 'COMMONLY_USED_TOGETHER', 'score')

You will see these results:

EXIA.5

Part 6: Investigate the community detection results. (Instructions)


Now, you can investigate the single node communities hypothesis by looking at Ingredient nodes that have no COMMONLY_USED_TOGETHER relationship.

Write Cypher code to find the count of Ingredient nodes that have no COMMONLY_USED_TOGETHER relationship.

Part 6: Investigate the community detection results. (Solution)


Write Cypher code to find the count of Ingredient nodes that have no COMMONLY_USED_TOGETHER relationship.

Here is the solution code:

MATCH (i:Ingredient)
WHERE NOT (i)-[:COMMONLY_USED_TOGETHER]-()
RETURN count(*) AS count

The results returned will look like this:

EXIA.6

 

There are 743 nodes without any COMMONLY_USED_TOGETHER relationship, and consequently, there are 743 communities that contain only a single node. This means that out of a total of 952 communities found by the Louvain Algorithm, only 209 of them consist of more than a single node.

Part 7: Inspect the top five largest communities of ingredients. (Instructions)


In this section, you will inspect the five largest communities of commonly used together ingredients.

Write a query to return all ingredient_community values of the Ingredient nodes. For each community id, return the size of the community, and the list of Ingredient names.

  • Return a list of 3 ingredients for each community.

  • Order the results by component size descending.

  • Limit it to the top five results.

Part 7: Inspect the top five largest communities of ingredients. (Solution)


Write a query to return all ingredient_community values of the Ingredient nodes. For each community id, return the size of the community, and the list of Ingredient names.

  • Return a list of 3 ingredients for each community.

  • Order the results by component size descending.

  • Limit it to the top five results.

Here is the solution code:

MATCH (i:Ingredient)
RETURN i.ingredient_community as community,
       count(*) as communitySize,
       collect(i.name)[..3] as ingredients
ORDER BY communitySize DESC LIMIT 5

The results returned will look like this:

EXIA.7

 

The largest community contains 18 ingredients, while the second largest community consists of 15 ingredients. On average, the communities are relatively small. It seems that our recipes dataset contains a variety of dishes that do not have many ingredients in common. You have returned random three members for each community, which might not be the most representative members.

Part 8: Find representatives of the top five communities with the PageRank algorithm. (Instructions)


The PageRank algorithm regards each relationship as a vote of importance or influence. In the context of our ingredients analysis, each relationship represents a commonly used together connections. Ingredients that are the most commonly used together with other common ingredients will rank the highest. You can regard them as the representatives of those communities. When calculating each community’s representatives, you want to consider only nodes and links of a specific community. You will use an anonymous graph with cypher projection to filter only nodes and links between each specific community. This way, you will run the PageRank algorithm for each community separately.

The query below provides a template for computing representatives of each community with PageRank. Update the query to:

  • Compute the representatives for the top five largest communities.

  • Update the relationshipQuery to match all the COMMONLY_USED_TOGETHER relationships between Ingredient nodes in a specific community.

  • Return the community id, community size, and the top three representatives for each community.

MATCH (i:Ingredient)
WITH i.ingredient_community as community,
     count(*) as communitySize

// Order by community size and limit the top five largest communities

CALL gds.pageRank.stream({
  nodeQuery:'MATCH (i:Ingredient) WHERE i.ingredient_community = $community
             RETURN id(i) as id',

  relationshipQuery: // Match all COMMONLY_USED_TOGETHER relationships between nodes in a specific community

  relationshipWeightProperty:'weight',
  parameters:{community:community}})
YIELD nodeId, score
WITH community, communitySize, nodeId, score
ORDER BY score DESC
// Return the community, communitySize, and the top three representatives for each community

Part 8: Find representatives of the top five communities with the PageRank algorithm. (Solution)


The query below provides a template for computing representatives of each community with PageRank. Update the query to:

  • Compute the representatives for the top five largest communities.

  • Update the relationshipQuery to match all the COMMONLY_USED_TOGETHER relationships between Ingredient nodes in a specific community.

  • Return the community id, community size, and the top three representatives for each community.

Here is the solution code:

MATCH (i:Ingredient)
WITH i.ingredient_community as community,count(*) as communitySize
ORDER BY communitySize DESC LIMIT 5
CALL gds.pageRank.stream({
  nodeQuery:'MATCH (i:Ingredient) WHERE i.ingredient_community = $community
             RETURN id(i) as id',
  relationshipQuery:'MATCH (s:Ingredient)-[r:COMMONLY_USED_TOGETHER]->(t:Ingredient)
                     WHERE s.ingredient_community = $community AND t.ingredient_community = $community
                     RETURN id(s) as source, id(t) as target,r.score as weight',
  relationshipWeightProperty:'weight',
  parameters:{community:community}})
YIELD nodeId,score
WITH community, communitySize, nodeId, score
ORDER BY score DESC
RETURN community, communitySize, collect(gds.util.asNode(nodeId).name)[..3] as representatives
ORDER BY communitySize DESC

The results returned will look like this:

EXIA.8

Analysis of commonly used together ingredients: Taking it further


  1. Change the similarityCutoff and topK parameters to see how it affects the results.

  2. Try using Overlap Similarity instead of Node Similarity algorithm.

  3. Try doing the same analysis for recipes instead of ingredients.

Analysis of commonly used together ingredients (Summary)


In the exercise you used a number of graph algorithms to analyze data.