28th March 2021
I suddenly had the urge to look into recommender systems seeing as they're literally everywhere these days. I wanted to find out what techniques were used to get them running. A fundamental technique I found is a machine learning approach with the use of matrix factorisation.
However, once I started looking for datasets that I could play around with, I found that there were a lot of relationships that existed between data points within these datasets. A song may feature in a playlist, and other songs may feature in the same playlist. The playlist itself would have particular artists that feature often in it.
This instantly made me think of graphs and how they could be used in this situation. What if each entity could be stored as a node, and any other entities that node is linked to has its relationship stored too? We could then traverse the graph to exploit these existing relationships to get recommendations.
graphs == Neo4j
That's when I started straying away from looking into ML methods and remembered a graph database I had heard about a while ago but never got round to trying, called Neo4j. It's the most popular graph database around. Neo4j is a schema-less NoSQL database (while being ACID compliant).
It basically is what it says it is. A graph database, which can be made up of nodes and directional relationships that go between those nodes. On top of this, each node and relationship can have key-value properties.
With just nodes and relationships to be stored, once your existing data is interpreted with a graph-based approach in mind, it becomes quite easy to store the data as a graph DB. Once you do so, that's when you can really leverage the relationships to perform interesting queries with ease that would otherwise require complex statements (e.g. many JOINs) in an SQL database.
Interestingly, reading the use cases for Neo4j proved my intuition to be right. Graph DBs are well suited to act as recommendation systems. Relationships between items can be used to perform content-based filtering, such as showing songs that often appear together in playlists. Meanwhile, relationships between users can be used to perform collaborative-filtering. Like the seminal hit song Despacito? Then use the graph to show me other songs in the same genre that are liked by other users who also like Despacito! That's a bit of a handful but trust me it makes sense...
Just like other database engines, Neo4j comes with its own query language called Cypher. In my opinion, it's a very intuitive query language with which you can get started really quickly.
Similar to SQL, there are a number of keywords used to perform operations. Some of them are carried over from SQL as is, but the most notable difference is MATCH
which is the equivalent of SQL's SELECT
.
To show how Cypher works with Neo4j, I'll be using flight data that I got from OpenFlights.
Below you can see what nodes and relationships look in Cypher:
// Nodes are defined inside parenthesises, with their label/type placed after the colon. The value before the colon is used to represent the node using a variable for the remainder of the statement.
(a:Airport)
// Nodes (and relationships) can also be given properties which are defined as key-value pairs placed within curly brackets.
(c:City {name: "London"})
// Relationships are defined within square brackets surrounded by dashes. Either dash is appended with < or > to determine the direction of the relationship.
(a:Airport {name: "London Heathrow"}) -[:IN]-> (c:City {name: "London"})
Notice how relationship names are defined in a natural way. You could say that the word, or phrase, that you would describe the connection as in speech, is used as the relationship name. This is intentional, as Neo4j encourages a human-readable structure to its graphs.
Creating a node for London Heathrow airport, which has a relationship to its location is as simple as the following queries:
// Initialise nodes
CREATE (ap:Airport {name: "London Heathrow", iata: "LHR"})
CREATE (ci:City {name: "London"})
CREATE (co:Country {name: "United Kingdom"})
// Create relationships between nodes while referring to them with their variable names
CREATE (ap) -[:IN]-> (ci) -[:IN]-> (co)
Neo4j has an in-built GUI that can be accessed via a desktop client or at a pre-defined local port which allows you to perform queries and visualise your graph DB. This is really handy with graphs being an inherently visual data structure.
Below is a screenshot of a visualisation when querying for all airports that are located in a city named "London". By querying each airport's respective city and country, we can see below an expanded form of our results.
The image shows that the query has returned three subgraphs, which may be a bit surprising at first if you're expecting to see just one city named London 🇬🇧. However, the results tell us that there are three cities named as such, found in the UK, US and in Canada. Furthermore, we can see every airport that exists in each of these London(s).
Nodes and relationships can be selected to view their properties. Each node can also be expanded to show all the other relationships they have to other nodes (⚠️ be careful when expanding a highly connected node, displaying all those relationships makes the fans go wild!). The GUI is a great tool to discover patterns in your data or to find and fix problems.
Of course, we haven't really exploited the graph structure of the database much. To do this, we need to make use of Cypher to perform some more complex queries. But these queries that are "complex" in nature are quite easy to write. I think as long as can get your head around the representation of your data as a graph, that's great. Figuring out how to use Cypher is then just a means to getting to what you want and where you need to be.
Below are just a few types of queries you can use to make use of a graph DB. Some of the keywords might be new but I won't be going through them. If you want to learn more, the Neo4j documentation on Cypher is a brilliant resource.
// Get the shortest path from London Heathrow airport to Auckland International Airport
MATCH p=shortestPath(
(a:Airport {iata:"LHR"})-[*]-(b:Airport {iata:"AKL"})
)
RETURN p
// Get the top 10 airports with the most incoming routes
MATCH (:Airport) -[r:FLIES_TO]-> (a:Airport)
RETURN a, count(r) as num_routes
ORDER BY num_routes DESC
LIMIT 10
// Get top 5 recommendations of country to visit for a user by finding other users who like the same countries. It also ensures that the recommendation is a new country that is not already liked by the user
MATCH (u:User {user_id:’142’})-[:LIKES]->(c:Country) <- [:LIKES]-(:User)-[:LIKES]->(rec:Country)
WHERE not (u)-[:LIKES]->(rec)
RETURN rec as Recommendation, count(*) as Frequency
ORDER BY Frequency DESC LIMIT 5;