3 RDBMS & Graph Database Deployment Strategies (Polyglot & More)

March 14, 2016, 4:00 am

≫ Next: Analyzing the Panama Papers with Neo4j: Data Models, Queries & More

≪ Previous: Non-Text Discovery with ConceptNet as a Neo4j Database [Community Post]

Whether you’re ready to move your entire legacy RDBMS into a graph database, you’re syncing databases for polyglot persistence or you’re just conducting a brief proof of concept, at some point you’ll want to bring a graph database into your organization or architecture.

Once you’ve decided on your deployment strategy, you’ll then need to move some (or all) of your data from your relational database into a graph database. In this blog post, we’ll show you how to make that process as smooth and seamless as possible.

Your first step is to ensure you have a proper understanding of the native graph property model (i.e., nodes, relationships, labels, properties and relationship-types), particularly as it applies to your given domain.

In fact, you should at least complete a basic graph model on a whiteboard before you begin your data import. Knowing your data model ahead of time – and the deployment strategy in which you’ll use it – makes the import process significantly less painful.

In this RDBMS & Graphs blog series, we’ll explore how relational databases compare to their graph counterparts, including data models, query languages, deployment strategies and more. In previous weeks, we’ve explored why RDBMS aren’t always enough, graph basics for the RDBMS developer, relational vs. graph data modeling and SQL vs. Cypher as query languages.

This week, we’ll discuss three different database deployment strategies for relational and graph databases – as well as how to import your RDBMS data into a graph.

Three Database Deployment Strategies for Graphs and RDBMS

There are three main strategies to deploying a graph database relative to your RDBMS. Which strategy is best for your application or architecture depends on your particular goals.

Below, you can see each of the deployment strategies for both a relational and graph database:

The three most common database deployment strategies for relational and graph databases.

First, some development teams decide to abandon their relational database all together and migrate all of their data into a graph database. This is typically a one-time, bulk migration.

Second, other developers continue to use their relational database for any use case that relies on non-graph, tabular data. Then, for any use cases that involve a lot of JOINs or data relationships, they store that data in a graph database.

Third, some development teams duplicate all of their data into both a relational database and a graph database. That way, data can be queried in whatever form is the most optimal for the queries they’re trying to run.

The second and third strategies are considered polyglot persistence, since both approaches use a data store according to its strengths. While this introduces additional complexity into an application’s architecture, it often results in getting the most optimized results from the best database for the query.

None of these is the “correct” strategy for deploying an RDBMS and a graph. Your team should consider your application goals, frequent use cases and most common queries and choose the appropriate solution for your particular environment.

Extracting Your Data from an RDBMS

No matter your given strategy, if you decide you need to import your relational data into a graph database, the first step is to extract it from your existing RDBMS.

Most all relational databases allow you to dump both whole tables or whole datasets, as well as carry results to CSV and to post queries. These tasks are usually just a copy function of the database itself. Of course, in many cases the CSV file resides on the database, so you have to download it from there, which can be a challenge.

Another option is to access your relational database with a database driver like JDBC or another driver to extract the datasets you want to pull out.

Also, if you want to set up a syncing mechanism between your relational and graph databases, then it makes sense to regularly pull the given data according to a timestamp or another updated flag so that data is synced into your graph.

Another facet to consider is that many relational databases aren’t designed or optimized for exporting large amounts of data within a short time period. So if you’re trying to migrate data directly from an RDBMS to a graph, the process might stall significantly.

For example, in one case a Neo4j customer had a large social network stored in a MySQL cluster. Exporting the data from the MySQL database took three days; importing it into Neo4j took just three hours.

One final tip before you begin: When you write to disk, be sure to disable virus scanners and check your disk schedule so you get the highest disk performance possible. It’s also worth checking any other options that might increase performance during the import process.

Importing Data via LOAD CSV

The easiest way to import data from your relational database is to create a CSV dump of individual entity-tables and JOIN-tables. The CSV format is the lowest common denominator of data formats between a variety of different applications. While the CSV format itself is unpopular, it’s also the easiest to work with when it comes to importing data into a graph database.

In Neo4j specifically, LOAD CSV is a Cypher keyword that allows you to load CSV files from HTTP or file URLs into your database. Each row of data is made available to your Cypher statement and then from those rows, you can actually create or update nodes and relationships within your graph.

The LOAD CSV command is a powerful way of converting flat data (i.e. CSV files) into connected graph data. LOAD CSV works both with single-table CSV files as well as with files that contain a fully denormalized table or a JOIN of several tables.

LOAD CSV allows you to convert, filter or de-structure import data during the import process. You can also use this command to split areas, pull out a single value or iterate over a certain list of attributes and then filter them out as attributes.

Finally, with LOAD CSV you can control the size of transactions so you don’t run into memory issues with a certain keyword, and you can run LOAD CSV via the Neo4j shell (and not just the Neo4j browser) which makes it easier it script your data imports.

In summary, you can use Cypher’s LOAD CSV command to:

Ingest data, accessing columns by header name or offset
Convert values from strings to different formats and structures (toFloat, split, …)
Skip rows to be ignored
MATCH existing nodes based on attribute lookups
CREATE or MERGE nodes and relationships with labels and attributes from the row data
SET new labels and properties or REMOVE outdated ones

A LOAD CSV Example

Here’s a brief example of importing a CSV file into Neo4j using the LOAD CSV Cypher command.

Example file: persons.csv

name;email;dept

"Lars Higgs";"lars@higgs.com";"IT-Department"

"Maura Wilson";"maura@wilson.com";"Procurement"

Cypher statement:

LOAD CSV FROM 'file:///data/persons.csv' WITH HEADERS AS line

FIELDTERMINATOR ";"

MERGE (person:Person {email: line.email}) ON CREATE SET p.name = line.name

MATCH (dep:Department {name:line.dept})

CREATE (person)-[:EMPLOYEE]->(dept)

You can import multiple CSV files from one or more data sources (including your RDBMS) to enrich your core domain model with other information that might add interesting insights and capabilities.

Other, dedicated import tools help you import larger volumes (10M+ rows) of data efficiently, as described below.

The Command-Line Bulk Loader

The neo4j-import command is a scalable input tool for bulk inserts. This tool takes CSV files and scales them across all of your available CPUs and disk capacity, putting the data into a stage architecture where each input step is parallelized if possible. Then, the tool stages step-by-step input using some advanced in-memory compression for creating new graph structures.

The command-line bulk loader is lightning fast, able to import up to one million records per second and handle large datasets of several billion nodes, relationships and properties. Note that because of these performance optimizations the neo4j-import tool can only be used for initial database population.

Loading Data Using Cypher

For importing data into a graph database, you can also use the Neo4j REST API to run Cypher statements yourself. With this API, you can run, create, update and merge statements using Cypher.

The transactional Cypher HTTP endpoint is available to all drivers. You can also use the HTTP endpoint directly from an HTTP client or an HTTP library in your language.

Using the HTTP endpoint (or another API), you can pull the data out of your relational database (or other data source) and convert it into parameters for Cypher statements. Then you can batch and control import transactions from there.

From Neo4j 2.2 onwards, Cypher also works really well with highly concurrent writes. In one test, one million nodes and relationships per second were inserted with highly concurrent Cypher statements using this method.

The Cypher-based loading method works with a number of different drivers, including the JDBC driver. If you have an ETL tool or Java program that already uses a JDBC tool, you can use Neo4j’s JDBC driver to import data into Neo4j because Cypher statements are just query strings (more on the JDBC driver next week). In this scenario, you can provide parameters to your Cypher statements as well.

Other RDBMS-to-Graph Import Resources:

This blog post has only covered the three most common methods for importing (or syncing) data in a graph database from a relational store.

The following are further resources on additional methods for data import, as well as more in-depth guides on the three methods discussed above:

Next week we’ll take a look at connecting to a graph database via drivers and other integrations.

Want to learn more on how relational databases compare to their graph counterparts? Download this ebook, The Definitive Guide to Graph Databases for the RDBMS Developer, and discover when and how to use graphs in conjunction with your relational database.

Get the Ebook

Catch up with the rest of the RDBMS & Graphs series:

Why Relational Databases Aren’t Always Enough

Graph Basics for the Relational Developer

Relational vs. Graph Data Modeling

SQL vs. Cypher Query Languages

Drivers for Connecting to a Graph Database

↧

Analyzing the Panama Papers with Neo4j: Data Models, Queries & More

April 8, 2016, 10:53 am

≫ Next: The 5-Minute Interview: Tom Zeppenfeldt, Founder of Graphileon

≪ Previous: 3 RDBMS & Graph Database Deployment Strategies (Polyglot & More)

As the world has seen, the International Consortium of Investigative Journalists (ICIJ) has exposed highly connected networks of offshore tax structures used by the world’s richest elites.

These structures were uncovered from leaked financial documents and were analyzed by the journalists. They extracted the metadata of documents using Apache Solr and Tika, then connected all the information together using the leaked databases, creating a graph of nodes and edges in Neo4j and made it accessible using Linkurious’ visualization application.

In this post, we look at the graph data model used by the ICIJ and show how to construct it using Cypher in Neo4j. We dissect an example from the leaked data, recreating it using Cypher, and show how the model could be extended.

Discover How the Panama Papers Can be Analyzed Using Neo4j with Example Data Models, Queries & More

The Steps Involved in the Document Analysis

Acquire documents
Classify documents

Scan / OCR
Extract document metadata

Whiteboard domain

Determine entities and their relationships
Determine potential entity and relationship properties
Determine sources for those entities and their properties

Work out analyzers, rules, parsers and named entity recognition for documents
Parse and store document metadata and document and entity relationships

Parse by author, named entities, dates, sources and classification

Infer entity relationships
Compute similarities, transitive cover and triangles
Analyze data using graph queries and visualizations

A Data Model of Implied Company Connections

Finding triads in the graph can show inferred connection. Here Bob has an inferred connection to CompanyB through CompanyA.

From Documents to Graph

A simple model of the organizational domain of business inter-relationships in a holding is simple and similar to the model you use in business registries, a common use case for Neo4j. As a minimum you have:

Clients
Companies
Addresses
Officers (both natural people and companies)

With these relationships:

(:Officer)-[:is officer of]->(:Company)

With these classifications:

protector
beneficiary, shareholder, director
beneficiary
shareholder

(:Officier)-[:registered address]->(:Address)
(:Client)-[:registered]->(:Company)
(:Officer)-[:has similar name and address]->(:Address)

All these entities have a lot of properties, like document numbers, share amounts, start- and end-dates of involvements, addresses, citizenship and much more. Two entities of the same name can have very different amounts of information attached to them, though this depends on the relevant information that was extracted from the sources, e.g., some officers have only a name, others have a full record with more than 15 attributes.

Those have specific relationships like a person is the “officer of” a company. This is a basic domain that you can populate from documents about a tax haven shell company holding, a.k.a. the #PanamaPapers.

Initially you classify the raw documents by types and subtypes (like contract or invitation). Then you attach as much direct and indirect metadata as you can, either directly from the document types (like the senders and receivers of an email or parties of a contract). Inferred metadata is gained from the content of the documents. There are techniques like natural language processing, named entity recognition or plain text search for well-known terms like distinctive names or roles.

The first step to build your graph model is to extract those named entities from the documents and their metadata. This includes companies, persons and addresses. These entities become nodes in the graph. For example, from a company registration document, we can extract the company and officer entities.

Some relationships can be directly inferred from the documents. In the previous example, we would model the officer as directly connected to the company:

(:Officer)-[:IS_OFFICER_OF]->(:Company)

Other relationships can be inferred by analyzing email records. If we see several emails between a person and a company we can infer that the person is a client of that company:

(:Client)-[:IS_CLIENT_OF]->(:Company)

We can use similar logic to create relationships between entities that share the same address, have family ties or business relationships or that regularly communicate.

Direct metadata -> entities -> relationships to documents

author, receivers, account-holder, attached to, mentioned, co-located
Turn plain entities / names into full records using registries and profile documents

Inferred metadata and information from other sources -> Relationships between entities

Related to people or organizations from the direct metadata
Same addresses / organizations
Find peer groups / rings within fraudulent activities
Family ties, business relationships
Part of the communication chain

The graph data model used by the ICIJ

Issues with the ICIJ Data Model

There are some modeling and data quality issues with the ICIJ data model.

The ICIJ data contains a lot of duplicates, but only a few of which are connected by a “has similar name or address” relationship, mostly those can be inferred by first and last part of a name together with addresses and family ties. It would also be beneficial for the data model to actually merge those duplicates, then certain duplicate relationships could also be merged.

In the ICIJ data model, shareholder information like number of shares, issue dates, etc. is stored with the “Officer” where the officer can be shareholder in any number of Companies. It would be better to store that shareholder information on the “is officer of – Shareholder” relationship.

Some of the Boolean properties could be represented as labels, eg. “citizenship=yes” could be a Person label.

How Could You Extend the Basic Graph Model Used by the ICIJ?

The domain model used by the ICIJ is really basic, just containing four types of entities (Officer, Client, Company, Address) and four relationships between them. It is more or less a static view on the organizational relationships but doesn’t include interactions or activities. Looking at the source documents and the other activities outlined in the report, there are many more things which can enrich this graph model to make it more expressive.

We can model the original documents and their metadata and the relationships to people. Part of those relationships are also inferred relationships from being part of conversations or being mentioned or the subject of documents. Other interesting relationships are aliases and interpretations of entities that were used during the analysis, which allows other journalists to reproduce the original thought processes.

Also, the sources for additional information like business registries, watch-lists, census records or other journalistic databases can be added. Human relationships like family or business ties can be created explicitly as well as implicit relationships that infer that the actors are part of the same fraudulent group or ring.

Another aspect that is missing are the activities and the money flow. Examples of activities are opening/closing of accounts, creation or merger of companies, filing records for those companies or assigning responsibilities. For the money flow, we could track banks, accounts and intermediaries used with the monetary transactions mentioned, so you can get an overview of the amounts transferred and the patterns of transfers. Those patterns can then be applied to extract additional fraudulent money flows from other transaction systems.

Graph data is very flexible and malleable, as soon as you have a single connection point, you can integrate new sources of data and start finding additional patterns and relationships that you couldn’t trace before.

New Entities:

Documents: E-Mail, PDF, Contract, DB-Record, …
Money Flow: Accounts / Banks / Intermediaries

New Relationships

Family / business ties
Conversations
Peer Groups / Rings
Similar Roles
Mentions / Topic-Of
Money Flow

Let’s Look at a Concrete Example

Let’s look at the family of the Azerbaijan’s President Ilham Aliyev who was already the topic of a GraphGist by Linkurious in the past. We see his wife, two daughters and son depicted in the graphic below.

The Azerbaijan President's Fraud Ring Analyzed by Linkurious

Quoting the ICIJ “The Power Players” Publication (emphasis for names added):

The family of Azerbaijan President Ilham Aliyev leads a charmed, glamorous life, thanks in part to financial interests in almost every sector of the economy. His wife, Mehriban, comes from the privileged and powerful Pashayev family that owns banks, insurance and construction companies, a television station and a line of cosmetics. She has led the Heydar Aliyev Foundation, Azerbaijan’s pre-eminent charity behind the construction of schools, hospitals and the country’s major sports complex. Their eldest daughter, Leyla, editor of Baku magazine, and her sister, Arzu, have financial stakes in a firm that won rights to mine for gold in the western village of Chovdar and Azerfon, the country’s largest mobile phone business. Arzu is also a significant shareholder in SW Holding, which controls nearly every operation related to Azerbaijan Airlines (“Azal”), from meals to airport taxis. Both sisters and brother Heydar own property in Dubai valued at roughly $75 million in 2010; Heydar is the legal owner of nine luxury mansions in Dubai purchased for some $44 million.

We took the data from the ICIJ visualization and converted the 2d graph visualization into graph patterns in the Cypher query language. If you squint, you can still see the same structure as in the visualization. We only compressed the “is officer of – Beneficiary, Shareholder, Director” to IOO_BSD and prefixed the other “is officer of” relationships with IOO.

We didn’t add shares, citizenship, reg-numbers or addresses that were properties of the entities or relationships. You can see them when clicking on the elements of the embedded original visualization.

Cypher Statement to Set Up the Visualized Entities and Relationships

CREATE
(leyla: Officer {name:"Leyla Aliyeva"})-[:IOO_BSD]->(ufu:Company {name:"UF Universe Foundation"}),
(mehriban: Officer {name:"Mehriban Aliyeva"})-[:IOO_PROTECTOR]->(ufu),
(arzu: Officer {name:"Arzu Aliyeva"})-[:IOO_BSD]->(ufu),
(mossack_uk: Client {name:"Mossack Fonseca & Co (UK)"})-[:REGISTERED]->(ufu),
(mossack_uk)-[:REGISTERED]->(fm_mgmt: Company {name:"FM Management Holding Group S.A."}),

(leyla)-[:IOO_BSD]->(kingsview:Company {name:"Kingsview Developents Limited"}),
(leyla2: Officer {name:"Leyla Ilham Qizi Aliyeva"}),
(leyla3: Officer {name:"LEYLA ILHAM QIZI ALIYEVA"})-[:HAS_SIMILIAR_NAME]->(leyla),
(leyla2)-[:HAS_SIMILIAR_NAME]->(leyla3),
(leyla2)-[:IOO_BENEFICIARY]->(exaltation:Company {name:"Exaltation Limited"}),
(leyla3)-[:IOO_SHAREHOLDER]->(exaltation),
(arzu2:Officer {name:"Arzu Ilham Qizi Aliyeva"})-[:IOO_BENEFICIARY]->(exaltation),
(arzu2)-[:HAS_SIMILIAR_NAME]->(arzu),
(arzu2)-[:HAS_SIMILIAR_NAME]->(arzu3:Officer {name:"ARZU ILHAM QIZI ALIYEVA"}),
(arzu3)-[:IOO_SHAREHOLDER]->(exaltation),
(arzu)-[:IOO_BSD]->(exaltation),
(leyla)-[:IOO_BSD]->(exaltation),
(arzu)-[:IOO_BSD]->(kingsview),

(redgold:Company {name:"Redgold Estates Ltd"}),
(:Officer {name:"WILLY & MEYRS S.A."})-[:IOO_SHAREHOLDER]->(redgold),
(:Officer {name:"LONDEX RESOURCES S.A."})-[:IOO_SHAREHOLDER]->(redgold),
(:Officer {name:"FAGATE MINING CORPORATION"})-[:IOO_SHAREHOLDER]->(redgold),
(:Officer {name:"GLOBEX INTERNATIONAL LLP"})-[:IOO_SHAREHOLDER]->(redgold),
(:Client {name:"Associated Trustees"})-[:REGISTERED]->(redgold)

Linked Entities in the Panama Papers Data Visualized in Neo4j

Interesting Queries

Family Ties via Last Name:

MATCH (o:Officer) 
WHERE toLower(o.name) CONTAINS "aliyev"
RETURN o

Family Ties by Last Name in the Azerjaiban Data

Family Involvements:

MATCH (o:Officer) WHERE toLower(o.name) CONTAINS "aliyev"
MATCH (o)-[r]-(c:Company)
RETURN o,r,c

Who Are the Officers of a Company and Their Roles:

	MATCH (c:Company)-[r]-(o:Officer) WHERE c.name = "Exaltation Limited"
RETURN *

Company Officers and Roles in the Azerbaijan Data

Show Joint Company Involvements of Family Members

MATCH (o1:Officer)-[r1]->(c:Company)<-[r2]-(o2:Officer)
WITH o1.name AS first, o2.name AS second, count(*) AS c, 
     collect({ name: c.name, kind1: type(r1), kind2:type(r2)}) AS involvements
WHERE c > 1 AND first < second
RETURN first, second, involvements, c

Joint Company Involvement of Family Members in the Azerbaijan Data

Resolve Duplicate Entities

MATCH (o:Officer) 
RETURN toLower(split(o.name," ")[0]), collect(o.name) as names, count(*) as count

Resolving Duplicate Entities in the Azerbaijan Data

Resolve Duplicate Entities by First and Last Part of the Name

MATCH (o:Officer)
WITH split(toLower(o.name), " ") AS name_parts, o
WITH name_parts[0] + " " + name_parts[-1] as name,  collect(o.name) AS names, count(*) AS count
WHERE count > 1
RETURN name, names, count
ORDER BY count DESC

Resolve Duplicate Data Entities by First and Last Part of Name

Transitive Path from Mossack to the Officers in that Example

MATCH path=(:Client {name: "Mossack Fonseca & Co (UK)"})-[*]-(o:Officer)
WHERE none(r in rels WHERE type(r) = "HAS_SIMILIAR_NAME")
RETURN [n in nodes(path) | n.name] as hops, length(path)

The Transitive Path between Mossack Fonseca and Company Officers in the Panama Papers Data

Shortest Path between Two People

MATCH (a:Officer {name: "Mehriban Aliyeva"})
MATCH (b:Officer {name: "Arzu Aliyeva"}) 
MATCH p=shortestPath((a)-[*]-(b))
RETURN p

Finding a Shortest Path in Neo4j in the Azerbaijan Data

Further Work – Extension of the Model

Merge Duplicates

Create a person node and connect all officers to that single person. Reuse our statement from the duplicate detection.

MATCH (o:Officer)
WITH split(toLower(o.name), " ") AS name_parts, o
WITH name_parts[0]+ " " + name_parts[-1] AS name, collect(o) AS officers


// originally natural people have a “citizenship” property
WHERE name CONTAINS "aliyev"

CREATE (p:Person { name:name })
FOREACH (o IN officers | CREATE (o)-[:IDENTITY]->(p))

Introduce Family Ties between Those People


CREATE (ilham:Person {name:"ilham aliyev"})
CREATE (heydar:Person {name:"heydar aliyev"})
WITH ilham, heydar
MATCH (mehriban:Person {name:"mehriban aliyeva"})

MATCH (leyla:Person {name:"leyla aliyeva"})
MATCH (arzu:Person {name:"arzu aliyeva"})

FOREACH (child in [leyla,arzu,heydar] | CREATE (ilham)-[:CHILD_OF]->(child) CREATE (mehriban)-[:CHILD_OF]->(child))
CREATE (leyla)-[:SIBLING_OF]->(arzu)
CREATE (leyla)-[:SIBLING_OF]->(heydar)
CREATE (arzu)-[:SIBLING_OF]->(heydar)
CREATE (ilham)-[:MARRIED_TO]->(mehriban)

Show the Family

MATCH (p:Person) RETURN p

The Aliyev Family in the Azerbaijan Data

Family Ties to Companies

MATCH (p:Person) WHERE p.name CONTAINS "aliyev"
OPTIONAL MATCH (c:Company)<--(o:Officer)-[:IDENTITY]-(p) 
RETURN c,o,p

GraphGist

You can explore the example dataset yourself in this interactive graph model document (called a GraphGist). You can find many more for various use-cases and industries on our GraphGist portal.

Related Information

E-Mail Analysis

Investigative Journalism

Existing GraphGists

Want to start your own project like this using Neo4j? Click below to get your free copy of O’Reilly’s Graph Databases ebook and get started with graph databases today.

Download My Free Copy

↧

The 5-Minute Interview: Tom Zeppenfeldt, Founder of Graphileon

October 21, 2016, 1:30 am

≫ Next: Detecting Fake News with Neo4j & KeyLines

≪ Previous: Analyzing the Panama Papers with Neo4j: Data Models, Queries & More

Catch this week’s 5-Minute Interview with Tom Zeppenfeldt, Director and Founder at Graphileon

For this week’s 5-Minute Interview, I chatted with Tom Zeppenfeldt, Director and Founder at Graphileon in the Netherlands. Tom and I chatted this past summer about what’s new at Graphileon.

Here’s what we covered:

Tell us a bit about yourself and about how you use Neo4j.

Tom Zeppenfeldt: I’m the Founder and Owner of Graphileon. We became a Neo4j solutions partner last April, but I already had a lot of experience working with Neo4j.

The project that we worked on before becoming a Neo4j partner was to create a platform for investigative journalists — the type of reporters who work on stories like the Panama Papers. And our main product at Graphileon is what we call the InterActor, which is a heavily enhanced user interface that communicates with Neo4j.

Can you share a bit more technical details about how that product works?

Tom: Of course. With the journalism project we’re working on, we ran into some limitations because we aren’t what I would call “hardcore ITers.” And we were looking for a user interface that people like us — who had been using Excel — could easily use. We needed a tool that would allow us to create and browse networks; create new nodes, tables and charts; and all different kinds of graph data.

Although we were working on the journalism project, we realized that if we made the tool generic, everyone who uses Neo4j could have this useful add-on. It’s always good to have some tool at your side that allows you to browse and do discovery and exploration in your Neo4j store in order to build prototype applications or applications that only have a short lifetime.

What made Neo4j stand out when you were exploring different technology solutions?

Tom: One of the main draws was Cypher; that was crucial. As I mentioned, we are not hardcore IT people, but Cypher — in terms of all the ASCII art-like pattern matching it allows you to do — was really easy to use. We’ve become more advanced and now consider ourselves power users of Neo4j.

The database is very easy to work with. You don’t have to go through a lot of technical studying to be able to create good data models or write your queries. But a user interface was still lacking.

Let’s say if you compare it to the standard user interface that comes with Neo4j — the Neo4j Browser — we have multiple panels. We can copy nodes from one panel to another, and we can also access different graph stores at the same time. We have shortcuts and even dynamic Cypher, which is very interesting.

For instance, imagine that you want to select a number of nodes that are linked in a node set. From that, you can automatically derive a kind of pattern and then send that to the database to give what we call isomorphs, or similar structures. This allows you to query and return all the places in your data where you have the same structure on the same path.

What have been some of the most surprising or interesting results you’ve seen using Neo4j?

Tom: The moment we started playing with dynamic Cypher was very interesting, especially once we found the correct division between software tiers such as the database and front-end tiers.

We started working with Cypher results as result maps that look and smell like a node, so it’s treated as a node by the database. That allowed us to make nice visualizations of aggregations of soft nodes, virtual nodes and virtual relationships. The fact that you can merge them into a result — even if you are combining your Cypher query data from different node types or nodes with different labels — makes it very easy to work with.

If you could take everything you know about Neo4j now and go back to the beginning, is there anything you would do differently?

Tom: Once we knew Cypher really well, we saw pitfalls in some of our models. My advice would be to try and limit the scope of your search with your Cypher statements as early as possible. In the first month or two, we struggled because we didn’t understand Cypher completely, which led to some mistakes. But if you can optimize those queries you can achieve huge improvements in performance.

For example, if you are doing traversals, opening a node to see what is inside is time consuming. Sometimes it’s better to use a node value instead of property because it allows you to use nodes instead of real values so you can search for ranges between those relationships.

As in any database, whether it’s a relational database, a document database or a graph database, you always have to consider the type of queries your want to perform. You can’t just make the model without knowing what kinds of questions you want to ask.

Anything else you’d like to add? Any closing thoughts?

Tom: It’s very interesting for us to see how quickly Neo4j develops. We started with version 1.0 and the difference between that version and what we have now is huge.

Since we build a lot of prototype applications, we are really pleased with the new functions every time they’re added. For instance, at a certain stage you added the keys functions and the properties functions, which makes developing a lot faster for us. Of course, we are also interested in what openCypher will bring because as more people start to use Cypher this will push further development of the language.

Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at content@neo4j.com

Using graph databases for journalism or investigation?
Read this white paper The Power of Graph-Based Search, and learn to leverage graph database technology for more insight and relevant database queries.

Discover Graph-Based Search

↧

Detecting Fake News with Neo4j & KeyLines

May 9, 2017, 12:00 am

≫ Next: Visualizing This Week in Tech

≪ Previous: The 5-Minute Interview: Tom Zeppenfeldt, Founder of Graphileon

Learn how to use KeyLines and Neo4j to detect fake news on social media through graph visualization

Fake news is one of the more troubling trends of 2017. The term is liberally applied to discredit everything, from stories with perceived bias through to ‘alternative facts’ and downright lies. It has a warping effect on public opinion and spreads misinformation.

Fake news is nothing new – bad journalism and propaganda have always existed – but what is new is its ability to spread through social media.

In this post, we’ll see how graph analysis and visualization techniques can help social networking sites stop the spread of fake news. We’ll see how, like fraud, fake news detection is about understanding networks. We’ll discuss how Neo4j and the KeyLines graph visualization toolkit can power a comprehensive fake news detection process.

A quick note: For simplicity, in this post we’ll limit the term ‘fake news’ to describe completely fictitious and unsubstantiated articles (see examples like PizzaGate and the Ohio lost votes story).

How Is Fake News a Graph Problem?

To detect fake news, it’s essential to understand how it spreads online – between accounts, posts, pages, timestamps, IP addresses, websites, etc. Once we model these connections as a graph, we can differentiate between normal behaviors and abnormal activity where fake content could be shared.

Let’s get started.

Building Our Graph Data Model

When we’re detecting fraud, we usually rely on verifiable, watch-list friendly, demographic data like real names, addresses or credit card details. With fake news, we don’t have this luxury, but we do have data on social networking sites. This can give us useful information, including:

Accounts (or Pages)
Posts
Articles

There are many ways to model this data as a graph. We usually start by mapping the main items to nodes: account, post and article. We know that IP addresses are important, so we can add those as nodes too. Everything else is added as a property:

Our fake news detection graph model

Detection vs. Investigation

Fake news spreaders are just as determined as regular fraudsters. They’ll adapt their behavior to avoid detection, and employ bots to run brute-force attacks.

Relying on algorithmic or manual detection isn’t enough. We need a hybrid approach that combines automated detection and manual investigation:

The process model for fake news detection between Neo4j and KeyLines

A simplified model showing fake news detection powered by Neo4j and KeyLines

Automated detection

Neo4j

Manual investigation

New behaviors are fed back into the automated process, so automated detection rules can adapt and become more sophisticated.

Detecting Fake News with Neo4j

Once we’ve created our data store, we can run complex queries to detect high-risk content and accounts.

Here’s where graph databases like Neo4j offer huge advantages over traditional SQL or relational databases. Queries that could take hours now take seconds and can be expressed using intuitive and clean Cypher queries.

For example, we know that fake news botnets tend to share content in short bursts, using recently registered accounts with few connections. We can run a Cypher query that:

Returns all accounts:

that have fewer than 20 friend connections
that shared a link to www.example.com
between 12.07pm – 12.37pm on 13 February 2017

In Cypher, we’d simply express this as:

MATCH (account:Account)--(ip:IP)--(post:Post)--(article:Article)
      WHERE account.friends < 20 AND article.url = 'www.example.com'
            AND post.timestamp > 1486987620000 
            AND post.timestamp < 1486989420000
RETURN account

Investigating Fake News with KeyLines

To seek out ‘unknown fraud’ – cases that follow patterns that can’t be fully defined yet – our manual investigation process looks for anomalous connections.

Visual investigation tools provide an intuitive way to uncover unusual connections that could indicate fake content

A graph data visualization tool like KeyLines is essential for this.

Building a Visual Graph Model

Let’s define a visual graph model so we can start to load data from Neo4j into KeyLines.

It’s not a great idea to load every node and link in our source data. Instead we should focus on the minimal viable elements that tell the story, and then add other properties as tooltips or nodes later.

We want to see our four key node types, with glyphs to highlight higher-risk data points like:

New accounts
Posts that have been reported by users
URLs that have previously been associated with fake content

This gives us a visual model that looks like this:

A visual data model of high-risk data points using glyphs in KeyLines

Loading the Data

To find anomalies, we need to define normal behavior. Graph visualization is the simplest way to do this.

Here’s what happens when we load 100 Post IDs into KeyLines:

Data loading of Facebook post IDs into KeyLines

Loading the metadata of 100 Facebook posts into KeyLines to identify anomalous patterns

Our synthesized dataset is simplified, with a lower rate of sharing activity and more anomalies than real-world data. But even in this example we can see both normal and unusual behavior:

Normal social media user news sharing behavior

Normal user sharing behavior, visualized as a graph

Normal posts look similar to our data model – featuring an account, IP, post and article. Popular posts may be attached to many accounts, each with their own IP, but generally this linear graph with no red glyphs indicates a low-risk post.

Other structures in the graph stand out as unusual. Let’s take a look at some examples.

1. Monitoring New Users

New users should always be treated as higher risk than established users. Without a known history, it’s difficult to understand a user’s intentions. Using the New User glyph, we can easily pick them out:

Non-suspicious user behavior social media post

A non-suspicious post being shared by a new user

A pattern of unusual user sharing behavior that might be fake news

This structure is much more suspicious, with a new user sharing flagged posts to articles on known fake news domains

2. Identifying Unusual Sharing Behavior

We can also use a graph view to uncover suspicious user behavior. Here’s one strange structure:

A deviant pattern of user sharing behavior that might be fake news

An anomalous structure for investigation

We can see one article has been shared multiple times, seemingly by three accounts with the same IP address. By expanding both the IP and article nodes, we can get a full view of the accounts associated with the link farm.

3. Finding New Fake News Domains

In addition to monitoring users, social networks should monitor links to domains known for sharing fake news. We’ve represented this in our visual model using red glyphs on the domain node. We’ve also used the combine feature to merge articles on the same domain:

Combining domain tracking using graph visualization

Using combos to see patterns in article domains

This view shows the websites being shared, rather than just the individual articles. We can pick out suspicious domains:

Suspicious fake news website domain sharing pattern

Try It for Yourself

This post is just an illustration of how you can use graph visualization techniques to understand the complex connected data associated with social media. We’ve used simplified data and examples to show how graph analysis could become part of the crackdown on fake news.

We’d love to see how this approach works using real-world data. Catch my lightning talk or stop by our table at GraphConnect on 11th May to see how we could work together!

References

The dataset we used here was synthesised from the following two sources:

Cambridge Intelligence is a Silver sponsor of GraphConnect Europe. Use discount code CAMBRIDGE30 to get 30% off your tickets and trainings.

Join us at the Europe's premier graph technology event: Get your ticket to GraphConnect Europe and we'll see you on 11th May 2017 at the QEII Centre in downtown London!

Sign Me Up

↧

Visualizing This Week in Tech

August 15, 2019, 12:00 am

≫ Next: Creating an Intelligent Recommendation Framework

≪ Previous: Detecting Fake News with Neo4j & KeyLines

Discover how knowledge graphs make deep connections.

Every week there seems to be an overwhelming amount of things happening in the tech world. New startups, products, funding rounds, acquisitions and scientific breakthroughs. All this results in a great deal of news content being produced.

TechCrunch publishes about 250 stories every week – for just one publication amongst many, that’s already a lot of information to take in.

Since I really love graphs and machine learning – even more so when they are combined – I thought it would be fascinating to visualize This Week in Tech stories in a graph, a way to visualize and explore the connections between the stories.

In this post, we’ll look at using machine learning and graph algorithms to explore a data visualization of clusters, connections and insights into what helped shape tech news this week.

Articles to Knowledge Graph

To make sense of all these articles, we can try to analyze and organize them in a process that produces additional insights.

Knowledge representation is a field in AI that focuses on representing information so that a computer program can access it autonomously for solving complex tasks. We represent the stories in a knowledge graph, which consists of a set of interconnected entities and their attributes. We’ll then create an ordered and connected version of the same information that’s otherwise isolated and disorganized.

To analyze the stories, we are going to make use of several AI and machine learning techniques.

The first technique is Named Entity Recognition (NER), to extract entities from the articles. Entities are things like people names, locations, organizations, startups, etc. The extracted entities allow us to recognize relations between stories by analyzing which content mentions the same or similar entities. IBM Watson also provides a relevance score that reflects how important an entity is to a text as well as the sentiment toward the entity.

To identify more complex connections that would require a better understanding of the world, we can make use of external knowledge bases. This helps our knowledge graph build a hierarchy of concepts that connects more stories.

For example, let’s say you find articles describing investments in two autonomous vehicle companies. NER can easily identify the two startups as companies, but it may not identify that these companies are related since they are in the same industry. Adding external knowledge can help make these connections.

Lastly, to capture some of the meaning of the text itself, we are going to make use of Doc2Vec embeddings.

The goal of Doc2Vec is to create a numeric representation of a document, regardless of its length. While the word “vectors” represents the concept of a word, the “document vector” intends to represent the concept of a document.

This vector is used for things like cosine similarity, where the similarity between two documents is measured. I’m not going to delve into too much detail about the theory behind Doc2Vec, but the original paper gives a great overview.

For this project, I used a Doc2Vec model trained a corpus of ~15,000 TechCrunch articles.

To get started, we are going to run one week’s (July 13 – July 20th) worth of TechCrunch stories (about 250) through the processing pipeline. This pipeline will first use IBM Watson’s Natural Language Understanding to extract the entities, concepts and topics from each story. We can then set up another API to process the raw text (tokenize and stem) and then use the TechCrunch Doc2Vec model to generate an embedding vector capturing the meaning of the story.

To manage all this information in the knowledge graph we can use the following graph model.

Each article node has an embedding property that stores the embedding vector (this will be important later). Each article node also has relationships to various concept, topic and entity nodes. Theses edges have a relevance and sentiment property.

Entity nodes can also have a relationship to a meta-category, such as companies or athletes. To visualize this model, we can use the built-in schema command.

To actually insert the articles we can use Py2neo to write a Python script that runs the following Cypher queries for the data on each article:

Creating article nodes:

MERGE (article:Article {url: {URL}})
ON CREATE SET article.title = {TITLE}, article.summary = {TEXT}, article.date = {DATE}, article.embedding = {EMBEDDING}

Related Entities:

MATCH(article:Article {title:{TITLE}})

                MERGE (entity:Entity {label: {ENTITY_LABEL}})

                MERGE (article)-[r:RELATED_ENTITY]->(entity)
                SET r.type = {ENTITY_TYPE}
                SET r.count = {COUNT}
                SET r.score = {RELEVANCE}
                SET r.sentiment = {SENTIMENT}

                FOREACH (category IN {ENTITY_CATEGORIES} |
                    MERGE (cat:Category {label: category})
                    MERGE (entity)-[:IN_CATEGORY]->(cat)
                )

Here is a snippet of what analyzed articles in the knowledge graph look like:

The Magic of Weighted Graphs

By adding a vector embedding property to each article node, we have created a very convenient way to measure the similarity between the two stories. This allows us to add a weight property to each relationship between two article nodes. This weight reflects how similar the two stories are.

Weighted graphs are incredibly useful for doing things like community detection, PageRank and finding shortest paths.

We can use the cosine similarity procedure from the Neo4j Graph Algorithms library to compute the weight between two articles nodes with the following query. Once again we’ll create a Python script that uses Py2neo to run the following query on each article node.

MATCH (a: Article{title:{TITLE}})
MATCH (b: Article)
WHERE NOT EXISTS((a)-[:SIMILARITY]->(b)) AND a <> b

WITH a, b, algo.similarity.cosine(a.embedding, b.embedding) as similarity

CREATE (a)-[r:SIMILARITY]->(b)
SET r.cosine_similarity = similarity

The results? A weighted graph where each edge between article nodes has a cosine_similarity property.

To get started with analyzing the weighted graph, we will once again use the Neo4j Graph Algorithms library to run the Louvain Community detection algorithm.

The algorithm is used for detecting communities in networks by maximizing modularity. Modularity is a metric that quantifies how well communities are assigned to nodes. It does this by evaluating how much more densely connected the nodes within a community are compared to how connected they would be in a random network.

The following snippet queries a subgraph using Cypher. This subgraph is composed of article nodes and their similarity relationships.

CALL algo.louvain(
                    'MATCH (p:Article) RETURN id(p) as id',
                    'MATCH (a1:Article)-[sim:SIMILARITY {most_related:true}]-(a2:Article)  RETURN id(a1) as source, id(a2) as target, sim.weight as weight',
                     {graph:'cypher', weightProperty: ‘cosine_similarity’, defaultValue: 0.0, write: true, writeProperty: louvain})

After running the algorithm, each article node will then have an integer as a property (“Louvain”), which specifies the community of articles it belongs to.

The Louvain graph algorithm performs very well. The visualization reveals various clusters for example security breaches, banking startups, and space exploration. You can check out the web app for a visualization of these clusters.

Result

To explore the weighted graph of stories I put together this web app. It uses React as well as Vis.js for the actual graph visualization.

Since each article node has an integer specifying its community, we can give each of these “topic clusters” a unique color.

You can click on any article node in the graph to explore it in more detail. The web app highlights the most related stories, and then displays in the sidebar. You can then use the slider to walk the graph traversing more related stories.

The cool part is that all these clusters were automatically determined using the techniques described earlier. You can see a large light blue cluster of articles related to space, or the yellow cluster of stories related to security.

Another interesting thing is exploring how two topic clusters are connected by a common story.

For example, the cluster of stories related to space exploration mentions a story about a rover used to explore a distant asteroid. This is related to a robotics cluster containing a story on an MIT robot mirroring a human, which in turn, is related to Elon Musk’s Neuralink developing human computer interfaces.

Conclusion

The techniques described here show how graphs can be used as a powerful tool for extracting insights from unstructured data. Neo4j’s Cypher and comprehensive graph libraries provide developers with amazing tools to implement some of these ideas.

Using some of these techniques to analyze this week’s tech stories resulted in a visualization of clusters, connections and insights into what helped shape This Week in Tech.

I’d love to hear what you thought of this project! I’m also always happy to discuss how you can use graphs to transform your data into actionable knowledge, create new services and reduce costs in your business.

Want to take your Neo4j skills up a notch? Take our online training class, Neo4j in Production, and learn how to scale the world’s leading graph database to unprecedented levels.

Take the Class

↧

Creating an Intelligent Recommendation Framework

September 5, 2019, 12:00 am

≫ Next: Model Matters: Graphs, Neo4j and the Future

≪ Previous: Visualizing This Week in Tech

Create an intelligent recommendation framework.

It is very likely that you are reading this blog post because a recommendation algorithm decided that it belonged on your feed. It’s also likely that your decision to read this is now being fed back into the very same algorithm that brought you here.

This, along with the rest of your data, is used to populate the ads decorating the sides of your feed and generating the revenue that keeps the servers running.

In simple terms, a recommender system is an algorithm that suggests items or decisions to a user or another system, scoring them based on predicted relevance.

While recommender systems are most commonly thought of in the context of product or content recommendations, many organizations have found innovative ways to apply these concepts to all aspects of their business.

Such examples are found everywhere. For instance, a manufacturer may leverage recommendations for alternate parts and materials when a vendor goes out of business. Or, an HR firm may recommend employees with a high potential for flight risk and then perform subsequent succession planning.

Contextual Recommendations

Neo4j is a native graph database built and designed for connected data. Especially considering recommendations’ increasing prevalence, many have found the underlying graph model helpful in supporting rich contextual recommendations that are understood intuitively and developed rapidly.

There are several reasons why graphs lend themselves so well to recommendations:

Explainability: Graph models are easy to understand and visualize for non-technical users. Using graph traversals and pattern matching with Cypher make graph-based recommendations easier to understand and dissect than black-box statistical approaches.
Rapid Development: Requirements change rapidly, and models need to adapt to fit these requirements. Neo4j is schemaless, and you can refactor your graph to easily consider new data or different access patterns.
Personalization & Contextualization: Neo4j is often used as a storage layer that connects data from multiple silos. Having quick access to everything you know about a user or product allows you to make more relevant recommendations.
Performance: Recommender systems typically need to analyze a large amount of data to produce meaningful results. Neo4j’s native graph engine uses index-free adjacency to eliminate the need for complex joins and enable real-time traversals.
Graph Algorithms: Neo4j’s graph algorithms library lets you run similarity, community detection, and centrality algorithms directly on the database to enrich your graph and enhance your recommendations.

This blog post aims to reach both technical and nontechnical audiences alike. Luckily, Neo4j’s whiteboard-friendly data model and visual, pattern-based query language Cypher are both easy to understand.

Explainability

Neo4j is a native graph database, meaning the graph is not simply an abstraction or a layer on top of a relational database. The data is actually stored as a graph.

The graph data model is “whiteboard-friendly.” When designing a data model, it’s quite intuitive to draw out your data in the form of connected entities.

In the relational world, the whiteboard model is then reformatted to fit the tables of a relational model. With Neo4j, however, no restructuring is necessary because the data is stored in the form of labeled nodes and relationships.

Take the following data model for a movie streaming service:

Recommendations and graph database technology.

There are a few entities (or “nodes,” in common graph nomenclature) in our graph: user, movie, genre, year, actor and director.

Based on this data model, it’s fairly easy to see how our entities interact with one another. Movies have genres, users rate movies, etc. This model is simple and easy to understand, and in turn, easy to query. There is no need for any join tables or views, and our queries can be understood by anyone.

Neo4j has developed and now open-sourced the popular query language: Cypher. Cypher is a declarative language, meaning you specify what results you want back, not what you want it to do.

More concretely, Cypher is centered around the concept of pattern matching. When using Cypher, the user specifies the graph pattern they would like to retrieve from the database, and Neo4j handles the rest.

For example, let’s take the data model from before. Perhaps we would like to find all of the movies in a particular genre. The pattern we’re looking for looks like this:

Or in Cypher, it’ll look like this:

MATCH (genre:Genre{name:"Comedy"})<-[:IN_GENRE]-(movie:Movie)
RETURN movie

Even without knowing what the MATCH and RETURN key phrases do, it is pretty clear what is going on. This holds true for even more advanced use cases. Say we are building a recommendation engine to recommend the most popular movies in a specific genre.

We can drive the entire recommendation in two lines of Cypher:

MATCH (genre:Genre{name:"Comedy"})<-[:IN_GENRE]-(movie:Movie)<-[rating:RATED]-(user:User)
RETURN movie, avg(rating.rating) AS score ORDER BY score DESC

Rapid Development

Many recommender systems rely heavily on statistical approaches. In order to adapt to changing requirements, models may need to be tweaked and retrained, which can be time intensive. In contrast, Cypher patterns and graph algorithms can be edited and adapted quickly without retraining.

Furthermore, graph databases are schemaless and thus more flexible than their relational counterparts. Graph models can grow and change along with requirements, unlike relational models where tables need to be dropped and rebuilt.

Let's look at adding an additional data point, like IMDb ratings to our movies graph. The recommendation query can be quickly extended to account for this new data, resulting in:

MATCH (genre:Genre{name:"Comedy"})<-[:IN_GENRE]-(movie:Movie)<-[rating:RATED]-(user:User)
MATCH (movie)-[:HAS_IMDB_RATING]->(imdbRating:ImdbRating)
RETURN movie, avg(rating.rating)+avg(imdbRating.rating) AS score ORDER BY score

We have now improved the accuracy of our recommendation score with the addition of a single line of Cypher. What would normally take multiple joins using SQL is accomplished by traversing a couple of relationships. This makes our system easier to manage and maintain, and helps us avoid costly mistakes.

Personalization & Contextualization

The graph model inherently supports rich, contextualized information. If a graph is modeled well, then each node should have neighbors that provide valuable context about that node.

For example, a user in our movies graph is connected to the movies they rated. From those movies, we can infer their favorite genre and actors, and even predict other users with similar movie preferences.

Using the same data model from before, we can learn a user’s favorite genres using the following query:

MATCH (:User {id: "1"})-[r:RATED]->(m)-[:IN_GENRE]->(g)
WHERE toInteger(r.rating) = 5.0
RETURN g.name AS genre, count(m) AS score
ORDER BY score DESC

The value of personalized context when curating recommendations is extremely high. Different users likely have varying tastes, and our recommendation systems need to take that into account in order to maximize its success.

Performance

Recommendations are often only useful in real time.

A classic example is online retail. Online shoppers often browse through items extremely quickly, so recommendations need to be computed within milliseconds.

A less obvious example is situational planning for manufacturers. If a part or product is recalled or contaminated, everything related to that part or product must be immediately recalled, and alternatives must be identified as well. Time is money.

Neo4j’s native graph engine makes it an exceptional tool for producing real-time recommendations. Neo4j makes use of index-free adjacency to traverse the graph in constant time, avoiding computationally expensive joins in a relational database. This allows us to anchor onto a single node in the graph (ideally with an index lookup) and then traverse outwards, analyzing relationship patterns in constant time to generate recommendations.

In our previous movie recommendation query, we’re traversing three different relationships, which would likely equate to three joins in a relational database.

Perhaps a better example to illustrate the usefulness of index-free adjacency is identifying fraud rings.

We can use Neo4j to identify accounts that are linked through various data points, like their SSN, email or phone number. Often, these chains can be dozens of hops long, making this kind of analysis not only slow but actually impossible using relational databases (where graphs can provide sub-second results).

Graph Algorithms

Graphs are not new; in fact, the concept dates back almost 300 years. So it comes as no surprise that graphs have a rich ecosystem of valuable proven algorithms.

Leveraging a graph pattern known as a triadic closure is a simple yet powerful way to generate recommendations using Neo4j. A triadic closure is the inference of a weak connection between two nodes not directly connected, but are instead indirectly connected by an arbitrary number of intermediary nodes.

For example, two movies that share a common genre have a weak connection and thus some degree of similarity. In practice, this might look something like this:

MATCH (m1:Movie{name:"Iron Man"})-[:IN_GENRE]->(g)<-[:IN_GENRE]-(m2)
RETURN m2 AS recommendation, count(g) AS score ORDER BY score

Here, the weaker connections between two nodes, the higher their implied similarity. Furthermore, because index-free adjacency allows us to traverse relationships in constant time, we can traverse paths of arbitrary length to infer connections between nodes.

Triadic closures are a great way to develop a simple recommendation system extremely quickly, but only scratch the surface of what can be done leveraging graph algorithms.

Without going into too much detail, here are some other algorithmic paradigms which could also be used:

Centrality: Centrality algorithms identify important nodes within a graph, discovering nodes which are popular or influential.
Community Detection: Community detection algorithms evaluate how a group is clustered or partitioned, as well as its tendency to strengthen or break apart. This might reveal things like user cohorts or clusters of content.
Path Finding: Path finding algorithms help find the shortest path or evaluate the availability and quality of routes from, to or between nodes.

Neo4j’s Intelligent Recommendation Framework

Keymaker, Neo4j's Intelligent Recommendations Framework, is a data model agnostic tool designed to help organizations design and manage their graph based recommender systems.

Keymaker includes an admin console where users can build out their recommendation pipelines, and exposes a GraphQL API where recommendations can be accessed. By focusing on the five qualities discussed above, Keymaker aims to minimize development efforts while helping users maximize the value of graphs.

For additional information on Keymaker, Neo4j's Intelligent Recommendations Framework, reach out to solutions@neo4j.com or tune in to our next blog post where we'll use Keymaker to build out a real-world recommender system.

Find the patterns in your connected data
Learn about the power of graph algorithms in the O'Reilly book,
Graph Algorithms: Practical Examples in Apache Spark and Neo4j by the authors of this article. Click below to get your free ebook copy.

Get the O'Reilly Ebook

↧

Model Matters: Graphs, Neo4j and the Future

February 25, 2013, 11:02 am

≫ Next: I MapReduced a Neo4j store

≪ Previous: Creating an Intelligent Recommendation Framework

OpenCredo, consulting partner in the Neo4j Partner Graph, discusses their experiences with using graph databases, Neo4j, and the potential of graph-based applications.

As part of our work, we often help our customers choose the right datastore for a project. There are usually a number of considerations involved in that process, such as performance, scalability, the expected size of the data set, and the suitability of the data model to the problem at hand. This blog post is about my experience with graph database technologies, specifically Neo4j. I would like to share some thoughts on when Neo4j is a good fit but also what challenges Neo4j faces now and in the near future. I would like to focus on the data model in this blog post, which for me is the crux of the matter. Why? Simply because if you don’t choose the appropriate data model, there are things you won’t be able to do efficiently and other things you won’t be able to do at all. Ultimately, all the considerations I mentioned earlier influence each other and it boils down to finding the most acceptable trade-off rather than picking a database technology for one specific feature one might fancy. So when is a graph model suitable? In a nutshell when the domain consists of semi-structured, highly connected data. That being said, it is important to understand that semi-structured doesn’t imply an absence of structure; there needs to be some order in your data to make any domain model purposeful. What it actually means is that the database doesn’t enforce a schema explicitly at any given point in time. This makes it possible for entities of different types to cohabit – usually in different dimensions – in the same graph without the need to make them all fit into a single rigid structure. It also means that the domain can evolve and be enriched over time when new requirements are discovered, mostly with no fear of breaking the existing structure. Effectively, you can start taking a more fluid view of your domain as a number of superimposed layers or dimensions, each one representing a slice of the domain, and each layer can potentially be connected to nodes in other layers. More importantly, the graph becomes the single place where the full domain representation can be consolidated in a meaningful and coherent way. This is something I have experienced on several projects, because modeling for the graph gives developers the opportunity to think about the domain in a natural and holistic way. The alternative is often a data-centric approach, that usually results from integrating different data flows together into a rigidly structured form which is convenient for databases but not for the domain itself. As an example to illustrate these points, let’s take a very simple Social Network model. Initially the graph simply consists of nodes representing users of the service together with relationships that link users who know each other. Those nodes and relationships represent one self-contained concept and therefore live in one “dimension” of the domain. Later, if the service evolves to allow users to express their preferences on TV shows, another dimension with the appropriate nodes and relationships can be added into the graph to capture this new concept. Now, whenever a user expresses a preference for a particular TV show, a relationship from the Social Network dimension through to the TV Shows dimension can be created. Read the full article.

↧

I MapReduced a Neo4j store

June 17, 2013, 5:04 pm

≫ Next: Graph Model of Facebook Post Reactions in Neo4j Part 1

≪ Previous: Model Matters: Graphs, Neo4j and the Future

Creating and using a Neo4j Graph Model for specific use cases

Lately I’ve been busy talking at conferences to tell people about our way to create large Neo4j databases. Large means some tens of millions of nodes and hundreds of millions of relationships and billions of properties. Although the technical description is already on the Xebia blog part 1 and part 2, I would like to give a more functional view on what we did and why we started doing it in the first place. Our use case consisted of exploring our data to find interesting patterns. The data we want to explore is about financial transactions between people, so the Neo4j graph model is a good fit for us. Because we don’t know upfront what we are looking for we need to create a Neo4j database with some parts of the data and explore that. When there is nothing interesting to find we go enhance our data to contain new information and possibly new connections and create a new Neo4j database with the extra information. This means it’s not about a one time load of the current data and keep that up to date by adding some more nodes and edges. It’s really about building a new database from the ground up everytime we think of some new way to look at the data.

First try without hadoop

Before we created our Hadoop based solution, we used the batchimport framework provided with Neo4j (the batch inserter API). This allows you to insert a large amount of nodes and edges without the transactional overhead (Neo4j is ACID compliant). The batch importer API is a very good fit for the medium sized graphs, or the one time imports of large datasets, but in our case recreating multiple databases a day, the running time was too long.

Scaling out

To speed the process we wanted to use our Hadoop cluster. If we could make the process of creating a Neo4j database work in a distributed way, we could make use of the total amount of cluster machines instead of the single machine batchimporter. But how do you go about that? The batch import framework was built upon the idea of having a single place to store the data. Having a server running somewhere the cluster could connect to had multiple downsides:

How to handle downtime of the Neo4j server
You’re back to being transactional
You need to check if nodes are already existing

So the idea became to build the database really from the ground up. Would it be possible to build the underlying filestructure without having the need of Neo4j running somewhere? Would be cool right? Read the full article. Here is the accompanying video:

↧

Graph Model of Facebook Post Reactions in Neo4j Part 1

October 24, 2019, 6:10 am

≫ Next: Being in Control and Staying Agile with Graph Requires Shifting Left at ING

≪ Previous: I MapReduced a Neo4j store

Graph Model of Facebook Post Reactions in Neo4j Part 1

This blog post explores how to model the six different reactions on Facebook – like, love, haha, wow, sad, angry – with graph technology.

↧

Being in Control and Staying Agile with Graph Requires Shifting Left at ING

May 6, 2020, 12:00 am

≫ Next: Graph Data Platforms: From a Napkin Sketch to a Category Leader

≪ Previous: Graph Model of Facebook Post Reactions in Neo4j Part 1

Stay agile with graph by shifting left and designing consistent architecture that eliminates migrations.

Editor’s Note: This presentation was given by Gary Stewart and Will Bleker at GraphConnect New York in September 2018.

Presentation Summary

With challenging requirements in availability, scalability and global reach, Gary Stewart and Will Bleker at ING needed to reconsider their architecture by designing one that would remove throughput as a challenge, eliminate migrations and ensure consistency over time by means of redeployments.

In this post, Stewart and Bleker discuss the concepts they used when designing their architecture, such as pets vs. cattle (or trains vs. cars, as they call it at ING). They also detail their journey to NoSQL, which ultimately resulted in what they called the cache cattle pipeline, a paradigm they adopted to bring graph databases to life in their organization.

Furthermore, they delve into a use case for managing their Cassandra platform, including its architecture overview. Lastly, they share with us a few features and graph learnings they obtained from NoSQL thinking.

Full Presentation: Being in Control and Staying Agile with Graph Requires Shifting Left at ING

Being in control and staying agile with graph requires shifting left. I’ll break that title down in a few minutes, but before that, I want to give a bit of inspiration.

Below is my LEGO set, shown on the left. It’s called the Architecture Set Kit and it only includes white and see-through blocks. Specifically, it’s designed to help you abstract buildings away. You get to build up a building, break it down, find another inspiration and build it back up. This LEGO kit gave us a lot of inspiration for the architecture we’re going to be talking about.

To bring a little bit of the Dutch inspiration, there’s this beautiful building in Amsterdam, shown on the top right. It’s called the Amsterdam Eye, and it’s actually a film museum. I tried to make this building using the Architecture Kit. It took me about two days of breaking it up and building it. In the end, I was quite proud of my results (shown on the bottom right).

Going back to the title, being in control is all about security, data quality and availability/scalability. In the financial industry, we think we are unique in this, but in actuality, everyone needs to be in control of their data.

Staying agile, on the other hand, means obtaining new insights. Time to market is particularly important. For example, we get new features and need to build them quickly. This means our design also needs to take this into consideration.

We also need to have a pluggable architecture, and one way to manage this very well is to make a pipeline.

Being in control also requires shifting left, which isn’t a new concept per se. Shifting left effectively entails taking as much of your operational work and moving it to the design and build phases. This way, you do less on your operations side and start designing for no migrations.

One of the things we also realized is that systems are best designed not to last long. This will help you be more agile and more in control.

About Us

A short introduction: we’re Gary Stewart and Will Bleker. Gary is a data store platform architect and Will is primarily responsible for NoSQL databases within ING. Will is also the chapter lead for the events and data store team within ING, and takes care of platforms for Apache Kafka and Apache Cassandra data storage.

For those unfamiliar with ING, we’re a global financial institution based in the Netherlands, but we operate in over 40 countries worldwide with around 52,000 employees. Everything we do, design and build might be in Europe, but also has to work in Asia, America or any part of the world. Therefore, what we build needs to be easily adjustable and agile.

With that in mind, let’s get on with the post. We’re going to break it up into four main sections, starting with the concepts of how we think. Then, we’ll discuss the pipelines we use to actually make this work. Next, we’ll go into one of the use cases that we’ve used Neo4j in. As a final bit, we’ll show some of the graph learnings we’ve come up with in the last couple of months.

Concepts

Before I go any further, let’s briefly discuss the analogy of pets versus cattle.

Pets vs. Cattle

Back in 2011 or 2012, Bill Baker was struggling to explain the difference between scaling up and scaling out. He basically came up with the analogy of pets and cattle.

Pets are your unique servers. When you need more throughput from them, you generally scale up. These things tend to have a rather long lifespan.

But moving to the cloud, we generally move towards more cattle-like architecture, things that are disposable or one of a herd. Quite often, these are pipelined, so it’s simple to redeploy when something goes wrong. You scale them out and see that they’re mostly based on an active-active type of architecture.

The databases and the data stores that we support in our team – Apache Kafka and Apache Cassandra – fall mostly under this cattle type of architecture.

Our Twist: Trains vs. Cars

We kept screwing up and saying pets versus cats instead of cattle, so we decided to come up with trains versus cars as our own analogy.

When we look at our data stores, they’re just like the transport industry. In the last couple of years, they’ve gone through a couple of transformations.

There’s still a place for all of them, but in general, if we look at trains, they’re unique, long-lived items built on demand.You generally pack trains full of people, who all leave at the same time as part of a fixed schedule. If something should happen to this train halfway through our journey, then all the people on the train are delayed.

Cars, on the other hand, have gone from a manual to a fully automated process. They’ve become a mass-produced, consumable product. When something happens to your car – whether you’ve had a bad accident or it’s just old – you simply replace it for a new one, unlike a train, which has a bit longer lifespan.

In our IT landscape, we started noticing more and more that sometimes it’s okay for certain parts to not arrive at the same time. We needed that bit of flexibility and hence, we scaled out towards cars, because you have smaller amounts of people in a car. Should one car have a problem, only a part of your data will be delayed.

Our Journey to NoSQL

Our journey started from a relational database onto NoSQL, and this was quite a big mind change. We chose Cassandra purely for its high availability and scalability options. It also brought along things like tunable consistency, replicas and a whole bunch of other things that seemed quite daunting at first.

However, after using it for roughly four years now, we’ve grown so accustomed to how these databases work, as well as the added benefits you get from them. For example, lifecycle management has now become a breeze.

We’ve actually got a lot more weekends free, because in a typical master-slave, active-passive type of environment, you end up having to schedule downtime to switch between your databases. But, in an active-active solution like Cassandra, we simply do our patches during the day time now.

But, we now have some liquid expectations of any new technology that we want to adopt in the future in that the new technology has to follow these kinds of rules and features that we have. After all, we don’t want to go back to working weekends.

About a year ago, we noticed there was a big gap in our landscape that could easily be filled by a graph database in order to get to grips with our connected data. However, we didn’t want to go down the route of another active-passive or leader-follower type of solution, so we had to figure out ways to work around these challenges and keep our weekends free.

There were also other interesting observations: if you look at other databases, for example the key-value store, it’s quite easy to mess up a partition. In a relational database, you can mess up a table with a simple query. But, in a graph database it’s pretty easy to screw up your whole database. With these – and our weekends – in mind, we had to think about ways to work around this problem.

When you take any architecture, you always have a system of records. Generally, the path is that the demand or types of queries increase and then you start having performance problems.

What we decided was we were going to put a cache in between, as shown in the image below. That cache can be any database used to solve the problem, such as Neo4j, Cassandra or any other database.

Now, you’ve got a cache, but the data might not match if you get there, so then you have to build a batch loader or an export. This can be off-the-shelf tools (it doesn’t really matter), but however you do it, you have to manage that as well.

Then, of course, you need to build your APIs to use this cache database. Generally speaking, you don’t want to write in your cache – you normally just read from your cache and then write in your system records.

Finally, you realize, hang on, I only have data at a moment in time, so I need to add a real-time sync. So, we put the real-time sync in. This was a lot of work.

Then, of course, you start noticing that we also need a resilience pattern between each component, so that the APIs are resilient on the cache databases. Moreover, the cache database, batch loader and real-time sync have to be resilient. It gets more ugly when you want to be in multiple data centers.

In this way, you actually start realizing that lifecycle management is hard. Each thing has versioning and requires cross data center resilience. We decided we didn’t really like this very much anymore. Again, this is a per-use case, so you have to figure out what fits where. But, we decided that we needed to simplify this.

What if we were to put all of these components into one car and manage that on its own? This way, we wouldn’t have APIs calling different incidents of databases.

Here, you get comfortable with rebuilding – you actually have all your code there to rebuild fast. You also don’t have to do migrations, and if you want to do availability, you just deploy an incident in each data center. This is a lot easier to reason with. When you’re using technology like Neo4j, all the data must fit on one node anyway. In this way, you’re actually not more restricted than before.

This is all fine and well, but we can actually make this even more simple, as shown in the image below. When you use Kafka, which has features called compact topics, you can get rid of the batch files and then use compacting topics to design that away, because managing files is also a security place. Then, you have to build in all the mechanisms for that.

We’re not going to go into too many details about Kafka because that’s a separate post on its own, but just know that you can design the components and even take some components out by using Kafka.

You can even take it one step further, as shown below. We like cars, we see Kafka as a car and our cache database is also a car. A lot of use cases can actually use Kafka or any other technology for a system of records. Then, you can get comfortable with a materialized view.

When you break a database down into two parts – a commit log and a materialized view – you basically use one technology for the commit log and another for the view of that data. Now, we’ll talk about how to build that up.

Pipeline

We’re building a whole lot of things here. We want to build them a lot, and quite often. So, how do we do that? Well, first of all, let’s put a pipeline in, because we don’t want to be building things up manually everyday.

In the example below, we’ve got two cars running. The car is the consumer on the one side, with the batch loader, the real time sync, the API for the customers and also the Neo4j database. This can run in a docker container, a VM or whatever technology you use within your company. The load balancer on the right hand side basically distributes the load across these two machines.

Let’s say we’ve got a new feature, or one of the cars broke down and we want to put a new one in. We can add a new VM or docker container, block all the ports to make sure that no client connections can come in and query the database or application while it’s being built up.

The second step is to get all the artifacts, libraries, binaries and everything else you need to build up a car. Similarly, in the car industry, they make sure they’ve got all the components in the factory to build up a car from A to B.

Then, we start with a pre-processing phase, where we source all of our data from the system of records, which is Kafka on this side. From there, we convert the data into whichever format we need (in our case it was a CSV file). We then use the Neo4j admin import tool, which I’ll cover next, to import the data into our application, or our car.

We can then start up the database, and after that, we’ve got a couple of post-processing steps, which includes things like creating indexes and running other complex queries to start up our applications and the rest of our car.

Once that’s all done successfully, we unblock the ports and perform the switch. We take out the old one, and the new one’s running without any downtime for the application.

Neo4j Admin Import

So, I touched briefly on the admin import above, but now we’ll go more in-depth. Admin import is a key tool for when you want to rebuild quite often.

In this example here, we’ve got a simple model with a source, which in our case was a host. We also have a server, and we’ve got a bunch of messages that come in and relate to this source.

Here, you’d use the admin import tool, as shown below. Do note that this can only be used once when the database is still empty because it bypasses the transition manager, and that also gives it the speed in importing loads of data.

So, we import two CSV files with nodes, the sources and the messages, and then a third one with all the relationships.

Here, the nodes have an ID, a name and a LABEL, as shown below. The relationships you need are a START_ID, an END_ID and the TYPE of the label that you’re importing.

We ran this on our use case and managed to import around 11 million nodes and relationships in just over four minutes. This is quite good, especially if you’re rebuilding frequently. But, seeing as we started rebuilding four or five times per day, our clusters and environments kept on growing and getting more and more data. We needed to optimize this.

Basically, in a third of the time, we managed to import more than double the amount of nodes and relationships. This is shown below.

An easy way to get that was to change the names of the IDs, as shown on the right. An ID is only used during the importation – it’s not real data, so just make sure you don’t have duplicate names there. If you work around that problem, you can decrease the size of the file. Hence, the amount of data coming through the admin import tool dramatically increases in your performance.

Use Case: Managing Our Cassandra Platform

Our particular use case was for managing our Cassandra platform. It’s not the first place you would think to use graph technology, but we wanted to stay in control of our clustered environments.

We started playing with graph with Neo4j about a year ago, and we manage our Cassandra environment that consists of a multitude of nodes and clusters. On those clusters, we again have loads of customers that all consume key spaces. We know there are lots of numbers, clients, components and other departments within our company that we have to deal with, but only when you start putting data into a graph database are you able to visualize it. From there, you realize how many components you actually have to deal with.

Now, this is pretty much business as usual, because every company has rapid growth and dependencies. But, we wanted to stay in control of our clusters. What we were seeing was that as soon as an instance occurred on a multi-tenant environment, we had loads of customers on our environment, all with their own needs and use cases.

Moreover, if something happens on one of our clusters, we would see the operators log into a multitude of nodes, run all kinds of queries to find where the problem is, look at all the different components and try to – in their own heads – belay the links between the problems.

We thought there must be a better way for this and wanted to pipeline this. Because we didn’t value initial bursts of energy anymore, we needed to pipeline our heroes and find a way to work around this problem.

Architecture

What we did was create this simple architecture, where we have a producer on one side, a queue in the middle and a consumer on the other side.

The producer is something really simple and efficient. It’s a small binary that runs on all of our nodes, and basically runs all of the commands that the operators would do once they log into the machines to try and get information about them.

The producer produces just raw messages. They don’t do any filtering or fancy things on the producer. All the raw events are pushed onto the Kafka topic and consumed by the consumer on the other side, who’s a much more complex piece of technology that does all the parsing. There, we do all of the parsing of the rules and data.

In the image below, we’ve got our producers on the left side running on all the nodes. Should we need to add new commands, we just need to run a configuration file on the node that contains a list of commands and dictate the interval at which we want them to be run. These raw messages can then be put onto a Kafka topic.

On the other side, we’ve got the materialized view: the “car” part of our architecture that we explained earlier. The consumer reads it, transforms the data into the format we can use for Neo4j and imports it into Neo4j.

Before we have an incident, we still need to manage the configuration for certain aspects of our car. We store this in GitLab. The consumer then consumes from both Kafka topic and GitLab to obtain the state of what they want their environment to be.

Now, when they get an incident on their environment, instead of logging into all the nodes and trying to find relationships between different components within the environment, the operator can now log into the Neo4j database. They can do so either via the web interface, Python script or whatever method they choose, and run queries to find relationships between different components.

All of this grew quite fast and we empowered a lot of people with this, where everybody had their own Cypher queries. This became a bit unmanageable, because we still had heroes, the guys with the Cypher queries.

So, we ended up going for an option using GraphGists. We would copy and paste commands and it would run. But, it’s useless if five people keep running the same commands everyday. So, we thought, let’s stick that into the consumer, ask the consumer to run these commands every 10 minutes and output that to a static HTML page. In this way, when something happens in our environment, you don’t need heroes with Cypher queries anymore – you simply need someone who can look at an HTML page.

Now that we have all this data, set the rules and stored a bunch of configurations of values and thresholds for our environment in GitLab, we can also start putting alerts on them when processing them.

Through this process, we’ve gone from a simple producer to the consumer where everything is parsed and even produces notifications to the end points.

Consumer: Graph Model

If we narrow down into the consumer – because the consumer’s not only where a lot of the logic is, but also the one that gets rebuilt all the time – we would have to parse the message. Because we’re doing nothing on the producer side to parse the message, we leave that to the consumer. This is the first step.

Generally speaking, you’ll want to parse, though you might have to change the rules later if you have new insights or want to take some more data out. That would normally break down to key-value pairs, which we can get into the graph database very quickly. A message on its own is mostly meaningless, but one combined with all the other messages from other hosts, machines and message types holds value.

Once you have your data in the graph database, you can start doing observations, calculations and recommendations, because you can now automate this. We started doing this repeatedly, and while we were doing this, we came up with a few principles that helped guide us to build this correctly.

One main principle is that you definitely need an inventory. You need a system of truth and you need to know what you’re looking for. If you have a self-discovery system, then you can always find out what’s going on. You have to have expectations so you can check what’s going on.

Second, you need to define agreements in code. If you can’t make agreements in code and have that checked once it’s deployed or rolled out, then it starts becoming very difficult to check. Every agreement – such as configuration for types of clusters – needs to be put in code so that you can actually use them in your graph model.

We’ve been doing databases for years and one of the things we don’t like about graph, sorry Neo4j, is it’s schemaless at the very minimum. On one hand, we love it, and on the other, we hate it. Schemaless tends to not allow you to document the data model, and requires you to have another tool if you’re rapidly building stuff. So, we actually ended up documenting a data model.

This is something we had never done. One of the things that we found missing is that you can’t easily add comments or metadata about the model in the graph – or at least we haven’t found out how to do that yet.

So, another key principle is that you need to get into the comfort of rebuilding, since no data migrations are allowed. When you repeatedly rebuild your model, the next person who comes into the project will be able to learn that project very quickly, because they now understand how things get to where they are. Because a graph model is very functional, you not only need to understand graph as a technology, but also understand the data that you’re working with.

So What Happened?

What actually happened was we wanted to learn graph. We weren’t really fully convinced it was the best fit for our use case, but it actually turned out to work really well.

Where did this actually come from? We started with four or five commands in the producer, and within two or three months, we ended up adding 50. This happened so quickly because we were like little kids in a toy store getting data out, linking stuff and figuring out what we could add.

Then, we would keep adding configurations to the producer and the momentum was going so fast. We even, for example, put net stats information and tried to check which applications were connecting to the cluster and whether they were connecting to all the nodes correctly. By just looking at the behaviors of the operational data, we managed to find misconfigured customers.

From there, we realized our architecture helped us with this way of thinking. Because it became easy to rebuild, we created a learning environment for learning graph. It took us, since we started, over 300 cars that were destroyed and rebuilt to actually learn how to use graph and obtain all our findings after each iteration.

Graph Learnings from NoSQL Thinking

This moves us into the last section.

The first feature we’ll discuss is one of the lesser-known features. Traditionally, when you get a message, parse it and try to get that in the database, you would have to write SET and “property name equals,” and you’d have all of these long lists. However, there’s actually something called “SET equals,” and after that, you can use a map. Then, if you use a +=, you can get all of the properties in one go. This way, you don’t actually have to mention the list that’s mentioned in the WITH.

The reason why this is also interesting is because now we can change the parsing rules without having to change the data going into Neo4j. Essentially, you can break everything down to key-value and then get it into your database and write your queries. If you change the parsing rules, then the raw data also changes and you need to rebuild. That’s something we did quite often, so we simply rebuilt and destroyed the old instance.

Let’s move onto the next feature we learned. As we mentioned above, we actually started documenting a data model, which is shown in the image below. It’s blurred out, but I wanted to point out that the blue source is in a different place in my model than the yellow source. The two are not the same.

Because we were building so fast, we’d add another label or node and kept adding and adding. Then, we accidentally realized we used the same labels for two different objects. So, when we did CALL db.schema(), we’d get a weird linking mess. That’s why we documented the data model to make sure that we don’t make these mistakes again.

What are solutions to this? You can use the property existence feature, which protects you in a lot of cases. However, in the Cassandra world, we always say you need to solve your reads with your writes. Overall, you have this theme where you can simply fix code and rebuild or destroy the incidents.

Now, let’s look at a very simple model, a source HAS_MSG.

It’s not rocket science, but what we forgot here was that we were suddenly adding 50 produced messages every hour. Over time, the number of nodes attached to the source was very large. We’d always joke that if you’re too scared to open the node or expand it with the browser, when you press plus and your browser starts getting a little bit wobbly, then you might want to rethink traversing to reduce the density of the nodes.

So, how would you do that? Instead of having a source HAS_MSG, you can then say that a source has LAST_MSG, has PREV_MSG, has PREV_MSG, has PREV_MSG and so on.

In this use case, we mostly cared about the last message, not all the other ones. Though we needed them, they weren’t the main driver. We actually learned this at a presentation in GraphConnect London. We came back to our office, simply changed our code and rebuilt.

Finally, we talked about never doing a migration. At first, we thought we might as well attempt a migration because there was no point in saying migrations were bad if we’d never actually tried. And we’d often write the following query:

Every time I wrote one of these queries on the database, I’d basically have to rebuild and destroy the database. Essentially, the outcome of this is that I’ve destroyed my database, purely because of one simple mistake. The problem with WHERE being after OPTIONAL MATCH was that you’d get a bigger dataset and effectively do a lot of stuff you didn’t intend to do with the first query. Instead, we needed to modify our code to look like this:

Conclusion

We started off this post by saying that in order to stay in control and be agile with graph, you need to shift left. Shifting left is not an easy thing to do and requires a lot of upfront work, but once you nail the shift left and design for rebuild, you ensure agility.

By enabling the rebuild feature, you can simply add new features as often as you want and just rebuild. You create a learning environment in which people aren’t afraid to make changes because if their change didn’t work, they can simply rebuild to the old version.

Ultimately, this ensures that you can keep control over your environment. The more often you rebuild your data, the better your data quality will be. You don’t have to work with data migrations, so there are no places to mess up your data. Moreover, when you wanna scale out, you simply spin up X amount of extra nodes. That’s how we stay in control of our environment.

Think you have what it takes to be Neo4j certified?
Show off your graph database skills to the community and employers with the official Neo4j Certification. Click below to get started and you could be done in less than an hour.

Get Certified

↧

Graph Data Platforms: From a Napkin Sketch to a Category Leader

December 3, 2020, 12:00 am

≪ Previous: Being in Control and Staying Agile with Graph Requires Shifting Left at ING

Learn about Neo4j as a graph data platform leader.

A famous Chinese proverb says: “A journey of a thousand miles begins with a single step.”

On its path to being named a leading Graph Data Platform by Forrester Research in its recent “Forrester Wave

: Graph Data Platform, Q4 2020,” Neo4j learned useful lessons for developers at every single step.

Michael Hunger, Neo4j’s head of developer relations, recounts some of the experiences that forged Neo4j into the role of an industry leader as it advanced from an idea to a sketch, to a new database type, to a category, to industry leadership in which Amazon and Microsoft have entered as challengers.

Blaise James: Michael, we often hear the Neo4j Napkin origin story. For those unfamiliar, could you relay what this is?

Michael Hunger: Neo4j was extracted from a production application in the DMS/CMS SaaS space in Sweden in the early 2000’s. Originally, two use cases could not be satisfied by relational databases despite much effort: real-time permission resolution for a SaaS business and complex semantic networks of hierarchical, translated keywords. That led the engineers on the project to explore other means, one of which was an in-memory graph model.

To gather help building that, our co-founder Emil flew to Mumbai and on that flight sketched the basic building blocks of the pragmatic property graph model – nodes and relationships with properties. And yes, this is documented on a literal airplane napkin.

Later on, that graph model approach was implemented successfully in Sweden and formed the kernel of the Neo4j graph database platform that we know today.

James: Can you share more about what the Neo4j technology journey has looked like between now and then?

Michael: At the starting point, we were a bit naive: “Building a database… how hard can that be?” It turned out – pretty hard. But we invested thousands of person-years into building Neo4j from nimble beginnings into the broad platform it is today. Since the beginning, Neo4j’s focus was always on making developer’s lives easier; that’s why we decided to use the pragmatic property graph model and not more scientific models like RDF. Now we are doing the same for data scientists, by making graph algorithms approachable and easy to use.

For our customers and users, having a graph database at their disposal enables them to solve problems and gain insights they would otherwise not be able to.

Our technology has evolved significantly, growing from a core library with the graph model to handling transactions, memory and I/O like a proper database. Early on, the library was wrapped into a server with APIs, which then enabled the first graphical user interfaces. Shortly after, we started implementing the Cypher query language to make working with the database available to all kinds of programming languages and environments. Meanwhile, the clustered Neo4j solution took several iterations from Zookeeper to Paxos and now (Multi-)Raft which powers our causal clusters.

Like many other successful data-intensive services (Cassandra, Kafka, Spark) we rely on the JVM for scalability and portability. To make it easier to build applications, we devised a binary protocol (bolt) and with that official drivers for .Net, JavaScript, Python, Go and Java. To improve the usability of the platform, Neo4j Browser, Neo4j Desktop and Neo4j Bloom form our initial set of developer and end-user facing applications that bring modern web-application (React and GraphQL) feel to database interactions. These are supplemented by an ever-growing list of extensions in the form of graph apps.

While Neo4j has been available as an official Docker image for a long time, our cloud offering – Neo4j Aura – utilizes kubernetes operators. As part of our Neo4j Labs efforts, a large number of user-defined utility procedures (APOC), as well as the GRANDstack GraphQL make application development easier. Our first class integrations for Kafka, Spark and JDBC enable the addition to modern data architectures. A more recent addition to the graph platform is the Graph Data Science Library, which uses resource-efficient computation to enable large-scale graph computation on complex connected data.

At the recent NODES conference, Emil spoke of making the impossible possible, then usable and then magical. As you can see in our journey so far, we already made good progress on that path and the widespread use and enthusiasm of developers for our capabilities, features and tools confirm that.

Graphs are already magical and Neo4j puts that into your hands, which is why folks often declare their “love” for Neo4j or Cypher.

While developers have been excited about using graphs for a long time, making the case within an organization can be a bit more challenging. While our customers regularly demonstrate in presentations and articles how they could achieve new capabilities, speed up their development or just save money, it’s often not enough to convince your boss.

That’s why the Forrester Wave for Graph Data Platforms is a really important publication. It is the first major analyst report that covers what we have been working on for more than 10 years. For once it confirms the maturity of the graph space to have a leading analyst firm taking the time and effort to evaluate this segment of data platforms. Also having an independent third party compare and evaluate the different offerings objectively provides credibility and trust to the results.

Many organizations try to reduce the risk in adopting new technologies so they either look to their peers or independent sources to confirm their choices. With the Forrester Wave, you can provide that information to your decision-makers to support the graph projects you are convinced are the right choice for your organization.

James: We’re excited that Neo4j outperformed 12 other vendors in the space. Can you share your perspectives on why that leadership is warranted?

Michael: There are several interesting areas that speak to the strength of our engineering team and the elegance of the core product architecture. In the case of performance which is critical, especially to transaction-based use cases, our scalable core architecture and Cypher engine enable users to achieve the performant results they expect in their production environments, while enabling us to constantly improve many aspects of the platform.

Our scalability is top-rated in the report, both for transactional and data science workloads. From scaled up single instances to clustered or sharded environments you can choose the scale aspects that are the best fit for your needs.

That said, graph databases don’t need to be gigantic to deliver a lot of value – even small graphs of a few thousand nodes can be worth millions of dollars. But Neo4j is used by some of the largest companies in the world to run real-time, customer-facing production applications, which fortunately take much fewer compute resources than other comparable database solutions. Instead of running clusters of 100 or 60 or 12 machines, often a three-instance cluster of Neo4j is enough to serve the workload. At the same time, Neo4j is proven to scale out for many billions of elements in the graph both for transactional and analytics workloads.

Since day one, Neo4j has been transactional and has never given up on that guarantee. The fine-grained graph model requires transactionality to safely handle complex network updates without compromising data consistency.

James: Another thing that struck me – as a relative newcomer to Neo4j – is the fact that leadership is about so much more than the actual technology.

Michael: Yes, that’s right. Our active, supportive community is often cited as one of the main reasons for choosing Neo4j. I’ve been involved in growing and supporting our community for the last 10 years and am proud to say it’s the best community I’ve ever been part of – extremely helpful, friendly and knowledgeable, and a place for both newcomers and experts alike.

I think our community is one of the main reasons why if you talk graph databases, you talk Neo4j. We created and grew the graph database category over the last 10 years and invested a lot of effort into educating developers, data scientists and other users. When searching for any topic related to graph databases or graph data science, our technology and resources are top of the list. This is also represented by our leading position in the graph database category on the popular db-engines.com site, which takes as many (12) metrics into account for its scoring. But that’s a conversation for a future discussion!

Another thing I’ve observed is that we don’t compromise on quality. For instance, the Neo4j customer support team exceeds expectations every time. We have a super high satisfaction rating combined with a quick turnaround time on tickets. If need be, involvement of core engineering for fast and thorough resolution of issues is their core competency.

The last key contributor to our leadership is that we have a truly global perspective. Neo4j grew from its Swedish roots and is now headquartered in the Bay Area with engineering in London (UK) and Malmö (Sweden) and everyone else distributed across the globe including APAC. This means that we both serve our customers locally and globally, depending on their needs. Similar to the versatility of use cases, our users work and operate all over the world.

James: What do you think that says about Neo4j and its approach to innovation?

Michael: Like every mature company, you need to balance customer commitment with pure innovation. In our case, we carefully pick the areas where we innovate a lot – like graph data science, Cypher runtime, cloud operations and developer tools – versus the parts where we are quite conservative – data safety and security, transactions, operations and core database features.

We can’t and don’t want to risk the trust of our users and customers. At the same time, there are a lot of exciting developments in many areas that we are participating in. For instance, modern application development with GraphQL, integrations with Apache Spark or Kafka. We also try to push the envelope by providing the world with the best, open graph query language, GQL, which was recently approved by the ISO committee as the first new query language effort since the inception of SQL.

As part of our engagement with users, we gather firsthand qualitative and quantitative feedback from discussions, training classes, customer engagements and community forums. This feeds into our product roadmap prioritization. With a renewed focus on developer experience (DX), especially in our cloud offering Neo4j Aura, we want to make the new user experience as friction free as possible.

James: What does Neo4j make of behemoths like Amazon and Microsoft entering the graph market space?

Michael: Usually you would be concerned about such a development, given the deep pockets and engineering prowess of the giants. But in this case, we welcome their entry for two reasons: they validate our market segment and grow the visibility of graph technology – “a rising tide lifts all boats” situation. But given their size, graphs are not a focus for them, it is just one of the many topics they juggle.

For us, graph technology is at the heart of what we do, every hour of every day. So we can excel at it and provide the best service to our users and customers while the big vendors only provide a “checkbox” implementation.

James: Given that Neo4j earned the highest possible score on its roadmap, would you give some insight into what new features developers might see from Neo4j in the not-too-distant future?

Michael: As Emil also eluded in his NODES keynote, we put a lot of focus on making Neo4j available on all cloud platforms for both transactional and data science uses supporting needs from individual developers to large enterprises. Democratizing graph data science is another one of our big goals.

And last but not least, continuing to make it easier to build applications, import data and integrate in cloud native services. All these topics are serving our ultimate goal – helping people make sense of data.

Ready to get your free copy of the report The Forrester Wave: Graph Data Platforms, Q4 2020, to learn more about new and emerging database technology that allow enterprises to solve complex problems and create meaningful insights quickly.

Get My Free Copy

↧